Skip to content

make json preprocess version numbers available external to jsons #2

@jeremydouglass

Description

@jeremydouglass

We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).

In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).

For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.

{
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      },
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions