Skip to content

Add oonidata package + reprocessing from jsonl#377

Open
hellais wants to merge 51 commits into
masterfrom
oonidata
Open

Add oonidata package + reprocessing from jsonl#377
hellais wants to merge 51 commits into
masterfrom
oonidata

Conversation

@hellais

@hellais hellais commented Feb 11, 2022

Copy link
Copy Markdown
Member

This is related to making it easier for third parties to consume OONI data (see: ooni/backend#514).

As part of this PR I have also done a bit of refactoring to the s3feeder with the following goals:

  • Make the code in there more re-usable by tools such as oonidata
  • Add support for fetching data from the jsonl buckets instead of the cans and minicans

Currently the jsonl fetchers are not wired up to the fastpath, but it should just be a matter of changing the functions that call stream_cans to stream_jsonl as the API is fully compatible.
We should probably first discuss with @FedericoCeratto when and how exactly this should be done.

In doing the switch from reprocessing from cans to jsonl, I think we should consider the following:

  • The new jsonl bucket format (the one that starts with the jsonl/ prefix), is very good if a user is interested in accessing just measurements for a given set of test_names and countries, yet it doesn't work very well if you want to process ALL data for a given time range. This may or may not be an issue for our data processing pipeline, I think that when we are to reprocess data, we probably also want to do it on a per test_name basis and we care to reprocess it from the beginning of time to date.
  • The average size of each individual jsonl is significantly smaller than a can. This means that batch retrieval might be slower.

Regarding the oonidata tool it currently supports the sync command, with command line options that are taken directly from the netanalysis.ooni.data.sync_measurements tool of @fortuna.

I would actually like to change the CLI API a bit, so it should not be considered stable ATM.

The main changes I would like to do are:

  • Support passing a list of country codes or omitting the country code entirely for retrieving all data for a given time range
  • Support passing a list of test_names and not defaulting to web_connectivity

Other minor things I am considering as well are:

  • Small cosmetic changes to the CLI args (ex. replace _ with - and align the naming conventions with those used elsewhere in the pipeline)

The reason why I am opening this PR is that I would like to start getting feedback on early on, so that this can inform future iterations.

This is very heavily based on the work of @fortuna in
https://github.com/Jigsaw-Code/net-analysis/.
The code from there has been adapted to maxmise the usage of existing
ooni/pipeline code.
* Fetch from JSONL buckets instead of cans
* Minimise duplication between can fetch and jsonl fetch code
* Improve efficiency in how jsonl files are listed
* Add support for passing lists of test_names and country codes
* Use - instead of _ for CLI flags
* Write some integration tests for the s3feeder
* Using parallelisation and sharing of the s3 client we get a 10x
performance boost
* Add progress bar to oonidata sync command via tqdm
@hellais hellais marked this pull request as ready for review February 18, 2022 17:34
* Estimate ETA for stream_measurements
* origin/master:
  fix(get_http_header): sync description w/ impl
  RiseUp VPN: bugfix (#379)
  0.54 Make clickhouse dependency optional
  Stunreachability fix (#378)
@hellais hellais requested a review from FedericoCeratto March 15, 2022 13:02
@hellais

hellais commented Mar 24, 2022

Copy link
Copy Markdown
Member Author

TODO:

  • Benchmark using ThreadPool
  • Set MAX_POOL_SIZE to a factor of os.cpu_count()

@fortuna fortuna left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments based on my experience with building the netanalysis ooni_client. My impression is that it may be easier to just adopt our code instead of incorporating the changes into s3feeder. Also keep in mind the use case of using this as a library, not only a command line tool.

Comment thread oonidata/oonidata/s3feeder.py Outdated
def create_s3_client():
return boto3.client("s3", config=botoConfig(signature_version=botoSigUNSIGNED))

s3 = create_s3_client()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a potential consumer of this library, I see a few issues:

  • You have a a bunch of standalone methods here. It's hard to follow them and figure out what I need.
  • You are relying on globals, which invariably cause problems down the line.
  • It uses alien concepts like "cans" and "minicans". A consumer doesn't need to be exposed to that.

Instead, offer a class instead that I can instantiate and not rely on globals, using concepts the consumer knows about (files, measurements), exposing a clear interface. This is what I came up with:
https://github.com/Jigsaw-Code/net-analysis/blob/0f9c75cbc5dbd80f6082aaf5a290be6eb7db0171/netanalysis/ooni/data/ooni_client.py#L42

@hellais hellais Apr 6, 2022

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the library will most likely not be exposed to end users. These methods are internal methods that we use inside of the data processing pipeline.

I agree with you that having globals is probably not ideal. What I did from the perspective of this PR was to minimise the amount of changes made to the existing codebase so that the likelyhood of breaking our pipeline is reduced.

Some refactoring is probably in order further down the line.

I also think that for end users we probably want to have a different, more user-friendly API, that abstracts away all of these alien concepts and just exposes a very simple "give me a generator of measurements for this search criteria".

legacy cans
"""
# TODO: split this and handle legacy cans and post/minicans independently
if fn.endswith(".tar.lz4"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend handling the legacy format in a separate code. In netanalysis I have _2020OoniClient and _LegacyOoniClient. Note that they don't have to have the same interface.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not functions that are meant to be used by end users. We should actually never be giving end users the legacy formats, but just the published JSONLs. That way there is no need for them to ever call code dealing with the various old formats.

We do, however, need these in our codebase because we need to handle them as part of our data pipeline.

@hellais

hellais commented Apr 6, 2022

Copy link
Copy Markdown
Member Author

Sharing here some of the results from the benchmark of ThreadPool vs ProcessPool.

These tests were run on a 32 core server with 1 Gbit link, using the default configuration of ThreadPool and ProcessPool (it defaults to using os.cpu_count()).

By running:

time oonidata sync --since 2022-01-01 --until 2022-01-05 --country-codes RU --test-names webconnectivity --output-dir output

This command in particular will download 8.5GB worth of data.

With ProcessPool:

real    1m24.508s
user    0m45.710s
sys     0m18.767s
---
real    1m23.868s
user    0m46.850s
sys     0m20.405s
---
real    1m23.433s
user    0m46.757s
sys     0m20.322s

With ThreadPool:

real    1m24.009s
user    0m58.021s
sys     0m28.024s
---
real    1m24.002s
user    0m57.935s
sys     0m27.903s
---
real    1m24.141s
user    0m58.933s
sys     0m29.225s

By running on a test type that has smaller files on a larger time range:

time oonidata sync --since 2022-01-01 --until 2022-03-01 --country-codes RU --test-names whatsapp --output-dir output

This range of data had an overall size of 526M.

with ThreadPool:

real    0m30.137s
user    4m18.437s
sys     0m4.370s
---
real    0m29.380s
user    4m16.164s
sys     0m4.374s

with ProcessPool:

real    0m28.351s
user    4m15.623s
sys     0m4.514s
---
real    0m28.075s
user    4m17.353s
sys     0m4.277s

Based on these stats, it seems like there is very little difference between the two, with just an extremely marginal improvement when using ProcessPool.

hellais and others added 4 commits April 6, 2022 18:08
Co-authored-by: Vinicius Fortuna <fortuna@users.noreply.github.com>
* 'oonidata' of github.com:ooni/pipeline:
  Update oonidata/oonidata/s3feeder.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants