Add oonidata package + reprocessing from jsonl by hellais · Pull Request #377 · ooni/pipeline

hellais · 2022-02-11T12:37:37Z

This is related to making it easier for third parties to consume OONI data (see: ooni/backend#514).

As part of this PR I have also done a bit of refactoring to the s3feeder with the following goals:

Make the code in there more re-usable by tools such as oonidata
Add support for fetching data from the jsonl buckets instead of the cans and minicans

Currently the jsonl fetchers are not wired up to the fastpath, but it should just be a matter of changing the functions that call stream_cans to stream_jsonl as the API is fully compatible.
We should probably first discuss with @FedericoCeratto when and how exactly this should be done.

In doing the switch from reprocessing from cans to jsonl, I think we should consider the following:

The new jsonl bucket format (the one that starts with the jsonl/ prefix), is very good if a user is interested in accessing just measurements for a given set of test_names and countries, yet it doesn't work very well if you want to process ALL data for a given time range. This may or may not be an issue for our data processing pipeline, I think that when we are to reprocess data, we probably also want to do it on a per test_name basis and we care to reprocess it from the beginning of time to date.
The average size of each individual jsonl is significantly smaller than a can. This means that batch retrieval might be slower.

Regarding the oonidata tool it currently supports the sync command, with command line options that are taken directly from the netanalysis.ooni.data.sync_measurements tool of @fortuna.

I would actually like to change the CLI API a bit, so it should not be considered stable ATM.

The main changes I would like to do are:

Support passing a list of country codes or omitting the country code entirely for retrieving all data for a given time range
Support passing a list of test_names and not defaulting to web_connectivity

Other minor things I am considering as well are:

Small cosmetic changes to the CLI args (ex. replace _ with - and align the naming conventions with those used elsewhere in the pipeline)

The reason why I am opening this PR is that I would like to start getting feedback on early on, so that this can inform future iterations.

@fortuna

This is very heavily based on the work of @fortuna in https://github.com/Jigsaw-Code/net-analysis/. The code from there has been adapted to maxmise the usage of existing ooni/pipeline code.

* Fetch from JSONL buckets instead of cans

* Minimise duplication between can fetch and jsonl fetch code * Improve efficiency in how jsonl files are listed

* Add support for passing lists of test_names and country codes * Use - instead of _ for CLI flags

…filenames

* Write some integration tests for the s3feeder

* Using parallelisation and sharing of the s3 client we get a 10x performance boost * Add progress bar to oonidata sync command via tqdm

* Estimate ETA for stream_measurements

* origin/master: fix(get_http_header): sync description w/ impl RiseUp VPN: bugfix (#379) 0.54 Make clickhouse dependency optional Stunreachability fix (#378)

hellais · 2022-03-24T14:33:39Z

TODO:

Benchmark using ThreadPool
Set MAX_POOL_SIZE to a factor of os.cpu_count()

fortuna

Some comments based on my experience with building the netanalysis ooni_client. My impression is that it may be easier to just adopt our code instead of incorporating the changes into s3feeder. Also keep in mind the use case of using this as a library, not only a command line tool.

fortuna · 2022-03-24T14:57:48Z

+def create_s3_client():
+    return boto3.client("s3", config=botoConfig(signature_version=botoSigUNSIGNED))
+
+s3 = create_s3_client()


As a potential consumer of this library, I see a few issues:

You have a a bunch of standalone methods here. It's hard to follow them and figure out what I need.

You are relying on globals, which invariably cause problems down the line.

It uses alien concepts like "cans" and "minicans". A consumer doesn't need to be exposed to that.

Instead, offer a class instead that I can instantiate and not rely on globals, using concepts the consumer knows about (files, measurements), exposing a clear interface. This is what I came up with:
https://github.com/Jigsaw-Code/net-analysis/blob/0f9c75cbc5dbd80f6082aaf5a290be6eb7db0171/netanalysis/ooni/data/ooni_client.py#L42

This part of the library will most likely not be exposed to end users. These methods are internal methods that we use inside of the data processing pipeline.

I agree with you that having globals is probably not ideal. What I did from the perspective of this PR was to minimise the amount of changes made to the existing codebase so that the likelyhood of breaking our pipeline is reduced.

Some refactoring is probably in order further down the line.

I also think that for end users we probably want to have a different, more user-friendly API, that abstracts away all of these alien concepts and just exposes a very simple "give me a generator of measurements for this search criteria".

fortuna · 2022-03-24T14:59:44Z

+    legacy cans
+    """
+    # TODO: split this and handle legacy cans and post/minicans independently
+    if fn.endswith(".tar.lz4"):


I'd recommend handling the legacy format in a separate code. In netanalysis I have _2020OoniClient and _LegacyOoniClient. Note that they don't have to have the same interface.

These are not functions that are meant to be used by end users. We should actually never be giving end users the legacy formats, but just the published JSONLs. That way there is no need for them to ever call code dealing with the various old formats.

We do, however, need these in our codebase because we need to handle them as part of our data pipeline.

hellais · 2022-04-06T16:04:04Z

Sharing here some of the results from the benchmark of ThreadPool vs ProcessPool.

These tests were run on a 32 core server with 1 Gbit link, using the default configuration of ThreadPool and ProcessPool (it defaults to using os.cpu_count()).

By running:

time oonidata sync --since 2022-01-01 --until 2022-01-05 --country-codes RU --test-names webconnectivity --output-dir output

This command in particular will download 8.5GB worth of data.

With ProcessPool:

real    1m24.508s
user    0m45.710s
sys     0m18.767s
---
real    1m23.868s
user    0m46.850s
sys     0m20.405s
---
real    1m23.433s
user    0m46.757s
sys     0m20.322s

With ThreadPool:

real    1m24.009s
user    0m58.021s
sys     0m28.024s
---
real    1m24.002s
user    0m57.935s
sys     0m27.903s
---
real    1m24.141s
user    0m58.933s
sys     0m29.225s

By running on a test type that has smaller files on a larger time range:

time oonidata sync --since 2022-01-01 --until 2022-03-01 --country-codes RU --test-names whatsapp --output-dir output

This range of data had an overall size of 526M.

with ThreadPool:

real    0m30.137s
user    4m18.437s
sys     0m4.370s
---
real    0m29.380s
user    4m16.164s
sys     0m4.374s

with ProcessPool:

real    0m28.351s
user    4m15.623s
sys     0m4.514s
---
real    0m28.075s
user    4m17.353s
sys     0m4.277s

Based on these stats, it seems like there is very little difference between the two, with just an extremely marginal improvement when using ProcessPool.

Co-authored-by: Vinicius Fortuna <fortuna@users.noreply.github.com>

* 'oonidata' of github.com:ooni/pipeline: Update oonidata/oonidata/s3feeder.py

hellais added 30 commits February 1, 2022 20:08

Move oonidata related functions into a separate package

d53e637

Implement sync functionality

dc164f7

This is very heavily based on the work of @fortuna in https://github.com/Jigsaw-Code/net-analysis/. The code from there has been adapted to maxmise the usage of existing ooni/pipeline code.

Fix broken metrics import

6f2f602

Use mocked statsd client when import is not available

1c51f5a

Fixup setup.py

d2c302d

Fix script path

3b32917

Fix timer mock

cd28117

Add support for trimming strings longer than a certain length

f795b5e

Sync command now fetches data from s3

df42f2a

* Fetch from JSONL buckets instead of cans

More refactoring in preparation for supporting jsonl in fastpath

85dbab0

Refactoring of the jsonl related functionality

72fc33d

* Minimise duplication between can fetch and jsonl fetch code * Improve efficiency in how jsonl files are listed

Fix bug in logic for determining ranges

f2e5094

Improvements to oonidata CLI

9eda084

* Add support for passing lists of test_names and country codes * Use - instead of _ for CLI flags

TMP commit

e078658

Refactor all code related to can and jsonl listing

ed605cc

Reflow using black

8ece3f5

Adjust oonidata CLI based on changes in s3feeder

f779ad6

Small cosmetic improvements to the CLI

64fdcdc

Reflow with black

b450860

Fix typo

a271bed

Simplify jsonl listing

589f9c8

Use day instead of timestamp

9e78b6a

Fix typo

849610f

Fix parsing in s3feeder

bbf8e82

Bugfix related to inconsistent filename in legacy jsonl vs new jsonl …

f9c2cf2

…filenames

Fix log line

6138e11

Include in listing yaml.lz4 files

0af8a04

Bugfixing of listing for legacy cans

f76d8fe

Use XX as unknow country code as key for cans

25d9bd7

Don't display warning for non jsonl

8f2afc2

hellais added 5 commits February 17, 2022 11:53

Boost performance of the jsonl_in_range function

cea7bb0

* Write some integration tests for the s3feeder

Drop TransferConfig

bee5679

Don't perform listing optimisations for ranges larger than 20 days

b92ea29

Add support for parallel listing and download of data

b6be947

* Using parallelisation and sharing of the s3 client we get a 10x performance boost * Add progress bar to oonidata sync command via tqdm

Fix bug in minican listing

a371ce4

hellais marked this pull request as ready for review February 18, 2022 17:34

hellais added 7 commits February 18, 2022 18:37

Fix fastpath tests

6fe7656

Fix bug in unit test

76ad460

Adjust the listing heuristic

95e7298

Fix bug spotted via unit tests

98b1c76

Don't parallelise stream_measurements

917d65d

* Estimate ETA for stream_measurements

Fix typo in stream_jsonl_measurements

8a96844

Merge remote-tracking branch 'origin/master' into oonidata

afdde0d

* origin/master: fix(get_http_header): sync description w/ impl RiseUp VPN: bugfix (#379) 0.54 Make clickhouse dependency optional Stunreachability fix (#378)

hellais requested a review from FedericoCeratto March 15, 2022 13:02

fortuna reviewed Mar 24, 2022

View reviewed changes

hellais added 5 commits April 6, 2022 17:14

Only look inside the jsonl tree if we need to

79dd249

Remove invalid import

30a8319

Add support for benchmarking threadpool vs processpool

73a5ca2

Put the closure outside of the function

08b321f

Use a partial instead of closure to get process pool to work

eb97646

hellais and others added 4 commits April 6, 2022 18:08

Update oonidata/oonidata/s3feeder.py

590cf4e

Co-authored-by: Vinicius Fortuna <fortuna@users.noreply.github.com>

Add metadata for publication of pypi

150fa29

Add .gitignore

3543ae6

Merge branch 'oonidata' of github.com:ooni/pipeline into oonidata

9cfc53f

* 'oonidata' of github.com:ooni/pipeline: Update oonidata/oonidata/s3feeder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add oonidata package + reprocessing from jsonl#377

Add oonidata package + reprocessing from jsonl#377
hellais wants to merge 51 commits into
masterfrom
oonidata

hellais commented Feb 11, 2022

Uh oh!

hellais commented Mar 24, 2022 •

edited

Loading

Uh oh!

fortuna left a comment

Uh oh!

Uh oh!

fortuna Mar 24, 2022

Uh oh!

hellais Apr 6, 2022 •

edited

Loading

Uh oh!

fortuna Mar 24, 2022

Uh oh!

hellais Apr 6, 2022

Uh oh!

hellais commented Apr 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hellais commented Feb 11, 2022

Uh oh!

hellais commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fortuna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fortuna Mar 24, 2022

Choose a reason for hiding this comment

Uh oh!

hellais Apr 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fortuna Mar 24, 2022

Choose a reason for hiding this comment

Uh oh!

hellais Apr 6, 2022

Choose a reason for hiding this comment

Uh oh!

hellais commented Apr 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hellais commented Mar 24, 2022 •

edited

Loading

hellais Apr 6, 2022 •

edited

Loading