Conversation
This is very heavily based on the work of @fortuna in https://github.com/Jigsaw-Code/net-analysis/. The code from there has been adapted to maxmise the usage of existing ooni/pipeline code.
* Fetch from JSONL buckets instead of cans
* Minimise duplication between can fetch and jsonl fetch code * Improve efficiency in how jsonl files are listed
* Add support for passing lists of test_names and country codes * Use - instead of _ for CLI flags
* Write some integration tests for the s3feeder
* Using parallelisation and sharing of the s3 client we get a 10x performance boost * Add progress bar to oonidata sync command via tqdm
* Estimate ETA for stream_measurements
|
TODO:
|
fortuna
left a comment
There was a problem hiding this comment.
Some comments based on my experience with building the netanalysis ooni_client. My impression is that it may be easier to just adopt our code instead of incorporating the changes into s3feeder. Also keep in mind the use case of using this as a library, not only a command line tool.
| def create_s3_client(): | ||
| return boto3.client("s3", config=botoConfig(signature_version=botoSigUNSIGNED)) | ||
|
|
||
| s3 = create_s3_client() |
There was a problem hiding this comment.
As a potential consumer of this library, I see a few issues:
- You have a a bunch of standalone methods here. It's hard to follow them and figure out what I need.
- You are relying on globals, which invariably cause problems down the line.
- It uses alien concepts like "cans" and "minicans". A consumer doesn't need to be exposed to that.
Instead, offer a class instead that I can instantiate and not rely on globals, using concepts the consumer knows about (files, measurements), exposing a clear interface. This is what I came up with:
https://github.com/Jigsaw-Code/net-analysis/blob/0f9c75cbc5dbd80f6082aaf5a290be6eb7db0171/netanalysis/ooni/data/ooni_client.py#L42
There was a problem hiding this comment.
This part of the library will most likely not be exposed to end users. These methods are internal methods that we use inside of the data processing pipeline.
I agree with you that having globals is probably not ideal. What I did from the perspective of this PR was to minimise the amount of changes made to the existing codebase so that the likelyhood of breaking our pipeline is reduced.
Some refactoring is probably in order further down the line.
I also think that for end users we probably want to have a different, more user-friendly API, that abstracts away all of these alien concepts and just exposes a very simple "give me a generator of measurements for this search criteria".
| legacy cans | ||
| """ | ||
| # TODO: split this and handle legacy cans and post/minicans independently | ||
| if fn.endswith(".tar.lz4"): |
There was a problem hiding this comment.
I'd recommend handling the legacy format in a separate code. In netanalysis I have _2020OoniClient and _LegacyOoniClient. Note that they don't have to have the same interface.
There was a problem hiding this comment.
These are not functions that are meant to be used by end users. We should actually never be giving end users the legacy formats, but just the published JSONLs. That way there is no need for them to ever call code dealing with the various old formats.
We do, however, need these in our codebase because we need to handle them as part of our data pipeline.
|
Sharing here some of the results from the benchmark of ThreadPool vs ProcessPool. These tests were run on a 32 core server with 1 Gbit link, using the default configuration of ThreadPool and ProcessPool (it defaults to using os.cpu_count()). By running: This command in particular will download 8.5GB worth of data. With ProcessPool: With ThreadPool: By running on a test type that has smaller files on a larger time range: This range of data had an overall size of 526M. with ThreadPool: with ProcessPool: Based on these stats, it seems like there is very little difference between the two, with just an extremely marginal improvement when using ProcessPool. |
Co-authored-by: Vinicius Fortuna <fortuna@users.noreply.github.com>
* 'oonidata' of github.com:ooni/pipeline: Update oonidata/oonidata/s3feeder.py
This is related to making it easier for third parties to consume OONI data (see: ooni/backend#514).
As part of this PR I have also done a bit of refactoring to the s3feeder with the following goals:
oonidatajsonlbuckets instead of thecansandminicansCurrently the
jsonlfetchers are not wired up to the fastpath, but it should just be a matter of changing the functions that callstream_canstostream_jsonlas the API is fully compatible.We should probably first discuss with @FedericoCeratto when and how exactly this should be done.
In doing the switch from reprocessing from cans to jsonl, I think we should consider the following:
jsonl/prefix), is very good if a user is interested in accessing just measurements for a given set of test_names and countries, yet it doesn't work very well if you want to process ALL data for a given time range. This may or may not be an issue for our data processing pipeline, I think that when we are to reprocess data, we probably also want to do it on a per test_name basis and we care to reprocess it from the beginning of time to date.Regarding the
oonidatatool it currently supports thesynccommand, with command line options that are taken directly from thenetanalysis.ooni.data.sync_measurementstool of @fortuna.I would actually like to change the CLI API a bit, so it should not be considered stable ATM.
The main changes I would like to do are:
Other minor things I am considering as well are:
_with-and align the naming conventions with those used elsewhere in the pipeline)The reason why I am opening this PR is that I would like to start getting feedback on early on, so that this can inform future iterations.