A log processing pipeline for parsing JSONL files, validating structure, and aggregating user/feature metrics.
backend-builder/
├── part0_generation/ # Generate sample log data
├── part1_parsing/ # Parse and validate logs
├── part2_aggregation/ # Compute aggregated metrics
├── utils/ # Shared utilities
├── tests/ # Test suite
└── output/ # Generated output files
pip install -r requirements.txtRun the full pipeline in order:
# 1. Generate sample log data
python3 part0_generation/generate_practice_file.py
# 2. Parse and validate logs
python3 part1_parsing/parse_logs.py
# 3. Aggregate metrics
python3 part2_aggregation/aggregate_metrics.py-
Start: The process begins with the input file
output/logs.jsonl. -
Part 1 (Parsing):
- Part1_Parser (
parse_logs.py) reads the file. - It validates each line into a
RawLogobject (checking timestamps, required fields). - It groups these logs and saves them to
output/parsed_summary.json.
- Part1_Parser (
-
Part 2 (Aggregation):
- Part2_Aggregator (
aggregate_metrics.py) reads the intermediate JSON. - It pairs "start" and "end" logs to create Session objects (calculating duration).
- It aggregates these sessions by User, Feature, and Time.
- Part2_Aggregator (
-
End: The final metrics are saved to
output/final_summary.json.
Script: part0_generation/generate_practice_file.py
Generates a sample JSONL file with ~100 lines for testing the pipeline. Includes deliberate error cases (malformed JSON, missing fields, invalid timestamps) to validate error handling.
python3 part0_generation/generate_practice_file.pyoutput/logs.jsonl— Sample log data
Script: part1_parsing/parse_logs.py
Parses JSONL files, validates structure, groups data by user and feature, and produces clean statistics.
- Error Handling: Gracefully handles malformed JSON, missing fields, and invalid data
- Structure Validation: Validates required fields (user_id, feature, timestamp, action)
- Data Grouping: Groups logs by user_id and feature for analysis
- Summary Generation: Creates comprehensive statistics and saves to JSON
python3 part1_parsing/parse_logs.py- Console: Parsing statistics, user/feature/action statistics
output/parsed_summary.json— Complete structured data
- Parse Errors: Malformed JSON (broken strings)
- Validation Errors:
- Missing required fields (e.g., missing "feature")
- Invalid timestamp format
- Invalid action or feature values
Script: part2_aggregation/aggregate_metrics.py
Builds on Part 1 to compute higher-level metrics involving time intervals and related entities.
- Session Duration Calculation: Matches start/end actions to calculate how long users spend in each feature
- Time Interval Grouping: Groups sessions by hourly intervals to identify usage patterns
- Feature-Level Aggregation: Higher-level metrics grouped by feature (total sessions, durations, unique users)
- User-Level Aggregation: Higher-level metrics grouped by user (total sessions, durations, features used)
- Final Summary: Combines Part 1 and Part 2 results into a comprehensive report
python3 part2_aggregation/aggregate_metrics.py- Console: Aggregated metrics by feature, user, and time intervals
output/final_summary.json— Complete aggregated data
- Session Durations: Time between start and end actions for each user-feature pair
- Feature Aggregation:
- Total sessions per feature
- Total and average duration per feature
- Min/max durations
- Unique users per feature
- User Aggregation:
- Total sessions per user
- Total and average duration per user
- Features used by each user
- Sessions breakdown by feature
- Time Interval Aggregation:
- Sessions grouped by hourly intervals
- Total and average durations per time period
- Incomplete sessions (unmatched start/end pairs) are tracked but excluded from duration calculations
- All durations are calculated in seconds and formatted for human readability
- The script reuses the
parsed_summary.jsonoutput from Part 1
pytest