backend-builder

A log processing pipeline for parsing JSONL files, validating structure, and aggregating user/feature metrics.

Project Structure

backend-builder/
├── part0_generation/     # Generate sample log data
├── part1_parsing/        # Parse and validate logs
├── part2_aggregation/    # Compute aggregated metrics
├── utils/                # Shared utilities
├── tests/                # Test suite
└── output/               # Generated output files

Setup

pip install -r requirements.txt

Quick Start

Run the full pipeline in order:

# 1. Generate sample log data
python3 part0_generation/generate_practice_file.py

# 2. Parse and validate logs
python3 part1_parsing/parse_logs.py

# 3. Aggregate metrics
python3 part2_aggregation/aggregate_metrics.py

Flow Explanation

Start: The process begins with the input file output/logs.jsonl.
Part 1 (Parsing):
- Part1_Parser (parse_logs.py) reads the file.
- It validates each line into a RawLog object (checking timestamps, required fields).
- It groups these logs and saves them to output/parsed_summary.json.
Part 2 (Aggregation):
- Part2_Aggregator (aggregate_metrics.py) reads the intermediate JSON.
- It pairs "start" and "end" logs to create Session objects (calculating duration).
- It aggregates these sessions by User, Feature, and Time.
End: The final metrics are saved to output/final_summary.json.

Part 0: Log Generation

Script: part0_generation/generate_practice_file.py

Generates a sample JSONL file with ~100 lines for testing the pipeline. Includes deliberate error cases (malformed JSON, missing fields, invalid timestamps) to validate error handling.

Usage

python3 part0_generation/generate_practice_file.py

Output

output/logs.jsonl — Sample log data

Part 1: Parsing and Understanding Log Structure

Script: part1_parsing/parse_logs.py

Parses JSONL files, validates structure, groups data by user and feature, and produces clean statistics.

Features

Error Handling: Gracefully handles malformed JSON, missing fields, and invalid data
Structure Validation: Validates required fields (user_id, feature, timestamp, action)
Data Grouping: Groups logs by user_id and feature for analysis
Summary Generation: Creates comprehensive statistics and saves to JSON

Usage

python3 part1_parsing/parse_logs.py

Output

Console: Parsing statistics, user/feature/action statistics
output/parsed_summary.json — Complete structured data

Error Types Handled

Parse Errors: Malformed JSON (broken strings)
Validation Errors:
- Missing required fields (e.g., missing "feature")
- Invalid timestamp format
- Invalid action or feature values

Part 2: Aggregation and Time-Based Metrics

Script: part2_aggregation/aggregate_metrics.py

Builds on Part 1 to compute higher-level metrics involving time intervals and related entities.

Features

Session Duration Calculation: Matches start/end actions to calculate how long users spend in each feature
Time Interval Grouping: Groups sessions by hourly intervals to identify usage patterns
Feature-Level Aggregation: Higher-level metrics grouped by feature (total sessions, durations, unique users)
User-Level Aggregation: Higher-level metrics grouped by user (total sessions, durations, features used)
Final Summary: Combines Part 1 and Part 2 results into a comprehensive report

Usage

python3 part2_aggregation/aggregate_metrics.py

Output

Console: Aggregated metrics by feature, user, and time intervals
output/final_summary.json — Complete aggregated data

Metrics Calculated

Session Durations: Time between start and end actions for each user-feature pair
Feature Aggregation:
- Total sessions per feature
- Total and average duration per feature
- Min/max durations
- Unique users per feature
User Aggregation:
- Total sessions per user
- Total and average duration per user
- Features used by each user
- Sessions breakdown by feature
Time Interval Aggregation:
- Sessions grouped by hourly intervals
- Total and average durations per time period

Notes

Incomplete sessions (unmatched start/end pairs) are tracked but excluded from duration calculations
All durations are calculated in seconds and formatted for human readability
The script reuses the parsed_summary.json output from Part 1

Tests

pytest

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
output		output
part0_generation		part0_generation
part1_parsing		part1_parsing
part2_aggregation		part2_aggregation
tests		tests
utils		utils
.gitignore		.gitignore
CODE_REVIEW.md		CODE_REVIEW.md
INTERVIEW_READINESS.md		INTERVIEW_READINESS.md
LICENSE		LICENSE
README.md		README.md
all_combined.py		all_combined.py
all_combined_simplified.py		all_combined_simplified.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

backend-builder

Project Structure

Setup

Quick Start

Flow Explanation

Part 0: Log Generation

Usage

Output

Part 1: Parsing and Understanding Log Structure

Features

Usage

Output

Error Types Handled

Part 2: Aggregation and Time-Based Metrics

Features

Usage

Output

Metrics Calculated

Notes

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

backend-builder

Project Structure

Setup

Quick Start

Flow Explanation

Part 0: Log Generation

Usage

Output

Part 1: Parsing and Understanding Log Structure

Features

Usage

Output

Error Types Handled

Part 2: Aggregation and Time-Based Metrics

Features

Usage

Output

Metrics Calculated

Notes

Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages