Skip to content

amantyagi22/log-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High-Performance Log Processor in Node.js 🚀

Week 1 Deliverable — Node.js Streaming + Cluster + Benchmark

Introduction

As a backend developer, I aimed to dive deep into Node.js internals while tackling real-world large data processing. This week, I built a high-performance log processor capable of handling 1GB+ log files without crashing, leveraging:

  • Node.js streams → memory-efficient processing
  • Node.js cluster module → parallel processing across CPU cores
  • Sync vs Async benchmarking → measure event loop performance

This project provided insights into Node.js under-the-hood mechanics and prepared me for building scalable backend systems.

Problem Statement

Most Node.js file processing tutorials either:

  1. Use fs.readFileSync → blocks the event loop and crashes with large files (>512MB).
  2. Ignore CPU utilization → runs single-threaded, leaving cores idle.

I wanted a solution that is:

  • Memory-efficient (never load entire file into memory)
  • CPU-efficient (use all cores for heavy workloads)
  • Measurable (compare async vs sync performance)

Architecture

Master Process (Node.js)
       │
       │ forks N workers (cluster)
       ▼
[Worker 0] Chunk 1 of file  → async streaming + optional sync benchmark
[Worker 1] Chunk 2 of file  → async streaming
...
[Worker N] Chunk N of file  → async streaming
  • Each worker processes a unique chunk of the file to prevent duplication.
  • Only worker 0 runs a sync benchmark using chunked reading (memory safe).
  • Master aggregates lines processed for a final total.

Key Features

  1. Memory-efficient streaming

    const rl = readline.createInterface({
      input: fs.createReadStream(filePath),
    });
    for await (const line of rl) {
      /* process line */
    }
    • Can handle files >1GB
    • Processes line by line, minimal memory overhead
  2. Cluster-based parallel processing

    if (cluster.isMaster) { cluster.fork() for N CPUs }
    • Utilizes all CPU cores
    • Workers auto-restart if a crash occurs
  3. Chunked sync benchmarking

    const buffer = Buffer.alloc(1024 * 1024);
    fs.readSync(fd, buffer, 0, buffer.length, null);
    • Simulates blocking I/O safely
    • Compares async vs sync performance
  4. Monitoring memory usage

    console.log(process.memoryUsage());
    • Tracks RSS and HeapUsed per worker

Sample Output

Master 27979 is running
Forking 8 workers...
[Worker 27980] Starting async stream lines 0-6250000
[Worker 27981] Starting async stream lines 6250000-12500000
...
[Worker 27980] Async processing done. Lines: 6250000
Memory Usage (MB) - RSS: 74.34, HeapUsed: 7.74
[Worker 27980] Starting chunked sync read benchmark
[Worker 27980] Chunked sync read done in 129 ms
Memory Usage (MB) - RSS: 75.08, HeapUsed: 8.17
...
All workers finished. Total lines processed: 50000000
  • Each worker processes its chunk independently.
  • Async streaming uses very low memory.
  • Sync benchmark runs safely on one worker.

Generate Huge Log File

To generate a large log file for testing (e.g., 50 million lines):

yes "INFO: User logged in at $(date)" | head -n 50000000 > huge-log-file.log

Lessons Learned

  • Node.js streams are powerful for large file processing.
  • Clusters enable parallelization of CPU-bound tasks, but work must be explicitly split.
  • Sync vs Async benchmarking: Sync I/O can crash on huge files; chunked reads are safer.
  • Memory profiling is critical to avoid surprises in production.

Next Steps

  • Integrate Docker for containerized deployment
  • Add CI/CD pipeline for automated build and deployment
  • Expose a REST API for dynamic log uploading and processing
  • Add metrics endpoint for real-time monitoring

This will evolve into the Month 1 capstone project, combining streaming, clustering, memory monitoring, and production readiness.

Code Repository Structure

high-perf-log-processor/
├─ src/
│   ├─ master.js
│   ├─ worker.js
│   └─ utils.js
├─ logs/
│   └─ huge-log-file.log
├─ package.json
└─ README.md

Outcome

By the end of Week 1, I have a production-ready, scalable Node.js service that efficiently processes huge log files while benchmarking performance.

About

High-performance Node.js log processor built for 1GB+ files using streaming to process logs line-by-line with minimal memory. Uses the Node.js cluster module to split work across CPU cores and aggregate results for fast throughput. Includes safe sync-vs-async benchmarking plus per-worker memory usage monitoring for performance insights.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors