MiniSpark

Overview

MiniSpark is a mini implementation of distributed data processing frameworks like Spark and MapReduce, created as a project for CS 537 (Introduction to Operating Systems). It processes data through a directed acyclic graph (DAG) of transformations, using multi-threading to achieve parallelism on a single node.

Distributed data processing frameworks are crucial for performing data analytics on large quantities of data. Frameworks like MapReduce and Spark are powerful and relatively simple to program. Users write declarative queries to manipulate data, and the framework processes the data, automatically handling difficult problems of distributed computing -- parallel processing, inter-process communication, and fault tolerance.

Learning Objectives

To learn about data processing pipelines To implement a correct MiniSpark framework with several common data processing operators To efficiently process data in parallel using threads

Project Structure

├── applications/ # Example applications using MiniSpark ├── lib/ # Core library code ├── solution/ # Implementation of MiniSpark ├── sample-files/ # Test input files ├── tests/ # Test suite └── Makefile # Build system

Background

To understand how to make progress on any project that involves concurrency, you should understand the basics of thread creation, mutual exclusion (with locks), and signaling/waiting (with condition variables).

Key Concepts

RDDs (Resilient Distributed Datasets) Data is represented as immutable RDDs that form the nodes in a DAG where edges represent transformations.

Transformations

map: Apply a function to each element filter: Keep elements that satisfy a predicate join: Combine elements from two RDDs with matching keys partitionBy: Redistribute data across partitions Actions count: Return the number of elements in an RDD print: Display each element in an RDD

Actions

count: Return the number of elements in an RDD print: Display each element in an RDD

Building the Project

To build MiniSpark and its example applications:

make

Running Tests

cd tests
./run-tests.sh

Parallelism

MiniSpark achieves parallelism through:

Materializing partitions of an RDD in parallel Computing independent parts of the DAG concurrently

About

This project was completed as part of CS 537 (Introduction to Operating Systems) to demonstrate understanding of concurrent programming and distributed data processing concepts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiniSpark

Overview

Learning Objectives

Project Structure

Background

Key Concepts

Transformations

Actions

Building the Project

Running Tests

Parallelism

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
applications		applications
graphics		graphics
lib		lib
sample-files		sample-files
solution		solution
tests		tests
Makefile		Makefile
README.md		README.md
gdb_concurrent_debugging.md		gdb_concurrent_debugging.md

Folders and files

Latest commit

History

Repository files navigation

MiniSpark

Overview

Learning Objectives

Project Structure

Background

Key Concepts

Transformations

Actions

Building the Project

Running Tests

Parallelism

About

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages