PySpark Tutorial for Beginners - VS Code Edition

Introduction

This repository provides a hands-on PySpark tutorial using VS Code as the primary development environment. It covers Spark installation, SparkContext, SparkSession, RDD transformations and actions, Spark DataFrames, Spark SQL, and more. The included notebooks and scripts allow you to follow along, experiment, and practice your PySpark skills.

Environment & Setup

Recommended Environment:

✅ Java 17
✅ Python 3.11 venv
✅ Spark 4.1.1
✅ winutils configured
✅ Stable local[1]
✅ VS Code interpreter + runtime aligned

Quick Start

Clone this repository:

git clone https://github.com/bobydo/pyspark-tutorial

Create a Python 3.11 virtual environment:
```
python -m venv pysparkenv
```

Activate the environment and install dependencies:

pysparkenv\Scripts\activate
pip install -r requirements.txt
pip install pyspark==4.1.1 findspark

Configure all environment variables and paths in the .env file at the project root.

Native Hadoop Binaries for Windows (Parquet/ORC support)

If you want to use Parquet or other advanced Spark features on Windows, you need native Hadoop binaries (DLLs) in addition to winutils.exe.

How to use the provided hadoop.zip:

Unzip hadoop.zip so that all files (DLLs and winutils.exe) are extracted to D:\hadoop\bin.
- The folder should contain files like winutils.exe, hadoop.dll, hadoop-native.dll, etc.
Set the following environment variables (in your terminal, .env, or launch.json):
- HADOOP_HOME=D:/hadoop
- Add D:/hadoop/bin to your PATH
Restart VS Code and your terminal to ensure the new environment variables are loaded.
You can now use Spark features that require native Hadoop support (e.g., writing Parquet files) on Windows.
Online Parquet Viewers: Websites like https://parquet-viewer-online.com/ let you upload and view Parquet files. Example PowerShell commands:

$env:HADOOP_HOME="D:/hadoop"
$env:PATH="D:/hadoop/bin;${env:PATH}"

If you encounter errors, make sure all DLLs are present in D:/hadoop/bin and that your Spark/Hadoop version matches the DLLs in the zip. 5. Open VS Code and select the correct Python interpreter:

Load the project folder in VS Code.
Press Ctrl + Shift + P and search for "Python: Select Interpreter".
Choose: Python 3.11.x (pysparkenv)
Confirm the physical path: \pyspark-tutorial\pysparkenv\Scripts\python.exe
Press Ctrl + Shift + P and select "Reload Window" to apply changes.

Python Environment & Libraries

The following key Python packages are required and already listed in requirements.txt:

pyspark==4.1.1
findspark==2.0.1
python-dotenv==1.2.1

Other libraries commonly present in the environment (for notebook and VS Code compatibility):

ipykernel
jupyter
traitlets
tornado
pygments
stack_data
nest_asyncio
setuptools
six
typing_extensions
decorator
zmq

These are typically installed automatically when using Jupyter or VS Code Python extensions, and are not required to be listed in requirements.txt for basic PySpark functionality.

Notebook Descriptions

01-PySpark-Get-Started: PySpark environment setup and configuration.
02-Create-SparkContext: Creating SparkContext objects in different PySpark versions.
03-Create-SparkSession: Creating SparkSession objects in PySpark.
04-RDD-Operations: RDD transformations and actions.
05-DataFrame-Intro: Introduction to Spark DataFrames and differences compared to RDD.
06-DataFrame-from-various-data-source: Creating Spark DataFrame from various data sources.
07-DataFrame-Operations: DataFrame operations like filtering, aggregation, etc.
08-Spark-SQL: Using Spark SQL for DataFrame queries.

Feel free to explore and run these notebooks/scripts at your own pace in VS Code.

Prerequisites

Basic knowledge of Python programming.
Understanding of data processing concepts (no prior PySpark experience required).

Usage

These notebooks/scripts are meant for self-learning and practice. Follow along with the tutorial video to gain a deeper understanding of PySpark concepts. Experiment with the code, modify it, and try additional exercises to solidify your skills.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
data		data
exam_scripts		exam_scripts
output_txt		output_txt
parquet_data		parquet_data
pysparkenv		pysparkenv
spark-warehouse/my_table		spark-warehouse/my_table
.env		.env
.gitignore		.gitignore
01-PySpark-Get-Started.ipynb		01-PySpark-Get-Started.ipynb
02-Create-SparkContext.ipynb		02-Create-SparkContext.ipynb
03-Create-SparkSession.ipynb		03-Create-SparkSession.ipynb
04-RDD-Operations.ipynb		04-RDD-Operations.ipynb
05-DataFrame-Intro.ipynb		05-DataFrame-Intro.ipynb
06-DataFrame-from-various-data-source.ipynb		06-DataFrame-from-various-data-source.ipynb
07-DataFrame-Operations.ipynb		07-DataFrame-Operations.ipynb
08-Spark-SQL.ipynb		08-Spark-SQL.ipynb
LICENSE		LICENSE
README.md		README.md
hadoop.zip		hadoop.zip
requirements.txt		requirements.txt
setup_pyspark_env.py		setup_pyspark_env.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Tutorial for Beginners - VS Code Edition

Introduction

Environment & Setup

Quick Start

Native Hadoop Binaries for Windows (Parquet/ORC support)

Python Environment & Libraries

Notebook Descriptions

Prerequisites

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PySpark Tutorial for Beginners - VS Code Edition

Introduction

Environment & Setup

Quick Start

Native Hadoop Binaries for Windows (Parquet/ORC support)

Python Environment & Libraries

Notebook Descriptions

Prerequisites

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages