The Reddit topic tracker App is designed to collect textual data from Reddit. It employs a modular architecture with microservices to ensure scalability and flexibility. This set of repositories constitutes a comprehensive Reddit Data ETL (Extract, Transform, Load) tool designed to efficiently extract text data from Reddit, transform it, and load it into MongoDB. The tool is orchestrated using Apache Airflow and optimizes data processing by leveraging Redis for intermediate storage.
-
Data Ingestion: The data ingestion process involves collecting posts and comments from Reddit. This is accomplished using a Python microservice that interacts with the Reddit API. The service fetches data based on specified subreddits, keywords, or other criteria, and then stores it in a temporary data store.
-
ETL Pipeline Once the data is ingested, it undergoes the ETL process:
-
Extract: Another microservice extracts the data from the temporary store and prepares it for transformation.
-
Transform: The transformed data is processed to extract valuable insights, such as sentiment analysis, keyword extraction, and user engagement metrics. Python libraries like NLTK and spaCy can be used for natural language processing tasks.
-
Load: The transformed data is stored in a MongoDB database for easy retrieval and analysis.
- Redis for Optimization: Redis is used within the ETL pipeline to cache frequently accessed data and optimize processing speed. This helps reduce the load on the Reddit API and improves overall system performance.
- Apache Airflow: Orchestrates the ETL pipeline and scheduling tasks.
- Python Microservices: Python microservice used for data ingestion, transformation, and loading.
- MongoDB: Stores the transformed Reddit text data.
- Redis: Caches data to optimize ETL processes.
- Python Flask: Provides a RESTful API for interaction with the application.
- Docker: Used to run the applications in container mode.
The Reddit Data ETL Tool consists of the following repositories, each with its specific functionality:
-
db-API Repository:
- Responsible for starting the MongoDB and API containers.
- MongoDB is used to persist the transformed text data.
- The API container provides the interface for registering Reddit comments with the external API.
-
Collector-API Repository:
- Hosts the Collector API, which offers functions for extracting and transforming Reddit data.
- Utilizes Redis for optimizing data transfer between the extract and transform steps.
- Runs both the API and Redis containers.
-
Collector-Airflow Repository:
- Contains the Apache Airflow DAGs responsible for scheduling the ETL tasks.
- Orchestrates the data extraction, transformation, and loading processes.
To run the Reddit Data ETL Tool, follow these steps:
-
db-API:
- Clone the repository:
https://github.com/RTT-app/db-api.git - Navigate to the repository:
cd db-api - Use the Makefile to start the MongoDB and API containers:
$ make run
- Clone the repository:
-
Collector-API:
- Clone the repository:
git clone https://github.com/RTT-app/collector-api.git - Navigate to the repository:
cd collector-api - Use the Makefile to start the API and Redis containers:
$ make run
- Clone the repository:
-
Collector-Airflow:
- Clone the repository:
git clone https://github.com/RTT-app/collector-airflow.git - Navigate to the repository:
cd collector-airflow - Use the Makefile to run Apache Airflow and schedule the ETL tasks:
$ make docker-up
- Clone the repository:
You can customize each repository to suit your specific Reddit data processing needs. Modify the ETL logic, API endpoints, and Airflow DAGs as required.
Feel free to contribute to any of the repositories by opening issues, proposing improvements, or submitting pull requests.