A real-time web application that scrapes tweets via Selenium, streams them into Kafka, processes them with Apache Spark, and performs sentiment prediction using a pre-trained logistic regression model.
Note: The
sentiment_BERT/directory contains large model files (≈420 MB) and is not included in this repository.
Please download it manually before running the app. (https://drive.google.com/drive/folders/1RqGCpUjVUT0-F05LE1pulAeueogXflrS)
kafka_2.12-3.5.0/ # Kafka distribution
├── bin/ # Kafka CLI scripts
├── config/
├── libs/
├── ...
static/
└── style.css # CSS for the web UI
templates/
└── index.html # HTML template
app.py # Flask + Spark + Kafka integration
scraper.py # Selenium scraper & Kafka producer
sentiment_BERT/
├── config.json
├── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── training_args.bin
└── vocab.txt
logreg_sentiment140_model.pkl # Pre-trained sentiment model
README.md # This file
- Java 8+ (for Spark & Kafka)
- Kafka & Zookeeper
- Python 3.8+ and pip
- Google Chrome (for Selenium)
pip install flask pyspark kafka-python selenium webdriver-manager joblib-
Start Zookeeper & Kafka
# In one terminal bin/zookeeper-server-start.sh config/zookeeper.properties # In another bin/kafka-server-start.sh config/server.properties
-
Delete & Recreate Topic (optional, to purge old data)
# Delete existing topic bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --delete --topic tweets # Recreate with 4 partitions bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --create \ --topic tweets \ --partitions 4 \ --replication-factor 1
-
(Optional) View Topic Contents
bin/kafka-console-consumer.sh \ --bootstrap-server localhost:9092 \ --topic tweets \ --from-beginning
-
Run the App
python app.py
- The Flask server will launch on
http://127.0.0.1:5000. - Use the Fetch Tweets button to start the scraper (opens headless Chrome, scrapes and pushes tweets to Kafka).
- Stop Fetching Tweets stops the scraper process.
- Start Prediction stops scraping, consumes all tweets from the topic, runs sentiment prediction, and displays results.
- The Flask server will launch on
scraper.py: Uses Selenium to scroll through Twitter search results (x.com) and produces tweet text messages into Kafka topictweets.app.py:- Spins up a SparkSession with the Kafka SQL connector.
- Broadcasts a pre-trained logistic regression model for inference.
- Provides Flask endpoints:
/fetch_tweets→ spawnsscraper.pyas a subprocess/stop_fetch→ terminates the scraper and fetches any remaining tweets/start_prediction→ reads entire topic fromearliestoffset, applies the model via a Spark UDF, and stores predictions/get_tweets→ returns a JSON list of{ tweet, prediction }
index.html+style.css: A simple UI with controls and a live table that polls/get_tweetsevery 2 seconds, animates new rows, and color‑codes predictions.
- Fetch Tweets → begins scraping & streaming to Kafka, table updates live.
- Stop Fetching Tweets → stops scraper but table continues polling Kafka for completeness.
- Start Prediction → stops polling (optionally), runs batch prediction over all messages, updates table cells with sentiment.