A production-ready Python pipeline for automatically collecting and structuring publicly available information about National School of Drama (NSD) graduates (2014–2019).
The system takes a list of NSD alumni names and graduation years, searches the internet, extracts relevant information using a local LLM via Ollama, and generates a structured database.
- CSV input
- Excel (.xlsx) input
Expected columns:
| Name | Graduation Year |
|---|---|
| Amanpreet Kaur | 2014 |
| Chirag Garg | 2014 |
- Full name
- Alternative names
- Graduation year
- Current profession
- Current organization
- Current city/location
- Personal website
- Portfolio website
- YouTube
- IMDb
- Twitter/X
- Other public profiles
- Plays acted in
- Plays directed
- Plays written
- Theatre groups
- Theatre companies
- Repertory associations
- Film appearances
- TV appearances
- Workshops conducted
- Awards
Only publicly available information is collected:
- Public phone numbers
- Public email addresses
- Public contact pages
- Management contacts
This project intentionally avoids:
- Guessing phone numbers
- Guessing email addresses
- Inferring personal information
- Accessing private content
- Bypassing logins
- Circumventing website restrictions
- Collecting non-public data
The pipeline searches across multiple publicly available sources:
- Google/DuckDuckGo
- Facebook pages
- IMDb
- Theatre websites
- Festival websites
- News articles
- YouTube interviews
- Portfolio websites
- Personal websites
- NSD pages
- Repertory companies
- Medium articles
- Public PDFs
- Wikipedia
project_root/
│
├── input/
│ └── nsd_graduates.xlsx
│
├── output/
│ ├── alumni_data.csv
│ ├── alumni_data.xlsx
│ └── alumni_data.json
│
├── cache/
│
├── logs/
│
├── models/
│
├── main.py
├── search.py
├── extractor.py
├── llm_parser.py
├── utils.py
├── requirements.txt
│
└── README.md
Input CSV/Excel
↓
Generate contextual search queries
↓
Collect search results
↓
Extract webpage text
↓
Send data to local LLM
↓
Parse structured JSON
↓
Confidence filtering
↓
Deduplication
↓
Output generation
This project uses a local LLM through Ollama.
Default:
gemma3Fallback:
llama3:8bInstall Ollama:
Pull model:
ollama pull gemma3or
ollama pull llama3:8bTest:
ollama run gemma3Clone repository:
git clone https://github.com/yourusername/nsd-alumni-intelligence.git
cd nsd-alumni-intelligenceCreate virtual environment:
Windows:
python -m venv venv
venv\Scripts\activateLinux/Mac:
python3 -m venv venv
source venv/bin/activateInstall dependencies:
pip install -r requirements.txtpandas
openpyxl
requests
beautifulsoup4
duckduckgo-search
ollama
tqdm
retry
loguru
rapidfuzz
lxml
Place your Excel or CSV file inside:
input/
Example:
Name,Graduation Year
Amanpreet Kaur,2014
Chirag Garg,2014
Jayanta Narzary,2014Run:
python main.pyThe pipeline will:
- Read names
- Generate contextual search queries
- Search the web
- Download webpage content
- Extract structured information
- Run local LLM analysis
- Generate final outputs
Generated files:
output/
├── alumni_data.csv
├── alumni_data.xlsx
└── alumni_data.json
Example output:
{
"name":"Amanpreet Kaur",
"graduation_year":"2014",
"current_profession":"Actor",
"current_location":"Mumbai",
"website":null,
"portfolio":null,
"linkedin":"...",
"instagram":"...",
"plays_acted_in":[
"Play A",
"Play B"
],
"plays_directed":[
"Play C"
],
"public_emails":[
"contact@example.com"
],
"sources":[
"https://..."
]
}- Retry handling
- Request timeouts
- Network failure handling
- Logging
- Duplicate detection
- Ambiguous name handling
- Confidence scoring
- Low-confidence filtering
- Structured JSON validation
- Caching
- Rate limiting
- Reduced duplicate searches
Logs are stored in:
logs/
Example:
2026-05-26 18:20:41 INFO Searching: Amanpreet Kaur NSD
2026-05-26 18:20:44 INFO Parsing results
2026-05-26 18:20:48 INFO Confidence Score: 0.91
Open folder:
code .Select interpreter:
Ctrl + Shift + P
Python: Select Interpreter
Choose:
venv
Run:
python main.py- Parallel processing
- GPU acceleration
- Vector database integration
- RAG-based retrieval
- Semantic duplicate detection
- Theatre-specific source prioritization
- Dashboard interface
This tool only collects publicly available information.
Users are responsible for complying with website terms of service, privacy laws, and ethical data usage practices.
MIT License