Create Document Report is a Python-based automated research report generator that collects information from multiple online sources, processes the data, downloads relevant images, builds a structured knowledge base, and generates a professional PDF report.
The project is designed to automatically create detailed reports for people, companies, technologies, products, events, or any topic available online.
Unlike traditional summarizers, this project performs:
- Wikipedia research
- News aggregation
- Web crawling
- Article extraction
- Image collection
- Knowledge organization
- PDF generation
The final output is a well-structured PDF report containing:
- Executive Summary
- Background Information
- Timeline
- Key Facts
- Recent Developments
- Analysis
- References
- Images
The system collects information from multiple sources:
Retrieves:
- Topic overview
- Detailed sections
- Historical information
- Background knowledge
Collects:
- Latest news articles
- Recent developments
- Trending updates
The crawler visits multiple pages and extracts:
- Related articles
- Additional information
- Supporting content
Downloads:
- Topic-related images
- Article images
- Reference images
These images are later inserted into the generated PDF.
User Topic
│
▼
Wikipedia Source
│
▼
News Source
│
▼
Crawler
│
▼
Article Extractor
│
▼
Image Crawler
│
▼
Knowledge Builder
│
▼
PDF Report Generator
│
▼
Generated PDF
Create-Document-Report
│
├── main.py
│
├── pdf
│ └── report_builder.py
│
├── sources
│ ├── wikipedia_source.py
│ ├── news_source.py
│ ├── crawler.py
│ ├── article_extractor.py
│ ├── image_crawler.py
│ └── knowledge_builder.py
│
├── images
│
├── reports
│
├── requirements.txt
│
└── README.md
Main entry point of the application.
Responsibilities:
- Reads user topic
- Starts data collection
- Coordinates all modules
- Builds knowledge base
- Generates final PDF
Responsible for collecting Wikipedia information.
Extracts:
- Topic summary
- Sections
- Subsections
- Page URL
Example:
wiki = get_wikipedia_content("Andrej Karpathy")Responsible for collecting news articles.
Features:
- RSS feed retrieval
- News article extraction
- Recent developments
Returns:
[
{
"title": "...",
"summary": "...",
"url": "..."
}
]Performs deep crawling.
Features:
- URL discovery
- Related page extraction
- Link traversal
- Content collection
The crawler expands the amount of information available beyond Wikipedia and news sources.
Uses Newspaper3k to extract:
- Full article text
- Authors
- Publish dates
- Keywords
- Top image
Example:
extractor.extract(url)Returns:
{
"title": "...",
"text": "...",
"summary": "...",
"top_image": "..."
}Responsible for:
- Discovering images
- Downloading images
- Filtering invalid formats
- Removing unsupported SVG files
Supported formats:
- JPG
- JPEG
- PNG
- WEBP
Downloaded images are stored inside:
images/
Creates a structured knowledge base.
Combines:
- Wikipedia content
- News articles
- Crawled articles
- Images
Also generates:
- Keywords
- Timeline
- Facts
- Statistics
Example:
knowledge = builder.build(...)Generates the final PDF report.
Sections:
- Cover Page
- Contents
- Executive Summary
- Image Gallery
- Background Information
- Timeline
- Key Facts
- Recent Developments
- Analysis
- References
Uses:
- ReportLab
- Pillow
Output:
reports/Topic_Report.pdf
python -m venv .venvActivate:
.venv\Scripts\activatesource .venv/bin/activatepip install -r requirements.txtreportlab
wikipedia-api
requests
beautifulsoup4
feedparser
pillow
newspaper3k
lxml
lxml_html_clean
python-dateutil
tldextract
cssselect
feedfinder2
jieba3kGenerate a report:
python main.py "Andrej Karpathy"Example:
python main.py "Tesla"Example:
python main.py "OpenAI"Example generated file:
reports/
└── Andrej_Karpathy_Report.pdf
The report contains:
- Research information
- Images
- Timeline
- News analysis
- References
The project handles:
Wikipedia failures are handled gracefully.
Corrupt images are skipped automatically.
SVG images are ignored.
Failed requests do not terminate the application.
Potential enhancements:
- Face detection
- Duplicate removal
- Image captions
- Topic relevance scoring
- Domain filtering
- Link prioritization
Add support for:
- ArXiv
- GitHub
- Company websites
- Research papers
- Custom themes
- Charts
- Tables
- Infographics
Support:
- DOCX
- HTML
- Markdown
- PowerPoint
Python
Libraries:
- ReportLab
- Newspaper3k
- BeautifulSoup
- Requests
- Pillow
- FeedParser
- Wikipedia API
This project is intended for educational, research, and personal use.
Always respect website terms of service and copyright regulations when collecting content from external sources.
Dhiraj Kumar
Automated Research Report Generator using Python, Web Crawling, Knowledge Extraction, and PDF Generation.