Skip to content

dhiraj7kr/Create-Document-Report

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Create Document Report

Overview

Create Document Report is a Python-based automated research report generator that collects information from multiple online sources, processes the data, downloads relevant images, builds a structured knowledge base, and generates a professional PDF report.

The project is designed to automatically create detailed reports for people, companies, technologies, products, events, or any topic available online.

Unlike traditional summarizers, this project performs:

  • Wikipedia research
  • News aggregation
  • Web crawling
  • Article extraction
  • Image collection
  • Knowledge organization
  • PDF generation

The final output is a well-structured PDF report containing:

  • Executive Summary
  • Background Information
  • Timeline
  • Key Facts
  • Recent Developments
  • Analysis
  • References
  • Images

Features

Multi-Source Research

The system collects information from multiple sources:

Wikipedia

Retrieves:

  • Topic overview
  • Detailed sections
  • Historical information
  • Background knowledge

News Sources

Collects:

  • Latest news articles
  • Recent developments
  • Trending updates

Web Crawling

The crawler visits multiple pages and extracts:

  • Related articles
  • Additional information
  • Supporting content

Image Collection

Downloads:

  • Topic-related images
  • Article images
  • Reference images

These images are later inserted into the generated PDF.


Project Architecture

User Topic
    │
    ▼
Wikipedia Source
    │
    ▼
News Source
    │
    ▼
Crawler
    │
    ▼
Article Extractor
    │
    ▼
Image Crawler
    │
    ▼
Knowledge Builder
    │
    ▼
PDF Report Generator
    │
    ▼
Generated PDF

Project Structure

Create-Document-Report
│
├── main.py
│
├── pdf
│   └── report_builder.py
│
├── sources
│   ├── wikipedia_source.py
│   ├── news_source.py
│   ├── crawler.py
│   ├── article_extractor.py
│   ├── image_crawler.py
│   └── knowledge_builder.py
│
├── images
│
├── reports
│
├── requirements.txt
│
└── README.md

Component Details

main.py

Main entry point of the application.

Responsibilities:

  • Reads user topic
  • Starts data collection
  • Coordinates all modules
  • Builds knowledge base
  • Generates final PDF

wikipedia_source.py

Responsible for collecting Wikipedia information.

Extracts:

  • Topic summary
  • Sections
  • Subsections
  • Page URL

Example:

wiki = get_wikipedia_content("Andrej Karpathy")

news_source.py

Responsible for collecting news articles.

Features:

  • RSS feed retrieval
  • News article extraction
  • Recent developments

Returns:

[
    {
        "title": "...",
        "summary": "...",
        "url": "..."
    }
]

crawler.py

Performs deep crawling.

Features:

  • URL discovery
  • Related page extraction
  • Link traversal
  • Content collection

The crawler expands the amount of information available beyond Wikipedia and news sources.


article_extractor.py

Uses Newspaper3k to extract:

  • Full article text
  • Authors
  • Publish dates
  • Keywords
  • Top image

Example:

extractor.extract(url)

Returns:

{
    "title": "...",
    "text": "...",
    "summary": "...",
    "top_image": "..."
}

image_crawler.py

Responsible for:

  • Discovering images
  • Downloading images
  • Filtering invalid formats
  • Removing unsupported SVG files

Supported formats:

  • JPG
  • JPEG
  • PNG
  • WEBP

Downloaded images are stored inside:

images/

knowledge_builder.py

Creates a structured knowledge base.

Combines:

  • Wikipedia content
  • News articles
  • Crawled articles
  • Images

Also generates:

  • Keywords
  • Timeline
  • Facts
  • Statistics

Example:

knowledge = builder.build(...)

report_builder.py

Generates the final PDF report.

Sections:

  1. Cover Page
  2. Contents
  3. Executive Summary
  4. Image Gallery
  5. Background Information
  6. Timeline
  7. Key Facts
  8. Recent Developments
  9. Analysis
  10. References

Uses:

  • ReportLab
  • Pillow

Output:

reports/Topic_Report.pdf

Installation

Create Virtual Environment

python -m venv .venv

Activate:

Windows

.venv\Scripts\activate

Linux / Mac

source .venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Requirements

reportlab
wikipedia-api
requests
beautifulsoup4
feedparser
pillow
newspaper3k
lxml
lxml_html_clean
python-dateutil
tldextract
cssselect
feedfinder2
jieba3k

Usage

Generate a report:

python main.py "Andrej Karpathy"

Example:

python main.py "Tesla"

Example:

python main.py "OpenAI"

Output

Example generated file:

reports/
└── Andrej_Karpathy_Report.pdf

The report contains:

  • Research information
  • Images
  • Timeline
  • News analysis
  • References

Error Handling

The project handles:

Missing Pages

Wikipedia failures are handled gracefully.

Invalid Images

Corrupt images are skipped automatically.

Unsupported Formats

SVG images are ignored.

Network Failures

Failed requests do not terminate the application.


Future Improvements

Potential enhancements:

Better Image Selection

  • Face detection
  • Duplicate removal
  • Image captions

Smarter Crawling

  • Topic relevance scoring
  • Domain filtering
  • Link prioritization

More Sources

Add support for:

  • ArXiv
  • GitHub
  • Company websites
  • Research papers

Better PDF Design

  • Custom themes
  • Charts
  • Tables
  • Infographics

Export Formats

Support:

  • DOCX
  • HTML
  • Markdown
  • PowerPoint

Technologies Used

Python

Libraries:

  • ReportLab
  • Newspaper3k
  • BeautifulSoup
  • Requests
  • Pillow
  • FeedParser
  • Wikipedia API

License

This project is intended for educational, research, and personal use.

Always respect website terms of service and copyright regulations when collecting content from external sources.


Author

Dhiraj Kumar

Automated Research Report Generator using Python, Web Crawling, Knowledge Extraction, and PDF Generation.

About

Generate a Document Report on the Particular Topic By just giving the Topic Name

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages