Skip to content

YoshiAligina/rfp-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RFP Processor: AI-Powered Request for Proposal Analysis System

Technical Architecture Overview

This system implements a sophisticated, production-grade RFP analysis pipeline utilizing advanced machine learning techniques, natural language processing, and hybrid AI architectures. The solution combines transformer-based neural networks with traditional statistical methods to provide accurate proposal scoring and decision support.

Core Architecture Components

┌─────────────────────────────────────────────────────────────────┐
│                    RFP Processor Architecture                   │
├─────────────────────────────────────────────────────────────────┤
│  Web Interface Layer (Flask/Jinja2)                           │
│  ├── templates/database.html    - Database management UI       │
│  ├── templates/upload.html      - Document upload interface    │
│  ├── templates/index.html       - Dashboard & analytics        │
│  └── templates/base.html        - Base template with Bootstrap │
├─────────────────────────────────────────────────────────────────┤
│  Application Logic Layer                                       │
│  ├── web_app.py                 - Flask routes & controllers   │
│  └── upload_handler.py          - File processing pipeline     │
├─────────────────────────────────────────────────────────────────┤
│  Machine Learning Core                                         │
│  ├── model_utils.py             - Longformer fine-tuning       │
│  ├── data_utils.py              - Data preprocessing & storage │
│  └── document_utils.py          - Text extraction & parsing    │
├─────────────────────────────────────────────────────────────────┤
│  Document Processing Pipeline                                  │
│  ├── ocr_utils.py               - OCR & scanned PDF handling   │
│  ├── safe_csv_utils.py          - Atomic database operations   │
│  └── database_utils.py          - Database schema management   │
├─────────────────────────────────────────────────────────────────┤
│  Analytics & Reporting                                         │
│  ├── rfp_counter.py             - Statistical analysis         │
│  ├── generate_html_report.py    - HTML report generation       │
│  └── evaluate_accuracy.py       - Model performance metrics    │
└─────────────────────────────────────────────────────────────────┘

Deep Technical Implementation Analysis

1. Machine Learning Core (model_utils.py)

Primary Algorithm: Fine-tuned Longformer Transformer Architecture

Longformer Implementation Details

MODEL_NAME = "allenai/longformer-base-4096"  # Base pretrained model
MODEL_SAVE_PATH = "fine_tuned_longformer"    # Custom fine-tuned checkpoint

Technical Specifications:

  • Architecture: Longformer-base with sliding window attention mechanism
  • Context Length: 4,096 tokens (8x longer than standard BERT)
  • Attention Pattern: Sparse attention with sliding window + global attention
  • Parameters: ~149M parameters in base configuration
  • Fine-tuning Strategy: Task-specific adaptation for binary classification

Hybrid Prediction System

The system implements a sophisticated hybrid approach combining:

  1. Transformer-based Neural Predictions

    def predict_document_probability(text, auto_finetune=True):
        # Process text through fine-tuned Longformer
        chunks = chunk_text(text, max_tokens=4000, stride=500)
        base_probs = []
        
        for chunk in chunks:
            inputs = tokenizer(chunk, return_tensors="pt", 
                             truncation=True, max_length=4096)
            logits = model(**inputs).logits
            prob = torch.softmax(logits, dim=1)[0][1].item()
            base_probs.append(prob)
  2. Historical Similarity Learning

    def calculate_similarity_score(current_text, historical_df):
        # TF-IDF vectorization with n-gram analysis
        vectorizer = TfidfVectorizer(max_features=1000, 
                                   stop_words='english', 
                                   ngram_range=(1, 2))
        # Cosine similarity computation against historical decisions
        similarities = cosine_similarity(current_vector, tfidf_matrix[1:])
  3. Adaptive Weight Distribution

    # Dynamic weighting based on model confidence and data availability
    if has_finetuned_model:
        weight_base = 0.7 if num_historical >= 20 else 0.5
        weight_similarity = 1.0 - weight_base
    else:
        weight_similarity = 0.8 if num_historical >= 10 else 0.5

Fine-tuning Process

Multi-Stage Training Pipeline:

  1. Data Preparation Stage

    def prepare_training_data():
        # Load historical decisions from CSV database
        # Filter for meaningful content (>20 characters)
        # Convert categorical decisions to binary labels
        # Balance approved/denied distribution
  2. Training Configuration

    optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=0, num_training_steps=total_steps
    )
  3. Background Training Architecture

    • Robust Error Handling: Automatic recovery and checkpoint saving
    • Memory Management: GPU cache clearing and garbage collection
    • Progress Persistence: Auto-save after each epoch
    • Tab-Switch Resilience: Continues training regardless of UI focus

2. Document Processing Pipeline (document_utils.py)

Multi-Format Document Extraction System:

PDF Processing

def extract_text_from_pdf(pdf_path: str) -> str:
    # Primary: PyMuPDF for standard PDFs
    # Fallback: OCR processing for scanned documents
    # Handle encrypted/protected PDFs
    # Extract text with formatting preservation

OCR Integration (ocr_utils.py)

def extract_text(image_path: str) -> str:
    # EasyOCR implementation with GPU acceleration
    # Multi-language support (English optimized)
    # Confidence thresholding for text extraction
    # Scanned document detection and processing

Microsoft Office Integration

def extract_text_from_docx(docx_path: str) -> str:
    # python-docx library for Word document parsing
    # Extract paragraphs, tables, and headers
    # Handle embedded objects and images
    # Preserve document structure

3. Web Application Architecture (web_app.py)

Flask-based RESTful API with Production Optimizations:

Route Architecture

@app.route('/upload', methods=['GET', 'POST'])
def upload_page():
    # Handle multipart file uploads
    # Validate file types and sizes
    # Secure filename processing
    # Asynchronous document processing

Database Integration

@app.route('/database')
def database_page():
    # Dynamic filtering and sorting
    # Pagination for large datasets
    # Real-time search capabilities
    # Bulk operations support

Production Deployment (run_production.py)

if __name__ == '__main__':
    # Disable debug mode for production
    # Remove file watching to prevent processing interruption
    # Configure logging and error handling
    app.run(host='0.0.0.0', port=5000, debug=False, use_reloader=False)

4. Data Management System

CSV-based Database with ACID Properties (safe_csv_utils.py)

Atomic Operations:

class SafeCSVManager:
    def __init__(self, csv_file: str):
        # Implement file locking for concurrent access
        # Automatic backup creation before modifications
        # Schema validation and type checking
        # Rollback capability for failed operations

Data Schema:

SCHEMA = {
    "filename": {"type": "string", "required": True},
    "title": {"type": "string", "required": True},
    "sender": {"type": "string", "required": True},
    "probability": {"type": "float", "min": 0.0, "max": 1.0},
    "decision": {"type": "string", "allowed": ["Pending", "Approved", "Denied"]},
    "summary": {"type": "string", "required": True},
    "file_list": {"type": "string", "required": False}
}

5. Analytics and Reporting System

Statistical Analysis (rfp_counter.py)

def count_unique_rfps(df):
    # Project-level counting and aggregation
    # Handle both individual and grouped RFP submissions
    # Smart duplicate detection for multi-file projects
    # Decision status distribution analysis

def count_unique_rfps_by_decision(df, decision_type):
    # Filter and count RFPs by specific decision status
    # Support for Pending, Approved, and Denied statuses
    # Project-based counting for accurate metrics

HTML Report Generation (generate_html_report.py)

def generate_comprehensive_report():
    # Create responsive HTML reports with Bootstrap
    # Interactive charts and visualizations
    # Export capabilities (PDF, CSV)
    # Real-time metric calculations

6. Model Evaluation Framework (evaluate_accuracy.py)

Comprehensive Model Assessment:

Cross-Validation Testing

def evaluate_model_accuracy(threshold=0.5, test_size=0.3):
    # Stratified train/test splitting
    # Multiple threshold optimization
    # Confusion matrix generation
    # Precision/Recall/F1-score calculation
    # ROC curve analysis

Performance Metrics

  • Overall Accuracy: Percentage of correct predictions
  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1-Score: Harmonic mean of precision and recall
  • AUC-ROC: Area under the receiver operating characteristic curve

7. Background Processing Architecture

Model Training Integration (model_utils.py)

def fine_tune_model(epochs=2, batch_size=1, learning_rate=2e-5):
    # Background-compatible training with progress tracking
    # Automatic saving after each epoch
    # Memory management and cleanup
    # Error recovery and checkpoint handling

Auto-Training System Integration

def auto_training_check(historical_df, auto_finetune=True):
    # Progressive training thresholds: 5 entries first, then every 10
    num_decisions = len(historical_df)
    has_model = os.path.exists(MODEL_SAVE_PATH)
    last_trained_count = get_last_training_count()
    
    # First training at 5 entries, subsequent training every 10 entries
    should_train = (not has_model and num_decisions >= 5) or \
                  (has_model and num_decisions >= last_trained_count + 10)
    
    if should_train:
        fine_tune_model(epochs=2, batch_size=1)
        save_training_count(num_decisions)

Advanced Features Implementation

Auto-Training System

  • Trigger Conditions: Automatic training at 5 entries initially, then every 10 entries thereafter
  • Training Strategy: Progressive fine-tuning with 2 epochs for optimal learning
  • Model Updates: Automatic model reload and improved predictions after training
  • Data Requirements: Balanced dataset with both approved and denied decisions
  • Progress Tracking: Persistent tracking of training milestones to prevent over-training

Text Chunking Algorithm

def chunk_text(text, max_tokens=4000, stride=500):
    # Sliding window approach with configurable overlap
    # Token-level splitting with context preservation
    # Boundary-aware chunking to maintain semantic coherence
    # Optimized for Longformer's attention mechanism

Feature Extraction Pipeline

def extract_rfp_features(text):
    # Regex-based pattern matching for RFP-specific elements
    # Budget information extraction
    # Deadline and timeline identification
    # Scope of work classification
    # Qualification requirements parsing

Technical Requirements and Dependencies

Core Dependencies

# Web Framework
Flask==2.3.3
Werkzeug==2.3.7
gunicorn>=20.1.0

# AI/ML Core
torch>=1.9.0
transformers>=4.20.0
scikit-learn>=1.0.2
datasets>=2.0.0
accelerate>=0.20.0

# Data Processing
pandas>=1.3.0
numpy>=1.21.0
langchain>=0.0.200

# Document Processing
PyPDF2>=2.10.0
pymupdf>=1.20.0
python-docx>=0.8.11
openpyxl>=3.0.9
xlrd>=2.0.1
easyocr>=1.6.0

# Additional Dependencies
Pillow>=8.0.0
requests>=2.25.0

System Requirements

  • Python: 3.8+ (tested with 3.12+)
  • Memory: 4GB RAM minimum, 8GB recommended for model training
  • Storage: 1GB free space for models and data
  • GPU: Optional CUDA-compatible GPU for accelerated training
  • Network: Internet connection for initial model download (~500MB)

Deployment Architecture

Production Mode

python run_production.py
# Optimized for stability with disabled debugging and file watching

Development Mode

python web_app.py
# Runs with debug mode, hot reloading, and verbose logging

Performance Optimization Strategies

Memory Management

  • GPU Memory: Automatic cache clearing between operations
  • System Memory: Garbage collection optimization
  • Model Loading: Lazy loading and efficient checkpoint management

Processing Optimization

  • Text Chunking: Optimized for transformer attention patterns
  • Batch Processing: Configurable batch sizes for memory constraints
  • Caching: Intelligent caching of frequently accessed data

Scalability Considerations

  • Horizontal Scaling: Stateless design enables load balancing
  • Database Scaling: CSV-based storage with migration path to SQL
  • Model Serving: Support for model versioning and A/B testing

Security Implementation

File Upload Security

  • Type Validation: Whitelist-based file type checking
  • Size Limits: Configurable upload size restrictions
  • Path Sanitization: Secure filename processing
  • Virus Scanning: Integration points for antivirus solutions

Data Protection

  • Input Sanitization: SQL injection and XSS prevention
  • Access Control: Role-based permission system ready
  • Audit Logging: Comprehensive activity tracking
  • Data Encryption: Preparation for at-rest encryption

This technical documentation provides a comprehensive overview of the RFP Processor's architecture, implementation details, and operational characteristics. Each component has been designed with production deployment, scalability, and maintainability in mind.

Quick Start Guide

Installation

# Clone the repository
git clone <repository-url>
cd rfp-processor-main

# Install dependencies
pip install -r requirements.txt

# Run the Flask web application
python run_production.py

Usage

  1. Navigate to http://localhost:5000 in your browser
  2. Upload RFP documents through the interface:
    • Individual Mode: Each file creates a separate database entry
    • Project Mode: Multiple files are combined into a single database entry with one score
  3. Review AI-generated scores and summaries
  4. Make approval/denial decisions to improve the model
  5. Use the database interface to manage processed RFPs
  6. Download individual files or complete project ZIP archives
  7. Fine-tune the model after accumulating 2+ decisions

Architecture Summary

This system represents a production-ready implementation of transformer-based document analysis with a professional Flask web interface. The hybrid approach combining Longformer neural networks with TF-IDF similarity learning provides robust performance across diverse RFP types. The system features comprehensive document processing, intelligent fine-tuning triggers, and modular architecture designed for maintainability and extensibility.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors