This system implements a sophisticated, production-grade RFP analysis pipeline utilizing advanced machine learning techniques, natural language processing, and hybrid AI architectures. The solution combines transformer-based neural networks with traditional statistical methods to provide accurate proposal scoring and decision support.
┌─────────────────────────────────────────────────────────────────┐
│ RFP Processor Architecture │
├─────────────────────────────────────────────────────────────────┤
│ Web Interface Layer (Flask/Jinja2) │
│ ├── templates/database.html - Database management UI │
│ ├── templates/upload.html - Document upload interface │
│ ├── templates/index.html - Dashboard & analytics │
│ └── templates/base.html - Base template with Bootstrap │
├─────────────────────────────────────────────────────────────────┤
│ Application Logic Layer │
│ ├── web_app.py - Flask routes & controllers │
│ └── upload_handler.py - File processing pipeline │
├─────────────────────────────────────────────────────────────────┤
│ Machine Learning Core │
│ ├── model_utils.py - Longformer fine-tuning │
│ ├── data_utils.py - Data preprocessing & storage │
│ └── document_utils.py - Text extraction & parsing │
├─────────────────────────────────────────────────────────────────┤
│ Document Processing Pipeline │
│ ├── ocr_utils.py - OCR & scanned PDF handling │
│ ├── safe_csv_utils.py - Atomic database operations │
│ └── database_utils.py - Database schema management │
├─────────────────────────────────────────────────────────────────┤
│ Analytics & Reporting │
│ ├── rfp_counter.py - Statistical analysis │
│ ├── generate_html_report.py - HTML report generation │
│ └── evaluate_accuracy.py - Model performance metrics │
└─────────────────────────────────────────────────────────────────┘
Primary Algorithm: Fine-tuned Longformer Transformer Architecture
MODEL_NAME = "allenai/longformer-base-4096" # Base pretrained model
MODEL_SAVE_PATH = "fine_tuned_longformer" # Custom fine-tuned checkpointTechnical Specifications:
- Architecture: Longformer-base with sliding window attention mechanism
- Context Length: 4,096 tokens (8x longer than standard BERT)
- Attention Pattern: Sparse attention with sliding window + global attention
- Parameters: ~149M parameters in base configuration
- Fine-tuning Strategy: Task-specific adaptation for binary classification
The system implements a sophisticated hybrid approach combining:
-
Transformer-based Neural Predictions
def predict_document_probability(text, auto_finetune=True): # Process text through fine-tuned Longformer chunks = chunk_text(text, max_tokens=4000, stride=500) base_probs = [] for chunk in chunks: inputs = tokenizer(chunk, return_tensors="pt", truncation=True, max_length=4096) logits = model(**inputs).logits prob = torch.softmax(logits, dim=1)[0][1].item() base_probs.append(prob)
-
Historical Similarity Learning
def calculate_similarity_score(current_text, historical_df): # TF-IDF vectorization with n-gram analysis vectorizer = TfidfVectorizer(max_features=1000, stop_words='english', ngram_range=(1, 2)) # Cosine similarity computation against historical decisions similarities = cosine_similarity(current_vector, tfidf_matrix[1:])
-
Adaptive Weight Distribution
# Dynamic weighting based on model confidence and data availability if has_finetuned_model: weight_base = 0.7 if num_historical >= 20 else 0.5 weight_similarity = 1.0 - weight_base else: weight_similarity = 0.8 if num_historical >= 10 else 0.5
Multi-Stage Training Pipeline:
-
Data Preparation Stage
def prepare_training_data(): # Load historical decisions from CSV database # Filter for meaningful content (>20 characters) # Convert categorical decisions to binary labels # Balance approved/denied distribution
-
Training Configuration
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01) scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=0, num_training_steps=total_steps )
-
Background Training Architecture
- Robust Error Handling: Automatic recovery and checkpoint saving
- Memory Management: GPU cache clearing and garbage collection
- Progress Persistence: Auto-save after each epoch
- Tab-Switch Resilience: Continues training regardless of UI focus
Multi-Format Document Extraction System:
def extract_text_from_pdf(pdf_path: str) -> str:
# Primary: PyMuPDF for standard PDFs
# Fallback: OCR processing for scanned documents
# Handle encrypted/protected PDFs
# Extract text with formatting preservationdef extract_text(image_path: str) -> str:
# EasyOCR implementation with GPU acceleration
# Multi-language support (English optimized)
# Confidence thresholding for text extraction
# Scanned document detection and processingdef extract_text_from_docx(docx_path: str) -> str:
# python-docx library for Word document parsing
# Extract paragraphs, tables, and headers
# Handle embedded objects and images
# Preserve document structureFlask-based RESTful API with Production Optimizations:
@app.route('/upload', methods=['GET', 'POST'])
def upload_page():
# Handle multipart file uploads
# Validate file types and sizes
# Secure filename processing
# Asynchronous document processing@app.route('/database')
def database_page():
# Dynamic filtering and sorting
# Pagination for large datasets
# Real-time search capabilities
# Bulk operations supportif __name__ == '__main__':
# Disable debug mode for production
# Remove file watching to prevent processing interruption
# Configure logging and error handling
app.run(host='0.0.0.0', port=5000, debug=False, use_reloader=False)Atomic Operations:
class SafeCSVManager:
def __init__(self, csv_file: str):
# Implement file locking for concurrent access
# Automatic backup creation before modifications
# Schema validation and type checking
# Rollback capability for failed operationsData Schema:
SCHEMA = {
"filename": {"type": "string", "required": True},
"title": {"type": "string", "required": True},
"sender": {"type": "string", "required": True},
"probability": {"type": "float", "min": 0.0, "max": 1.0},
"decision": {"type": "string", "allowed": ["Pending", "Approved", "Denied"]},
"summary": {"type": "string", "required": True},
"file_list": {"type": "string", "required": False}
}def count_unique_rfps(df):
# Project-level counting and aggregation
# Handle both individual and grouped RFP submissions
# Smart duplicate detection for multi-file projects
# Decision status distribution analysis
def count_unique_rfps_by_decision(df, decision_type):
# Filter and count RFPs by specific decision status
# Support for Pending, Approved, and Denied statuses
# Project-based counting for accurate metricsdef generate_comprehensive_report():
# Create responsive HTML reports with Bootstrap
# Interactive charts and visualizations
# Export capabilities (PDF, CSV)
# Real-time metric calculationsComprehensive Model Assessment:
def evaluate_model_accuracy(threshold=0.5, test_size=0.3):
# Stratified train/test splitting
# Multiple threshold optimization
# Confusion matrix generation
# Precision/Recall/F1-score calculation
# ROC curve analysis- Overall Accuracy: Percentage of correct predictions
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1-Score: Harmonic mean of precision and recall
- AUC-ROC: Area under the receiver operating characteristic curve
def fine_tune_model(epochs=2, batch_size=1, learning_rate=2e-5):
# Background-compatible training with progress tracking
# Automatic saving after each epoch
# Memory management and cleanup
# Error recovery and checkpoint handlingdef auto_training_check(historical_df, auto_finetune=True):
# Progressive training thresholds: 5 entries first, then every 10
num_decisions = len(historical_df)
has_model = os.path.exists(MODEL_SAVE_PATH)
last_trained_count = get_last_training_count()
# First training at 5 entries, subsequent training every 10 entries
should_train = (not has_model and num_decisions >= 5) or \
(has_model and num_decisions >= last_trained_count + 10)
if should_train:
fine_tune_model(epochs=2, batch_size=1)
save_training_count(num_decisions)- Trigger Conditions: Automatic training at 5 entries initially, then every 10 entries thereafter
- Training Strategy: Progressive fine-tuning with 2 epochs for optimal learning
- Model Updates: Automatic model reload and improved predictions after training
- Data Requirements: Balanced dataset with both approved and denied decisions
- Progress Tracking: Persistent tracking of training milestones to prevent over-training
def chunk_text(text, max_tokens=4000, stride=500):
# Sliding window approach with configurable overlap
# Token-level splitting with context preservation
# Boundary-aware chunking to maintain semantic coherence
# Optimized for Longformer's attention mechanismdef extract_rfp_features(text):
# Regex-based pattern matching for RFP-specific elements
# Budget information extraction
# Deadline and timeline identification
# Scope of work classification
# Qualification requirements parsing# Web Framework
Flask==2.3.3
Werkzeug==2.3.7
gunicorn>=20.1.0
# AI/ML Core
torch>=1.9.0
transformers>=4.20.0
scikit-learn>=1.0.2
datasets>=2.0.0
accelerate>=0.20.0
# Data Processing
pandas>=1.3.0
numpy>=1.21.0
langchain>=0.0.200
# Document Processing
PyPDF2>=2.10.0
pymupdf>=1.20.0
python-docx>=0.8.11
openpyxl>=3.0.9
xlrd>=2.0.1
easyocr>=1.6.0
# Additional Dependencies
Pillow>=8.0.0
requests>=2.25.0- Python: 3.8+ (tested with 3.12+)
- Memory: 4GB RAM minimum, 8GB recommended for model training
- Storage: 1GB free space for models and data
- GPU: Optional CUDA-compatible GPU for accelerated training
- Network: Internet connection for initial model download (~500MB)
python run_production.py
# Optimized for stability with disabled debugging and file watchingpython web_app.py
# Runs with debug mode, hot reloading, and verbose logging- GPU Memory: Automatic cache clearing between operations
- System Memory: Garbage collection optimization
- Model Loading: Lazy loading and efficient checkpoint management
- Text Chunking: Optimized for transformer attention patterns
- Batch Processing: Configurable batch sizes for memory constraints
- Caching: Intelligent caching of frequently accessed data
- Horizontal Scaling: Stateless design enables load balancing
- Database Scaling: CSV-based storage with migration path to SQL
- Model Serving: Support for model versioning and A/B testing
- Type Validation: Whitelist-based file type checking
- Size Limits: Configurable upload size restrictions
- Path Sanitization: Secure filename processing
- Virus Scanning: Integration points for antivirus solutions
- Input Sanitization: SQL injection and XSS prevention
- Access Control: Role-based permission system ready
- Audit Logging: Comprehensive activity tracking
- Data Encryption: Preparation for at-rest encryption
This technical documentation provides a comprehensive overview of the RFP Processor's architecture, implementation details, and operational characteristics. Each component has been designed with production deployment, scalability, and maintainability in mind.
# Clone the repository
git clone <repository-url>
cd rfp-processor-main
# Install dependencies
pip install -r requirements.txt
# Run the Flask web application
python run_production.py- Navigate to
http://localhost:5000in your browser - Upload RFP documents through the interface:
- Individual Mode: Each file creates a separate database entry
- Project Mode: Multiple files are combined into a single database entry with one score
- Review AI-generated scores and summaries
- Make approval/denial decisions to improve the model
- Use the database interface to manage processed RFPs
- Download individual files or complete project ZIP archives
- Fine-tune the model after accumulating 2+ decisions
This system represents a production-ready implementation of transformer-based document analysis with a professional Flask web interface. The hybrid approach combining Longformer neural networks with TF-IDF similarity learning provides robust performance across diverse RFP types. The system features comprehensive document processing, intelligent fine-tuning triggers, and modular architecture designed for maintainability and extensibility.