NexusOps is an enterprise-grade, horizontally scalable Machine Learning architecture designed to automate deep learning training and hyperparameter tuning. It is specifically engineered to run complex ML workloads (like Kaggle datasets) on highly constrained cloud environments (e.g., 512MB RAM free-tier instances) by separating computation from the API layer using a distributed worker-node cluster.
Training Deep Learning models on 512MB RAM servers usually results in instant Out-Of-Memory (OOM) crashes. NexusOps solves this through Horizontal Scaling and Memory Orchestration. By deploying multiple independent worker nodes connected via an Upstash Redis Message Broker, the system dynamically distributes hyperparameter tuning trials. If one node fails, the system recovers. If more power is needed, more free-tier nodes are attached. It brings data-center level distributed computing to zero-cost cloud hosting.
NexusOps operates on a fully automated, 7-step sequential and distributed pipeline:
- The user interacts with a responsive Streamlit UI.
- Inputs include the raw CSV dataset, the target column to predict, and a natural language description of the "Use Case".
- The UI packages this into an immutable Input State and sends it to the FastAPI backend, tracking progress via a unique Submission ID.
- The Problem: Blindly building neural networks can explode memory.
- The Solution: The backend sends the dataset metadata and user "Use Case" to the Google Gemini API. The LLM acts as an architect, generating strict JSON constraints (e.g., max hidden layers, specific activation functions, dropout boundaries) tailored to the specific dataset size to prevent memory exhaustion during training.
- Workers download the dataset and perform automated preprocessing.
- Memory Optimization: Downcasts
float64tofloat32to instantly halve the RAM footprint. - Handles missing values, performs categorical encoding (One-Hot), and splits data into Train/Val/Test subsets. PyTorch DataLoaders are configured with optimized batch sizes to stream data into memory efficiently.
- This is the core engine. FastAPI pushes the training task to the Redis Task Queue.
- Multiple Celery Worker Web Services actively listen to this queue. They pull the task and begin parallel executions.
- Optuna utilizes
JournalRedisStorageto maintain a centralized state. Worker A and Worker B can simultaneously run deep learning trials (e.g., 10 trials each) without overriding each other, dramatically reducing total execution time.
- Once all distributed trials conclude, the system fetches the absolute best hyperparameter dictionary from the Redis database.
- A final, comprehensive PyTorch neural network is built using these best parameters and trained on the full dataset.
- Post-training, feature importance (e.g., finding out that 'smoker' dictates 'insurance charges') is calculated.
- This raw mathematical data is sent back to the Gemini API via a Safe Fallback REST Call to generate a human-readable Executive Summary.
- The trained model weights (
.pth), configuration (.json), and metrics are permanently saved. The MongoDB Atlas document is updated to aCOMPLETEDstate.
- The Streamlit frontend constantly polls the MongoDB database. Upon completion, it dynamically unlocks the dashboard.
- Users can view live
$R^2$ scores, RMSE, and the AI-generated insights. - Inference Mode: Users can upload a test dataset (without the target column), and the UI will run live predictions using the newly distributed-trained model.
- Frontend: Streamlit
- REST API: FastAPI, Uvicorn
- Distributed Queue & Broker: Celery, Upstash Redis
- Database: MongoDB Atlas (NoSQL for state tracking)
- AI & Machine Learning: PyTorch, Optuna, Scikit-Learn, Pandas
- Generative AI: Google Gemini (Constraint Engine & Insight Generation)
- Cloud Infrastructure: Render (Multiple Web Services for API and Worker Nodes)
- Uptime Management: Cron-job / UptimeRobot (Bypassing free-tier sleep restrictions via dummy HTTP servers).
This project employs several hardcore software engineering tactics:
- Thread Throttling:
torch.set_num_threads(1)ensures multi-core spikes don't trigger silent memory kills on fractional vCPUs. - Aggressive Garbage Collection:
gc.collect()and explicit cache clearing after every Optuna trial. - Decoupled Fallbacks: If the Gemini API fails (e.g., 503 Server Error), the system uses a
try-exceptboundary to inject default text, ensuring the 20-minute training pipeline doesn't crash at the finish line.
To run this distributed cluster on your local machine:
1. Clone the repository
git clone [https://github.com/Yogesh-max2123/NexusOps.git](https://github.com/Yogesh-max2123/NexusOps.git)
cd NexusOps2. Create a Virtual Environment & Install Dependencies
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt3. Environment Variables (.env) Create a .env file in the root directory:
MONGODB_URL=your_mongodb_atlas_connection_string
REDIS_URL=your_upstash_redis_url
GEMINI_API_KEY=your_google_gemini_api_key
CLOUDINARY_URL=your_cloudinary_url (if used for file hosting)4.Start the Cluster (Open 3 separate terminals)
Terminal 1: Start the MLflow Tracking Server (The Control Room)
mlflow ui
# Accessible at [http://127.0.0.1:5000](http://127.0.0.1:5000)Terminal 2: Start the FastAPI Backend
uvicorn app.main:app --reloadTerminal 3: Start the Celery Worker (The ML Engine)
celery -A app.celery_config worker --loglevel=info --pool=soloTerminal 4: Start the Streamlit UI
streamlit run ui/app.pyEngineered out of pure curiosity to crack cloud constraints, master MLOps, and build some seriously cool stuff.