Machine Learning System

Sales Forecasting Platform

End-to-End MLOps System

Enterprise-grade retail demand forecasting platform — from raw transaction data to production API. PySpark on AWS EMR for 125M+ rows, XGBoost training, MLflow tracking, FastAPI serving, Evidently drift monitoring, and automated CI/CD to Amazon ECR.

GitHub Repository
Revenue Forecasting Monthly Sales Trend
125M+
Transaction Rows
PySpark
AWS EMR
250
Optuna Trials
FastAPI
Inference Service
Drift
Evidently Monitor
CI/CD
GitHub Actions
Problem

The Retail Forecasting Challenge

Inaccurate demand forecasts erode margins, increase costs, and degrade customer experience. Traditional forecasting fails to scale across thousands of SKUs and store locations.

Key business consequences:

  • Stockouts cause lost revenue and customer churn
  • Overstock ties up capital and creates waste
  • Poor promotion planning leads to inefficient spend
  • Inefficient staffing from disconnected demand signals
  • Reactive logistics instead of predictive replenishment

At scale: 125M+ rows of transaction data require distributed processing. Production data drift goes undetected without monitoring.

This system solves forecasting as a production ML engineering challenge. It demonstrates PySpark on AWS EMR, train-serve parity, S3 artifact management, FastAPI inference, Evidently drift monitoring, and CI/CD automation.

Project Overview

From Raw Data to Production API

An end-to-end ML forecasting service that transforms raw retail data into production-ready demand predictions.

Built for scalability and reliability: PySpark feature engineering, XGBoost training, S3 storage, FastAPI inference, and automated drift detection.

Python 3.11 PySpark XGBoost FastAPI MLflow Evidently Docker AWS S3 / EMR / ECR
End-to-End Workflow (10 Stages)
  1. Raw Data Ingestion → S3/Parquet historical sales
  2. Distributed Feature Engineering → PySpark on EMR
  3. Model Training → XGBoost with Optuna tuning
  4. Model Artifact Storage → S3 versioning
  5. Inference API → FastAPI REST service
  6. Online Feature Engineering → Train-serve parity
  7. Prediction Logging → S3 structured JSON
  8. Drift Monitoring → Evidently reports
  9. User Interface → Streamlit frontend
  10. CI/CD Quality Gates → GitHub Actions to ECR
Business Impact

How This System Enables Stakeholder Decisions

Supporting operational decision-making across retail functions

📊
Retail Planners
What should we stock next week?
  • Store-item level demand predictions
  • Seasonal uplift quantification
  • Safety stock optimization signals
  • Slow-mover markdown triggers
🚚
Supply Chain
How do we optimize replenishment?
  • Cross-dock capacity planning
  • Vendor order sizing
  • Distribution center allocation
  • Forward demand sharing
⚙️
Operations
How do we staff optimally?
  • Labor scheduling from demand forecasts
  • Peak period staffing allocation
  • Equipment capacity planning
  • Shift optimization
🎯
Marketing
How do we plan promotions?
  • Demand uplift estimates
  • Markdown velocity forecasts
  • Campaign ROI budgeting
  • Promo-sensitive SKU targeting
ROI Analysis

ROI Simulation — Forecasting Impact

Economic rationale for a mid-sized retailer with 50 stores and 3,000 SKUs:

Baseline Problem Cost

Cost Component Annual Impact
Excess inventory holding costs$2.5M
Stockout-related lost revenue$1.8M
Waste and shrinkage from overstock$1.2M
Expedited freight costs$0.6M
Manual forecasting labor$0.4M
Total Annual Cost$6.5M

Estimated Annual Savings

Savings Category Annual Benefit
Inventory cost reduction$250K
Stockout revenue recovery$450K
Waste reduction$144K
Freight cost avoidance$120K
Analyst labor redeployment$280K
Total Annual Benefit$1.24M

Payback period: ~3 months
Year 1 Net Benefit: $978K
3-Year NPV (10% discount rate): ~$2.8M

System Architecture Overview

Cloud-native ML system with microservices architecture

Data Integration and Model Training Layer

MLops Pipeline

ML Pipeline

13 sequential stages from data ingestion to deployment

01

Data Ingestion

Historical sales transactions, store metadata, and item attributes ingested from S3 and local storage in CSV/Parquet format.

02

Data Validation

CI/CD pipeline validates reference datasets for schema correctness, row volume sanity, and Parquet readability before deployment.

03

Spark Feature Engineering

PySpark on AWS EMR constructs temporal, lag, rolling window, and store-level aggregation features with anti-leakage constraints.

04

Model Training

XGBoost regression models trained with baseline and Optuna-tuned variants. Temporal train-validation split enforced.

05

Experiment Tracking

MLflow tracks training runs, hyperparameters, evaluation metrics (RMSE, MAE), and model artifacts for reproducibility.

06

Model Versioning

Trained models versioned via S3 key naming conventions with date-stamped filenames. Supports rollback and A/B testing.

07

Model Packaging

XGBoost models serialized to JSON format for cross-language compatibility. CI/CD validates format integrity.

08

Model Deployment

Containerized deployment using Docker and Compose. GitHub Actions builds, tests, and pushes to Amazon ECR.

09

Inference Pipeline

FastAPI service loads model from S3. Includes request validation, online feature engineering, and graceful error handling.

10

Prediction Logging

Every prediction logged as structured JSON to date/hour partitioned S3 paths for audit trails and retraining data.

11

Drift Monitoring

Evidently compares production logs against reference datasets. Generates HTML reports with drift share alerts.

12

Retraining Readiness

System architecture supports manual retraining workflows. Prediction logs and drift reports provide retraining signals.

13

CI/CD Automation

GitHub Actions executes S3 checks, model format validation, prediction quality tests, and Docker builds to ECR.

Engineered Features

Temporal, lag, and aggregation features for seasonality and trends

Temporal & Calendar

  • Year, Month, Day of Month
  • Day of Week, Week of Year
  • Is Weekend indicator

Captures weekly and seasonal patterns

Holiday Features

  • Thanksgiving, Christmas, New Year
  • Independence Day, Labor Day
  • Before/after holiday indicators

Explicit holiday demand uplift signals

Lag Features

  • Lag 1, 7, 14, 28 days
  • Short-term momentum
  • Weekly and monthly cycles

Auto-correlation and temporal dependency

Rolling Windows

  • 7-day & 30-day avg/std
  • Demand trend smoothing
  • Volatility quantification

Smoothed signals and uncertainty metrics

Store Aggregations

  • Mean sales by store
  • Mean sales by item
  • Category-level priors

Location and item baseline features

Train-Serve Parity

  • Spark ETL logic
  • Inference transformer matching
  • Graceful fallback handling

Prevents production feature skew

Tech Stack

Machine Learning
XGBoost scikit-learn Optuna MLflow
MLOps & Monitoring
MLflow Tracking Evidently Amazon S3
Data Engineering
Apache Spark (PySpark) pandas PyArrow boto3
Backend & API
FastAPI Pyd antic Uvicorn
Cloud & Infrastructure
Amazon S3 Amazon ECR AWS EMR AWS IAM
DevOps & CI/CD
Docker Docker Compose GitHub Actions AWS CLI

Want to explore the full system?

The full infrastructure code, PySpark ETL, Docker configuration, and API implementation are available on GitHub.

View on GitHub ← All Projects