Sales Forecasting | SAKIB HASAN

Problem

The Retail Forecasting Challenge

Inaccurate demand forecasts erode margins, increase costs, and degrade customer experience. Traditional forecasting fails to scale across thousands of SKUs and store locations.

Key business consequences:

Stockouts cause lost revenue and customer churn
Overstock ties up capital and creates waste
Poor promotion planning leads to inefficient spend
Inefficient staffing from disconnected demand signals
Reactive logistics instead of predictive replenishment

At scale: 125M+ rows of transaction data require distributed processing. Production data drift goes undetected without monitoring.

This system solves forecasting as a production ML engineering challenge. It demonstrates PySpark on AWS EMR, train-serve parity, S3 artifact management, FastAPI inference, Evidently drift monitoring, and CI/CD automation.

Project Overview

From Raw Data to Production API

An end-to-end ML forecasting service that transforms raw retail data into production-ready demand predictions.

Built for scalability and reliability: PySpark feature engineering, XGBoost training, S3 storage, FastAPI inference, and automated drift detection.

Python 3.11 PySpark XGBoost FastAPI MLflow Evidently Docker AWS S3 / EMR / ECR

End-to-End Workflow (10 Stages)

Raw Data Ingestion → S3/Parquet historical sales
Distributed Feature Engineering → PySpark on EMR
Model Training → XGBoost with Optuna tuning
Model Artifact Storage → S3 versioning
Inference API → FastAPI REST service
Online Feature Engineering → Train-serve parity
Prediction Logging → S3 structured JSON
Drift Monitoring → Evidently reports
User Interface → Streamlit frontend
CI/CD Quality Gates → GitHub Actions to ECR

Business Impact

How This System Enables Stakeholder Decisions

Supporting operational decision-making across retail functions

📊

Retail Planners

What should we stock next week?

Store-item level demand predictions
Seasonal uplift quantification
Safety stock optimization signals
Slow-mover markdown triggers

🚚

Supply Chain

How do we optimize replenishment?

Cross-dock capacity planning
Vendor order sizing
Distribution center allocation
Forward demand sharing

⚙️

Operations

How do we staff optimally?

Labor scheduling from demand forecasts
Peak period staffing allocation
Equipment capacity planning
Shift optimization

🎯

Marketing

How do we plan promotions?

Demand uplift estimates
Markdown velocity forecasts
Campaign ROI budgeting
Promo-sensitive SKU targeting

ROI Analysis

ROI Simulation — Forecasting Impact

Economic rationale for a mid-sized retailer with 50 stores and 3,000 SKUs:

Baseline Problem Cost

Cost Component	Annual Impact
Excess inventory holding costs	$2.5M
Stockout-related lost revenue	$1.8M
Waste and shrinkage from overstock	$1.2M
Expedited freight costs	$0.6M
Manual forecasting labor	$0.4M
Total Annual Cost	$6.5M

Estimated Annual Savings

Savings Category	Annual Benefit
Inventory cost reduction	$250K
Stockout revenue recovery	$450K
Waste reduction	$144K
Freight cost avoidance	$120K
Analyst labor redeployment	$280K
Total Annual Benefit	$1.24M

Payback period: ~3 months
Year 1 Net Benefit: $978K
3-Year NPV (10% discount rate): ~$2.8M

ML Pipeline

13 sequential stages from data ingestion to deployment

01

Data Ingestion

Historical sales transactions, store metadata, and item attributes ingested from S3 and local storage in CSV/Parquet format.

02

Data Validation

CI/CD pipeline validates reference datasets for schema correctness, row volume sanity, and Parquet readability before deployment.

03

Spark Feature Engineering

PySpark on AWS EMR constructs temporal, lag, rolling window, and store-level aggregation features with anti-leakage constraints.

04

Model Training

XGBoost regression models trained with baseline and Optuna-tuned variants. Temporal train-validation split enforced.

05

Experiment Tracking

MLflow tracks training runs, hyperparameters, evaluation metrics (RMSE, MAE), and model artifacts for reproducibility.

06

Model Versioning

Trained models versioned via S3 key naming conventions with date-stamped filenames. Supports rollback and A/B testing.

07

Model Packaging

XGBoost models serialized to JSON format for cross-language compatibility. CI/CD validates format integrity.

08

Model Deployment

Containerized deployment using Docker and Compose. GitHub Actions builds, tests, and pushes to Amazon ECR.

09

Inference Pipeline

FastAPI service loads model from S3. Includes request validation, online feature engineering, and graceful error handling.

10

Prediction Logging

Every prediction logged as structured JSON to date/hour partitioned S3 paths for audit trails and retraining data.

11

Drift Monitoring

Evidently compares production logs against reference datasets. Generates HTML reports with drift share alerts.

12

Retraining Readiness

System architecture supports manual retraining workflows. Prediction logs and drift reports provide retraining signals.

13

CI/CD Automation

GitHub Actions executes S3 checks, model format validation, prediction quality tests, and Docker builds to ECR.

Engineered Features

Temporal, lag, and aggregation features for seasonality and trends

Temporal & Calendar

Year, Month, Day of Month
Day of Week, Week of Year
Is Weekend indicator

Captures weekly and seasonal patterns

Holiday Features

Thanksgiving, Christmas, New Year
Independence Day, Labor Day
Before/after holiday indicators

Explicit holiday demand uplift signals

Lag Features

Lag 1, 7, 14, 28 days
Short-term momentum
Weekly and monthly cycles

Auto-correlation and temporal dependency

Rolling Windows

7-day & 30-day avg/std
Demand trend smoothing
Volatility quantification

Smoothed signals and uncertainty metrics

Store Aggregations

Mean sales by store
Mean sales by item
Category-level priors

Location and item baseline features

Train-Serve Parity

Spark ETL logic
Inference transformer matching
Graceful fallback handling

Prevents production feature skew

Sales Forecasting Platform

End-to-End MLOps System