Skip to content

FunPact/DAY_14_Final_Production_Ready_System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

This Repository contains the Final Implementation for Day-14: Final Production-Ready System from the Databricks 14 Days AI Challenge – 2 (Advanced). The Objective of this Stage was to Combine the previously Developed Data Pipeline & Machine learning Pipeline into a Unified Workflow capable of generating Predictions.

The Pipeline loads the Processed Delta Dataset, Performs User-Level Feature Engineering, Generates Purchase Labels, & Constructs the Training Dataset. A Logistic Regression Model is trained & evaluated using AUC as the Performance Metric.

After Validation, the Model is saved to a Unity Catalog Volume & Logged using MLflow for Experiment Tracking. Batch Inference is then executed to generate Purchase Probability Predictions for all Users. The Results are written to a Gold Delta Table, enabling Identification of High-Probability Buyers.

This Stage demonstrates how individual Data Engineering & ML Components developed across the Challenge can be assembled into a Complete Production-Style Predictive Pipeline.

Output Analysis

To Validate the Final Production Pipeline, Sample Rows from the Intermediate & Output DataFrames were reviewed. The Feature Vector DataFrame confirms that the Engineered Behavioral Features were successfully combined into a Spark ML Vector using VectorAssembler. Each Row contains the Feature Vector, the Purchase Label, and the Corresponding User Identifier.

The Train & Test DataFrames retain the same Schema after the Dataset Split, ensuring that the Model receives Consistent Input Structure during both Training & Evaluation. The presence of both Purchase & Non-Purchase Labels in the Datasets indicates that the Model was trained on Representative Behavioral Patterns.

The Scoring DataFrame contains the Full Feature set for each user, including Total Events, Number of Purchases, Total Spending, & Average Price. These Variables represent aggregated Behavioural Signals used to estimate Purchase Likelihood.

Finally, the Prediction DataFrame includes Additional Columns generated by the Trained Model: rawPrediction, probability, & prediction. The Probability Vector represents the Likelihood of each Class, where the Second Value corresponds to the Probability of Purchase. These Results were then transformed into a Final Output containing the User Identifier, Purchase Probability, & Predicted Label.

The Sample Rows show that the Model successfully distinguishes Users with Strong Purchasing Behavior from those with Minimal Interaction Signals. This confirms that the Integrated Pipeline

  • From Feature Engineering to Batch Inference
  • Operates as Expected & Produces Interpretable Predictions suitable for Downstream Analytics.

About

Day-14 Final Production System from the Databricks 14 Days AI Challenge – 2. Integrated the data pipeline and ML pipeline, trained a Logistic Regression Model, Logged the Model with MLflow, Generated Purchase Predictions through Batch Inference, & Stored Results in a Gold Delta Table.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors