Skip to content

IsmaelMousa/khamsat-predictor

Repository files navigation

Khamsat Predictor

Utilizing web scraping and machine learning techniques to accurately predict https://khamsat.com service prices.

Data-Driven Optimization of Pricing Strategies on Microservice Platforms: Insights From Khamsat

Overview

Present the Khamsat Predictor, a machine learning model aimed at addressing challenges in the freelance marketplace. By analyzing data from https://khamsat.com, the largest platform for Arab freelancers, the model provides price estimates based on service characteristics. While the current version faces challenges like data limitations and overfitting, it offers a foundation to improve transparency and help alleviate price-related concerns for both sellers and clients. The model employs classical machine learning techniques and follows structured methodologies for data collection, training, and deployment, with plans for future enhancements through better data acquisition and advanced techniques.

Live.trial.for.khamsat.predictor.mp4

Objective

The goal is to develop an AI-based pricing model that predicts accurate prices for freelance services listed on Khamsat. By analyzing historical data from the platform, the model helps both sellers and clients make informed decisions.

Workflow

The project follows a structured strategy with five key phases:

  1. Data Collection: Data was scraped from Khamsat using Selenium to handle dynamic elements and navigate through menus.
  2. Exploratory Data Analysis (EDA): Key patterns were uncovered in the data, identifying correlations and resolving format inconsistencies.
  3. Data Preprocessing: Placeholder values were cleaned, and categorical features were encoded.
  4. Modeling: Various classical machine learning models (such as SoftMax Regression, Support Vector Classifier, and Random Forest Classifier) were trained and optimized.
  5. Deployment: The trained models were deployed using FastAPI, providing a user-friendly interface for price predictions.

Note

You can find the exploratory data analysis (EDA), data preprocessing, modeling phases in the notebooks.

Modules

This shows the project's skeleton.

khamsat-predictor
 ├── charts
 ├── data
 │   ├── raw
 │   ├── balanced.csv
 │   ├── clean.csv
 │   ├── raw.csv
 │   └── scraper.py
 ├── experiments 
 │   ├── Random Forest Classifier
 │   │   └── 0
 │   │       ├── balanced
 │   │       ├── imbalanced
 │   │       └── meta.yaml
 │   ├── SoftMax Regression
 │   │   └── 0
 │   │       ├── balanced
 │   │       ├── imbalanced
 │   │       └── meta.yaml
 │   └── SVC
 │       └── 0
 │           ├── balanced
 │           ├── imbalanced
 │           └── meta.yaml
 ├── mappers 
 │   ├── to_categorical
 │   │   └── feature_names.json
 │   ├── to_numeric
 │   │   ├── category_name.json
 │   │   ├── duration.json
 │   │   ├── offer_response_time.json
 │   │   ├── owner_level.json
 │   │   ├── owner_response_time.json
 │   │   └── service_name.json
 │   ├── __init__.py
 │   ├── features.py
 │   ├── load_and_map.py
 │   └── one_hot.py
 │   
 ├── models 
 │   ├── __init__.py
 │   └── offer.py
 ├── notebooks
 │   ├── eda.ipynb
 │   ├── preprocessing.ipynb
 │   └── modeling.ipynb
 ├── routers
 │   ├── __init__.py
 │   └── offer.py
 ├── views
 │   ├── images
 │   ├── index.html
 │   ├── index.js
 │   └── style.css     
 ├── .gitignore
 ├── LICENSE.md
 ├── main.py
 ├── paper.pdf
 ├── README.md
 ├── requirements.txt
 └── requirements-dev.txt

Here is a summary for the purpose of each major module or component in the project.

Click to see
Module Purpose
data Contains the scraper utility for extracting data and the files representing datasets used in the project.
experiments Stores the results of experiments, including trained models, hyperparameter configurations, and the metrics associated with their performance.
mappers Handles the transformation of text-based data into numerical formats, including custom encoding techniques for model compatibility.
models Contains Pydantic models (schemas) used for data validation and serialization between different layers of the application.
notebooks Includes Jupyter notebooks that illustrate the workflow across the three phases: data exploration, preprocessing, and modeling, offering a clear representation of the project’s progression.
routers Manages API route definitions, linking frontend requests to backend functionalities, including data processing and prediction endpoints.
views Responsible for rendering frontend templates or static files, providing the visual interface for interacting with the application.
main.py Serves as the project's entry point, initializing the application and orchestrating its components.
paper.pdf Provides a detailed overview of the project, including its objectives, methodology, experiments, results, and conclusions, serving as the primary documentation.
requieremnts.txt Lists the dependencies required to run the application, ensuring that all necessary libraries and tools are installed.
requierments-dev.txt Specifies additional dependencies for development purposes.
charts Contains the images of all charts/plots.

Technologies

Technologies and tools used.

Selenium FastAPI Scikit-Learn Pandas NumPy Matplotlib Seaborn MLflow Optuna

Here is a summary for the purpose of each tool used in the project.

Click to see
Dependency Usage Phase
selenium Used to interact with Khamsat website and collect (scrap) dynamic content. Data Collection
mlflow Managing the machine learning lifecycle and operations (MLOps), including experiment tracking and model management. Modeling
optuna Used to Tune and optimize model hyperparameters for better performance. Modeling
scikit-learn Implementing the mathematical formulations and implementations of the models. Modeling
pandas Used for handling datasets, cleaning, and preprocessing the data. EDA, Preprocessing
numpy Used for working with arrays and mathematical operations in data processing. Preprocessing
matplotlib Creating plots and graphs for visualizing trends in the data. EDA
seaborn Used for more advanced and aesthetically pleasing plots. EDA
arabic_reshaper Reshaping Arabic text, ensuring that it displays correctly when visualized in plots or graphs. EDA
python-bidi Facilitates bidirectional text rendering, useful for displaying Arabic script. EDA
fastapi Used to build the interface for price prediction, allowing the model to interact with users in real-time. Deployment

Experiments

In this project, we adopt a multimodal approach, evaluating and optimizing several machine learning algorithms through hyperparameter tuning. The experiments carried out aimed to evaluate model performance on both the original imbalanced dataset and a balanced version created by random oversampling. Both experimental setups were tracked using MLflow library, ensuring reproducibility and optimization of the model workflows.

Optuna library was used for hyperparameter optimization, conducting up to 50 trials for each model to maximize the F1-score. Tables 1 and 2 summarize the hyperparameters tuned for each model.

Table 1: Hyperparameters tuned on the original clean imbalanced dataset.

Hyperparameter SoftMax Regression SVC Random Forest Classifier
Class Weight balanced balanced balanced
Multi Class multinomial - -
Solver lbfgs - -
Kernel - linear -
Gamma - 0.48 -
C 0.53 0.79 -
Max Iterations 6972 8170 -
Decision Function - ovo -
Estimators - - 395
Depth - - 14
Criterion - - log loss

Table 2: Hyperparameters tuned on the balanced dataset using random oversampling

Hyperparameter SoftMax Regression SVC Random Forest Classifier
Class Weight balanced balanced balanced
Multi Class multinomial - -
Solver lbfgs - -
Kernel - rbf -
Gamma - 0.33 -
C 0.88 0.57 -
Max Iterations 7356 9189 -
Decision Function - ovo -
Estimators - - 133
Depth - - 22
Criterion - - log loss

Challenges

  1. Overfitting: The model exhibited signs of overfitting, where it performed well on the training data but struggled to generalize to unseen data. This was primarily due to the random oversampling technique used to address class imbalance. While oversampling can balance the dataset, it also introduces redundancy and can cause data leakage between the training and unseen data. This redundancy might lead the model to memorize specific patterns from the training set, which hampers its ability to generalize.

  2. Small and Imbalanced Data: A significant limitation was the small size of the dataset, which constrained the model's ability to learn meaningful patterns. Coupled with this, the data imbalance made it harder for the model to learn representations for the underrepresented classes, further compounding the overfitting issue. The imbalance, when combined with the small data size, made it difficult for the model to build a reliable generalization from the training data.

  3. Challenges in Data Acquisition: The ability to gather enough high-quality data is critical. Web scraping has been limited in our case, making it challenging to acquire a sufficient volume of diverse examples. This limitation is expected to persist, and will need to be addressed in the future by either collaborating with Khamsat to obtain more data or exploring data augmentation techniques, including synthetic data generation, to simulate additional varied examples that better represent the target distribution.

Results

The performance of the models is assessed using a set of standard evaluation metrics, including Log Loss, Accuracy, Precision, Recall, and the F1-score.

The results for each model, presented in Tables 3 and 4, highlight the trade-offs between these metrics under different conditions, offering insight into the strengths and weaknesses of each approach.

Table 3: Performance metrics on the imbalanced dataset.

Model Loss Accuracy Precision Recall F1
SoftMax Regression 1.62 40% 17% 23% 17%
SVC 1.18 42% 16% 18% 16%
Random Forest Classifier 1.28 56% 22% 25% 22%

Table 4: Performance metrics on the balanced dataset.

Model Loss Accuracy Precision Recall F1
SoftMax Regression 0.95 68% 66% 68% 67%
SVC 0.05 98% 98% 98% 98%
Random Forest Classifier 0.23 97% 97% 97% 97%

Usage

Follow the instructions bellow to use the Khamsat Predictor:

  1. Clone this repository to your local machine:
git clone git@github.com:IsmaelMousa/khamsat-predictor.git
  1. Navigate to the khamsat-predictor directory
cd khamsat-predictor
  1. Setup virtual environment
python3 -m venv .venv
  1. Activate the virtual environment
source .venv/bin/activate
  1. Install the required dependencies
pip install -r requirements.txt
  1. Run the server program
uvicorn main:app --host localhost --port 8080
  1. Navigate to http://localhost:8080, and start using it.

Ownership

The data utilized in this project is the property of the https://khamsat.com platform, with all associated rights reserved to them. This project, however, is an independent endeavor and solely owned by me, with no affiliation to any institution or organization.

About

Data-driven optimization of pricing strategies on microservice platforms using web scraping and classical machine learning to predict service prices on khamsat.com

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors