Utilizing web scraping and machine learning techniques to accurately
predict
service
prices.
Data-Driven Optimization of Pricing Strategies on Microservice Platforms: Insights From Khamsat
Present the Khamsat Predictor, a machine learning model aimed at addressing challenges in the freelance marketplace.
By analyzing data
from
,
the largest platform for Arab freelancers, the model provides price estimates based on service characteristics. While
the current version faces challenges like data limitations and overfitting, it offers a foundation to improve
transparency and help alleviate price-related concerns for both sellers and clients. The model employs classical machine
learning techniques and follows structured methodologies for data collection, training, and deployment, with plans for
future enhancements through better data acquisition and advanced techniques.
Live.trial.for.khamsat.predictor.mp4
The goal is to develop an AI-based pricing model that predicts accurate prices for freelance services listed on Khamsat. By analyzing historical data from the platform, the model helps both sellers and clients make informed decisions.
The project follows a structured strategy with five key phases:
- Data Collection: Data was scraped from Khamsat using Selenium to handle dynamic elements and navigate through menus.
- Exploratory Data Analysis (EDA): Key patterns were uncovered in the data, identifying correlations and resolving format inconsistencies.
- Data Preprocessing: Placeholder values were cleaned, and categorical features were encoded.
- Modeling: Various classical machine learning models (such as SoftMax Regression, Support Vector Classifier, and Random Forest Classifier) were trained and optimized.
- Deployment: The trained models were deployed using FastAPI, providing a user-friendly interface for price predictions.
Note
You can find the exploratory data analysis (EDA), data preprocessing, modeling phases in the notebooks.
This shows the project's skeleton.
khamsat-predictor
├── charts
├── data
│ ├── raw
│ ├── balanced.csv
│ ├── clean.csv
│ ├── raw.csv
│ └── scraper.py
├── experiments
│ ├── Random Forest Classifier
│ │ └── 0
│ │ ├── balanced
│ │ ├── imbalanced
│ │ └── meta.yaml
│ ├── SoftMax Regression
│ │ └── 0
│ │ ├── balanced
│ │ ├── imbalanced
│ │ └── meta.yaml
│ └── SVC
│ └── 0
│ ├── balanced
│ ├── imbalanced
│ └── meta.yaml
├── mappers
│ ├── to_categorical
│ │ └── feature_names.json
│ ├── to_numeric
│ │ ├── category_name.json
│ │ ├── duration.json
│ │ ├── offer_response_time.json
│ │ ├── owner_level.json
│ │ ├── owner_response_time.json
│ │ └── service_name.json
│ ├── __init__.py
│ ├── features.py
│ ├── load_and_map.py
│ └── one_hot.py
│
├── models
│ ├── __init__.py
│ └── offer.py
├── notebooks
│ ├── eda.ipynb
│ ├── preprocessing.ipynb
│ └── modeling.ipynb
├── routers
│ ├── __init__.py
│ └── offer.py
├── views
│ ├── images
│ ├── index.html
│ ├── index.js
│ └── style.css
├── .gitignore
├── LICENSE.md
├── main.py
├── paper.pdf
├── README.md
├── requirements.txt
└── requirements-dev.txtHere is a summary for the purpose of each major module or component in the project.
Click to see
| Module | Purpose |
|---|---|
data |
Contains the scraper utility for extracting data and the files representing datasets used in the project. |
experiments |
Stores the results of experiments, including trained models, hyperparameter configurations, and the metrics associated with their performance. |
mappers |
Handles the transformation of text-based data into numerical formats, including custom encoding techniques for model compatibility. |
models |
Contains Pydantic models (schemas) used for data validation and serialization between different layers of the application. |
notebooks |
Includes Jupyter notebooks that illustrate the workflow across the three phases: data exploration, preprocessing, and modeling, offering a clear representation of the project’s progression. |
routers |
Manages API route definitions, linking frontend requests to backend functionalities, including data processing and prediction endpoints. |
views |
Responsible for rendering frontend templates or static files, providing the visual interface for interacting with the application. |
main.py |
Serves as the project's entry point, initializing the application and orchestrating its components. |
paper.pdf |
Provides a detailed overview of the project, including its objectives, methodology, experiments, results, and conclusions, serving as the primary documentation. |
requieremnts.txt |
Lists the dependencies required to run the application, ensuring that all necessary libraries and tools are installed. |
requierments-dev.txt |
Specifies additional dependencies for development purposes. |
charts |
Contains the images of all charts/plots. |
Technologies and tools used.
Here is a summary for the purpose of each tool used in the project.
Click to see
| Dependency | Usage | Phase |
|---|---|---|
selenium |
Used to interact with Khamsat website and collect (scrap) dynamic content. | Data Collection |
mlflow |
Managing the machine learning lifecycle and operations (MLOps), including experiment tracking and model management. | Modeling |
optuna |
Used to Tune and optimize model hyperparameters for better performance. | Modeling |
scikit-learn |
Implementing the mathematical formulations and implementations of the models. | Modeling |
pandas |
Used for handling datasets, cleaning, and preprocessing the data. | EDA, Preprocessing |
numpy |
Used for working with arrays and mathematical operations in data processing. | Preprocessing |
matplotlib |
Creating plots and graphs for visualizing trends in the data. | EDA |
seaborn |
Used for more advanced and aesthetically pleasing plots. | EDA |
arabic_reshaper |
Reshaping Arabic text, ensuring that it displays correctly when visualized in plots or graphs. | EDA |
python-bidi |
Facilitates bidirectional text rendering, useful for displaying Arabic script. | EDA |
fastapi |
Used to build the interface for price prediction, allowing the model to interact with users in real-time. | Deployment |
In this project, we adopt a multimodal approach, evaluating and optimizing several machine learning algorithms through hyperparameter tuning. The experiments carried out aimed to evaluate model performance on both the original imbalanced dataset and a balanced version created by random oversampling. Both experimental setups were tracked using MLflow library, ensuring reproducibility and optimization of the model workflows.
Optuna library was used for hyperparameter optimization, conducting up to 50 trials for each model to maximize the F1-score. Tables 1 and 2 summarize the hyperparameters tuned for each model.
Table 1: Hyperparameters tuned on the original clean imbalanced dataset.
| Hyperparameter | SoftMax Regression | SVC | Random Forest Classifier |
|---|---|---|---|
| Class Weight | balanced | balanced | balanced |
| Multi Class | multinomial | - | - |
| Solver | lbfgs | - | - |
| Kernel | - | linear | - |
| Gamma | - | 0.48 | - |
| C | 0.53 | 0.79 | - |
| Max Iterations | 6972 | 8170 | - |
| Decision Function | - | ovo | - |
| Estimators | - | - | 395 |
| Depth | - | - | 14 |
| Criterion | - | - | log loss |
Table 2: Hyperparameters tuned on the balanced dataset using random oversampling
| Hyperparameter | SoftMax Regression | SVC | Random Forest Classifier |
|---|---|---|---|
| Class Weight | balanced | balanced | balanced |
| Multi Class | multinomial | - | - |
| Solver | lbfgs | - | - |
| Kernel | - | rbf | - |
| Gamma | - | 0.33 | - |
| C | 0.88 | 0.57 | - |
| Max Iterations | 7356 | 9189 | - |
| Decision Function | - | ovo | - |
| Estimators | - | - | 133 |
| Depth | - | - | 22 |
| Criterion | - | - | log loss |
-
Overfitting: The model exhibited signs of overfitting, where it performed well on the training data but struggled to generalize to unseen data. This was primarily due to the random oversampling technique used to address class imbalance. While oversampling can balance the dataset, it also introduces redundancy and can cause data leakage between the training and unseen data. This redundancy might lead the model to memorize specific patterns from the training set, which hampers its ability to generalize.
-
Small and Imbalanced Data: A significant limitation was the small size of the dataset, which constrained the model's ability to learn meaningful patterns. Coupled with this, the data imbalance made it harder for the model to learn representations for the underrepresented classes, further compounding the overfitting issue. The imbalance, when combined with the small data size, made it difficult for the model to build a reliable generalization from the training data.
-
Challenges in Data Acquisition: The ability to gather enough high-quality data is critical. Web scraping has been limited in our case, making it challenging to acquire a sufficient volume of diverse examples. This limitation is expected to persist, and will need to be addressed in the future by either collaborating with Khamsat to obtain more data or exploring data augmentation techniques, including synthetic data generation, to simulate additional varied examples that better represent the target distribution.
The performance of the models is assessed using a set of standard evaluation metrics, including Log Loss, Accuracy, Precision, Recall, and the F1-score.
The results for each model, presented in Tables 3 and 4, highlight the trade-offs between these metrics under different conditions, offering insight into the strengths and weaknesses of each approach.
Table 3: Performance metrics on the imbalanced dataset.
| Model | Loss | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| SoftMax Regression | 1.62 | 40% | 17% | 23% | 17% |
| SVC | 1.18 | 42% | 16% | 18% | 16% |
| Random Forest Classifier | 1.28 | 56% | 22% | 25% | 22% |
Table 4: Performance metrics on the balanced dataset.
| Model | Loss | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| SoftMax Regression | 0.95 | 68% | 66% | 68% | 67% |
| SVC | 0.05 | 98% | 98% | 98% | 98% |
| Random Forest Classifier | 0.23 | 97% | 97% | 97% | 97% |
- Clone this repository to your local machine:
git clone git@github.com:IsmaelMousa/khamsat-predictor.git- Navigate to the khamsat-predictor directory
cd khamsat-predictor- Setup virtual environment
python3 -m venv .venv- Activate the virtual environment
source .venv/bin/activate- Install the required dependencies
pip install -r requirements.txt- Run the server program
uvicorn main:app --host localhost --port 8080
- Navigate to http://localhost:8080, and start using it.
The data utilized in this project is the property of
the
platform, with all associated rights reserved to them. This project, however, is an independent endeavor and solely
owned by me, with no affiliation to any institution or organization.