This project presents distributed fraud transaction analysis using PySpark and a large-scale financial transaction dataset.
The solution focuses on data transformation, aggregation, fraud pattern analysis, date and time operations, and distributed data processing techniques in Apache Spark.
- Distributed transaction analysis using PySpark
- Large-scale CSV data processing
- Descriptive statistics for transaction data
- Fraud transaction pattern analysis
- Gender and category-based transaction analysis
- Time-based fraud activity analysis
- Customer and geographic aggregation
- Distance calculations using geolocation data
- Age-based transaction filtering
- Data transformation and preprocessing
The project calculates:
- Minimum transaction values
- Maximum transaction values
- Average transaction values
- Summary statistics for numerical columns
The analysis includes:
- Transaction counts by gender
- Average transaction amount by gender
- Fraud rate comparison between genders
The project analyzes:
- Transaction counts by category
- Average transaction amount by category
- Fraud distribution across categories
The solution identifies:
- Cities with the highest number of female customers
- States with the lowest number of male customers
- Customers living closest to Warsaw using latitude and longitude calculations
The project investigates:
- Fraud occurrence by time of day
- Fraud activity during morning, afternoon, evening, and night periods
- Time span between the first and last fraud event
The analysis includes:
- Underage customer transaction detection
- Job-based transaction deviation analysis
- Customer-level aggregation using distinct credit card identifiers
The project uses the following Kaggle dataset:
Credit Card Fraud Detection Dataset
https://www.kaggle.com/datasets/kartik2112/fraud-detection
Required file:
- fraudTrain.csv.gz
The dataset should be uploaded to the Colab session storage before running the notebook.
The dataset is not included in this repository because of its size.
- PySpark
- Apache Spark
- Python
- Google Colab
- Distributed Data Processing
- Big Data Analytics
- Data Transformation
- CSV Processing
- Aggregation Functions
- Window and Time Operations
The goal of this project is to demonstrate practical skills in distributed data processing, large-scale transaction analysis, and fraud-oriented analytical workflows using PySpark.
The solution successfully demonstrates:
- Distributed CSV processing
- Large-scale transaction aggregation
- Fraud pattern analysis
- Geographic and demographic analysis
- Time-based transaction analysis
- PySpark transformation workflows
- Data preprocessing and analytical reporting
- Practical usage of Spark DataFrames
Paulina Broda