PySpark Fraud Detection Analysis

This project presents distributed fraud transaction analysis using PySpark and a large-scale financial transaction dataset.

The solution focuses on data transformation, aggregation, fraud pattern analysis, date and time operations, and distributed data processing techniques in Apache Spark.

Project Overview

Distributed transaction analysis using PySpark
Large-scale CSV data processing
Descriptive statistics for transaction data
Fraud transaction pattern analysis
Gender and category-based transaction analysis
Time-based fraud activity analysis
Customer and geographic aggregation
Distance calculations using geolocation data
Age-based transaction filtering
Data transformation and preprocessing

Analytical Tasks

Descriptive Statistics

The project calculates:

Minimum transaction values
Maximum transaction values
Average transaction values
Summary statistics for numerical columns

Gender-Based Analysis

The analysis includes:

Transaction counts by gender
Average transaction amount by gender
Fraud rate comparison between genders

Category Analysis

The project analyzes:

Transaction counts by category
Average transaction amount by category
Fraud distribution across categories

Geographic Analysis

The solution identifies:

Cities with the highest number of female customers
States with the lowest number of male customers
Customers living closest to Warsaw using latitude and longitude calculations

Fraud Time Analysis

The project investigates:

Fraud occurrence by time of day
Fraud activity during morning, afternoon, evening, and night periods
Time span between the first and last fraud event

Customer and Occupation Analysis

The analysis includes:

Underage customer transaction detection
Job-based transaction deviation analysis
Customer-level aggregation using distinct credit card identifiers

Dataset

The project uses the following Kaggle dataset:

Credit Card Fraud Detection Dataset

https://www.kaggle.com/datasets/kartik2112/fraud-detection

Required file:

fraudTrain.csv.gz

The dataset should be uploaded to the Colab session storage before running the notebook.

The dataset is not included in this repository because of its size.

Technologies

PySpark
Apache Spark
Python
Google Colab
Distributed Data Processing
Big Data Analytics
Data Transformation
CSV Processing
Aggregation Functions
Window and Time Operations

Goal

The goal of this project is to demonstrate practical skills in distributed data processing, large-scale transaction analysis, and fraud-oriented analytical workflows using PySpark.

Results

The solution successfully demonstrates:

Distributed CSV processing
Large-scale transaction aggregation
Fraud pattern analysis
Geographic and demographic analysis
Time-based transaction analysis
PySpark transformation workflows
Data preprocessing and analytical reporting
Practical usage of Spark DataFrames

Author

Paulina Broda

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
GH_pyspark_fraud_detection_analysis_international.ipynb		GH_pyspark_fraud_detection_analysis_international.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Fraud Detection Analysis

Project Overview

Analytical Tasks

Descriptive Statistics

Gender-Based Analysis

Category Analysis

Geographic Analysis

Fraud Time Analysis

Customer and Occupation Analysis

Dataset

Technologies

Goal

Results

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PySpark Fraud Detection Analysis

Project Overview

Analytical Tasks

Descriptive Statistics

Gender-Based Analysis

Category Analysis

Geographic Analysis

Fraud Time Analysis

Customer and Occupation Analysis

Dataset

Technologies

Goal

Results

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages