Skip to content

polabroda/PySpark_fraud_detection_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

PySpark Fraud Detection Analysis

This project presents distributed fraud transaction analysis using PySpark and a large-scale financial transaction dataset.

The solution focuses on data transformation, aggregation, fraud pattern analysis, date and time operations, and distributed data processing techniques in Apache Spark.


Project Overview

  • Distributed transaction analysis using PySpark
  • Large-scale CSV data processing
  • Descriptive statistics for transaction data
  • Fraud transaction pattern analysis
  • Gender and category-based transaction analysis
  • Time-based fraud activity analysis
  • Customer and geographic aggregation
  • Distance calculations using geolocation data
  • Age-based transaction filtering
  • Data transformation and preprocessing

Analytical Tasks

Descriptive Statistics

The project calculates:

  • Minimum transaction values
  • Maximum transaction values
  • Average transaction values
  • Summary statistics for numerical columns

Gender-Based Analysis

The analysis includes:

  • Transaction counts by gender
  • Average transaction amount by gender
  • Fraud rate comparison between genders

Category Analysis

The project analyzes:

  • Transaction counts by category
  • Average transaction amount by category
  • Fraud distribution across categories

Geographic Analysis

The solution identifies:

  • Cities with the highest number of female customers
  • States with the lowest number of male customers
  • Customers living closest to Warsaw using latitude and longitude calculations

Fraud Time Analysis

The project investigates:

  • Fraud occurrence by time of day
  • Fraud activity during morning, afternoon, evening, and night periods
  • Time span between the first and last fraud event

Customer and Occupation Analysis

The analysis includes:

  • Underage customer transaction detection
  • Job-based transaction deviation analysis
  • Customer-level aggregation using distinct credit card identifiers

Dataset

The project uses the following Kaggle dataset:

Credit Card Fraud Detection Dataset

https://www.kaggle.com/datasets/kartik2112/fraud-detection

Required file:

  • fraudTrain.csv.gz

The dataset should be uploaded to the Colab session storage before running the notebook.

The dataset is not included in this repository because of its size.


Technologies

  • PySpark
  • Apache Spark
  • Python
  • Google Colab
  • Distributed Data Processing
  • Big Data Analytics
  • Data Transformation
  • CSV Processing
  • Aggregation Functions
  • Window and Time Operations

Goal

The goal of this project is to demonstrate practical skills in distributed data processing, large-scale transaction analysis, and fraud-oriented analytical workflows using PySpark.


Results

The solution successfully demonstrates:

  • Distributed CSV processing
  • Large-scale transaction aggregation
  • Fraud pattern analysis
  • Geographic and demographic analysis
  • Time-based transaction analysis
  • PySpark transformation workflows
  • Data preprocessing and analytical reporting
  • Practical usage of Spark DataFrames

Author

Paulina Broda

About

PySpark project focused on distributed fraud transaction analysis, aggregations, and large-scale data processing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors