Skip to content
This repository was archived by the owner on Dec 16, 2025. It is now read-only.
This repository was archived by the owner on Dec 16, 2025. It is now read-only.

UTS #944

@salsaovn4-ux

Description

@salsaovn4-ux

Pastikan jalankan di Google Colab

1) Upload file dataset CSV lewat sidebar Colab (atau mount Google Drive)

2) Ganti 'amazon_sales.csv' dengan nama file kamu jika berbeda

import pandas as pd
import numpy as np
import random

-------------------------

0. Load dataset

-------------------------

Jika kamu upload lewat Colab UI, gunakan:

from google.colab import files

uploaded = files.upload()

filename = list(uploaded.keys())[0]

df = pd.read_csv(filename, low_memory=False)

Jika file ada di drive: df = pd.read_csv('/content/drive/MyDrive/amazon_sales.csv', low_memory=False)

df = pd.read_csv('amazon_sales.csv', low_memory=False) # ganti nama jika perlu
pd.set_option('display.max_columns', 50)

Jika kolom punya nama berbeda, sesuaikan contoh:

df = df.rename(columns={'price': 'actual_price', 'discount': 'discount_percentage', ...})

Periksa 10 baris pertama

print("Preview dataset:")
display(df.head())

-------------------------

1. Standardisasi kolom penting (pastikan kolom ada)

-------------------------

expected_cols = {
'product_name': ['product_name', 'productTitle', 'title', 'product'],
'category': ['category', 'product_category', 'categories'],
'actual_price': ['actual_price', 'price', 'actualPrice', 'list_price'],
'discount_percentage': ['discount_percentage', 'discount', 'discount_percent'],
'sales': ['sales', 'total_sales', 'units_sold', 'sales_amount'],
'rating': ['rating', 'average_rating', 'stars'],
'rating_count': ['rating_count', 'ratings_count', 'review_count']
}

helper untuk cari kolom yang cocok

def find_col(candidates, df):
for c in candidates:
if c in df.columns:
return c
return None

col_map = {}
for key, cand in expected_cols.items():
found = find_col(cand, df)
col_map[key] = found

print("Mapping kolom yang digunakan:", col_map)

Jika kolom penting tidak ditemukan, beri peringatan

missing = [k for k,v in col_map.items() if v is None]
if missing:
print("Peringatan: kolom berikut tidak ditemukan otomatis:", missing)
print("Silakan cek nama kolom di dataset dan ubah mapping di kode jika perlu.")

Buat kolom standar pada dataframe untuk dipakai selanjutnya (jika ada)

for std, col in col_map.items():
if col is not None:
df[std] = df[col]
else:
# jika tidak ada kolom, isi NaN atau nilai default
df[std] = np.nan

-------------------------

2. Preprocessing dasar: bersihkan dan konversi tipe

-------------------------

Hapus spasi berlebih di nama produk dan kategori

df['product_name'] = df['product_name'].astype(str).str.strip()
df['category'] = df['category'].astype(str).str.strip()

Konversi numeric (hapus simbol jika perlu)

def to_numeric_clean(series):
return pd.to_numeric(series.astype(str).str.replace(r'[^\d\.\-]', '', regex=True), errors='coerce')

df['actual_price'] = to_numeric_clean(df['actual_price'])
df['discount_percentage'] = to_numeric_clean(df['discount_percentage'])
df['sales'] = to_numeric_clean(df['sales'])
df['rating'] = to_numeric_clean(df['rating'])
df['rating_count'] = to_numeric_clean(df['rating_count'])

Beberapa dataset menyimpan discount sebagai fraction (0-1) atau percent (0-100)

Standardisasi: kita asumsikan jika nilai max > 1 then it's percent; else fraction.

if df['discount_percentage'].max(skipna=True) is not None:
max_disc = df['discount_percentage'].max(skipna=True)
if max_disc <= 1:
# fraction -> convert to percent
df['discount_percentage_percent'] = df['discount_percentage'] * 100
else:
df['discount_percentage_percent'] = df['discount_percentage']
else:
df['discount_percentage_percent'] = np.nan

-------------------------

3. (c) actual_price_rupiah = actual_price * 185

-------------------------

FX = 185 # kurs sesuai soal
df['actual_price_rupiah'] = df['actual_price'] * FX

-------------------------

4. (d) discount_price_rupiah = actual_price_rupiah * (discount_percentage_percent / 100)

-------------------------

df['discount_price_rupiah'] = df['actual_price_rupiah'] * (df['discount_percentage_percent'] / 100.0)

Jika discount kosong, isikan 0

df['discount_price_rupiah'] = df['discount_price_rupiah'].fillna(0)

-------------------------

5. (f) tambah kolom terikat_sales = random int 1-100

-------------------------

random.seed(42)
np.random.seed(42)
df['terikat_sales'] = np.random.randint(1, 101, size=len(df))

-------------------------

6. (a) Kelompokkan 10 category berdasarkan sales terbesar (jumlah sales per kategori)

-------------------------

Pastikan kolom 'sales' merepresentasikan angka penjualan (jumlah/nominal)

category_sales = df.groupby('category', dropna=False)['sales'].sum().reset_index().sort_values('sales', ascending=False)
top10_category_by_sales = category_sales.head(10)
print("\nTop 10 kategori berdasarkan total sales:")
display(top10_category_by_sales)

-------------------------

7. (b) Kelompokkan 10 category berdasarkan rating terkecil (rata-rata rating terendah)

-------------------------

category_rating = df.groupby('category', dropna=False)['rating'].mean().reset_index().sort_values('rating', ascending=True)
bottom10_category_by_rating = category_rating.head(10)
print("\n10 kategori dengan rata-rata rating terendah:")
display(bottom10_category_by_rating)

-------------------------

8. (e) Tampilkan 10 produk dengan rating terbesar, sertakan discount_price_rupiah

-------------------------

top10_products_by_rating = df.sort_values(['rating', 'rating_count'], ascending=[False, False]).drop_duplicates(subset=['product_name']).head(10)
cols_to_show = ['product_name', 'category', 'rating', 'rating_count', 'actual_price', 'actual_price_rupiah', 'discount_percentage_percent', 'discount_price_rupiah']
print("\n10 produk dengan rating terbesar (termasuk discount_price_rupiah):")
display(top10_products_by_rating[cols_to_show])

-------------------------

9. (g) & (h) & (i) Buat contoh preprocessing hasil akhir sesuai contoh soal

- Saya buat 3 tabel contoh: tabel ringkasan produk, tabel top kategori, tabel product detail

-------------------------

Contoh tabel ringkasan (seperti tampilan di soal)

summary_table = df[['product_name','category','actual_price_rupiah','discount_price_rupiah','rating','rating_count','terikat_sales']].copy()
summary_table = summary_table.sort_values('discount_price_rupiah', ascending=False).head(20) # contoh 20 baris
print("\nContoh hasil preprocessing - summary table (20 baris teratas berdasarkan discount_price_rupiah):")
display(summary_table)

Tabel top kategori (gabungan sales dan avg rating)

category_summary = df.groupby('category').agg(
total_sales=('sales','sum'),
avg_rating=('rating','mean'),
avg_price_rp=('actual_price_rupiah','mean'),
total_products=('product_name','nunique')
).reset_index().sort_values('total_sales', ascending=False)
print("\nCategory summary (contoh):")
display(category_summary.head(20))

Tabel product detail untuk beberapa product tertentu (contoh untuk export)

product_detail = df[['product_name','category','actual_price','actual_price_rupiah','discount_percentage_percent','discount_price_rupiah','rating','rating_count','terikat_sales']]
print("\nProduct detail sample:")
display(product_detail.head(20))

-------------------------

10. Simpan hasil preprocessing ke file (opsional)

-------------------------

summary_table.to_csv('preprocessing_summary_table.csv', index=False)
category_summary.to_csv('category_summary.csv', index=False)
product_detail.to_csv('product_detail.csv', index=False)
print("\nFile CSV hasil preprocessing disimpan: preprocessing_summary_table.csv, category_summary.csv, product_detail.csv")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions