-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcode.py
More file actions
3278 lines (2377 loc) · 174 KB
/
code.py
File metadata and controls
3278 lines (2377 loc) · 174 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""code.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1mhjkS1jETnNjFOpimfaDIKuFMjJpnGhp
# Student ID: 2400570
**You student_id is your 7/8 digit faser number.**
This is a sample format for CE807: Assignment . You must follow the format.
The code will have three broad sections, and additional section, if needed,
1. Common Codes
2. Method/model 1 Specific Codes
3. Method/model 2 Specific Codes
4. Other Method/model Codes, if any
**You must have `train_unsup`, `test_unsup` for Unsupervised method and `train_dis`, `test_dis` for Discriminatuve method to perform full training and testing. This will be evaluated automatically, without this your code will fail and no marked.**
You code should be proverly indended, print as much as possible, follow standard coding (https://peps.python.org/pep-0008/) and documentaion (https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.01-Help-And-Documentation.ipynb) practices.
Before each `code block/function`, you must have a `text block` which explain what code block/function is going to do. For each function/class, you need to properly document what are it's input, functionality and output.
If you are using any non-standard library, you must have command to install that, for example `pip install datasets`.
You must print `train`, `validation` and `test` performance measures.
You must also print `train` and `validation` loss in each `epoch`, wherever you are using `epoch`, say in any deep learning algorithms.
Your code must
* To reproducibality of the results you must use a `seed`, you have to set seed in `torch`, `numpy` etc, use same seed everywhere **and your Student ID should be your seed**.
* read dataset from './data/number/', where number is last digit of your student_id folder which will have 3 files [`train.csv`, `val.csv`, `test.csv`]
* save model after finishing the training in './model/student_id/Model_Unsup/' and './model/student_id/Model_Dis/' for Unsupervised and Discriminative model respectively.
* at testing time you will load models from './model/student_id/Model_Unsup/' and './model/student_id/Model_Dis/' for Unsupervised and Discriminative model respectively. Your output file based on the test file will be named “test.csv” and you will add/modify “out_label_model_unsup” and “out_label_model_dis” column in the existing columns from test.csv. These outputs will be generated from your trained models.
* after testing, your output file will be named “test.csv” and you will add/modify “out_label_model_unsup” and “out_label_model_Dis” column in the existing columns from test.csv. These outputs will be generated from your trained models.
**Install and import all required libraries first before starting to code.**
# Declaring ``student_id`` as valiable to use different places
"""
student_id = 2400570
"""# Reproducibility and Environment Setup
This section ensures **reproducibility** by installing dependencies, importing necessary libraries, and setting random seeds to achieve consistent results across different runs.
**1. Install Dependencies**
Install all required libraries for text preprocessing, topic modeling, and machine learning tasks.
**2. Import Libraries**
Import the necessary libraries that are required for data processing, model building, and evaluation.
**3. Set Random Seeds for Reproducibility**
Set random seeds for various libraries (Python, NumPy, PyTorch) to ensure that random operations and model initializations yield consistent results across runs.
## Install Dependencies
**Why is this needed?**
This installation ensures that you have **all necessary dependencies** for BERTopic to function fully, including:
- **Embedding models**:
- For creating document embeddings, which are essential for understanding the semantic structure of your text. This is typically achieved using models from `sentence-transformers`.
- **Visualization libraries**:
- Libraries like `plotly`, `umap-learn`, and `hdbscan` are used for **topic visualizations** and **dimensionality reduction**, which help in interpreting and presenting the topics generated by the model.
- **Clustering algorithms**:
- For **clustering** the document embeddings (e.g., using `hdbscan`), which is crucial in identifying patterns and groups of similar topics within your text.
- **Support for various functionalities**:
- This ensures that BERTopic provides **interactive visualizations**, **topic modeling**, and **efficient processing** of text data, offering a complete pipeline for topic modeling and analysis.
"""
# Install BERTopic with all optional dependencies for extended features like visualization, vectorization, and clustering support
!pip install bertopic[all]
"""## Import libraries"""
# ==== System & Utilities ====
import os # File and directory operations
import re # Regular expressions for pattern matching
import string # String constants and utility functions
import pickle # Object serialization and deserialization
import joblib # Saving/loading models efficiently
import warnings # To filter or suppress warnings
from collections import Counter # Count elements in iterables
import random # Import the random module
# ==== Numerical & Data Handling ====
import numpy as np # Numerical operations
import pandas as pd # Data manipulation and analysis
# ==== NLP: Text Preprocessing ====
import nltk # Natural Language Toolkit
from nltk.corpus import stopwords # Stopword list
from nltk.stem import WordNetLemmatizer # Lemmatizer to reduce words to root form
from nltk.tokenize import word_tokenize # Tokenization
from nltk.corpus import cmudict # Pronunciation dictionary
# ==== Deep Learning: PyTorch ====
import torch # Main PyTorch library
import torch.nn as nn # Neural network components
import torch.optim as optim # Optimizers for training
from torch.utils.data import Dataset, DataLoader # Data pipeline tools
# ==== Deep Learning: TensorFlow / Keras ====
from tensorflow.keras.preprocessing.text import Tokenizer # Tokenizes text into sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences # Pads sequences to same length
# ==== Evaluation Metrics ====
from sklearn.metrics import (
accuracy_score,
precision_recall_fscore_support,
confusion_matrix,
ConfusionMatrixDisplay,
silhouette_score,
silhouette_samples,
adjusted_rand_score
)
# ==== Clustering & Vectorization ====
from sklearn.feature_extraction.text import CountVectorizer # Bag-of-words vectorizer
from sklearn.cluster import KMeans # KMeans clustering
import hdbscan # HDBSCAN clustering
# ==== Embeddings & Topic Modeling ====
from sentence_transformers import SentenceTransformer # Sentence-level embeddings
from bertopic import BERTopic # BERTopic for topic modeling
# ==== Class Balancing ====
from imblearn.over_sampling import ADASYN # Oversampling minority class to balance data
# ==== Visualization ====
import matplotlib.pyplot as plt # General plotting
import seaborn as sns # Statistical plots with themes
"""## NLTK Downloads for Text Preprocessing
1. **warnings.filterwarnings("ignore")**
- **Why is this needed?**
- This is used to suppress warnings that might appear during execution, helping to keep the output clean and uncluttered.
2. **Stopwords**
- **Why is this needed?**
- The **stopwords** dataset contains a list of common words (such as "the", "is", "in", etc.) that are typically considered unimportant and are **removed** during text preprocessing. Removing stopwords helps in focusing on the more meaningful words in a text, which can improve the performance of downstream tasks like topic modeling or text classification.
3. **WordNet**
- **Why is this needed?**
- **WordNet** is a lexical database of the English language used for **lemmatization** and **synonym mapping**. Lemmatization reduces words to their root form (e.g., "running" to "run"), making it easier for models to understand the meaning of different forms of a word. This helps in improving consistency and understanding across variations of the same word.
4. **Open Multilingual WordNet (omw-1.4)**
- **Why is this needed?**
- The **Open Multilingual WordNet (omw-1.4)** extends the **WordNet** corpus to include non-English words. This provides **multilingual support**, allowing for better processing of texts in multiple languages. It is useful if you're working with multilingual datasets or need to perform NLP tasks on text in languages other than English.
5. **CMU Pronouncing Dictionary (cmudict)**
- **Why is this needed?**
- The **CMU Pronouncing Dictionary** is a dictionary that maps words to their phonetic transcriptions. It is particularly useful for **phonetic analysis**, such as **syllable counting**, which can be an important feature in linguistic and sentiment analysis tasks.
6. **Punkt Sentence Tokenizer**
- **Why is this needed?**
- **Punkt** is a pre-trained unsupervised tokenizer model that segments text into **sentences and words**. It is essential for many preprocessing tasks where proper sentence and word boundaries are required, such as **sentence length analysis**, **token-level transformations**, or **custom feature extraction**.
"""
# ==== Suppress warnings to keep output clean ====
warnings.filterwarnings("ignore")
# ==== NLTK Downloads for Text Preprocessing ====
# Stopwords: Common words (like "the", "is", "in") often removed during preprocessing
nltk.download("stopwords")
# WordNet: A large lexical database for English, used with lemmatization
nltk.download("wordnet")
# omw-1.4: Open Multilingual WordNet, enhances WordNet with multilingual support
nltk.download("omw-1.4")
# CMU Pronouncing Dictionary: Useful for phonetic analysis and syllable counts
nltk.download("cmudict")
# Punkt Sentence Tokenizer: Pretrained tokenizer for sentence and word splitting
nltk.download("punkt")
# punkt_tab: A tokenizer model that provides better sentence tokenization in specific contexts
nltk.download('punkt_tab')
"""## Setting Random Seeds for Reproducibility
In order to ensure **reproducibility** across different runs of the code and to make sure the results are consistent each time the script is executed, we set the random seed for various libraries. Here's why each one is needed:
1. **Python's Built-in Random Module**:
- **Why is this needed?**
- This sets the seed for Python's built-in random module. By using a fixed seed (in this case, `student_id`), we ensure that any random operations (like shuffling or selecting random numbers) yield the same result every time the code is run.
2. **NumPy's Random Number Generator**:
- **Why is this needed?**
- This sets the seed for NumPy's random number generator, ensuring that random operations like generating random arrays or sampling are reproducible. By fixing the seed, the same random numbers will be generated across runs, which is critical for reproducibility in experiments.
3. **PyTorch's CPU Operations**:
- **Why is this needed?**
- This sets the random seed for PyTorch's CPU operations. This is important for ensuring that model parameters are initialized the same way every time the script is executed, resulting in consistent training outcomes.
4. **PyTorch's GPU Operations (if CUDA is available)**:
- **Why is this needed?**
- This sets the seed for all GPU operations if CUDA is available. PyTorch uses this to initialize random number generators on the GPU, ensuring that results are consistent and reproducible even when running on different hardware setups.
"""
# Set the random seed for Python's built-in random module using the student_id for reproducibility
random.seed(student_id)
# Set the random seed for NumPy's random number generator, ensuring reproducible results across runs
np.random.seed(student_id)
# Set the random seed for PyTorch's CPU operations to ensure the same initialization of model parameters across runs
torch.manual_seed(student_id)
# Set the random seed for PyTorch's GPU operations, if CUDA is available, to ensure reproducibility on GPUs as well
torch.cuda.manual_seed_all(student_id)
"""# Data Access and Path Setup
This section focuses on the steps involved in accessing datasets, declaring paths, and modifying data as required for your analysis.
**1. Dataset Access**
Here, we define how to access and load the datasets, ensuring that the data is correctly imported and ready for processing.
**2. Path Declarations**
Paths to important files (such as datasets, configuration files, or models) are declared and set up. This allows for easy referencing and consistent file access throughout the code.
"""
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
# Add your code to initialize GDrive and data and models paths
GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = './CE807-25-SP/Assignment/'
GOOGLE_DRIVE_PATH = os.path.join('gdrive', 'MyDrive', GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
print('List files: ', os.listdir(GOOGLE_DRIVE_PATH))
DATA_PATH = os.path.join(GOOGLE_DRIVE_PATH, 'data', '20') # 20 data id
train_file = os.path.join(DATA_PATH, 'train.csv')
print('Train file: ', train_file)
val_file = os.path.join(DATA_PATH, 'valid.csv')
print('Validation file: ', val_file)
test_file = os.path.join(DATA_PATH, 'test.csv')
print('Test file: ', test_file)
MODEL_PATH = os.path.join(GOOGLE_DRIVE_PATH, 'model', str(student_id)) # student Regitration number
MODEL_Dis_DIRECTORY = os.path.join(MODEL_PATH, 'model_dis') # Model Discriminative directory
print('Model Generative directory: ', MODEL_Dis_DIRECTORY)
MODEL_unsup_DIRECTORY = os.path.join(MODEL_PATH, 'model_unsup') # Model unsuper directory
print('Model Unsuper directory: ', MODEL_unsup_DIRECTORY)
os.makedirs(MODEL_Dis_DIRECTORY, exist_ok=True)
"""# Data Exploration
In this section, we focus on the essential steps of loading, cleaning, and understanding the dataset. This includes loading the CSV data, previewing its contents, dropping unnecessary columns, and analyzing the distribution of sentiment categories.
## Loading and Previewing CSV Data
In this section, we load a CSV file into a DataFrame and display the first few rows to get an initial look at the dataset.
"""
def load_and_preview_csv(file_path, n=5):
"""
Loads a CSV file into a DataFrame and returns the first `n` rows.
Parameters:
file_path (str): Path to the CSV file.
n (int): Number of rows to display. Default is 5.
Returns:
pd.DataFrame: The loaded DataFrame.
"""
df = pd.read_csv(file_path)
print(df.head(n))
return df
train_df = load_and_preview_csv(train_file) # Load and preview the training data
val_df = load_and_preview_csv(val_file) # Load and preview the validation data
test_df = load_and_preview_csv(test_file) # Load and preview the test data
"""## Dropping Unnecessary Columns
In this step, we remove the `data_id` column from the dataset. This column does not contribute to the analysis or model training, so it is dropped to clean the data and focus on the relevant features.
"""
def preview_and_drop_columns(file_path, columns_to_drop=None, n=5):
"""
Load a CSV file into a DataFrame, preview the first `n` rows, and drop specified columns.
Parameters:
file_path (str): The path to the CSV file.
columns_to_drop (list): List of column names to drop from the DataFrame. Default is None.
n (int): The number of rows to preview. Default is 5.
Returns:
pd.DataFrame: The DataFrame after dropping specified columns (if any).
"""
# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)
# Preview the first `n` rows of the DataFrame
print(f"Preview of the first {n} rows of {file_path}:")
print(df.head(n)) # Show the first n rows
# If there are columns to drop, remove them
if columns_to_drop:
print(f"Dropping columns: {columns_to_drop}")
df = df.drop(columns=columns_to_drop)
# Return the modified DataFrame
return df
# Preview and drop specified columns
train_df = preview_and_drop_columns(train_file, columns_to_drop=['data_id'])
# Preview and drop specified columns
val_df = preview_and_drop_columns(val_file, columns_to_drop=['data_id'])
"""## Distribution of Sentiment Categories
This section visualizes the distribution of sentiment categories (e.g., "positive" and "negative") in the dataset. It helps in understanding the balance of sentiment labels and can highlight potential class imbalances, which may require addressing during model training to ensure fair performance across all classes.
"""
def plot_sentiment_pie_chart(df):
# Value count of 'sentiment'
count = df['sentiment'].value_counts()
# Create a figure for the pie chart
fig, ax = plt.subplots(figsize=(3, 3))
# Create a pie chart
palette = sns.color_palette("coolwarm", len(count))
sns.set_palette(palette)
ax.pie(count, labels=count.index, autopct='%1.1f%%', startangle=140, colors=palette)
# Title
ax.set_title('Distribution of Sentiment Categories', fontsize=15, fontweight='bold')
# Display the plot
plt.tight_layout()
plt.show()
plot_sentiment_pie_chart(train_df)
"""**Insights on Sentiment Distribution**
The sentiment distribution of the dataset is highly imbalanced:
- **20.4% Negative Sentiment**: Only a small portion of the dataset expresses negative sentiment.
- **79.6% Positive Sentiment**: The majority of the dataset shows positive sentiment.
This imbalance could impact model performance, as models may become biased towards the majority class (positive sentiment). It might be necessary to apply techniques like **oversampling**, **undersampling**, or **class-weight adjustments** to address the imbalance for more accurate predictions.
# Feature Analysis
**Feature Analysis**
This section focuses on extracting and analyzing key linguistic features that can help in understanding text structure and sentiment patterns. These features are useful for enhancing model performance by providing more informative inputs.
**Syllable Count**
Analyzing the number of syllables helps assess text complexity and readability, which may correlate with sentiment intensity.
**Sentence Length Distribution**
Understanding the variation in sentence lengths can reveal stylistic differences and emotional tone in text data.
**Root Words Analysis**
Examining root words (lemmas) highlights the core meaning of sentences and reduces dimensionality by grouping word variants.
**Extracting Word Formation Features**
Capturing structural elements like prefixes, suffixes, or word shapes can provide cues about word function and sentiment tendencies.
"""
# Store the 'text' column in a variable
text_data = train_df['text']
# Ensure text_data has no NaN values and is in string format
text_data = text_data.fillna('').astype(str)
"""## Syllable Count
**Syllable Counting:**
- For each word in the text, the script uses the **CMU Pronouncing Dictionary** to find the number of syllables (based on phonemes ending in digits, e.g., `AH1`).
- If a word is not found in the dictionary, it defaults to **1 syllable**.
**Sentence Syllable Count:**
- The total syllables for each sentence are calculated by **summing** the syllable counts of all words in that sentence.
**Visualization:**
- A **histogram** is plotted using **Seaborn** to show the frequency of sentences with varying syllable counts.
- Custom styling is applied to enhance visual appeal.
- The plot provides insights into the **complexity of the text data**, helping to analyze **sentence structure** or **readability**.
**Why are we doing this?**
Understanding syllable distribution helps assess the **linguistic complexity and readability** of the text. Sentences with higher syllable counts often indicate more complex vocabulary and structure, which is important in **sentiment analysis**, **readability scoring**, or **text simplification tasks**.
"""
# Load CMU Pronouncing Dictionary
d = cmudict.dict()
# Syllable count function using CMU Pronouncing Dictionary
def syllable_count(word):
word = word.lower()
if word in d:
# Count syllables in the word
return max([len([y for y in pron if y[-1].isdigit()]) for pron in d[word]])
else:
return 1 # Default to 1 if word is not in CMU dict
# Apply syllable count to words in each sentence
syllable_counts = text_data.apply(
lambda x: sum([syllable_count(word) for word in word_tokenize(x)])
)
plt.gcf().set_facecolor('#b1d8fe') # Set figure background color
# Plot syllable count distribution using Seaborn
plt.figure(figsize=(6, 4))
sns.histplot(syllable_counts, bins=30, kde=False, color='#ff864f', edgecolor='white')
plt.title('Distribution of Syllable Count in Text', fontsize=14)
plt.xlabel('Syllable Count', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# Set the axis background color to a light blue
plt.gca().set_facecolor('#e2f0fe')
plt.show()
"""**Insights on Syllable Count Distribution**
- **Distribution Shape**: The syllable count distribution is heavily **right-skewed**, indicating that most text samples have relatively low syllable counts.
- **Peak of the Distribution**: The **mode** is observed at the lower end, with many samples having syllable counts under 100. This suggests that the majority of sentences in the dataset are relatively simple in terms of syllable complexity.
- **Outliers/Long-Tail Texts**: There are **outliers** in the dataset with syllable counts reaching up to 3500, but these instances are **rare**. These long-tail samples could represent complex, longer texts or perhaps some noise in the data.
This distribution insight helps understand the general complexity of the text data and the presence of both simple and exceptionally complex samples, which could influence further text analysis and model training.
## Sentence Length Distribution
1. **Sentence Length Calculation:**
It calculates the length of each sentence by counting the number of words in it. This is done by tokenizing each sentence using `nltk.word_tokenize()`, then calculating the length of the tokenized sentence.
2. **Data Preparation:**
The sentence lengths are stored in a DataFrame (`sentence_lengths_df`) for easier handling, especially when plotting using Seaborn.
3. **Plotting:**
A histogram is created to show the distribution of sentence lengths. The `sns.histplot()` function is used, which provides a histogram along with an optional Kernel Density Estimate (KDE) curve. This is done with a light green color for the bars.
4. **Title and Labels:**
The plot is given a title ("Sentence Length Distribution") and labeled axes: the x-axis represents the "Sentence Length (Number of Words)", and the y-axis represents the "Frequency".
**Why are we doing this?**
Understanding the **distribution of sentence lengths** helps identify the **complexity and variability of the text**. This insight can inform decisions on **text truncation, padding, or segmentation**, especially when preparing text data for machine learning models like RNNs, LSTMs, or transformers, which may have input length constraints.
"""
# Calculate sentence length (number of words in each sentence)
sentence_lengths = [len(nltk.word_tokenize(sentence)) for sentence in text_data]
# Convert to a DataFrame for easier handling in Seaborn
sentence_lengths_df = pd.DataFrame(sentence_lengths, columns=["Sentence Length"])
# Plot sentence length distribution using Seaborn
plt.figure(figsize=(6, 4))
sns.histplot(sentence_lengths_df['Sentence Length'], bins=30, kde=True, color="#83ca89")
plt.title('Sentence Length Distribution')
plt.xlabel('Sentence Length (Number of Words)')
plt.ylabel('Frequency')
# Set x-axis limit to 40
plt.xlim(0, 40)
plt.tight_layout()
plt.show()
"""**Insights on Sentence Length Distribution**
- **Peak Around 15 Words**: The distribution peaks at **15 words per sentence**, indicating that most sentences in the dataset are around this length. This is typical for natural language, where sentences tend to be of moderate length.
- **Decline in Frequency**: There is a **steady decline** in frequency as the sentence length increases beyond 15 words, suggesting that **longer sentences** are less common.
- **Short Sentences**: Sentences with fewer than **5 words** are also relatively **less common** compared to sentences in the range of **10–20 words**. This shows that short sentences are less frequent in this dataset.
- **Right-Skewed Distribution**: The chart displays a **slightly right-skewed distribution**, which is typical for natural language. This implies that **short and medium-length sentences** dominate the data, while longer sentences become progressively rarer.
These insights offer a better understanding of the text's sentence structure, indicating a tendency towards more concise, moderate-length sentences in the dataset.
## Root Words Analysis
Lemmatization is a **natural language processing** technique that reduces words to their base or root form (e.g., "running" → "run"). This script uses the **WordNet Lemmatizer** to:
1. Tokenize and lemmatize words in the text dataset.
2. Compare original words with their lemmatized forms.
3. Visualize the count of words that need lemmatization versus those already in their root form.
**Why are we doing this?**
Lemmatization helps in **reducing lexical variation** and ensures that different forms of the same word are treated uniformly. This is especially important for **text classification, clustering, and topic modeling**, where consistency in word forms can significantly improve model performance. The bar chart provides a quick visual check of how much preprocessing is required for the dataset.
"""
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Tokenize and lemmatize words in the text data
all_words = ' '.join(text_data).split()
lemmatized_words = [lemmatizer.lemmatize(word) for word in all_words]
# Compare original words and lemmatized words
comparison_df = pd.DataFrame({
'Original Word': all_words,
'Lemmatized Word': lemmatized_words
})
comparison_df['Needs Lemmatization'] = comparison_df['Original Word'] != comparison_df['Lemmatized Word']
# Count words that require lemmatization
lemmatization_counts = comparison_df['Needs Lemmatization'].value_counts()
# Plot the data using Seaborn
plt.figure(figsize=(6, 4))
sns.barplot(
x=lemmatization_counts.index.map({False: 'Already Root', True: 'Needs Lemmatization'}),
y=lemmatization_counts.values,
palette=["#ff864f", "#83ca89"]
)
# Set the background and labels
plt.gcf().set_facecolor('#b1d8fe') # Set figure background color
plt.gca().set_facecolor('#e2f0fe') # Set axis background color
plt.title('Words Needing Lemmatization vs. Already in Root Form', fontsize=14)
plt.xlabel('Lemmatization Status', fontsize=12)
plt.ylabel('Count of Words', fontsize=12)
plt.tight_layout()
plt.show()
"""**Insights on Lemmatization vs. Root Form Words**
- **Already in Root Form**: A significant majority of the words in the dataset, more than **140,000 words**, are already in their **root form**. This indicates that a large portion of the text is composed of base words without requiring further transformation.
- **Need Lemmatization**: Around **10,000 words** in the dataset require **lemmatization**. These are words in inflected or derived forms (e.g., "running" to "run"), which need to be reduced to their base form for consistency and better model performance.
This analysis highlights that most of the text is already in a usable format for NLP tasks, but there is still a smaller subset that will benefit from lemmatization, potentially improving downstream tasks like text classification or topic modeling.
## Extracting Word Formation Features
This script identifies and counts **word formation patterns** in the text data:
1. **Root Words** – Words already in their base form.
2. **Suffixes** – Words ending with **"ing"** or **"ed"**, indicating progressive or past tense forms.
3. **Prefixes** – Words starting with **"un"**, often indicating negation.
4. **Pluralization** – Words that differ from their lemmatized root, suggesting plural forms.
**Why are we doing this?**
Analyzing word formation helps us understand **morphological complexity** in the dataset. It reveals how often certain grammatical forms (e.g., tense, negation, plurality) appear, which can provide insights into **text style, sentiment**, or **author intent**. These features can also improve performance in downstream tasks like **sentiment analysis, topic modeling**, or **readability scoring**. Visualizing them in a bar chart makes it easy to observe their distribution and compare frequencies.
"""
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Extract word formation features
features = {
"Root Words": 0,
"Prefixes": 0,
"Suffixes": 0,
"Pluralization": 0,
}
for sentence in text_data:
for word in word_tokenize(sentence):
root = lemmatizer.lemmatize(word.lower())
if word.endswith("ing") or word.endswith("ed"):
features["Suffixes"] += 1
elif word.startswith("un"):
features["Prefixes"] += 1
elif root != word.lower():
features["Pluralization"] += 1
else:
features["Root Words"] += 1
# Convert features dictionary to DataFrame
features_df = pd.DataFrame(list(features.items()), columns=["Feature", "Count"])
plt.gcf().set_facecolor('#b1d8fe') # Set figure background color
# Plot using Seaborn
plt.figure(figsize=(6, 4))
sns.barplot(x="Feature", y="Count", data=features_df, palette=["#ff864f", "#83ca89", "#fa5477", '#ac7cd1'])
plt.title("Distribution of Word Formation Features", fontsize=14)
# Calculate the maximum count for annotation positioning
max_count = features_df["Count"].max()
plt.xlabel("Word Formation Feature", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45)
# Set the axis background color to a light blue
plt.gca().set_facecolor('#e2f0fe')
plt.tight_layout()
plt.show()
"""**Insights on Word Formation Features**
- **Root Words**: The majority of the words in the dataset, approximately **150,000 words**, are already in their **root form**. This suggests that the text predominantly consists of base words that don't require significant transformation.
- **Suffixes and Pluralization**: Both **suffixes** (words ending in "ing" or "ed") and **pluralization** (words differing from their lemmatized root) together account for fewer than **25,000 words**. This indicates that a smaller portion of the dataset involves word variations that can be identified through suffix patterns or pluralization.
This distribution highlights that while most words are in their base form, a smaller but significant portion of words might need attention through processes like stemming or lemmatization to ensure uniformity for further analysis or model building.
# Data Preprocessing
## Variable Initialization for Model Evaluation
These lists help track the **training and validation performance** across different learning approaches (unsupervised and discriminative models). By storing and analyzing these values, we can evaluate how well the models generalize to unseen data, compare the effectiveness of different methods, and identify potential improvements in the model architecture or training process.
"""
# For unsupervised learning phase
average_training_unsup = [] # Stores average metrics (accuracy, precision, recall, f1) for unsupervised training
average_val_unsup = [] # Stores average metrics for validation during unsupervised training
metrics_unsup = [] # Stores individual confusion metrics during unsupervised training
# For discriminative learning phase
average_training_dis = [] # Stores average metrics for discriminative training
average_val_dis = [] # Stores average metrics for validation during discriminative training
metrics_dis = [] # Stores individual confusion metrics during discriminative training
"""## Preprocessing Function for Text Data
**Text Preprocessing Function Explanation:**
---
**Step-by-Step Breakdown**
1. Lemmatization
- **What it does:** Reduces words to their root form (e.g., "running" → "run").
- **Why it's important:** Helps the model treat variations of a word as a single concept, improving generalization.
2. Stopwords Removal
- **What it does:** Removes common words like “the”, “is”, “in” which add little meaning.
- **Why it's important:** Reduces noise, improves signal in data, especially for tasks like classification and clustering.
3. Text Cleaning
- **What it does:**
- Converts text to lowercase.
- Removes URLs, punctuation, extra spaces, and square brackets.
- **Why it's important:**
- Ensures uniform formatting.
- Strips irrelevant elements that don't contribute to sentiment or topic.
4. Splitting and Filtering
- **What it does:** Splits text into tokens, lemmatizes each, and filters stopwords.
- **Why it's important:** Leaves only meaningful, normalized words for downstream NLP tasks.
---
---
**Preprocessing Justification: Theoretical and Practical Foundations**
| Justification | Theoretical Basis | Practical Impact |
|----------------------|------------------------------------------------------------------------|------------------------------------------------------------------------|
| **Lemmatization** | Linguistics theory: transforms inflected forms to base forms | Helps reduce dimensionality while preserving semantics |
| **Stopword Removal** | Zipf’s Law: high-frequency words carry less information | Reduces feature space and computational cost, enhances model focus |
| **Lowercasing** | Normalization in IR and NLP tasks | Ensures tokens like “Apple” and “apple” are treated the same |
| **URL & Symbol Removal** | Information noise filtering | Eliminates non-linguistic content that confuses models |
| **Token Filtering** | Cognitive load reduction, semantic clarity | Improves interpretability and feature quality for clustering/classification |
---
**Comparison of Common Preprocessing Techniques (With Justification)**
| Technique | Description & Variants | Included in Our Pipeline? | Why This Matters | Justification for Our Choice / Exclusion |
|---------------------------|-----------------------------------------------------------------------------------------------------------|----------------------------|----------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| **Lowercasing** | Converts all text to lowercase (e.g., “Happy” → “happy”) | Yes | Ensures uniformity; avoids treating “Happy” and “happy” as different tokens. | Essential for reducing feature space and avoiding case-based duplication. |
| **Stopword Removal** | Removes common non-informative words (e.g., "the", "is", "in") | Yes | Helps reduce noise and focus on meaningful words. | We used NLTK's stopword list — a practical choice that improves classification performance. |
| **Lemmatization** | Converts words to their base form. Variants: <br>• WordNetLemmatizer (our choice) <br>• spaCy <br>• Stemming (Porter, Snowball) | Yes (WordNet) | Reduces inflectional forms and standardizes meaning. | WordNet is slower but more linguistically accurate than stemming. Better for semantic tasks. |
| **Stemming** | Truncates to root (e.g., “running” → “run” or “runn”) | No | Crude and may distort meaning; less accurate than lemmatization. | We preferred lemmatization for better language representation — correct for semantic analysis. |
| **Tokenization** | Splits text into individual words or subwords | Yes | Needed for nearly all NLP tasks; basis for feature extraction. | A standard and necessary step — well-handled in our preprocessing function. |
| **URL Removal** | Removes hyperlinks from the text | Yes | URLs carry no semantic sentiment; removing reduces irrelevant noise. | A good choice — it improves clarity, especially for user-generated text. |
| **Punctuation Removal** | Removes symbols like . , ? ! etc. | Yes | Helps simplify text and reduce token clutter. | Important for traditional models like TF-IDF or clustering. |
| |
| **Special Char & Emoji Removal** | Removes emojis, hashtags, etc. (often used in informal/social media) | No | Emojis can carry sentiment (😊, 😡), useful in emotion/sentiment tasks. | Keeping emojis could boost performance in informal datasets (e.g., tweets, reviews). We excluded this for now, but would reconsider if analyzing social media text. |
| **Spelling Correction** | Fixes typos and variants (e.g., “hapy” → “happy”) | No | Enhances consistency and reduces sparsity — but computationally expensive. | We skipped this due to resource constraints and this could also help tp predict. |
|
| **TF-IDF Weighting** | Converts cleaned text to feature vectors (numerical) | Yes (in our model) | Helps highlight rare and informative terms. | A well-chosen vectorizer for traditional clustering/classification pipelines. |
| **Embeddings (e.g., BERT)** | Transforms sentences into dense vectors using models like SentenceTransformer | Yes (BERTopic step) | Captures deep semantic relationships beyond word-level features. | A strong addition for topic modeling and unsupervised sentiment detection. |
|
---
**Summary:**
- We prioritized **clean, semantically meaningful representation** of text for clustering and sentiment modeling.
- Chose **lemmatization (WordNet)** over stemming — good for interpretability and semantic coherence.
- Excluded things like emoji removal and spelling correction — valid if the dataset is formal and clean.
- Combined **traditional vectorization (TF-IDF)** with **semantic embeddings** — a strong hybrid approach.
---
"""
# Define a function to return a preprocessor for text data
def get_preprocessor():
"""
Returns a text preprocessing function that:
1. Converts text to lowercase
2. Removes URLs and square bracketed content
3. Strips punctuation and extra whitespace
4. Lemmatizes words and removes common stopwords
This function uses the NLTK library's stopwords and WordNetLemmatizer for processing text.
Returns:
preprocess_text (function): A function that takes a text string as input and returns the preprocessed string.
"""
# Initialize the lemmatizer from NLTK (used to reduce words to their root form)
lemmatizer = WordNetLemmatizer()
# Load the list of common stop words in English from NLTK
stop_words = set(stopwords.words("english"))
# Define the inner function that will preprocess the text
def preprocess_text(text):
"""
Preprocesses a given text string by conditionally:
1. Converting to lowercase.
2. Removing content inside square brackets.
3. Removing URLs.
4. Removing punctuation.
5. Normalizing whitespace.
6. Lemmatizing and removing stopwords.
Args:
text (str): The input text to be preprocessed.
Returns:
str: The cleaned and processed text.
"""
if pd.isna(text):
return ""
# Step 1: Convert to lowercase if not already lowercase
if any(char.isupper() for char in text):
text = text.lower()
# Step 2: Remove text inside square brackets if brackets exist
if "[" in text and "]" in text:
text = re.sub(r"\[.*?\]", "", text)
# Step 3: Remove URLs if any common URL patterns are found
if "http" in text or "www." in text:
text = re.sub(r"https?://\S+|www\.\S+", "", text)
# Step 4: Remove punctuation if punctuation characters exist
if any(char in string.punctuation for char in text):
text = text.translate(str.maketrans("", "", string.punctuation))
# Step 5: Normalize whitespace if needed
if " " in text or text != text.strip():
text = re.sub(r"\s+", " ", text).strip()
# Step 6: Lemmatize and remove stopwords if there are words
words = text.split()
if words:
words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
text = " ".join(words)
return text
# Return the preprocess_text function
return preprocess_text
"""# Method Unsupervised Start
In this section you will write all details of your Method 1.
You will have to enter multiple `code` and `text` cell.
Your code should follow the standard ML pipeline
* Data reading
* Data clearning, if any
* Convert data to vector/tokenization/vectorization
* Model Declaration/Initialization/building
* Training and validation of the model using training and validation dataset
* Save the trained model
* Load and Test the model on testing set
* Save the output of the model
You could add any other step(s) based on your method's requirement.
After finishing the above, you need to usd splited data as defined in the assignment and then do the same for all 4 sets. Your code should not be copy-pasted 4 time, make use of `function`.
## Function for Model Evaluation and Analysis
This section covers the functions used for evaluating and analyzing model performance, including text preprocessing, classification evaluation, and cluster quality assessment.
### Preprocessing for Training and Validation Datasets
**Purpose:**
This function is designed to load and preprocess both the training and validation datasets, ensuring that the data is cleaned and ready for analysis or model training. The function performs essential steps, such as handling missing values and applying a custom preprocessing function to the text data.
**Why is this needed?**
Preprocessing text data is a critical step in natural language processing (NLP) workflows. Raw text data often contains noise, such as missing values or irrelevant content, which can impact model performance. This function:
1. Ensures that datasets are free from missing or incomplete text entries.
2. Applies custom preprocessing (like tokenization, lemmatization, or stopword removal) to make the text consistent and suitable for further analysis or training.
**What does this function do?**
- **Prints the shape of the input data**: This provides a quick overview of the number of records in the training and validation datasets.
- **Drops rows with missing text**: Any records in the datasets where the text is missing are removed to ensure the quality of the data.
- **Applies preprocessing**: A custom preprocessing function is applied to the "text" column to transform the data (e.g., removing stopwords, stemming, or lemmatization) into a form suitable for model input.
**Output:**
The function returns the preprocessed training and validation datasets, which can now be used for further analysis or model building.
"""
def load_and_preprocess(train_df, val_df, preprocess_fn):
"""
Loads and preprocesses training and validation data by:
1. Printing the shape of the input data.
2. Dropping rows with missing text in both the training and validation data.
3. Applying the provided preprocessing function to the "text" column.
Args:
train_df (pd.DataFrame): The training dataset, which must contain a "text" column.
val_df (pd.DataFrame): The validation dataset, which must contain a "text" column.
preprocess_fn (function): A function that will preprocess the "text" column.
Returns:
pd.DataFrame: The preprocessed training and validation datasets.
"""
# Print the shape of the training and validation datasets to provide an overview of the data
print(f"Train data shape: {train_df.shape}")
print(f"Validation data shape: {val_df.shape}")
# Iterate over both training and validation datasets
for df in [train_df, val_df]:
# Print the number of records being processed
print(f"Preprocessing {df.shape[0]} records...")
# Drop rows where the 'text' column is NaN
df.dropna(subset=["text"], inplace=True)
# Apply the preprocessing function to the 'text' column
df["text"] = df["text"].apply(preprocess_fn)
# Return the preprocessed training and validation datasets
return train_df, val_df
"""### Plotting Confusion Matrix
**Purpose:**
This function generates and visualizes a confusion matrix to assess the performance of a classification model. It shows how predicted labels compare to true labels, helping to identify misclassifications.
**Why is this needed?**
A confusion matrix is a valuable tool for evaluating classification models. It provides insight into the model’s accuracy, precision, recall, and other metrics by visualizing true positive, false positive, true negative, and false negative values.
"""
def plot_confusion_matrix(true_labels, predicted_labels, labels=None, title="Confusion Matrix", n_clusters = 2):
"""
Plots a confusion matrix using a heatmap.
Args:
true_labels (array-like): Array of true labels (actual values).
predicted_labels (array-like): Array of predicted labels (predicted by the model).
labels (array-like, optional): List of class labels to display on the axes. If None, the labels will be inferred from the data.
title (str, optional): The title of the confusion matrix plot. Default is "Confusion Matrix".
"""
# Compute the confusion matrix based on true labels and predicted labels
cm = confusion_matrix(true_labels, predicted_labels, labels=labels)
# Store the full matrix
metrics_unsup.append({
"n_clusters": n_clusters,
"confusion_matrix": cm.tolist()
})
# Create a figure with a specific size
plt.figure(figsize=(3, 3))
# Create a heatmap using Seaborn, with annotations for each cell (the count), formatted as integers
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=labels, yticklabels=labels, cbar=False)
# Labeling the x-axis and y-axis
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
# Set the title for the plot
plt.title(title)
# Adjust the layout so everything fits properly
plt.tight_layout()
# Display the plot
plt.show()
"""### Silhouette Score Visualization for Clusters
**Purpose:**
This section calculates and visualizes the silhouette score, which helps in evaluating the quality of clustering.
**Why is this needed?**
The silhouette score measures how similar each point is to its own cluster compared to other clusters. A higher score indicates that the samples are well-clustered, while a lower score suggests that the clustering might need adjustments.
"""
def plot_silhouette(X, labels, title="Silhouette Scores for Clusters"):
"""
Plots the silhouette scores distribution for clustering results.
Args:
X (array-like): The feature data (input data for clustering).
labels (array-like): The cluster labels (predicted by the clustering algorithm).
title (str, optional): The title of the plot. Default is "Silhouette Scores for Clusters".
"""
# Calculate the overall silhouette score for the clustering
sil_score = silhouette_score(X, labels)
# Print the silhouette score (excluding noise points) for feedback
print(f"Silhouette Score (excluding noise): {sil_score:.4f}")
# Calculate the silhouette score for each sample
sil_samples = silhouette_samples(X, labels)
# Set up the style for the plot (whitegrid style from seaborn for a clean look)
sns.set(style="whitegrid")
# Create a new figure with a specific size
plt.figure(figsize=(6, 4))
# Plot a histogram of the silhouette scores for individual samples, with Kernel Density Estimation (KDE)
sns.histplot(sil_samples, bins=25, kde=True)
# Set the title of the plot
plt.title(title)
# Label the x-axis (Silhouette Coefficient)
plt.xlabel("Silhouette Coefficient")
# Label the y-axis (Count of samples)
plt.ylabel("Count")
# Adjust the layout to ensure everything fits within the figure
plt.tight_layout()
# Display the plot
plt.show()
"""## Training Unsupervised Method Code
Your test code should be a stand alone code that must take `train_file`, `val_file`, and `model_dir` as input. You could have other things as also input, but these three are must. You would load both files, and train using the `train_file` and validating using the `val_file`. You will `print` / `display`/ `plot` all performance metrics, loss(if available) and save the output model in the `model_dir`.
Note that at the testing time, you need to use the same pre-processing and model. So, it would be good that you make those as seperate function/pipeline whichever it the best suited for your method. Don't copy-paste same code twice, make it a fucntion/class whichever is best.
This function implements an unsupervised learning pipeline for sentiment analysis. The pipeline utilizes a combination of **BERTopic** for topic modeling, **HDBSCAN** for clustering, **KMeans** for further clustering analysis, and **ADASYN** for handling class imbalance. The steps include topic modeling, clustering, and data balancing, followed by model evaluation on both training and validation sets.
**1. Why KMeans for Clustering?**
Since our goal is to predict sentiment (positive or negative) in an unsupervised manner, we use **KMeans** for clustering the text data. Here's why it's appropriate:
- **Text Data Clustering**: The text data contains inherent patterns related to sentiment, and KMeans helps to partition this data into clusters. While we don't have explicit sentiment labels, **KMeans allows us to divide the data into two main clusters** (positive and negative sentiments), aligning with the sentiment prediction task.
- **Fixed Number of Clusters**: KMeans requires us to define the number of clusters upfront. In sentiment analysis, we can assume that there are two primary sentiments (positive and negative), making it a natural fit for this task. KMeans will group the data accordingly and can help us understand the distribution of sentiments based on clustering.
- **Efficiency**: KMeans is fast and works well on large datasets, which is common for text analysis tasks. Its simplicity in clustering makes it suitable for this unsupervised learning scenario where no labeled data is available.
**2. Why HDBSCAN for Clustering?**
**HDBSCAN** is chosen for density-based clustering, which is useful in this context for the following reasons:
- **Handling Noise**: In sentiment analysis, some documents may not fit neatly into either positive or negative sentiment (e.g., ambiguous or neutral content). **HDBSCAN excels in identifying and labeling these "noisy" documents** that don’t belong to any clear cluster, providing a more refined clustering approach compared to traditional methods like KMeans.
- **No Predefined Number of Clusters**: Unlike KMeans, HDBSCAN does not require specifying the number of clusters in advance. Since the sentiment data may not always follow a strict binary classification, **HDBSCAN's ability to discover clusters of varying shapes and densities is beneficial** for uncovering hidden structures in the data.
- **Better for Imbalanced Data**: The dataset might have an imbalance between positive and negative sentiments (e.g., more positive reviews). **HDBSCAN is robust to varying cluster sizes** and can identify clusters even in imbalanced datasets, making it ideal for handling cases where the sentiment distribution is skewed.
**3. Why ADASYN for Data Balancing?**