-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathIntroduction to Machine Learning
More file actions
299 lines (243 loc) · 9.38 KB
/
Introduction to Machine Learning
File metadata and controls
299 lines (243 loc) · 9.38 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
Machine Learning
------------------
Building a model from example inputs to make data-driven predictions versus following strictly static program instructions.
Application:
1. Email a spam?
2. How can cars drive themselves?
3. What will people buy?
Machine Learning
-----------------
2 categories
a. Supervised;
- Value Prediction
- Needs training data containing value being predicted, the trained model predicts value in the new model;
b. Unsupervised;
- Identify clusters of like data;
- Data does not contain cluster membership, but model provides access to data by cluster;
url -> https://www.continuum.io/downloads
Machine Learning WorkFlow:
--------------------------
An orchestrated and repeatable pattern which systematically transforms and processes information to create prediction solutions.
1. Asking the right question;
2. Preparing data;
3. Selecting the algorithm;
4. Training the model;
5. Testing the model;
------------------------------------------------------------------------------------------------------------------------
1. Asking the Right Question
-----------------------------
a. Define scope (including data sources);
- Using Pima Indian Diabetes data, predict which people will develop diabetes.
b. Define target performance;
- Using Pima Indian Diabetes data, predict with 70% or grater accuracy, which people will develop diabetes.
c. Define context for usage;
- Using Pima Indian Diabetes data, predict with 70% or greater accuracy which people are likely to develop diabetes.
d. Define how solution is created;
- Use the Machine Learning Workflow to process and transform Pima Indian data to create a prediction model. This model
must predict whih people are likely to develop diabetes with 70% or greater accuracy.
---------------------------------------------------------------------------------------------------------------------------
2. Preparing data
---------------------
a. Tidy Data
- Tidy datasets are easy to manipulate, model and visualize,and have a specific structure:
* each variable is a column;
* each observation is a row;
* each type of observational unit is a table;
** 50 - 80% of a ML project is spent getting, cleaning, and organizing data;
Data Rule #1:
---------------
- Closer the data is to what you are predicting, the better;
Data Rule #2:
--------------
- Data will never be in the format you need;
* Columns to eliminate - Not used, no values, duplicates;
* Correlated columns - Same information in different format, add little value, and cause algorithm to get confused;
* Modling Data - Adjusting data types, creating columns, if required;
* Dealing with missing data -
- Ignore it - Algorithms may fail;
- Impute it - update to "reasonable" values - Most frequent, Mean, Median, Expert reasonable value;
Data Rule #3:
----------------
Accurately predicting rare events is difficult;
Data Rule #4:
--------------
Track how to manipulate data;
-------------------------------------------------------------------------------------------------------------------------
3. Selecting the algorithm:
------------------------------
Role of the Algorithm
- fit the training set and predict on the read data;
- (fit()) training data -> Algorithm -> model;
- (predict()) real data -> Model -> result;
Over 50 algorithms
- algorithm selection
*. Compare factors;
*. Difference of opinions about which factors are important;
*. Develop your own factors;
Algorithm Decision Factors
--------------------------
i. Learning Type
ii. Result
iii. Complexity
iv. Basic vs Enhanced
i. Learning Type:
"Use the Machine Learning Workflow to process and transform Pima Indian data to create a "prediction model". This model must
predict which people are likely to develop diabetes with 70% or greater accuracy."
-> Prediction Model => Supervised machine learning;
Over 28 algorithms
ii. Result
a. Regression - constinuous vales;
b. Classification - discrete values;
"Use the Machine Learning Workflow to process and transform Pima Indian data to create a prediction model. This model must
"predict which people are likely to develop diabetes" with 70% or greater accuracy."
- Diabetes
- Binary (True/False)
- Algorithm must support classification - Binary classification;
** Over 20 algorithms;
iii. Complexity
- Keep it simple;
- Eliminate ensemble algorithms - Container algorithm; Multiple child algorithm, boost performance, Can be difficult to debug;
** Over 14 algorithm;
iv. Enhanced vs. Basic
- Enhanced - variation of basic, performance improvements, additional functionality, more complex;
- Basic - simpler, easier to understand;
Candidate Algorithms
--------------------
a. Naive Bayes;
b. Logistics Regression;
c. Decision Tree;
a. Naive Bayes - Based on likelihood and probability; every feature has same weight; requires smaller amount of data;
b. Logistic Regression - Binary result, relation between features are weighted;
c. Decision Tree - Binary tree, node contains decision, requires enough data to determine nodes and splits;
Selected algorithm - Naive Bayes
-------------------------------
Simple - easy to understand;
Fast - up to 100X faster;
Stable to data changes;
Overview
----------
Lots of algorithms available
Selected based on
- Learning = Supervised
- Result = Binary classification
- Non-ensemble
- Basic
----------------------------------------------------------------------------------------------------------------------------------
4. Training the Model
---------------------
* Machine Learning Training : Letting specific data teach a Machine Learning algorithm to create a specific prediction model;
* Training Overview
i. Prepared data - split - 70% Training data; 30% test data;
ii. Training data - algorithm (train model) - model;
iii. Test data - model - forecast output;
* Selecting Training Features
------------------------------
Want minimum features (columns)
* Selected features
i. Glucose onc;
ii. Blood pressure
iii. skin thickness;
iv. insulin level;
v. Body Mass Index;
vi. Diabetes Predisposition;
vii. Age;
Steps:
a. Split the data into training and test data sets;
b. Perform post split data preparation;
c. Train with initial algorithm;
Missing data
-------------
- Ignore
- Drop observations (rows)
- Replace values (Impute)
Using mean imputing;
-------------------------------------------------------------------------------------------------------------------------------
5. Tesing the Model's Accuracy
-------------------------------
* Statistics are only data. We define what is good or bad;
* Performance Improvement Options
-------------------------------
a. Adjust current algorithm;
b. Get more data or improve data;
c. Improve training;
d. Switch algorithms;
* Random Forest
----------------
-> Ensemble Algorithm;
-> Fits multiple trees with subsets of data;
-> Average tree results to improve performance and control overfitting;
- Train with training data: y = x1 + w2 * (x2)^3 + w3 * (x3)^8
- complex decision boundary;
- good fit of training data;
- poor fit of test data;
- Overfitting;
Fixing Overfitting
------------------
* Regularization hyperparameter
y = x1 + w2 * (x2)^3 + w3 * (x3)^8 - f(W)/(lambda)
𝑦=𝑥1+𝑤2 * 𝑥2^3+𝑤3 * 𝑥3^8−𝑓(𝑊)/𝝀
- Cross validation;
- Bias - variance trade-off;
- Sacrifice some perfection for better overall performance;
Unbalanced Classes
--------------------
More of one class than the others.
- Our Data - 65% No Diabetes, 35% Diabetes;
- Can be causing biases estimation yielding poor prediction results;
Fixing Unbalanced Classes
--------------------------
- Using the hyperparameter, to shift the predicted class boundary;
Division of the Data for better testing
---------------
- Training - Test split;
- How we can evaluate training without using Testing Data?
- the data must be split into -> Training - Validation - Testing;
- In case we don't have enough data, the training data is split in folds;
- Each fold of the training data is used for validation until all folds are examined and validated;
- For each fold, determine best hyperparameter value;
- Set model hyperparameter value to average best;
Algorithm CV Variants
----------------------
Algorithm + Cross Validation; Runs the algorithm K times;
Overview
----------
a. Evaluated Naive Bayes model
- predict()
- confusionMatrix()
b. Tried using Random Forest Algorithm
- Overfit
c. Improved performance with Logistics Regression
- Regularization;
- Achieved performance goal;
d. Logistic Regression Cross Validation
- Slightly below 70% target
- Better performance on real data;
-----------------------------------------------------------------------------------------------------------------------
Key Points
-------------
-> Machine Learning is Data driven;
-> Follow the Machine Learning WorkFlow;
* Machine Learning Workflow
----------------------------
i. Asking the right question
- Started with question;
- Used requiredments and knowledge to transform;
- Resulted in solution statement;
ii. Preparing the data
- Retrieved diabetes data;
- Cleaned data;
- Molded data;
iii. Selecting the algorithm
- Learning type
- Result type
- Complexity
-Basic vs Enhanced
iv. Training the model
- Split data - 70% / 30%;
- Trained with training data;
v. Testing the model
- Evaluated prediction;
- Selected Logistic Regression
- Achieved success
- Used Cross Validation version for better general performance;
-------------------------------------------------------------------------------------------------------------------------------