[Udemy] 課堂筆記 – Machine Learning A-Z: Hands on Python & R in Data Science

點閱: 377

這是在udemy線上學習machine learning的筆記。持續更新中。

目錄

Part1: Data preprocessing

Part2: Regression

Section4: Simple linear regression

file

ordinary least squares: 回歸分析中,根據觀察值(yi),求出一條回歸線與預測值(yihat),找出sum(yi-yihat)最小的值,這個方法就叫做ordinary least squares.

Section5: Multiple linear regression

Section6: Polynomial regression

Section7: Support vector regression

71. SVR intuition

  • Support Vector Regression(SVR) is a type of support vector machine that supports linear and non-linear regression
  • 換句話說,可同時支援線性回歸與非線性回歸,除此之外,亦可用來解決分類問題。
  • SVR跟線性回歸的概念很像,目的就是產生一條回歸線,讓最多的資料可被包含有最小的epsilon所解釋的回歸線,但還是有本質上的差異
    • 線性回歸: 將觀察值與實際值的差異最小化
    • SVR: 確保觀察值與實際值的差異不會到達設定的閥值(threshold)
  • file
  • 如何進行SVR
    • training set
    • 選擇分析kernel(ex: Gaussian), parameter and regularization (ex: noise)
    • correlation matrix
    • train, get contraction coefficient, and predict

72 & 73. SVR in python & R

```

```R
# Set the env
rm(list=ls())
library(e1071)
library(ggplot2)

# Importing the dataset
setwd("~/Section 7 - Support Vector Regression (SVR)")
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[2:3]

> dataset
>    Level  Salary
> 1      1   45000
> 2      2   50000
> 3      3   60000
> 4      4   80000
> 5      5  110000
> 6      6  150000
> 7      7  200000
> 8      8  300000
> 9      9  500000
> 10    10 1000000

# Try three regressor: svm, lm, poly

## svm
regressor = svm(formula = Salary ~ .,
                data = dataset,
                type = 'eps-regression',
                kernel = 'radial')

## lm
regressor.lm <- lm(formula = Salary ~., data = dataset)

## polynomial
regressor.poly <- lm(Salary ~ poly(Level, 2), data = dataset)

# check the performance of model
rmse <- function(error) {
  sqrt(mean(error^2))
}

paste0("RMSE in SVR / lm / poly are: ", rmse(regressor$residuals), " / ", rmse(regressor.lm$residuals), " / ", rmse(regressor.poly$residuals))

> [1] "RMSE in SVR / lm / poly are: 142425.39810616 / 163388.735192726 / 82212.1240045125"

# Visualising the SVR results
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red', lwd = 1.3) +
  geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)), colour = 'blue', lwd = 1.1) +
  geom_line(aes(x = dataset$Level, y = predict(regressor.lm, newdata = dataset)), colour = 'black', lwd = 1.1) + 
  geom_line(aes(x = dataset$Level, y = predict(regressor.poly, newdata = dataset)), colour = 'orange', lwd = 1.1) + 
  ggtitle('Truth or Bluff (SVR)') +
  xlab('Level') +
  ylab('Salary')

file

一些PS
1. 在R中,一併比較了svr, linear regression, polynomial的回歸線與模型預測結果(以RMSE來推估)
2. R語言在執行統計分析上,比python來的簡便

Section8: Decision tree

75. Decision tree regression intuition

通常被稱為`CART(Classification And Regression Tree)`

回歸決策樹的作法

  • 將兩個變數(X1, X2)作圖
  • 根據information entropy,計算出資料的切點(split),得出許多的leaves。該切點就是使leaves內最像、但leaves間最不像的切點
    • file
  • y值的評估,是由每個leaves對應的y平均值產生(上圖的綠字)
  • 把每個切點與資料畫成樹狀流程圖
    • file

77. Decision tree regression in python

使用的是課堂準備好的資料,如下圖
file

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

# Importing the dataset
os.chdir(r'path')
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Fitting Decision Tree Regression to the dataset
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)

# Predicting a new result
y_pred = regressor.predict([[6.5]])

# Visualising the Decision Tree Regression results (higher resolution)
X_grid = np.arange(min(X), max(X), 0.01)  # 因為決策樹的演算法是non-continus model,是把X在某區間的值對應到Y區間,因此要把X切細,用切細的X去對應Y,才是正確的圖形
X_grid = X_grid.reshape((len(X_grid), 1))
print(X_grid)
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (Decision Tree Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

78. Decision tree regression in R

# Decision Tree Regression

# Importing the dataset
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[2:3]

library(rpart)
regressor = rpart(formula = Salary ~ .,
                  data = dataset,
                  control = rpart.control(minsplit = 4))

# Predicting a new result with Decision Tree Regression
y_pred = predict(regressor, data.frame(Level = 6.5))

# Visualising the Decision Tree Regression results (higher resolution)
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Decision Tree Regression)') +
  xlab('Level') +
  ylab('Salary')

# Plotting the tree
plot(regressor)
text(regressor)

Section9: Random forest regression

Section10: Evaluating regression models performance

Part3: Classification

Section12: Logistic regression

Section13: K-Nearest Neighbors(KNN)

Section14: Support vector machine(SVM)

Section15: Kernel SVM

Section16: Naive Bayes

Section17: Decision tree classification

Section18: Random forest classification

Section19: Evaluating classification models performance

Part4: Clustering

Section21: K-Means clustering

Section22: Hierarchical clustering

Part5: Association rule learning

Section24: Apriori

Section25: Eclat

Part6: Reinforcement learning

Section27: Upper confidence bound

Section28: Thompson sampling

Part7: Natural language processing

Part8: Deep learning

用一張圖解釋什麼是deep learning: 把電腦想成人腦,就是你有許多input,在一個學習的過程中,不斷的從hidder layer進行運算後,輸出到output layer並找到最佳的output
Machine learning model

這邊引用科技報橘文章內的一段話解釋什麼叫做神經網路與深度學習

1956 年,美國認知心理學家弗蘭克 ・羅森布拉特( Frank Rosenblatt)基於神經元的理論發明瞭一種模擬神經元的方法。它的基本點是一個被稱為神經元的小單位的集合。 這些集合都是小的計算單元,但可以模擬人腦計算的方式。和我們從感官中獲取數據一樣,這些神經元可以獲取傳入數據並進行學習,所以神經網絡可以隨著時間的推移做出決定。

Section31: Artifical neural networks(ANN)

218. Plan of attack

在本節會學到下列東西:

  1. The neuron
  2. Activation function
  3. How do neural networks work
  4. How do neural networks learn
  5. Gradient descent
  6. Stochastic gradient descent
  7. Backpropagation

219. The neuron

在DL的領域中,會使用到許多神經科學的相關名詞,因為DL的神經網路,就是仿照人類的神經元的運作方式,來達到資料預測/分類等應用,下圖為神經網路的簡單示意圖。

file

名詞解釋

input: 就像是神經元的樹突(dendrites),接收刺激,傳給神經元
w: weight權重
neuron: 又稱為hidden layer(aka blackbox),在這層中會進行三步驟,完成神經元的學習工作
  1. 將input乘上w,加總起來
  2. 將上步驟啟動(透過啟動函數activation function)
  3. 決定是否要輸出到output

220. The activation function

主要的四種activation function

  1. Threshold function (like yes/no function)
    file

  2. Sigmoid function (like logistic regression)
    file

  3. Rectifier function (常用的)
    file

這邊文章有提到為何rectifier function很常用

  1. Hyperbolic tangent (很像sigmoid,但是從-1到1之間)
    file

在hidden layer與output layer,可分別套用不同的activation function

221. How do neural networks work

舉預測房價為例,下圖說明神經網路如何運作

file

4種變數,有不同的排列組合,可個別被hidden layer中的神經元(H1 – H5)所運用,藉由H1-H5來預測房價,達到很準的效果。

Neuron in hidden layer


H1: Area + Distance
H2: Bedrooms + Distance
H3: Area + Bedrooms + Age
H4: Area + Bedrooms + Distance + Age
H5: Age

# 要注意的是,hidden layer中的每個神經元抓取到的特徵,可能會難以解釋

222. How do neural network learn

perceptron: 感知器,由input, weight, neuron, yhat組成

神經網路中,透過neuron計算後的預測值,會產生與實際值差距的cost function(C)[最基本的數學式:`sum(1/2(yhat – y)^2)`],在神經網路學習中,目標就是讓cost function最小化,因此第一次的預測後,會把cost function傳回neuron中,修正w1, w2 … wm之後,再丟入neuron中。反覆這樣的操作,直到cost function最小為止,完成學習。這個反覆讓cost function最小的過程,稱為`back propagation(反向傳播法)`。

step1
file
step2
file

223. Gradient descent

梯度下降:將yhat與cost function作圖,得到曲線圖
file

要如何求出最佳的cost function呢?基本上就是畫出上方的曲線圖後,代入yhat,得出cost function的斜率,若觀察斜率,判斷是否收斂到最小的cost function。
file

224. Stochastic gradient descent

在上一章有提到,找出最佳的cost function,可透過斜率來觀察,但這有個前提,就是yhat與C的關係式必須是convex function(凸函數; 開口向上)才可以,若有下圖的情況,只會找到非全域的最低點而非全域最低點。

同場加映:凹函數(concave function) 凹凸函數

file

要如何解決這樣的問題呢?就是本章所提的stochastic gradient descent,相較於本章的stochastic gradient descent,上一章提到的又稱為batch gradient descent,以下圖說明這兩種降維法的差異。
file

225. Backpropagation

file

參考連結

229. ANN in python

先講會用到的框架

  1. Theano
  2. Tensorflow
  3. Keras

使用Theano與Tensorflow在進行深度學習時,會用到比較多的程式碼,Keras則比較像是快速的懶人包,透過少少的指令達到快速的深度學習效果。

pip install theano
pip install tensorflow
pip install keras

`dummy variable` 與 ` dummy variable trap`

Section32: Convolutional neural networks(CNN)

Part9: Dimensionality reduction

Section34: Principal component analysis(PCA)

Section35: Linear discriminant analysis(LDA)

Section36: Kernel PCA

Part10: Model selection & boosting

Section38: Model selection

在資料科學領域中,要如何評估模型的預測能力呢?通常的做法就是將資料隨機按照比例分成訓練組`training set`與測試組`testing set`(常見的`訓練:測試`比例有7:3, 8:2, 9:1…等),將訓練組建模後,與測試組比對,看看模型的評估能力(常用的評估指標有`RMSE/MAE/MAPE,AUC`…等,另參閱連結)。

假設依據訓練組建出來的模型,對於測試組的資料有著100%的準確度,我們可以說這個模型真的是準確的嗎?有沒有可能是這組測試資料真的那麼剛好跟訓練資料一模一樣呢?為了避免這個問題發生,統計學家Seymour Geisser提出一套叫做交叉驗證(cross validation)的方法,常見的方法有`Holdout`、`leave one out`、`Resubstitution`、`k-fold`等方法,這邊只講最常使用的`k-fold(k折交叉驗證)`方法。

下圖擷自scikit-learn的官方文件,清楚的說明了cross validation的概念,我們可以看到,先將all data分成training data(綠)與test data(藍)後,根據k(此例k=5)的數值,執行k次(split 1, split 2 …)訓練資料的再分群,每次的再分群,會分別執行`分群內的子訓練資料與子測試資料的模型檢驗`,最後把k次模型檢驗的評估指標平均,得出交叉驗證後的模型,再將這個模型與最一開始的測試資料進行預測與驗證。
file

這是另外一張圖,說明每次的迭代(k=1, k=2, …k=5)都會產生一次的評估指標,最後將k次的評估指標平均,得出最佳的評估指標後,再反推出最佳的建模模型。
file

284. K-fold cross validation in python

cross validation的概念講完了,接下來就是程式碼講解。
課堂上有用到的`Social_Network_Ads.csv`檔案格式如下,此範例主要是以SVM來預測不同ID的使用者是否會購買商品。相依變數(dependent variable)為`Purchased`、獨立變數(independent variable)為`Age`、`EstimatedSalary`(為求簡化`Age`在本例中不使用)。

file

完整程式碼

# k-Fold Cross Validation

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
os.chdir(r'path_you_but_data')

# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Fitting Kernel SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print('mean accuracy is: ', accuracies.mean())
print('sd is: ', accuracies.std())

285. K-fold cross validation in R

# Importing the dataset
setwd('path')
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75) # create a T&F vector based on the rowno of dataset
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set[,-3] = scale(training_set[,-3])
test_set[,-3] = scale(test_set[-3])

# training_set$EstimatedSalary = scale(training_set$EstimatedSalary)
# test_set$EstimatedSalary = scale(test_set$EstimatedSalary)

# Fitting Kernel SVM to the Training set
library(e1071)
classifier = svm(formula = Purchased ~ Age + EstimatedSalary,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'radial')

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set)

# Making the Confusion Matrix
cm = table(test_set$Purchased, y_pred)

# Applying k-Fold Cross Validation
library(caret)
folds = createFolds(training_set$Purchased, k = 10) # create a list of k that include randomly Purchased
cv = lapply(folds, function(x) {
  training_fold = training_set[-x, ] # 把training_set去掉第1,2...10個fold,設為2-10training set
  test_fold = training_set[x, ] # 把training_set的第1,2...10個fold設為testing set
  classifier = svm(formula = Purchased ~ Age + EstimatedSalary,
                   data = training_fold,
                   type = 'C-classification',
                   kernel = 'radial')
  y_pred = predict(classifier, newdata = test_fold)
  cm = table(test_fold$Purchased, y_pred)
  accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
})
accuracy = mean(as.numeric(cv))
accuracy

286. Grid search in python 1 & 2

288. Grid search in R

Section39: XGBoost

Section40: Bonus lectures

About the Author

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *

Related Posts