Hits: 635
這是在udemy線上學習machine learning的筆記。持續更新中。
目錄
Part1: Data preprocessing
Part2: Regression
Section4: Simple linear regression
ordinary least squares
: 回歸分析中,根據觀察值(yi),求出一條回歸線與預測值(yihat),找出sum(yi-yihat)
最小的值,這個方法就叫做ordinary least squares.
Section5: Multiple linear regression
Section6: Polynomial regression
Section7: Support vector regression
71. SVR intuition
- Support Vector Regression(SVR) is a type of support vector machine that supports linear and non-linear regression
- 換句話說,可同時支援線性回歸與非線性回歸,除此之外,亦可用來解決分類問題。
- SVR跟線性回歸的概念很像,目的就是產生一條回歸線,讓最多的資料可被包含有最小的epsilon所解釋的回歸線,但還是有本質上的差異
- 線性回歸: 將觀察值與實際值的差異最小化
- SVR: 確保觀察值與實際值的差異不會到達設定的閥值(threshold)
- 如何進行SVR
- training set
- 選擇分析kernel(ex: Gaussian), parameter and regularization (ex: noise)
- correlation matrix
- train, get contraction coefficient, and predict
72 & 73. SVR in python & R
```
```R
# Set the env
rm(list=ls())
library(e1071)
library(ggplot2)
# Importing the dataset
setwd("~/Section 7 - Support Vector Regression (SVR)")
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[2:3]
> dataset
> Level Salary
> 1 1 45000
> 2 2 50000
> 3 3 60000
> 4 4 80000
> 5 5 110000
> 6 6 150000
> 7 7 200000
> 8 8 300000
> 9 9 500000
> 10 10 1000000
# Try three regressor: svm, lm, poly
## svm
regressor = svm(formula = Salary ~ .,
data = dataset,
type = 'eps-regression',
kernel = 'radial')
## lm
regressor.lm <- lm(formula = Salary ~., data = dataset)
## polynomial
regressor.poly <- lm(Salary ~ poly(Level, 2), data = dataset)
# check the performance of model
rmse <- function(error) {
sqrt(mean(error^2))
}
paste0("RMSE in SVR / lm / poly are: ", rmse(regressor$residuals), " / ", rmse(regressor.lm$residuals), " / ", rmse(regressor.poly$residuals))
> [1] "RMSE in SVR / lm / poly are: 142425.39810616 / 163388.735192726 / 82212.1240045125"
# Visualising the SVR results
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red', lwd = 1.3) +
geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)), colour = 'blue', lwd = 1.1) +
geom_line(aes(x = dataset$Level, y = predict(regressor.lm, newdata = dataset)), colour = 'black', lwd = 1.1) +
geom_line(aes(x = dataset$Level, y = predict(regressor.poly, newdata = dataset)), colour = 'orange', lwd = 1.1) +
ggtitle('Truth or Bluff (SVR)') +
xlab('Level') +
ylab('Salary')
一些PS
1. 在R中,一併比較了svr, linear regression, polynomial的回歸線與模型預測結果(以RMSE來推估)
2. R語言在執行統計分析上,比python來的簡便
Section8: Decision tree
75. Decision tree regression intuition
通常被稱為`CART(Classification And Regression Tree)`
回歸決策樹的作法
- 將兩個變數(X1, X2)作圖
- 根據information entropy,計算出資料的切點(split),得出許多的leaves。該切點就是使leaves內最像、但leaves間最不像的切點
- y值的評估,是由每個leaves對應的y平均值產生(上圖的綠字)
- 把每個切點與資料畫成樹狀流程圖
77. Decision tree regression in python
使用的是課堂準備好的資料,如下圖
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
# Importing the dataset
os.chdir(r'path')
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Decision Tree Regression to the dataset
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict([[6.5]])
# Visualising the Decision Tree Regression results (higher resolution)
X_grid = np.arange(min(X), max(X), 0.01) # 因為決策樹的演算法是non-continus model,是把X在某區間的值對應到Y區間,因此要把X切細,用切細的X去對應Y,才是正確的圖形
X_grid = X_grid.reshape((len(X_grid), 1))
print(X_grid)
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (Decision Tree Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
78. Decision tree regression in R
# Decision Tree Regression
# Importing the dataset
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[2:3]
library(rpart)
regressor = rpart(formula = Salary ~ .,
data = dataset,
control = rpart.control(minsplit = 4))
# Predicting a new result with Decision Tree Regression
y_pred = predict(regressor, data.frame(Level = 6.5))
# Visualising the Decision Tree Regression results (higher resolution)
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
colour = 'blue') +
ggtitle('Truth or Bluff (Decision Tree Regression)') +
xlab('Level') +
ylab('Salary')
# Plotting the tree
plot(regressor)
text(regressor)
Section9: Random forest regression
Section10: Evaluating regression models performance
Part3: Classification
Section12: Logistic regression
Section13: K-Nearest Neighbors(KNN)
Section14: Support vector machine(SVM)
Section15: Kernel SVM
Section16: Naive Bayes
Section17: Decision tree classification
Section18: Random forest classification
Section19: Evaluating classification models performance
Part4: Clustering
Section21: K-Means clustering
Section22: Hierarchical clustering
Part5: Association rule learning
Section24: Apriori
Section25: Eclat
Part6: Reinforcement learning
Section27: Upper confidence bound
Section28: Thompson sampling
Part7: Natural language processing
Part8: Deep learning
用一張圖解釋什麼是deep learning: 把電腦想成人腦,就是你有許多input,在一個學習的過程中,不斷的從hidder layer進行運算後,輸出到output layer並找到最佳的output
這邊引用科技報橘文章內的一段話解釋什麼叫做神經網路與深度學習
1956 年,美國認知心理學家弗蘭克 ・羅森布拉特( Frank Rosenblatt)基於神經元的理論發明瞭一種模擬神經元的方法。它的基本點是一個被稱為神經元的小單位的集合。 這些集合都是小的計算單元,但可以模擬人腦計算的方式。和我們從感官中獲取數據一樣,這些神經元可以獲取傳入數據並進行學習,所以神經網絡可以隨著時間的推移做出決定。
Section31: Artifical neural networks(ANN)
218. Plan of attack
在本節會學到下列東西:
- The neuron
- Activation function
- How do neural networks work
- How do neural networks learn
- Gradient descent
- Stochastic gradient descent
- Backpropagation
219. The neuron
在DL的領域中,會使用到許多神經科學的相關名詞,因為DL的神經網路,就是仿照人類的神經元的運作方式,來達到資料預測/分類等應用,下圖為神經網路的簡單示意圖。
名詞解釋
input: 就像是神經元的樹突(dendrites),接收刺激,傳給神經元
w: weight權重
neuron: 又稱為hidden layer(aka blackbox),在這層中會進行三步驟,完成神經元的學習工作
1. 將input乘上w,加總起來
2. 將上步驟啟動(透過啟動函數activation function)
3. 決定是否要輸出到output
220. The activation function
主要的四種activation function
-
Threshold function (like yes/no function)
-
Sigmoid function (like logistic regression)
-
Rectifier function (常用的)
這邊文章有提到為何rectifier function很常用
- Hyperbolic tangent (很像sigmoid,但是從-1到1之間)
在hidden layer與output layer,可分別套用不同的activation function
221. How do neural networks work
舉預測房價為例,下圖說明神經網路如何運作
4種變數,有不同的排列組合,可個別被hidden layer中的神經元(H1 – H5)所運用,藉由H1-H5來預測房價,達到很準的效果。
Neuron in hidden layer
H1: Area + Distance
H2: Bedrooms + Distance
H3: Area + Bedrooms + Age
H4: Area + Bedrooms + Distance + Age
H5: Age
# 要注意的是,hidden layer中的每個神經元抓取到的特徵,可能會難以解釋
222. How do neural network learn
perceptron: 感知器,由input, weight, neuron, yhat組成
神經網路中,透過neuron計算後的預測值,會產生與實際值差距的cost function(C)[最基本的數學式:`sum(1/2(yhat – y)^2)`],在神經網路學習中,目標就是讓cost function最小化,因此第一次的預測後,會把cost function傳回neuron中,修正w1, w2 … wm之後,再丟入neuron中。反覆這樣的操作,直到cost function最小為止,完成學習。這個反覆讓cost function最小的過程,稱為`back propagation(反向傳播法)`。
step1
step2
223. Gradient descent
梯度下降:將yhat與cost function作圖,得到曲線圖
要如何求出最佳的cost function呢?基本上就是畫出上方的曲線圖後,代入yhat,得出cost function的斜率,若觀察斜率,判斷是否收斂到最小的cost function。
224. Stochastic gradient descent
在上一章有提到,找出最佳的cost function,可透過斜率來觀察,但這有個前提,就是yhat與C的關係式必須是convex function(凸函數; 開口向上)才可以,若有下圖的情況,只會找到非全域的最低點而非全域最低點。
同場加映:凹函數(concave function) 凹凸函數
要如何解決這樣的問題呢?就是本章所提的stochastic gradient descent,相較於本章的stochastic gradient descent,上一章提到的又稱為batch gradient descent,以下圖說明這兩種降維法的差異。
225. Backpropagation
229. ANN in python
先講會用到的框架
- Theano
- Tensorflow
- Keras
使用Theano與Tensorflow在進行深度學習時,會用到比較多的程式碼,Keras則比較像是快速的懶人包,透過少少的指令達到快速的深度學習效果。
pip install theano
pip install tensorflow
pip install keras
`dummy variable` 與 ` dummy variable trap`
Section32: Convolutional neural networks(CNN)
Part9: Dimensionality reduction
Section34: Principal component analysis(PCA)
Section35: Linear discriminant analysis(LDA)
Section36: Kernel PCA
Part10: Model selection & boosting
Section38: Model selection
在資料科學領域中,要如何評估模型的預測能力呢?通常的做法就是將資料隨機按照比例分成訓練組`training set`與測試組`testing set`(常見的`訓練:測試`比例有7:3, 8:2, 9:1…等),將訓練組建模後,與測試組比對,看看模型的評估能力(常用的評估指標有`RMSE/MAE/MAPE,AUC`…等,另參閱連結)。
假設依據訓練組建出來的模型,對於測試組的資料有著100%的準確度,我們可以說這個模型真的是準確的嗎?有沒有可能是這組測試資料真的那麼剛好跟訓練資料一模一樣呢?為了避免這個問題發生,統計學家Seymour Geisser提出一套叫做交叉驗證(cross validation)的方法,常見的方法有`Holdout`、`leave one out`、`Resubstitution`、`k-fold`等方法,這邊只講最常使用的`k-fold(k折交叉驗證)`方法。
下圖擷自scikit-learn的官方文件,清楚的說明了cross validation的概念,我們可以看到,先將all data分成training data(綠)與test data(藍)後,根據k(此例k=5)的數值,執行k次(split 1, split 2 …)訓練資料的再分群,每次的再分群,會分別執行`分群內的子訓練資料與子測試資料的模型檢驗`,最後把k次模型檢驗的評估指標平均,得出交叉驗證後的模型,再將這個模型與最一開始的測試資料進行預測與驗證。
這是另外一張圖,說明每次的迭代(k=1, k=2, …k=5)都會產生一次的評估指標,最後將k次的評估指標平均,得出最佳的評估指標後,再反推出最佳的建模模型。
284. K-fold cross validation in python
cross validation的概念講完了,接下來就是程式碼講解。
課堂上有用到的`Social_Network_Ads.csv`檔案格式如下,此範例主要是以SVM來預測不同ID的使用者是否會購買商品。相依變數(dependent variable)為`Purchased`、獨立變數(independent variable)為`Age`、`EstimatedSalary`(為求簡化`Age`在本例中不使用)。
完整程式碼
# k-Fold Cross Validation
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
os.chdir(r'path_you_but_data')
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fitting Kernel SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print('mean accuracy is: ', accuracies.mean())
print('sd is: ', accuracies.std())
285. K-fold cross validation in R
# Importing the dataset
setwd('path')
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75) # create a T&F vector based on the rowno of dataset
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[,-3] = scale(training_set[,-3])
test_set[,-3] = scale(test_set[-3])
# training_set$EstimatedSalary = scale(training_set$EstimatedSalary)
# test_set$EstimatedSalary = scale(test_set$EstimatedSalary)
# Fitting Kernel SVM to the Training set
library(e1071)
classifier = svm(formula = Purchased ~ Age + EstimatedSalary,
data = training_set,
type = 'C-classification',
kernel = 'radial')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set)
# Making the Confusion Matrix
cm = table(test_set$Purchased, y_pred)
# Applying k-Fold Cross Validation
library(caret)
folds = createFolds(training_set$Purchased, k = 10) # create a list of k that include randomly Purchased
cv = lapply(folds, function(x) {
training_fold = training_set[-x, ] # 把training_set去掉第1,2...10個fold,設為2-10training set
test_fold = training_set[x, ] # 把training_set的第1,2...10個fold設為testing set
classifier = svm(formula = Purchased ~ Age + EstimatedSalary,
data = training_fold,
type = 'C-classification',
kernel = 'radial')
y_pred = predict(classifier, newdata = test_fold)
cm = table(test_fold$Purchased, y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
accuracy = mean(as.numeric(cv))
accuracy
Comments