SlideShare a Scribd company logo
1 of 37
Download to read offline
Proprietary + ConfidentialProprietary + Confidential
只要會SQL就能做Machine Learning?
BigQuery ML簡介
Aaron Lee
aaronlee@mitac.com.tw
李東霖 Aaron
現職
● 神通資訊科技Google 解決方案顧問
● Qlik、Sophos產品經理
經歷
● Google Apps認證
● Google雲端平台架構師
● PMP專案管理師
● SAP MM顧問
● Oracle OCP認證
演講/授課經驗
專案管理師協會、靜宜大學、前川科技、毅太科技、水利署、國防大學、玉山銀行、神通
資訊科技、國際演講協會、桃園巿稅務局、外貿協會、亞東氣體
Why BigQuery ML?
用SQL就可以建立和執行ML Model,並且做出預測,讓SQL使用者可以用現有工具
加速開發,不用搬移資料,不用費時建立TensorFlow,讓Machine Learning普及化。
BigQuery ML GA了!!
結果......
Objectives
● 用sample data建立一個模型,它會預測電商訪客是否下單
● 用 CREATE MODEL 語法 建立二元迴歸 (是否)
● 用 ML.EVALUATE 語法 評估ML Model
● 用 ML.PREDICT 語法 做預測
Always free usage limits
Resource Monthly Free Usage Limits Details
Storage The first 10 GB per month is free. BigQuery ML models and training data stored in BigQuery are included in the
storage free tier.
Queries
(analysis)
The first 1 TB of query data processed
per month is free.
Queries that use BigQuery ML prediction, inspection, and evaluation functions
are included in the analysis free tier. BigQuery ML queries that contain CREATE
MODEL statements are not.
Flat-rate pricing is also available for high-volume customers that prefer a stable,
monthly cost.
BigQuery ML
CREATE MODEL
queries
The first 10 GB of data processed by
queries that contain CREATE MODEL
statements per month is free.
BigQuery ML CREATE MODEL queries are independent of the BigQuery analysis
free tier.
美國價格,但是......
台灣價格
原始資料:電商使用者與是否下單
一、建立Dataset “4bqml_tutorial” (用新的UI)
地點選擇United States
On the Create dataset page:
● For Dataset ID, enter bqml_tutorial .
● For Data location, choose United
States (US). Currently, the public
datasets are stored in the US
multi-region location. For simplicity, you
should place your dataset in the same
location.
On the Create dataset page:
● For Dataset ID, enter bqml_tutorial .
● For Data location, choose United
States (US). Currently, the public
datasets are stored in the USmulti-region
location. For simplicity, you should place
your dataset in the same location.
● Leave all of the other default settings in
place and click Create dataset.
二、建立模型
#standardSQL
CREATE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
等很久......
BigQuery ML可用的模型類別
● 線性迴歸 linear_reg
● 二元邏輯迴歸 logistic_reg
● 多分類邏輯迴歸 logistic_reg
● K-means分群 kmeans
https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create
三、取得訓練結果
四、評估模型 Evaluate your model
#standardSQL
SELECT * FROM
ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
Data set
http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/08_CV_Ensembling/fig-holdout.png
評估結果
When the query is complete, click the Results tab below the query text area. The results should look like the following:
欄位說明
Because you performed a logistic regression, the results include the following columns:
● precision — A metric for classification models. Precision identifies the frequency with which a model was correct
when predicting the positive class. 準確度
● recall — A metric for classification models that answers the following question: Out of all the possible positive
labels, how many did the model correctly identify? 召回度
● accuracy — Accuracy is the fraction of predictions that a classification model got right. 明確度
● f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of the precision and
recall. An f1 score's best value is 1. The worst value is 0.
● log_loss — The loss function used in a logistic regression. This is the measure of how far the model's predictions
are from the correct labels.
● roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that a randomly
chosen positive example is actually positive than that a randomly chosen negative example is positive. For more
information, see Classification in the Machine Learning Crash Course.
欄位說明 公式1
Because you performed a logistic regression, the results include the following
columns:
● precision — 準確度 TP / (TP + FP) 在判斷出來為為陽性的個體中,被正確
判斷為陽性之比率
● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確
判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率
● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性
之比率
● f1_score — Precision 跟 Recall 的調和平均數
Confusion matrix 混淆矩陣
https://www.ycc.idv.tw/confusion-matrix.html
Confusion matrix 混淆矩陣例子:愛滋病預測
True condition 真實情況
True 有愛滋 False 沒愛滋
Predicted Outcome
預測結果
Yes 有愛滋,驗出有愛滋
True Positive
TP
沒愛滋,驗出有愛滋
False Positive
FP
No 有愛滋,沒驗出有愛滋
False Negative
FN
沒愛滋,沒驗出有愛滋
True Negagive
TN
Confusion matrix 混淆矩陣例子:愛滋病預測
True condition 真實情況
True 有愛滋 100人 False 沒愛滋 9900人
Predicted Outcome
預測結果
Yes 有愛滋,驗出有愛滋
True Positive
TP
0人
沒愛滋,驗出有愛滋
False Positive
FP
0人
No 有愛滋,沒驗出有愛滋
False Negative
FN
100人
沒愛滋,沒驗出有愛滋
True Negagive
TN
9900人
假設10000人檢測,模型為:全部的人都沒愛滋
Because you performed a logistic regression, the results include the following
columns:
● precision — 準確度 9900 / (9900 + 100),99%
● accuracy — 精確度 0 ⇒ 準備度悖論
● Recall - 召回率 0 / ( 0 + 100 ) = 0
● 準確度高沒有用,重點是要驗出有愛滋病的人
計算結果
混淆矩陣用在這個例子:User是否下單
True condition
True 真的有下單 False 沒有下單
Predicted
Outcome
Yes
模型預測會下單
會下單,模型預測會下單
True Positive
TP
不會下單,模型預測會下單
False Positive
FP
No
模型預測不會下
單
會下單,模型預測不會下單
False Negative
FN
不會下單,模型預測不會下單
True Negagive
TN
欄位說明 公式1
Because you performed a logistic regression, the results include the following
columns:
● precision — 準確度 TP / (TP + FP),所有個體中,被正確判斷為陽性之比
率
● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確
判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率
● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性
之比率
● f1_score — Precision 跟 Recall 的調和平均數
欄位說明 公式2
● log_loss — The loss function used in a logistic regression. This is the
measure of how far the model's predictions are from the correct labels.
預測結結果接近真實數據的程度
欄位說明 公式3
● roc_auc — The area under the ROC curve. This is the probability that a
classifier is more confident that a randomly chosen positive example is
actually positive than that a randomly chosen negative example is positive.
For more information, see Classification in the Machine Learning Crash
Course.
AUC=0.5 (no discrimination 無鑑別力)
0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力)
0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力)
0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)
五、用模型預測結果 by country
#standardSQL
SELECT
country, SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC LIMIT 10
執行結果
六、預測每個user的購買
#standardSQL
SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country,
fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10
預測結果
結論
● 你只要會SQL語法就可以用
● 語法簡單,可立即實作
● 資料放美國
參考資料
BigQuery Start
https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start
Machine Learning Crash Course
https://developers.google.com/machine-learning/crash-course/
Proprietary + ConfidentialProprietary + Confidential
Thank you
Aaron Lee
aaronlee@mitac.com.tw

More Related Content

Similar to 20190424 只要會SQL就能做Machine Learning? BigQuery ML簡介

Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
Eric Esajian
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
MOINDALVS
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
Rahul Bhatia
 

Similar to 20190424 只要會SQL就能做Machine Learning? BigQuery ML簡介 (20)

Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
Machine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperMachine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paper
 
Applying data science to sales pipelines -- for fun and profit
Applying data science to sales pipelines -- for fun and profitApplying data science to sales pipelines -- for fun and profit
Applying data science to sales pipelines -- for fun and profit
 
Applying Data Science - for Fun and Profit
Applying Data Science - for Fun and ProfitApplying Data Science - for Fun and Profit
Applying Data Science - for Fun and Profit
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
 
1000 track2 Bharadwaj
1000 track2 Bharadwaj1000 track2 Bharadwaj
1000 track2 Bharadwaj
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
 
Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling...
Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling...Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling...
Using Machine Learning on AWS for Continuous Sentiment Analysis from Labeling...
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
ML in Android
ML in AndroidML in Android
ML in Android
 
The 4 Machine Learning Models Imperative for Business Transformation
The 4 Machine Learning Models Imperative for Business TransformationThe 4 Machine Learning Models Imperative for Business Transformation
The 4 Machine Learning Models Imperative for Business Transformation
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdf
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning Algorithm
 

Recently uploaded

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 

Recently uploaded (20)

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 

20190424 只要會SQL就能做Machine Learning? BigQuery ML簡介

  • 1. Proprietary + ConfidentialProprietary + Confidential 只要會SQL就能做Machine Learning? BigQuery ML簡介 Aaron Lee aaronlee@mitac.com.tw
  • 2.
  • 3. 李東霖 Aaron 現職 ● 神通資訊科技Google 解決方案顧問 ● Qlik、Sophos產品經理 經歷 ● Google Apps認證 ● Google雲端平台架構師 ● PMP專案管理師 ● SAP MM顧問 ● Oracle OCP認證 演講/授課經驗 專案管理師協會、靜宜大學、前川科技、毅太科技、水利署、國防大學、玉山銀行、神通 資訊科技、國際演講協會、桃園巿稅務局、外貿協會、亞東氣體
  • 4. Why BigQuery ML? 用SQL就可以建立和執行ML Model,並且做出預測,讓SQL使用者可以用現有工具 加速開發,不用搬移資料,不用費時建立TensorFlow,讓Machine Learning普及化。
  • 7. Objectives ● 用sample data建立一個模型,它會預測電商訪客是否下單 ● 用 CREATE MODEL 語法 建立二元迴歸 (是否) ● 用 ML.EVALUATE 語法 評估ML Model ● 用 ML.PREDICT 語法 做預測
  • 8. Always free usage limits Resource Monthly Free Usage Limits Details Storage The first 10 GB per month is free. BigQuery ML models and training data stored in BigQuery are included in the storage free tier. Queries (analysis) The first 1 TB of query data processed per month is free. Queries that use BigQuery ML prediction, inspection, and evaluation functions are included in the analysis free tier. BigQuery ML queries that contain CREATE MODEL statements are not. Flat-rate pricing is also available for high-volume customers that prefer a stable, monthly cost. BigQuery ML CREATE MODEL queries The first 10 GB of data processed by queries that contain CREATE MODEL statements per month is free. BigQuery ML CREATE MODEL queries are independent of the BigQuery analysis free tier.
  • 13. 地點選擇United States On the Create dataset page: ● For Dataset ID, enter bqml_tutorial . ● For Data location, choose United States (US). Currently, the public datasets are stored in the US multi-region location. For simplicity, you should place your dataset in the same location. On the Create dataset page: ● For Dataset ID, enter bqml_tutorial . ● For Data location, choose United States (US). Currently, the public datasets are stored in the USmulti-region location. For simplicity, you should place your dataset in the same location. ● Leave all of the other default settings in place and click Create dataset.
  • 14. 二、建立模型 #standardSQL CREATE MODEL `bqml_tutorial.sample_model` OPTIONS(model_type='logistic_reg') AS SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
  • 16. BigQuery ML可用的模型類別 ● 線性迴歸 linear_reg ● 二元邏輯迴歸 logistic_reg ● 多分類邏輯迴歸 logistic_reg ● K-means分群 kmeans https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create
  • 18. 四、評估模型 Evaluate your model #standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, ( SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
  • 20. 評估結果 When the query is complete, click the Results tab below the query text area. The results should look like the following:
  • 21. 欄位說明 Because you performed a logistic regression, the results include the following columns: ● precision — A metric for classification models. Precision identifies the frequency with which a model was correct when predicting the positive class. 準確度 ● recall — A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify? 召回度 ● accuracy — Accuracy is the fraction of predictions that a classification model got right. 明確度 ● f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of the precision and recall. An f1 score's best value is 1. The worst value is 0. ● log_loss — The loss function used in a logistic regression. This is the measure of how far the model's predictions are from the correct labels. ● roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive. For more information, see Classification in the Machine Learning Crash Course.
  • 22. 欄位說明 公式1 Because you performed a logistic regression, the results include the following columns: ● precision — 準確度 TP / (TP + FP) 在判斷出來為為陽性的個體中,被正確 判斷為陽性之比率 ● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確 判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率 ● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性 之比率 ● f1_score — Precision 跟 Recall 的調和平均數
  • 24. Confusion matrix 混淆矩陣例子:愛滋病預測 True condition 真實情況 True 有愛滋 False 沒愛滋 Predicted Outcome 預測結果 Yes 有愛滋,驗出有愛滋 True Positive TP 沒愛滋,驗出有愛滋 False Positive FP No 有愛滋,沒驗出有愛滋 False Negative FN 沒愛滋,沒驗出有愛滋 True Negagive TN
  • 25. Confusion matrix 混淆矩陣例子:愛滋病預測 True condition 真實情況 True 有愛滋 100人 False 沒愛滋 9900人 Predicted Outcome 預測結果 Yes 有愛滋,驗出有愛滋 True Positive TP 0人 沒愛滋,驗出有愛滋 False Positive FP 0人 No 有愛滋,沒驗出有愛滋 False Negative FN 100人 沒愛滋,沒驗出有愛滋 True Negagive TN 9900人 假設10000人檢測,模型為:全部的人都沒愛滋
  • 26. Because you performed a logistic regression, the results include the following columns: ● precision — 準確度 9900 / (9900 + 100),99% ● accuracy — 精確度 0 ⇒ 準備度悖論 ● Recall - 召回率 0 / ( 0 + 100 ) = 0 ● 準確度高沒有用,重點是要驗出有愛滋病的人 計算結果
  • 27. 混淆矩陣用在這個例子:User是否下單 True condition True 真的有下單 False 沒有下單 Predicted Outcome Yes 模型預測會下單 會下單,模型預測會下單 True Positive TP 不會下單,模型預測會下單 False Positive FP No 模型預測不會下 單 會下單,模型預測不會下單 False Negative FN 不會下單,模型預測不會下單 True Negagive TN
  • 28. 欄位說明 公式1 Because you performed a logistic regression, the results include the following columns: ● precision — 準確度 TP / (TP + FP),所有個體中,被正確判斷為陽性之比 率 ● recall — 召回度 TP / (TP + FN),代表在所有實際為陽性的個體中,被正確 判斷為陽性之比率,例如下單的人當中,被正確預測會下單的比率 ● accuracy — TN / (TN + FP),在所有實際為陰性的個體中,被正確判斷為陰性 之比率 ● f1_score — Precision 跟 Recall 的調和平均數
  • 29. 欄位說明 公式2 ● log_loss — The loss function used in a logistic regression. This is the measure of how far the model's predictions are from the correct labels. 預測結結果接近真實數據的程度
  • 30. 欄位說明 公式3 ● roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive. For more information, see Classification in the Machine Learning Crash Course. AUC=0.5 (no discrimination 無鑑別力) 0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力) 0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力) 0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)
  • 31. 五、用模型預測結果 by country #standardSQL SELECT country, SUM(predicted_label) as total_predicted_purchases FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY country ORDER BY total_predicted_purchases DESC LIMIT 10
  • 33. 六、預測每個user的購買 #standardSQL SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country, fullVisitorId FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY fullVisitorId ORDER BY total_predicted_purchases DESC LIMIT 10
  • 36. 參考資料 BigQuery Start https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start Machine Learning Crash Course https://developers.google.com/machine-learning/crash-course/
  • 37. Proprietary + ConfidentialProprietary + Confidential Thank you Aaron Lee aaronlee@mitac.com.tw