Microsoft R can be used with Spark to perform advanced analytics on big data in the cloud or on-premises. Key features include the ability to choose between Spark and other compute contexts, easily deploy analytic models as web services, and process data at scale on HDInsight clusters with hundreds of nodes. R enables building end-to-end AI solutions from data preparation and modeling to operationalizing models for production using services like SQL Server, HDInsight, and Azure.
6. SQL Server on Linux
Microsoft joins
Eclipse Foundation
HD Insight managed
service on Linux
Azure Marketplace
60% of all images in Azure
Marketplace are based on
Linux/OSS
Partnership with the
Linux Foundation
for Linux on Azure
certification
600 Million+
Lines of open source code
submitted to GitHub by
Microsoft engineersMicrosoft OpenSource Hub
Wim Coekaerts
Oracle’s Mr. Linux
joins Microsoft
1 out of 3
1 out of 3 VMs on Azure run
Linux, and more than half of all
new VMs run Linux
Acquisition
Jenkins project on Azure
제품 측면
파트너쉽 측면
제공 서비스 측면
Ross Gardler
President Apache
SW Foundation
문화 측면
Partnership
Run Linux on Windows natively
C:Usersmarkhill> bash
root@localhost: #
13. Pretrained
Image
Featurization
FeaturizeImage()
– Used to identify parts of
images
– People, things, animals,
etc.
FeaturizeText()
– Returns Ngram digest
& counts from many
partitions of text data
Text
Featurization
Featurizer
Featurizer
Ngrams
(phrases)
counts
Text
Data
Sets Featurizer
ngram
ngram
ngram
Image
Data
Sets
Image
contents
Featurizer
Featurizer
Image
contents
found
Featurizer
GetSentiment()
– Pretrained to return
sentiment score (0-1)
– English only for now
Pretrained
Sentiment
Analysis
Featurizer
Featurizer
getSentime
nt()
Text
Data
Sets Featurizer
Sentiment
Score
14. rxEnsemble:
– Returns ensembled model
combining multiple types
– Ensembling settings
balance speed & accuracy
Many
Small
Models
Ensemble
Learning
Model 1
Model 2
rxEnsemble
Single or
Distributed
Data Sets
ManyModels (w/ rxExecBy):
– Used to run model on
each of many partitions.
– Returns one model trained
to per cohort (partition) of
data.
P3
Model P1
P2
P1
Model P2
Returns a
set of
Models
Data
Partitioned
by Cohort
Model P3
Model P1
Model P2
Model P3
Ensemble
Model
16. Defines where the processing happens
Current set compute context determines processing
location
Write Once Deploy Anywhere (WODA) by changing
compute context
20. Predict airline delays from historical flight data and weather
information
Data Sets
Airline delay (2009 to 2012) – 44 variables
Weather information – 11 variables
Demo
Data manipulation using sparklyr
Interoperability between sparklyr and RevoScaleR
Supervised learning using RevoScaleR
Predict airline delay per origin airport (small data many models)
Interoperability between H2O and RevoScaleR
21. R R R R R
R R R R R
ScaleR
Production
RStudio Server Community/Pro
Microsoft R Server
1. Copy
2. Stream
3. Send
22. R Server on HDInsight – 수십억 건으로 확장
Configuration:
• HDI cluster size: 100 nodes
- All nodes: D4 (8 cores, 28GB)
• Dataset: Airlines dataset
- transformed, and duplicated
• Number of parameters: 370
• Format: CSV
• fs.azure.selfthrottling.read.factor=1
0
200
400
600
800
1000
1200
1400
1600
0 5 10 15 20 25
ElapsedTime(seconds)
Billions of rows
rxLogit on a 100 node HDInsight Cluster
23. Configuration:
• 1 Edge Node: 16 cores,
112GB
• 4 Worker Nodes: 16 cores,
112GB
• Dataset: Duplicated Airlines
data (.csv)
• Number of columns: 26
E2E Process:
• Load Data from .csv
• Transform Features
• Split Data: Train +
Test
• Fit Model: Logistic
Regression (no
regularization)
• Predict and Write
Outputs
23http://tinyurl.com/Strata2017R/Performance_Comparison
24. 모델을 손쉽게 배포하는 방법!
Operationalizing Analytic Models
25. • Turn R analytics Web
services in one line of
code;
• Swagger-based REST
APIs, easy to consume,
with any programming
languages, including R!
• Deploying web service
server to any platform:
Windows, SQL,
Linux/Hadoop
• On-prem or in cloud
• Fast scoring, real time
and batch
• Scaling to a grid for
powerful computing with
load balancing
• Diagnostic and capacity
evaluation tools
• Enterprise
authentication:
AD/LDAP or AAD
• Secure connection:
HTTPS with SSL/TLS 1.2
• Enterprise grade high
availability
Instant Deployment Deploy to Anywhere Fast and Scalable Secure and Reliable
Unique
27. Function Description
publishService Publish a predictive function as a Web Service
deleteService Delete a Web Service
getService Get a Web Service
ListServices List the different published web services
serviceOption
Retrieve, set, and list the different service
options
updateService Updates a Web Service
{mrsdeploy}
28. • Seamless integration
with authentication
solution:
LDAP/AD/AAD
• Secure connection:
HTTPS encrypted by
TLS 1.2/SSL
• Compliance with
Microsoft Security
Development
Lifecycle
R
Client
29. ModelPrepare
SQL
2017
OperationalizeOperationalize
R & ScaleR
Models
CRAN R
Models
AzureML
Web Services
R Server VMs
ModelPrepare
Operationalize
T-SQL/Stored
Procedure
Operationalize
R Server
On PremCloud
Deploy to SQL
Server 2017
Deploy to Hadoop / Linux
Server / Windows Server
1 2 3 4
SQL
2017
{mrsdeploy}, {azureml}, {sqlutils}
32. 이미지 분류 딥러닝 절차 (Learning/Scoring)
Images
Featurization
(using pre-trained
ResNet18 neural network
model)
Features
Classification
Algorithm
(Boosted Tree)
Classifier
Model
Learning
Labels
Images Features
Scoring
Predictions
Featurization
(using pre-trained
ResNet18 neural network
model)
Classification
resnet18, resnet50, resnet101, alexnet
33. SQL Server
Edge
Distributed Featurization
CT Scan Images
Azure Blob Storage
Classifier Training
Featurization
Models
Table
HDInsight-MRS
HDInsight에서의 Distributed Featurization + Training
34. Featurization
Scoring
with the classifier
model
Web App
Diagnosis: 35% certainty
Stored Procedures with R Code
SQL Server에서 딥러닝 모델로 Scoring
Stored
Procedure
call
Model table,
Features table,
New Images table
SQL Server
38. R Server for Hadoop 9.1
Data
Frames
Worker
Task
Worker
Task
Worker
Task
ScaleR
Master Task
Finalizer
Initiator
Remote Execution:
ssh
Web Services
MRSDeplo
y
R Tools for Visual Studio
BI Tools &
Applications
Jupyter Notebooks
Thin Client IDEs
https://
https://
Edge Node
39. Snapshot Functions
createSnapshot
Create a snapshot of the remote session (workspace and
working directory)
loadSnapshot
Load a snapshot from the server into the remote session
(workspace and working directory)
listSnapshots Get a list of snapshots for the current user
downloadSnapshot Download a snapshot from the server
deleteSnapshot Delete a snapshot from the server
Remote Objects Management
listRemoteFiles
Get a list of files in the working directory of the remote
session
deleteRemoteFile
Delete a file from the working directory of the remote
R session
getRemoteFile
Copy a file from the working directory of the remote R
session
putLocalFile
Copy a file from the local machine to the working
directory of the remote R session
getRemoteObject Get an object from the remote R session
putLocalObject
Put an object from the local R session and load it into
the remote R session
getRemoteWorkspace
Take all objects from the remote R session and load
them into the local R session
putLocalWorkspace
Take all objects from the local R session and load them
into the remote R session
Remote Connection
remoteLogin Remote login to the R Server with AD or admin credentials
remoteLoginAAD Remote login to R Server server using Azure AD
remoteLogout Logout of the remote session on the DeployR Server.
Remote Execution
remoteExecute Remote execution of either R code or an R script
remoteScript Wrapper function for remote script execution
diffLocalRemote Generate a 'diff' report between local and remote
pause Pause remote connection and back to local
resume Return the user to the 'REMOTE >' command prompt
40.
41.
42. Cloud AI Stack
Services
Processing
Frameworks
AI Applications
Cognitive Services
Infrastructure
AML Web Services BOT Framework
Model & Experimentation
Management
Data Wrangling & Spark AI Batch
Training
Storage (Azure Data Services) & Hardware (CPU, GPU, FPGS & ASIC)
Inferencing
Spark, SQL,
Other Engines
DSVM
Machine Learning and Deep Learning Toolkits
CNTK Tensorflow ML Server Scikit-Learn Other Libs.
ACS
Docker
Tooling
CPUs
Edge
Notas do Editor
To help us meet these goals, we have three main products
Cortana Intelligence Suite
Cognitive Services, Bot Framework, Cortana
Power BI
Machine learning, Stream Analytics, HDInsight
Data Lake, SQL, DW
Data Factory, Data Catalog, Event Hubs
SQL Server 2017
SSRS, DataZen, R
SSAS, SQL Server Machine Learning Services
OLTP, DW, Hadoop, EDSs
SSIS, DQS, MDS
Microsoft R
R visualizations
Microsoft R
Hadoop, Teradata, Linux, Windows
Spark SQL/ETL
To help us meet these goals, we have three main products
Cortana Intelligence Suite
Cognitive Services, Bot Framework, Cortana
Power BI
Machine learning, Stream Analytics, HDInsight
Data Lake, SQL, DW
Data Factory, Data Catalog, Event Hubs
SQL Server 2017
SSRS, DataZen, R
SSAS, SQL Server Machine Learning Services
OLTP, DW, Hadoop, EDSs
SSIS, DQS, MDS
Microsoft R
R visualizations
Microsoft R
Hadoop, Teradata, Linux, Windows
Spark SQL/ETL
CRAN - Growing library of over 10,000+ R packages built by a thriving open source community. Huge repository of freely exchanged, algorithms, techniques, scripts, adapters, techniques, training available.
Enterprise Grade Analytics Platform –
Solve for operationalization challenges using Microsoft R’s ability to write the code once and deploy on multiple platforms.
Get enterprise grade support and safeguard your analytics investments.
Works with what you have –
We understand your data lives in different environments and your needs may change over time which is why Microsoft R supports several platforms like – Hadoop/Spark, Linux, Windows, Teradata, SQL
Best of Open Source and Microsoft innovation
Parallelized, remote executing algorithms
In-database analytics to take analytics to your data
Machine learning packages from Microsoft
Time to value
Solution templates, tutorials to help you build solutions
Microsoft’s partner ecosystem to help you execute projects.
There are many R products in market having Operationalization capabilities. These 4 pillars separate R Server from other R products.
If a version is not specified, a temp guid endpoint is created – this is mainly for development phase and sharing among team members privately.
I can model in any of those environments, and I can deploy in any of those environments. Interchangeably!
On the left we have left we have R client which offers two things to data scientists
They can leverage all the Microsoft R packages locally on their workstations
Second, they can push the compute and bid data analytics to where the data lives. This gives them access to the power of servers and eliminates need for data movement, reducing time and increasing security
On the right is our commitment of meeting customers where they are, and where their data lives.
Slide Objective
Show how R Server for Hadoop Spark can interoperate with 4 different methos of development and deployment.
Talking Points
Snapshot functions are very useful for remote execution scenarios. It can save the whole workspace and working directory so that you can pick up from exactly where you left last time. Thank about saving and loading a game.
It can also be used when publish a web service to help you handle the environment dependencies. But it might impact the performance of the Request-Response time. For optimal performance, consider the size of the snapshot carefully. Before creating a snapshot, ensure that keep only those workspace objects you need and purge the rest. And, in the event that you only need a single object, consider passing that object alone itself instead of using a snapshot.