SlideShare uma empresa Scribd logo
1 de 28
Introduction to Data Science and
R language
13 August 2013
•Anju Gahlawat
Index
– Introduction to Data Science
– Hidden skills of Data Scientist
– Failure of Current Statistical
tools like SAS and Excel
– Introduction to R language
– R Basic Commands
– Running SQL server with R
– Visualizing Data with R
– Introduction to Shiny
– Future of R
1
Data Science
2
Data Science is all about telling a STORY from the data.
Data Science deals with….
3
5 Hidden Skills for Data Scientists
– Be Clear: Is Your Problem Really A
Big Data Problem?
– Communicating About Your Data
– Invest in Interactive Analytics, not
Reporting
– Understand the Role and Quality
of Human Evaluations of Data
– Spend Time on the Plumbing
4
Difference between Data Science
and Big Data
Big data is more concerned with the engineering components of data and in
answering the following questions:
– How do you store it,
– How do you manipulate it,
– How do you do parallelized computations on it,
– How do you access it,
– How do you mine it
But science is more than that.
– It deals with looking at the algorithmic and mathematical aspects of
extracting knowledge from data.
– Data science applies advanced analytical tools and algorithms to generate
predictive insights and new product innovations that are a direct result of
the data
5
Shortcomings of current
Visualization and statistical tools
– The most commonly-used statistical software tools either fail completely or are
too slow to be useful on huge data sets
– Less scalability
– Less Flexibility to new and fast scalable algorithms
– Problems printing charts in Excel: Missing legend data or sometimes x or y axis
missing
– If there’s a value in the upper-left corner of the data set (A1 in this case), Excel fails
to chart the data correctly. e.g.
6
Introduction to R
– R is a computer language and run-time environment which is used for
data manipulation, statistics, and graphics
– The base part of R comes with a wide range of standard statistical and
graphical analyses and user-developed extension packages built in.
– R is an expression-based language.
– It is possible to interface procedures written in C, C+, or FORTRAN
languages for efficiency, and to write additional primitives.
7
R, And the Rise of the Best Software Money Can’t
Buy
=
R users rely on functions that have been developed for them
by statistical researchers, but they can also create their own or
modify the existing ones as per their needs.
8
Why R?
9
Contd…
10
Getting started R
11
▪ Latest Version 3.0.1 for windows
▪ Link to download R setup http://cran.r-project.org/bin/windows/base/
▪ 51.5MB set up file
▪ GUI for R – R Studio. Latest Version 0.97.551
▪ Link to download R studio
http://www.rstudio.com/ide/download/desktop
▪ 32.5MB exe file.
R Studio
12
Sample R code
– Read a data set into R (from a local file or
network URL).
• bse <- read.csv("bse_table.csv",
header = TRUE, sep=",")
– Examine the basic structure of data
13
Running SQL server with R
Install package – RODBC
Create ODBC connection
channel <- odbcConnect([ODBC Name]);
Tab1 <- sqlQuery(channel, "Select * from TabName")
14
R code - Plotting graph
• > bse$Date <- as.Date(bse$Date, format="%Y-%m-%d")
• > plot(x<- bse$Date, y<-bse$Open,type = "l" , main = "BSE Data",col = blue“,
xlab="Periods", ylab="Index",lwd=2)
15
Stock Analysis - Sample graph
16
Packages in R….
17
Some graphs made using R:
18
Introduction to Shiny – R web UI
•R Package Shiny from RStudio supplies
–interactive web application / dynamic HTML-
Pages with plain R
–GUI for own needs
–Website as server
19
What makes Shiny so special?
– Very Simple: Ready to Use Components
– Shiny is very slick, achieving interactive and pleasant looking web UI’s.
– Event-driven (reactive programming): input <-> output (without requiring a
reload of the browser)
– Shiny user interfaces can be built entirely using R, or can be written directly
in HTML, CSS, and JavaScript for more flexibility.
– A highly customizable slider widget with built-in support for animation.
– Pre-built output widgets for displaying plots, tables, and printed output of R
objects.
– Fast bidirectional communication between the web browser and R using the
websocket package.
20
Stock Analysis - Using Shiny
21
Current Market trend
of
Statistical languages
22
Stats related to R - Google hits
23
R is the most powerful and flexible statistical programming language in the
world………
24
Job trends in Statistical Market
25
Software 2012 2013 Difference Ratio
SAS 13234 12272 -961 0.93
SPSS 3299 3289 -10 1
R 1196 1693 497 1.42
Minitab 1769 1615 -154 0.91
Stata 842 898 56 1.07
JMP 644 619 -25 0.96
Statistica 61 71 10 1.17
Systat 14 15 1 1.07
BMDP 6 10 3 1.53
-1200
-1000
-800
-600
-400
-200
0
200
400
600
SAS SPSS R Minitab Stata JMP Statistica Systat BMDP
Trend of Jobs on Indeed.com in March 2012 and 2013
Final Words of Warning
• “Using R is a bit akin to smoking.
The beginning is difficult, one may
get headaches and even gag the
first few times. But in the long
run,it becomes pleasurable and
even addictive. Yet, deep
down, for those willing to be
honest, there is something not
fully healthy in it.” --Francois
Pinard
26
R
Visualization is only one slice of R
cake……..
27
R deals with
• Machine Learning
• Social Media Analytics
• Sentiment Analysis
• Predictive Modeling
• Network Analysis
• Visualization
• Time series Analysis
• Simulation
• And lot more
To be continued……….

Mais conteúdo relacionado

Destaque

Data Science Presentation
Data Science PresentationData Science Presentation
Data Science Presentation
Marta Turetska
 

Destaque (20)

Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
An Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceAn Obligatory Introduction to Data Science
An Obligatory Introduction to Data Science
 
Machine learning workshop @DYP Pune
Machine learning workshop @DYP PuneMachine learning workshop @DYP Pune
Machine learning workshop @DYP Pune
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Agile reluctancy in india anju gahlawat
Agile reluctancy in india anju gahlawatAgile reluctancy in india anju gahlawat
Agile reluctancy in india anju gahlawat
 
Statistics with R
Statistics with R Statistics with R
Statistics with R
 
Data Science Presentation
Data Science PresentationData Science Presentation
Data Science Presentation
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Financial Network Analysis @ Central Bank of Bolivia
Financial Network Analysis @ Central Bank of BoliviaFinancial Network Analysis @ Central Bank of Bolivia
Financial Network Analysis @ Central Bank of Bolivia
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
 
Fintech - Presentations from the DataScience Meetup about Banking - Brussels ...
Fintech - Presentations from the DataScience Meetup about Banking - Brussels ...Fintech - Presentations from the DataScience Meetup about Banking - Brussels ...
Fintech - Presentations from the DataScience Meetup about Banking - Brussels ...
 
Curso Modelamiento De Datos
Curso Modelamiento De DatosCurso Modelamiento De Datos
Curso Modelamiento De Datos
 
Big Data Science Team Building
Big Data Science Team BuildingBig Data Science Team Building
Big Data Science Team Building
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Analysing Banking Data to Provide Relevant Offers to Customers
Analysing Banking Data to Provide Relevant Offers to CustomersAnalysing Banking Data to Provide Relevant Offers to Customers
Analysing Banking Data to Provide Relevant Offers to Customers
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Introduction to data science and R language

  • 1. Introduction to Data Science and R language 13 August 2013 •Anju Gahlawat
  • 2. Index – Introduction to Data Science – Hidden skills of Data Scientist – Failure of Current Statistical tools like SAS and Excel – Introduction to R language – R Basic Commands – Running SQL server with R – Visualizing Data with R – Introduction to Shiny – Future of R 1
  • 3. Data Science 2 Data Science is all about telling a STORY from the data.
  • 4. Data Science deals with…. 3
  • 5. 5 Hidden Skills for Data Scientists – Be Clear: Is Your Problem Really A Big Data Problem? – Communicating About Your Data – Invest in Interactive Analytics, not Reporting – Understand the Role and Quality of Human Evaluations of Data – Spend Time on the Plumbing 4
  • 6. Difference between Data Science and Big Data Big data is more concerned with the engineering components of data and in answering the following questions: – How do you store it, – How do you manipulate it, – How do you do parallelized computations on it, – How do you access it, – How do you mine it But science is more than that. – It deals with looking at the algorithmic and mathematical aspects of extracting knowledge from data. – Data science applies advanced analytical tools and algorithms to generate predictive insights and new product innovations that are a direct result of the data 5
  • 7. Shortcomings of current Visualization and statistical tools – The most commonly-used statistical software tools either fail completely or are too slow to be useful on huge data sets – Less scalability – Less Flexibility to new and fast scalable algorithms – Problems printing charts in Excel: Missing legend data or sometimes x or y axis missing – If there’s a value in the upper-left corner of the data set (A1 in this case), Excel fails to chart the data correctly. e.g. 6
  • 8. Introduction to R – R is a computer language and run-time environment which is used for data manipulation, statistics, and graphics – The base part of R comes with a wide range of standard statistical and graphical analyses and user-developed extension packages built in. – R is an expression-based language. – It is possible to interface procedures written in C, C+, or FORTRAN languages for efficiency, and to write additional primitives. 7 R, And the Rise of the Best Software Money Can’t Buy
  • 9. = R users rely on functions that have been developed for them by statistical researchers, but they can also create their own or modify the existing ones as per their needs. 8
  • 12. Getting started R 11 ▪ Latest Version 3.0.1 for windows ▪ Link to download R setup http://cran.r-project.org/bin/windows/base/ ▪ 51.5MB set up file ▪ GUI for R – R Studio. Latest Version 0.97.551 ▪ Link to download R studio http://www.rstudio.com/ide/download/desktop ▪ 32.5MB exe file.
  • 14. Sample R code – Read a data set into R (from a local file or network URL). • bse <- read.csv("bse_table.csv", header = TRUE, sep=",") – Examine the basic structure of data 13
  • 15. Running SQL server with R Install package – RODBC Create ODBC connection channel <- odbcConnect([ODBC Name]); Tab1 <- sqlQuery(channel, "Select * from TabName") 14
  • 16. R code - Plotting graph • > bse$Date <- as.Date(bse$Date, format="%Y-%m-%d") • > plot(x<- bse$Date, y<-bse$Open,type = "l" , main = "BSE Data",col = blue“, xlab="Periods", ylab="Index",lwd=2) 15
  • 17. Stock Analysis - Sample graph 16
  • 19. Some graphs made using R: 18
  • 20. Introduction to Shiny – R web UI •R Package Shiny from RStudio supplies –interactive web application / dynamic HTML- Pages with plain R –GUI for own needs –Website as server 19
  • 21. What makes Shiny so special? – Very Simple: Ready to Use Components – Shiny is very slick, achieving interactive and pleasant looking web UI’s. – Event-driven (reactive programming): input <-> output (without requiring a reload of the browser) – Shiny user interfaces can be built entirely using R, or can be written directly in HTML, CSS, and JavaScript for more flexibility. – A highly customizable slider widget with built-in support for animation. – Pre-built output widgets for displaying plots, tables, and printed output of R objects. – Fast bidirectional communication between the web browser and R using the websocket package. 20
  • 22. Stock Analysis - Using Shiny 21
  • 24. Stats related to R - Google hits 23
  • 25. R is the most powerful and flexible statistical programming language in the world……… 24
  • 26. Job trends in Statistical Market 25 Software 2012 2013 Difference Ratio SAS 13234 12272 -961 0.93 SPSS 3299 3289 -10 1 R 1196 1693 497 1.42 Minitab 1769 1615 -154 0.91 Stata 842 898 56 1.07 JMP 644 619 -25 0.96 Statistica 61 71 10 1.17 Systat 14 15 1 1.07 BMDP 6 10 3 1.53 -1200 -1000 -800 -600 -400 -200 0 200 400 600 SAS SPSS R Minitab Stata JMP Statistica Systat BMDP Trend of Jobs on Indeed.com in March 2012 and 2013
  • 27. Final Words of Warning • “Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run,it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.” --Francois Pinard 26 R
  • 28. Visualization is only one slice of R cake…….. 27 R deals with • Machine Learning • Social Media Analytics • Sentiment Analysis • Predictive Modeling • Network Analysis • Visualization • Time series Analysis • Simulation • And lot more To be continued……….