SlideShare uma empresa Scribd logo
1 de 97
Data Manipulation and
Visualization in R
Hyebong Choi
School of ICT and Global Entrepreneurship
Handong Global University
Big Data: Why it matters
2
 In the past, data(record) are only for important people
and events
 e.g. royal family, war records, …
Big Data: Why it matters
3
 In the past, data(record) are expensive and for very
important event or a few privileged people
 e.g. royal family, war records, …
 Now data is for everyone and every moment
Data easily go beyond Human and …
4
Data easily go beyond Human and computer
5
Moore’s Law is a computing term which originated around 1970; the simplified
version of this law states that
processor speeds, or overall processing power for computers will double
every two years.
6
What we call Big Data
7
3V definition
Data Science
 Data Science aims to derive knowledge fr
om big data, efficiently and intelligently
 Data Science encompasses the set of acti
vities, tools, and methods that enable da
ta-driven activities in science, business,
medicine, and government
 Machine Learning(or Data Mining) is agai
n one of the core technologies that enabl
es Data Science
http://www.oreilly.com/data/free/what-is-data-science.csp
8
Data Science is process to get Actionable Insight from Big DATA
Night time Bus by Seoul Local Government
 Effective Bus Route Design with Big Data
 5 Million Records of Taxi Take on/off
 3 Billion Night time call records(location) from telco company
Singapore Traffic visualization
 Data from LTA
 Subway, bus, taxi traffic info.
 By data analytics team in Institute for Infocomm Research
Big data and Data Science
What We Cover Here
• Data Manipulation and Visualization in R
12
R
• Most common tools for data scientist other than DBMS
• Cover wide range of data scientist – Data engineer, Statistician, Data main
expert, … rather than just computer engineer
• Providing thousands of ready-to-use powerful packages for data scientist
• Well documented
13
http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html
R How to Start
• Explanation of windows in R studio
14
R
• Getting help
▫ help(command) or ?command
▫ example(command) to see examples
15
Package Installation and loading
• install.packages(“package name”)
• to load package
▫ library(“package name”)
▫ or require(“package name”)
16
Variable
• Variable is a container to hold data (or information) that
we want to work with
• Variable can hold
▫ a single value: 10, 10.5, “abc”, factor, NA, NULL
▫ multiple values: vector, matrix, list
▫ specially formatted data (values): data.frame
17
How you assign a value to variable
• var <- value
18
my_first_variable <- 35.121
New variable is now assigned and
available in working environment
Operators
19
a <- 10.5
b <- 20
c <- 4
a + b ## addition
## [1] 30.5
a - c ## substraction
## [1] 6.5
a * c ## mulitiplication
## [1] 42
b / c ## division
## [1] 5
a %% c ## remainder
## [1] 2.5
a > b ## inequality
## [1] FALSE
a*2 == b ## equality
## [1] FALSE
!(a > b) ## negation
## [1] TRUE
(b > a) & (b > c) ## logical AND
## [1] TRUE
(a > b) | (a > c) ## logical OR
## [1] TRUE
Data Type – Missing Value (NA)
• Sometimes values are missing, and R represent the
missing values as NAs
20
Vector
• A vector is a sequence of data elements of the same basic type.
• All members should be of same data type
21
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
typeof(numeric_vector)
## [1] "double"
typeof(character_vector)
## [1] "character"
typeof(boolean_vector)
## [1] "logical"
length(numeric_vector) ## number of members in the vector
## [1] 3
new_vector <- c(numeric_vector, 50)
new_vector
## [1] 1 10 49 50
Vector
• R’s vector index starts from 1
▫ 1,2,3,4, …
• Minus Index means “except for”
22
Vector with named elements
• We can give name to each element of vector
• and we can use the name instead of index number
23
some_vector <- c("John Doe", "poker player")
names(some_vector) <- c("Name", "Profession")
some_vector
## Name Profession
## "John Doe" "poker player"
some_vector['Name']
## Name
## "John Doe"
some_vector['Profession']
## Profession
## "poker player"
some_vector[1]
## Name
## "John Doe"
Vector with named elements
• We can give name to each element of vector
• and we can use the name instead of index number
24
weather_vector <- c("Mon" = "Sunny", "Tues" = "Rainy",
"Wed" = "Cloudy", "Thur" = "Foggy",
"Fri" = "Sunny", "Sat" = "Sunny",
"Sun" = "Cloudy")
weather_vector
## Mon Tues Wed Thur Fri Sat Sun
## "Sunny" "Rainy" "Cloudy" "Foggy" "Sunny" "Sunny" "Cloudy“
names(weather_vector)
## [1] "Mon" "Tues" "Wed" "Thur" "Fri" "Sat" "Sun"
Short-cut to make numeric vector
25
a_vector <- 1:10 ## numbers from 1 to 10
b_vector <- seq(1, 10, 2) ## numbers from 1 to 10 increasing by 2
a_vector
## [1] 1 2 3 4 5 6 7 8 9 10
b_vector
## [1] 1 3 5 7 9
c_vector <- rep(1:3, 3)
d_vector <- rep(1:3, each = 3)
c_vector
## [1] 1 2 3 1 2 3 1 2 3
d_vector
## [1] 1 1 1 2 2 2 3 3 3
c(a_vector, b_vector) ## combine vectors to single vector
## [1] 1 2 3 4 5 6 7 8 9 10 1 3 5 7 9
Basic Vector operations
26
a_vector <- c(1,5,2,7,8)
b_vector <- seq(1, 10, 2)
sum(a_vector) ## summation
## [1] 23
mean(a_vector) ## average
## [1] 4.6
# operation of Vector and Scala
a_vector + 10
## [1] 11 15 12 17 18
a_vector > 4
## [1] FALSE TRUE FALSE TRUE TRUE
sum(a_vector > 4) ## what does this mean?
## [1] 3
# operation of Vector and Vector
a_vector - b_vector
## [1] 0 2 -3 0 -1
a_vector == b_vector
## [1] TRUE FALSE FALSE TRUE FALSE
sum(a_vector == b_vector) ## what does this mean?
## [1] 2
Vector Indexing (Selection)
27
sample_vector <- c(1, 4, NA, 2, 1, NA, 4, NA) ## vector with some missing values
sample_vector[1:5]
## [1] 1 4 NA 2 1
sample_vector[c(1,3,5)]
## [1] 1 NA 1
sample_vector[-1]
## [1] 4 NA 2 1 NA 4 NA
sample_vector[c(-1, -3, -5)]
## [1] 4 2 NA 4 NA
sample_vector[c(T, T, F, T, F, T, F, T)]
## [1] 1 4 2 NA NA
is.na(sample_vector)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
sum(is.na(sample_vector))
## [1] 3
## can you select non-NA elements from the vector?
Selection by numeric vector
Selection by logical vector
Data Frame
• Very commonly datasets contains variables of different kinds
▫ e.g. student dataset may contain name(character), age(integer), major(factor),
gpa(numeric, real number)…
• Vector and metric can have values of same data type
• A data frame has the variables of a data set as columns and the
observations as rows.
28
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Overviewing of Data frame
• head functions shows the first n (6 by default) observation of dataframe
• tail functions shows the last n (6 by default) observation of dataframe
29
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
head(mtcars, 10) ## try to see what happens
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
tail(mtcars, 10) ## try to see what happens
Overviewing of Data frame
• str function shows the structure of your data set, it tells you
▫ The total number of observations (e.g. 32 car types)
▫ The total number of variables (e.g. 11 car features)
▫ A full list of the variables names (e.g. mpg, cyl ... )
▫ The data type of each variable (e.g. num)
▫ The first few observations
30
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Creating Data frame
• data.frame function with vectors (of same length and possibly different
type) makes you a data frame
31
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
planets_df
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Creating Data frame
• you may specify the variables as parameters
32
my.df <- data.frame(name = c('John', 'Kim', 'Kaith'), job =
c('Teacher', 'Policeman', 'Secertary'), age = c(32, 25, 28))
my.df
## name job age
## 1 John Teacher 32
## 2 Kim Policeman 25
## 3 Keith Secretary 28
Selection of data frame elements
Similar to vectors and matrices, you select elements from a data frame with
the help of square brackets [ ].
By using a comma, you can indicate what to select from the rows and the
columns respectively.
33
# Print out diameter of Mercury (row 1, column 3)
planets_df[1,3]
## [1] 0.382
# Print out data for Mars (entire fourth row)
planets_df[4, ]
## name type diameter rotation rings
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
# you can use of directly variable name
# Select first 5 values of diameter column
planets_df[1:5, 'diameter']
## [1] 0.382 0.949 1.000 0.532 11.209
Selection of data frame elements
You will often want to select an entire column, namely one specific variable from a data
frame. If you want to select all elements of the variable diameter, for example, both of
these will do the trick:
planets_df[,3]
planets_df[,"diameter"]
However, there is a short-cut. If your columns have names, you can use the $ sign:
planets_df$diameter
34
Selection of data frame elements
- a tricky part
• You can use a logical vector to select from data frame
35
## find planets with rings
planets_df[planets_df$rings, ]
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## select names of planets with rings
planets_df[planets_df$rings, 'name']
## [1] Jupiter Saturn Uranus Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
## find planets with larger diameter than earth
planets_df$diameter > 1
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
planets_df[planets_df$diameter > 1, ]
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Handling Big Data using R
<2st> Package “dplyr”, essential for data preprocessing in R
36
HyunJae Jo
School of ICT and Global Entrepreneurship
Handong Global University
Contents
• Data preprocessing
• dplyr packages
• Lab example
37
Why we have to do “Data preprocessing”
• Data preprocessing is the process of making raw data
suitable for data analysis.
38
• Data Cleaning
 Correction of missing value,
outlier and noisy data
• Data Integration
 Combining data from multiple
sources
• Data Reduction
 Reduce only the data needed for
analysis
• Data Transformation
 Data transformation to maximize
efficiency of data analytics
Data Example of “Data Integration”
39
Missing
value Noisy data
Outlier
1. Data Integration
 Integrating “Female Population”
Column
“Data preprocessing”
Example of “Data Cleaning”
40
Missing
value Noisy data
Outlier
“Data preprocessing”
2. Data Cleaning
 Correction “Male Pop.” column
 Remove missing value(“others” column)
 Correction outliers
Example of “Data Reduce”
41
Missing
value Noisy data
Outlier
“Data preprocessing”
3. Data Reduce
 Top 5 Population Regions by Total Population
Example of “Data Transformation”
42
Missing
value Noisy data
Outlier
“Data preprocessing”
4. Data Transfromation
 Ratio region population to total population
What is “dplyr”
• "dplyr" is an R package that specializes in data processing.
• Functions of “dplyr”
▫ mutate() adds new variables that are functions of existing variables
▫ select() picks variables based on their names.
▫ filter() picks cases based on their values.
▫ summarise() reduces multiple values down to a single summary.
▫ arrange() changes the ordering of the rows.
▫ group_by() allows to perform any operation “by group”
▫ %>%(chain) connect each operation and perform it at once
43
How we can use “dplyr”
• Install packages and load packages.
• If you want to see package manual in R, you can use “help” function
44
imdb dataset
• IMDb (Internet Movie Database) is an online database of information related to
films, television programs, home videos and video games, and streaming content
online -- including cast, production crew and personnel biographies, plot
summaries, trivia, and fan reviews and ratings. As of October 2018, IMDb has
approximately 5.3 million titles (including episodes) and 9.3 million
personalities in its database, as well as 83 million registered users. (Wikipedia)
• We use reduced imdb dataset and you can load imdb dataset using read.csv()
function with URL link:
• https://raw.githubusercontent.com/myhan0710/prac_dplyr/master/imdb.csv
45
Overviewing: imdb dataset
46
(e.g. 15,190 movies)
(e.g. 44 related variable)
(e.g. fn, tid, title,wordsInTitle, …)
(e.g. chr, num, int)
Lab Example
• We will now look at the functions of the dplyr library through the imdb
dataset.
• Familiarize yourself with the roles of the given functions, and just enter
the code in your RStudio.
• Through the Lab, we can finally obtain data from the imdb dataset that
meet the following conditions:
47
dplyr: filter
48
• Use filter() function to extract specific
row that fits the condition.
• First, we can find recent, high rating and
• many rating counts movie like this:
dplyr: mutate
49
• Use mutate() function to make new column.
• Second, we can make column that indicates
whether a movie contains action and adventure
genres:
• If a certain movie’s Action column value is 1 and
Adventure column value is 0, it means that movie
includes action genre, but not Advenutre genre.
• So, ActionAdven column value 1 means a movie
includes both action and adventure genres.
dplyr: select
• Use select() function to extract specific column
that fits the condition.
• We only need a few columns:
- wordsInTitle
- imdbRating
- ratingCount
- Year
- ActionAven
50
dplyr: arrange
51
• Use the arrange() function to sort data from small to large values based on a
specified column.
• Fourth, we would like to sort the year
in ascending order and then grade in descending order.
• desc() function allows it to be sorted in descending order.
dplyr: summarise
• Use the summarise() function to obtain basic statistics by specifying
functions such as mean(), var(), and median()
• Now, we want to know average rating, average year, and number of
Action-Adventure genre.
52
dplyr: group_by
• Use the group_by() function, then you can group data by level in the
specified column.
• After group_by() function, You can associate the results with summarise()
to view the results of each level.
53
dplyr: %>%
• Use the chain function(%>%) to write a code with all functions at once.
• Then, the results are same. (imdb_chain, imdb7)
54
All contents
55
Excercise
① Extract movies from the original imdb dataset between 1970
and 2000.
② Using chain function, extract movies that has both drama
and family genre. Then, columns are only year, duration,
nrOfphotos.
• Please replace <Fill In> with your solution.
• After replacing your solution, check your solution is right on the next page.
56
Exercise key
① Key
② Key
57
58
Data Visualization
• Essential component of skill set as a data scientist
• With ever increasing volume of data, it is impossible to tell
stories without visualizations.
• Data visualization is an art of how to turn “Big Data” into
useful knowledge
59
Data Visualization
• Data Visualization
60
Statistics Design
Graphical
Data Analysis
Communication &
Perception
Data Visualization
• Exploratory Visualization
▫ Help you see what is in the data
• Explanatory Visualization
▫ Shows others what you’ve found in your data
• R supports both types of visualizations
61
Data Visualization
• Exploratory Visualization
▫ Help you see what is in the data
▫ Keep as much as detail as possible
▫ Practical Limit: how much can you see and interpret
• Explanatory Visualization
▫ Help us share our understanding with others
▫ Shows others what you’ve found in your data
▫ Requires editorial decisions:
▫ Highlight the key features you want to emphasize
▫ Eliminate extraneous details
62
Exploratory Visualzation
63
Explanatory Visualization
64
ggplot2
• Author: Hadley Wickham
• Open Source implementation of the layered grammar of
graphics
• High-level R package for creating publication-quality statistical
graphics
▫ Carefully chosen defaults following basic graphical design rules
• Flexible set of components for creating any type of graphics
• Things you cannot do With ggplot2
▫ 3-dimensional graphics
▫ Graph-theory type graphs (nodes/edges layout)
65
ggplot2 installation
• In R console:
install.packages("ggplot2")
library(ggplot2)
66
Toy examples
67
Grammar of Graphics
68
The quick brown fox jumps over the lazy dog
69
Grammar of Graphics
70
• Plotting Framework
• Leland Wilkinson, Grammar of Graphics, 1999
• 2 principles
▫ Graphics = distinct layers of grammatical elements
▫ Meaningful plots through aesthetic mapping
Essential Grammatical Elements
71
Element Description
Data The dataset being plotted.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
All Grammatical Elements
72
Element Description
Data The dataset being plotted.
Aesthetics The scales onto which we map our data.
Geometries The visual elements used for our data.
Facets Plotting small multiples
Statistics Representations of our data to aid understanding.
Coordinates The space on which the data will be plotted.
Themes All non-data ink.
Diagram
73
Grammar of Graphics
• Building blocks
• Solid, creative, meaningful visualizations
• Essential Layers: Data, Aesthetics, Geometries
• For Enhancement: Facets, Statistics, Coordinates,
Themes
74
Example – First trial of ggplot
• To get a first feel for ggplot2, let's try to run some basic ggplot2 commands.
Together, they build a plot of the mtcars dataset that contains information about
32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and
contains a variety of continuous and categorical variables.
75
Example – First trial of ggplot
• The plot from the previous exercise wasn't really satisfying.
Although cyl (the number of cylinders) is categorical, it is classified
as numeric in mtcars. You'll have to explicitly tell ggplot2 that cyl is
a categorical variable.
76
more examples
77
more examples
78
more examples
79
more examples
80
more examples
81
more examples
82
more examples
83
more examples
84
more examples
85
Practice
86
https://gist.githubusercontent.com/tiangechen/b68782efa49a1
6edaf07dc2cdaa855ea/raw/0c794a9717f18b094eabab2cd6a6b9a2269
03577/movies.csv
Practice
87
Practice
88
Practice
89
Practice
90
https://raw.githubusercontent.com/meKIDO/MyanmarData/master/MichelinNY.csv'
Practice
91
Practice
92
Practice
93
Practice
94
Practice
95
https://raw.githubusercontent.com/meKIDO/MyanmarData/master/M
yanmarDB.csv
Practice
96
Thank you
97

Mais conteúdo relacionado

Mais procurados

第6回 関数とフロー制御
第6回 関数とフロー制御第6回 関数とフロー制御
第6回 関数とフロー制御Wataru Shito
 
第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)
第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)
第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)Wataru Shito
 
第4回 データフレームの基本操作 その2(解答付き)
第4回 データフレームの基本操作 その2(解答付き)第4回 データフレームの基本操作 その2(解答付き)
第4回 データフレームの基本操作 その2(解答付き)Wataru Shito
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)Takashi Kitano
 
Market Basket Analysis in R
Market Basket Analysis in RMarket Basket Analysis in R
Market Basket Analysis in RRsquared Academy
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
第2回 基本演算,データ型の基礎,ベクトルの操作方法
第2回 基本演算,データ型の基礎,ベクトルの操作方法第2回 基本演算,データ型の基礎,ベクトルの操作方法
第2回 基本演算,データ型の基礎,ベクトルの操作方法Wataru Shito
 
第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1Wataru Shito
 
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)Wataru Shito
 
第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出しWataru Shito
 
The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184Mahmoud Samir Fayed
 
Extending Operators in Perl with Operator::Util
Extending Operators in Perl with Operator::UtilExtending Operators in Perl with Operator::Util
Extending Operators in Perl with Operator::UtilNova Patch
 
The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210Mahmoud Samir Fayed
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Pythonpugpe
 
The arena of national basketball association data
The arena of national basketball association dataThe arena of national basketball association data
The arena of national basketball association dataJillian Harvey
 
第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2Wataru Shito
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL codeSimon Hoyle
 
Python data structures
Python data structuresPython data structures
Python data structuresHarry Potter
 

Mais procurados (19)

第6回 関数とフロー制御
第6回 関数とフロー制御第6回 関数とフロー制御
第6回 関数とフロー制御
 
第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)
第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)
第2回 基本演算,データ型の基礎,ベクトルの操作方法(解答付き)
 
第4回 データフレームの基本操作 その2(解答付き)
第4回 データフレームの基本操作 その2(解答付き)第4回 データフレームの基本操作 その2(解答付き)
第4回 データフレームの基本操作 その2(解答付き)
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
 
Market Basket Analysis in R
Market Basket Analysis in RMarket Basket Analysis in R
Market Basket Analysis in R
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
第2回 基本演算,データ型の基礎,ベクトルの操作方法
第2回 基本演算,データ型の基礎,ベクトルの操作方法第2回 基本演算,データ型の基礎,ベクトルの操作方法
第2回 基本演算,データ型の基礎,ベクトルの操作方法
 
第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1
 
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
 
Python 1
Python 1Python 1
Python 1
 
第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し
 
The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184
 
Extending Operators in Perl with Operator::Util
Extending Operators in Perl with Operator::UtilExtending Operators in Perl with Operator::Util
Extending Operators in Perl with Operator::Util
 
The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
The arena of national basketball association data
The arena of national basketball association dataThe arena of national basketball association data
The arena of national basketball association data
 
第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL code
 
Python data structures
Python data structuresPython data structures
Python data structures
 

Semelhante a Data manipulation and visualization in r 20190711 myanmarucsy

Introduction to R
Introduction to RIntroduction to R
Introduction to RStacy Irwin
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
Writing Readable Code with Pipes
Writing Readable Code with PipesWriting Readable Code with Pipes
Writing Readable Code with PipesRsquared Academy
 
Forecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsForecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsGeoffery Mullings
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709Min-hyung Kim
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Alexey Grigorev
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Mingxuan Li
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RRsquared Academy
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Raman Kannan
 
Practical Introduction to Web scraping using R
Practical Introduction to Web scraping using RPractical Introduction to Web scraping using R
Practical Introduction to Web scraping using RRsquared Academy
 

Semelhante a Data manipulation and visualization in r 20190711 myanmarucsy (20)

Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Writing Readable Code with Pipes
Writing Readable Code with PipesWriting Readable Code with Pipes
Writing Readable Code with Pipes
 
Forecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsForecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series Models
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
MLflow with R
MLflow with RMLflow with R
MLflow with R
 
MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709
 
Graphics in R
Graphics in RGraphics in R
Graphics in R
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In R
 
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
 
Practical Introduction to Web scraping using R
Practical Introduction to Web scraping using RPractical Introduction to Web scraping using R
Practical Introduction to Web scraping using R
 

Último

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Último (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Data manipulation and visualization in r 20190711 myanmarucsy

  • 1. Data Manipulation and Visualization in R Hyebong Choi School of ICT and Global Entrepreneurship Handong Global University
  • 2. Big Data: Why it matters 2  In the past, data(record) are only for important people and events  e.g. royal family, war records, …
  • 3. Big Data: Why it matters 3  In the past, data(record) are expensive and for very important event or a few privileged people  e.g. royal family, war records, …  Now data is for everyone and every moment
  • 4. Data easily go beyond Human and … 4
  • 5. Data easily go beyond Human and computer 5 Moore’s Law is a computing term which originated around 1970; the simplified version of this law states that processor speeds, or overall processing power for computers will double every two years.
  • 6. 6
  • 7. What we call Big Data 7 3V definition
  • 8. Data Science  Data Science aims to derive knowledge fr om big data, efficiently and intelligently  Data Science encompasses the set of acti vities, tools, and methods that enable da ta-driven activities in science, business, medicine, and government  Machine Learning(or Data Mining) is agai n one of the core technologies that enabl es Data Science http://www.oreilly.com/data/free/what-is-data-science.csp 8 Data Science is process to get Actionable Insight from Big DATA
  • 9. Night time Bus by Seoul Local Government  Effective Bus Route Design with Big Data  5 Million Records of Taxi Take on/off  3 Billion Night time call records(location) from telco company
  • 10. Singapore Traffic visualization  Data from LTA  Subway, bus, taxi traffic info.  By data analytics team in Institute for Infocomm Research
  • 11. Big data and Data Science
  • 12. What We Cover Here • Data Manipulation and Visualization in R 12
  • 13. R • Most common tools for data scientist other than DBMS • Cover wide range of data scientist – Data engineer, Statistician, Data main expert, … rather than just computer engineer • Providing thousands of ready-to-use powerful packages for data scientist • Well documented 13 http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html
  • 14. R How to Start • Explanation of windows in R studio 14
  • 15. R • Getting help ▫ help(command) or ?command ▫ example(command) to see examples 15
  • 16. Package Installation and loading • install.packages(“package name”) • to load package ▫ library(“package name”) ▫ or require(“package name”) 16
  • 17. Variable • Variable is a container to hold data (or information) that we want to work with • Variable can hold ▫ a single value: 10, 10.5, “abc”, factor, NA, NULL ▫ multiple values: vector, matrix, list ▫ specially formatted data (values): data.frame 17
  • 18. How you assign a value to variable • var <- value 18 my_first_variable <- 35.121 New variable is now assigned and available in working environment
  • 19. Operators 19 a <- 10.5 b <- 20 c <- 4 a + b ## addition ## [1] 30.5 a - c ## substraction ## [1] 6.5 a * c ## mulitiplication ## [1] 42 b / c ## division ## [1] 5 a %% c ## remainder ## [1] 2.5 a > b ## inequality ## [1] FALSE a*2 == b ## equality ## [1] FALSE !(a > b) ## negation ## [1] TRUE (b > a) & (b > c) ## logical AND ## [1] TRUE (a > b) | (a > c) ## logical OR ## [1] TRUE
  • 20. Data Type – Missing Value (NA) • Sometimes values are missing, and R represent the missing values as NAs 20
  • 21. Vector • A vector is a sequence of data elements of the same basic type. • All members should be of same data type 21 numeric_vector <- c(1, 10, 49) character_vector <- c("a", "b", "c") boolean_vector <- c(TRUE, FALSE, TRUE) typeof(numeric_vector) ## [1] "double" typeof(character_vector) ## [1] "character" typeof(boolean_vector) ## [1] "logical" length(numeric_vector) ## number of members in the vector ## [1] 3 new_vector <- c(numeric_vector, 50) new_vector ## [1] 1 10 49 50
  • 22. Vector • R’s vector index starts from 1 ▫ 1,2,3,4, … • Minus Index means “except for” 22
  • 23. Vector with named elements • We can give name to each element of vector • and we can use the name instead of index number 23 some_vector <- c("John Doe", "poker player") names(some_vector) <- c("Name", "Profession") some_vector ## Name Profession ## "John Doe" "poker player" some_vector['Name'] ## Name ## "John Doe" some_vector['Profession'] ## Profession ## "poker player" some_vector[1] ## Name ## "John Doe"
  • 24. Vector with named elements • We can give name to each element of vector • and we can use the name instead of index number 24 weather_vector <- c("Mon" = "Sunny", "Tues" = "Rainy", "Wed" = "Cloudy", "Thur" = "Foggy", "Fri" = "Sunny", "Sat" = "Sunny", "Sun" = "Cloudy") weather_vector ## Mon Tues Wed Thur Fri Sat Sun ## "Sunny" "Rainy" "Cloudy" "Foggy" "Sunny" "Sunny" "Cloudy“ names(weather_vector) ## [1] "Mon" "Tues" "Wed" "Thur" "Fri" "Sat" "Sun"
  • 25. Short-cut to make numeric vector 25 a_vector <- 1:10 ## numbers from 1 to 10 b_vector <- seq(1, 10, 2) ## numbers from 1 to 10 increasing by 2 a_vector ## [1] 1 2 3 4 5 6 7 8 9 10 b_vector ## [1] 1 3 5 7 9 c_vector <- rep(1:3, 3) d_vector <- rep(1:3, each = 3) c_vector ## [1] 1 2 3 1 2 3 1 2 3 d_vector ## [1] 1 1 1 2 2 2 3 3 3 c(a_vector, b_vector) ## combine vectors to single vector ## [1] 1 2 3 4 5 6 7 8 9 10 1 3 5 7 9
  • 26. Basic Vector operations 26 a_vector <- c(1,5,2,7,8) b_vector <- seq(1, 10, 2) sum(a_vector) ## summation ## [1] 23 mean(a_vector) ## average ## [1] 4.6 # operation of Vector and Scala a_vector + 10 ## [1] 11 15 12 17 18 a_vector > 4 ## [1] FALSE TRUE FALSE TRUE TRUE sum(a_vector > 4) ## what does this mean? ## [1] 3 # operation of Vector and Vector a_vector - b_vector ## [1] 0 2 -3 0 -1 a_vector == b_vector ## [1] TRUE FALSE FALSE TRUE FALSE sum(a_vector == b_vector) ## what does this mean? ## [1] 2
  • 27. Vector Indexing (Selection) 27 sample_vector <- c(1, 4, NA, 2, 1, NA, 4, NA) ## vector with some missing values sample_vector[1:5] ## [1] 1 4 NA 2 1 sample_vector[c(1,3,5)] ## [1] 1 NA 1 sample_vector[-1] ## [1] 4 NA 2 1 NA 4 NA sample_vector[c(-1, -3, -5)] ## [1] 4 2 NA 4 NA sample_vector[c(T, T, F, T, F, T, F, T)] ## [1] 1 4 2 NA NA is.na(sample_vector) ## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE sum(is.na(sample_vector)) ## [1] 3 ## can you select non-NA elements from the vector? Selection by numeric vector Selection by logical vector
  • 28. Data Frame • Very commonly datasets contains variables of different kinds ▫ e.g. student dataset may contain name(character), age(integer), major(factor), gpa(numeric, real number)… • Vector and metric can have values of same data type • A data frame has the variables of a data set as columns and the observations as rows. 28 mtcars ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
  • 29. Overviewing of Data frame • head functions shows the first n (6 by default) observation of dataframe • tail functions shows the last n (6 by default) observation of dataframe 29 head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 head(mtcars, 10) ## try to see what happens tail(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2 tail(mtcars, 10) ## try to see what happens
  • 30. Overviewing of Data frame • str function shows the structure of your data set, it tells you ▫ The total number of observations (e.g. 32 car types) ▫ The total number of variables (e.g. 11 car features) ▫ A full list of the variables names (e.g. mpg, cyl ... ) ▫ The data type of each variable (e.g. num) ▫ The first few observations 30 str(mtcars) ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... ## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec: num 16.5 17 18.6 19.4 17 ... ## $ vs : num 0 0 1 1 0 1 0 1 1 1 ... ## $ am : num 1 1 1 0 0 0 0 0 0 0 ... ## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
  • 31. Creating Data frame • data.frame function with vectors (of same length and possibly different type) makes you a data frame 31 # Definition of vectors name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) # Create a data frame from the vectors planets_df <- data.frame(name, type, diameter, rotation, rings) planets_df ## name type diameter rotation rings ## 1 Mercury Terrestrial planet 0.382 58.64 FALSE ## 2 Venus Terrestrial planet 0.949 -243.02 FALSE ## 3 Earth Terrestrial planet 1.000 1.00 FALSE ## 4 Mars Terrestrial planet 0.532 1.03 FALSE ## 5 Jupiter Gas giant 11.209 0.41 TRUE ## 6 Saturn Gas giant 9.449 0.43 TRUE ## 7 Uranus Gas giant 4.007 -0.72 TRUE ## 8 Neptune Gas giant 3.883 0.67 TRUE
  • 32. Creating Data frame • you may specify the variables as parameters 32 my.df <- data.frame(name = c('John', 'Kim', 'Kaith'), job = c('Teacher', 'Policeman', 'Secertary'), age = c(32, 25, 28)) my.df ## name job age ## 1 John Teacher 32 ## 2 Kim Policeman 25 ## 3 Keith Secretary 28
  • 33. Selection of data frame elements Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. 33 # Print out diameter of Mercury (row 1, column 3) planets_df[1,3] ## [1] 0.382 # Print out data for Mars (entire fourth row) planets_df[4, ] ## name type diameter rotation rings ## 4 Mars Terrestrial planet 0.532 1.03 FALSE # you can use of directly variable name # Select first 5 values of diameter column planets_df[1:5, 'diameter'] ## [1] 0.382 0.949 1.000 0.532 11.209
  • 34. Selection of data frame elements You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick: planets_df[,3] planets_df[,"diameter"] However, there is a short-cut. If your columns have names, you can use the $ sign: planets_df$diameter 34
  • 35. Selection of data frame elements - a tricky part • You can use a logical vector to select from data frame 35 ## find planets with rings planets_df[planets_df$rings, ] ## name type diameter rotation rings ## 5 Jupiter Gas giant 11.209 0.41 TRUE ## 6 Saturn Gas giant 9.449 0.43 TRUE ## 7 Uranus Gas giant 4.007 -0.72 TRUE ## 8 Neptune Gas giant 3.883 0.67 TRUE ## select names of planets with rings planets_df[planets_df$rings, 'name'] ## [1] Jupiter Saturn Uranus Neptune ## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus ## find planets with larger diameter than earth planets_df$diameter > 1 ## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE planets_df[planets_df$diameter > 1, ] ## name type diameter rotation rings ## 5 Jupiter Gas giant 11.209 0.41 TRUE ## 6 Saturn Gas giant 9.449 0.43 TRUE ## 7 Uranus Gas giant 4.007 -0.72 TRUE ## 8 Neptune Gas giant 3.883 0.67 TRUE
  • 36. Handling Big Data using R <2st> Package “dplyr”, essential for data preprocessing in R 36 HyunJae Jo School of ICT and Global Entrepreneurship Handong Global University
  • 37. Contents • Data preprocessing • dplyr packages • Lab example 37
  • 38. Why we have to do “Data preprocessing” • Data preprocessing is the process of making raw data suitable for data analysis. 38 • Data Cleaning  Correction of missing value, outlier and noisy data • Data Integration  Combining data from multiple sources • Data Reduction  Reduce only the data needed for analysis • Data Transformation  Data transformation to maximize efficiency of data analytics
  • 39. Data Example of “Data Integration” 39 Missing value Noisy data Outlier 1. Data Integration  Integrating “Female Population” Column “Data preprocessing”
  • 40. Example of “Data Cleaning” 40 Missing value Noisy data Outlier “Data preprocessing” 2. Data Cleaning  Correction “Male Pop.” column  Remove missing value(“others” column)  Correction outliers
  • 41. Example of “Data Reduce” 41 Missing value Noisy data Outlier “Data preprocessing” 3. Data Reduce  Top 5 Population Regions by Total Population
  • 42. Example of “Data Transformation” 42 Missing value Noisy data Outlier “Data preprocessing” 4. Data Transfromation  Ratio region population to total population
  • 43. What is “dplyr” • "dplyr" is an R package that specializes in data processing. • Functions of “dplyr” ▫ mutate() adds new variables that are functions of existing variables ▫ select() picks variables based on their names. ▫ filter() picks cases based on their values. ▫ summarise() reduces multiple values down to a single summary. ▫ arrange() changes the ordering of the rows. ▫ group_by() allows to perform any operation “by group” ▫ %>%(chain) connect each operation and perform it at once 43
  • 44. How we can use “dplyr” • Install packages and load packages. • If you want to see package manual in R, you can use “help” function 44
  • 45. imdb dataset • IMDb (Internet Movie Database) is an online database of information related to films, television programs, home videos and video games, and streaming content online -- including cast, production crew and personnel biographies, plot summaries, trivia, and fan reviews and ratings. As of October 2018, IMDb has approximately 5.3 million titles (including episodes) and 9.3 million personalities in its database, as well as 83 million registered users. (Wikipedia) • We use reduced imdb dataset and you can load imdb dataset using read.csv() function with URL link: • https://raw.githubusercontent.com/myhan0710/prac_dplyr/master/imdb.csv 45
  • 46. Overviewing: imdb dataset 46 (e.g. 15,190 movies) (e.g. 44 related variable) (e.g. fn, tid, title,wordsInTitle, …) (e.g. chr, num, int)
  • 47. Lab Example • We will now look at the functions of the dplyr library through the imdb dataset. • Familiarize yourself with the roles of the given functions, and just enter the code in your RStudio. • Through the Lab, we can finally obtain data from the imdb dataset that meet the following conditions: 47
  • 48. dplyr: filter 48 • Use filter() function to extract specific row that fits the condition. • First, we can find recent, high rating and • many rating counts movie like this:
  • 49. dplyr: mutate 49 • Use mutate() function to make new column. • Second, we can make column that indicates whether a movie contains action and adventure genres: • If a certain movie’s Action column value is 1 and Adventure column value is 0, it means that movie includes action genre, but not Advenutre genre. • So, ActionAdven column value 1 means a movie includes both action and adventure genres.
  • 50. dplyr: select • Use select() function to extract specific column that fits the condition. • We only need a few columns: - wordsInTitle - imdbRating - ratingCount - Year - ActionAven 50
  • 51. dplyr: arrange 51 • Use the arrange() function to sort data from small to large values based on a specified column. • Fourth, we would like to sort the year in ascending order and then grade in descending order. • desc() function allows it to be sorted in descending order.
  • 52. dplyr: summarise • Use the summarise() function to obtain basic statistics by specifying functions such as mean(), var(), and median() • Now, we want to know average rating, average year, and number of Action-Adventure genre. 52
  • 53. dplyr: group_by • Use the group_by() function, then you can group data by level in the specified column. • After group_by() function, You can associate the results with summarise() to view the results of each level. 53
  • 54. dplyr: %>% • Use the chain function(%>%) to write a code with all functions at once. • Then, the results are same. (imdb_chain, imdb7) 54
  • 56. Excercise ① Extract movies from the original imdb dataset between 1970 and 2000. ② Using chain function, extract movies that has both drama and family genre. Then, columns are only year, duration, nrOfphotos. • Please replace <Fill In> with your solution. • After replacing your solution, check your solution is right on the next page. 56
  • 58. 58
  • 59. Data Visualization • Essential component of skill set as a data scientist • With ever increasing volume of data, it is impossible to tell stories without visualizations. • Data visualization is an art of how to turn “Big Data” into useful knowledge 59
  • 60. Data Visualization • Data Visualization 60 Statistics Design Graphical Data Analysis Communication & Perception
  • 61. Data Visualization • Exploratory Visualization ▫ Help you see what is in the data • Explanatory Visualization ▫ Shows others what you’ve found in your data • R supports both types of visualizations 61
  • 62. Data Visualization • Exploratory Visualization ▫ Help you see what is in the data ▫ Keep as much as detail as possible ▫ Practical Limit: how much can you see and interpret • Explanatory Visualization ▫ Help us share our understanding with others ▫ Shows others what you’ve found in your data ▫ Requires editorial decisions: ▫ Highlight the key features you want to emphasize ▫ Eliminate extraneous details 62
  • 65. ggplot2 • Author: Hadley Wickham • Open Source implementation of the layered grammar of graphics • High-level R package for creating publication-quality statistical graphics ▫ Carefully chosen defaults following basic graphical design rules • Flexible set of components for creating any type of graphics • Things you cannot do With ggplot2 ▫ 3-dimensional graphics ▫ Graph-theory type graphs (nodes/edges layout) 65
  • 66. ggplot2 installation • In R console: install.packages("ggplot2") library(ggplot2) 66
  • 69. The quick brown fox jumps over the lazy dog 69
  • 70. Grammar of Graphics 70 • Plotting Framework • Leland Wilkinson, Grammar of Graphics, 1999 • 2 principles ▫ Graphics = distinct layers of grammatical elements ▫ Meaningful plots through aesthetic mapping
  • 71. Essential Grammatical Elements 71 Element Description Data The dataset being plotted. Aesthetics The scales onto which we map our data. Geometries The visual elements used for our data.
  • 72. All Grammatical Elements 72 Element Description Data The dataset being plotted. Aesthetics The scales onto which we map our data. Geometries The visual elements used for our data. Facets Plotting small multiples Statistics Representations of our data to aid understanding. Coordinates The space on which the data will be plotted. Themes All non-data ink.
  • 74. Grammar of Graphics • Building blocks • Solid, creative, meaningful visualizations • Essential Layers: Data, Aesthetics, Geometries • For Enhancement: Facets, Statistics, Coordinates, Themes 74
  • 75. Example – First trial of ggplot • To get a first feel for ggplot2, let's try to run some basic ggplot2 commands. Together, they build a plot of the mtcars dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables. 75
  • 76. Example – First trial of ggplot • The plot from the previous exercise wasn't really satisfying. Although cyl (the number of cylinders) is categorical, it is classified as numeric in mtcars. You'll have to explicitly tell ggplot2 that cyl is a categorical variable. 76

Notas do Editor

  1. Hello everyone, my name is Hyunjae Jo. I’m instructor of 2nd lectures in big data course. In the previous lecture, you could hear the introduction of big data and basic grammars of R. In this lecture, we will learn about data preprocessing and ‘dplyr’ which is the most powerful package for data preprocessing.
  2. This is what we are going to do in this lecture. First, we will learn what data preprocessing is and why we have to do. And you’ll learn 4 types of data preprocessing. Next, we’ll learn about the dplyr package. You will find out what functions are in the package and what their role is. Lastly, we’ll use dplyr package through rap example.
  3. Let’s start. What is data preprocessing? Data preprocessing is the process of making raw data suitable for data analysis. When you download original data, there may be unnecessary or incorrect contents in the data. The raw data itself is inconvenient for you to analyze and can be unnecessarily large in size. You should refine these original data into the data that you could use conveniently. And we call this process data preprocessing. There are four main types of data pre-processing, and each of which is used when a process is required. Data cleaning is a process of correcting incorrect values. And Data integration is a process of synthesizing the required data into the existing data. Data reduction is a process of removing unnecessary data. Lastly, data transformation is a process of converting raw data to the required data form.
  4. Let's take a look at it with an example. The picture on the left is a sample of data related to Myanmar's population. Current data shows the total population and male population by region. But as you can see, there are a few wrong numbers or empty items in the data. You can see these minus values. We call it outlier when the figures are significantly different from the existing data. And others row have no value in each column, we call it missing value. Lastly, Male population almost has unnecessary values, and we call it noisy data. So, Let’s preprocess this data step by step. First, I think we need one more column, Female population by region. So, I integrate one more column “Female population”. We call this process Data integration.
  5. Next, we have to correct wrong values like outliers, missing values, and noisy data. As shown in the picture on the right, the wrong values have been changed to the right ones. And some unnecessary values were removed. We call this process data cleaning.
  6. Now, we want to select a top five areas of the population and analyze this data. In other words, without top five population regions, the other data is not needed. I pick out only the top five population areas, and the rest of the data is removed. This process is called data redirection.
  7. Finally, with top 5 population regions data, we want to know ratio of the population by region. So, I divide the population of each region by the total population. This enabled me to get a percentage of the population from region to region. We call this process data transformation. Through four preprocesses of data, we could able to obtain the percentage of population in each region from the raw data.
  8. Yeah, this is data preprocessing. I believe you follow me well. Now, we learn about package dplyr, which is most skillful way to preprocess data in R. As you can see here, dplyr supports seven major functions. You will actually use the above functions through the back slides.
  9. Yesterday, all computer was installed dplyr. So, you don’t have to install it. Just load package ‘dplyr’ And you want some guide about dplyr, you can use help function. Give it a try!
  10. Let me introduce the imdb dataset that we will use to apply dplyr function. imdb dataset mainly has information related to movies, TV and so on. It's a very large dataset, so we're going to do pre-processing with some scaled-down imdb datasets. You can load a dataset through read.csv function and its url link. Now get the data on your computer. If there's a problem, raise your hand and get some helps.
  11. I think most of you have done it, and now let's look at the structure of the dataset. You can use str function to look at the structure of the data. Through str function, you can see the number of observations in the data, total number of variables, names of the variables, type of data in the variables, and some observations in each variable. Our data has 15190 observations and 44 variables.
  12. Now let’s start lab example. We will use dplyr function and apply it imdb dataset. Familiarize yourself with the function and enter the code on it. After the lab, you will know what data preprocessing is.
  13. First, you can use filter function. Through filter function, you can extract specific rows that fit the conditions. So, first preprocessing is to find recent, high rating and many rating counts movies. I’ll give you time to write this code, so now, just follow me.
  14. Second function is mutate. You can make new column through mutate function. With this function, you can make a new column that show whether a movie contains both an action and an adventure genre. I named the column ActionAdven. If ActionAdven column is 1, it contains both genres. And ActionAdven column value 0 means it doesn’t contain one or more genre.
  15. Third function is select. You can extract specific columns that fit the conditions with select function. So, we make a dataframe that only has wordsInTitle, imdbRating, ratingCount, year, ActionAdven columns. Write above these three codes on your r and Check the result is same like pictures.
  16. So, I think most of you have written the select function code. Let’s move on the next. next function is arrange. With arrange function, we can sort data from small to large value based on specific column. And also, you can sort data from large to small value with desc function. Now, we would like to sort data by the year.
  17. Finally, with most powerful function, you can write code with all functions at once. This function name is chain function, and with this function, you can link all the codes you used before. Please write down the code below and check the results. If it was written correctly, the results of the imdb_chain and imdb7 would be the same.
  18. This is all the code of the lab example we just did.
  19. Now let's try to solve the problem. There are two problems. Read it and fill in the blanks. After you solve it, try matching the answer to the next one.
  20. I think most of them are done, so I'll finish this selection with this. You learned what data preprocessing is and why we have to do. And we looked at the use of dplyr, most powerful data processing package in R. I hope what you've learned today will be helpful to you when you're doing an analysis through big data. I'll finish with this. Thank you.