Big Data Analysis in Hydrogen Station using Spark and Azure ML

High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant Data
Analysis and Prediction Using
Spark
Manvi Chandra, mchandr2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles

Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model

Jongwook Woo
CSULA
Myself
Name: Manvi chandra
Experience:
 2012 -2014
– Programmer Analyst at Cognizant Technology Solutions
 2015-2016 - Present : Master’s in information system
 Exposed to Big Data Analytics
 Pursuing research in Big data analytics and machine learning
 2007-2011-Bachelor of Technology in Electronics and
Communication Engineering.

Jongwook Woo
CSULA
Introduction To Big Data

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, Streaming data, smart phone, online
game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers

Jongwook Woo
CSULA
What is Hadoop?
8
Hadoop Founder:
Doug Cutting
Chief Architect at Cloudera

Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–You can build and run your applications

Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk

Jongwook Woo
CSULA
Machine Learning
Subfield of computer science that evolved from
the study of pattern recognition and
computational learning theory in artificial
intelligence.
Explores pattern recognition during data analysis
through computer science and statistics.
Machine learning is a method of data analysis
that automates analytical model building. Using
algorithms that iteratively learn from data,
machine learning allows computers to find
hidden insights without being explicitly
programmed where to look.

Jongwook Woo
CSULA
Machine Learning Studio
Microsoft Azure Machine Learning Studio is a
collaborative, drag-and-drop tool you can use
to build, test, and deploy predictive analytics
solutions on your data.

Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query

Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices

Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development

Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 Hydrogen Gas Power Plant Prediction
Model

Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data

Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()

Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java

Jongwook Woo
CSULA
Spark
SparkSQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees, Linear/Logistic
Regression, PCA
SVD and PCA

Jongwook Woo
CSULA
Spark
Hydrogen gas power plant spark
model
o Separating the labeled column.
o Creation of RDD.
o Splitting the data into training and test sets.
o Training the dataset using Decision forest
regression algorithm.
o Evaluation of the result.

Jongwook Woo
CSULA
Spark
Hydrogen gas power plant spark
model

Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research
and Fueling Facility (H2 Station) was
formally opened on May 7, 2014.

Jongwook Woo
CSULA
Prediction Model
The station is capable of producing hydrogen
onsite from renewable energy sources, using the
process known as electrolysis.
Cal State L.A. Hydrogen Research and Fueling
Facility became the first station in the nation
to sell hydrogen fuel by the kilogram to the
public.

Jongwook Woo
CSULA
Prediction Model
Workflow

Jongwook Woo
CSULA
Prediction Model
Model

Jongwook Woo
CSULA
Prediction Model
Results and observations

Jongwook Woo
CSULA
Prediction Model
According to our research we are able to predict
Vehicle Pressure (Pressure of hydrogen gas within the
vehicle Hydrogen Storage System)using our model.
The algorithm used is decision forest regression.
Decision forest are an ensemble learning method for
classification, regression and other tasks, that operate
by constructing a multitude of decision trees at
training time and outputting the class that is
the mode of the classes (classification) or mean
prediction (regression) of the individual trees.

Jongwook Woo
CSULA
Prediction Model
STATE OF CHARGE (SOC):-
– Ratio of hydrogen density within the vehicle
storage system to the full-fill density. SOC is
expressed as a percentage and is computed
based on the gas density as per formula below:
Our model predict vehicle pressure which in
turn could be used to determine the state of
charge.

Jongwook Woo
CSULA
Question?

Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )

Big Data Analysis in Hydrogen Station using Spark and Azure ML

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (18)

Semelhante a Big Data Analysis in Hydrogen Station using Spark and Azure ML

Semelhante a Big Data Analysis in Hydrogen Station using Spark and Azure ML (20)

Mais de Jongwook Woo

Mais de Jongwook Woo (15)

Último

Último (20)

Big Data Analysis in Hydrogen Station using Spark and Azure ML