SlideShare uma empresa Scribd logo
1 de 35
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant Data
Analysis and Prediction Using
Spark
Manvi Chandra, mchandr2@calstatela.edu
Jongwook Woo, PhD, jwoo5@calstatela.edu
High-Performance Information Computing Center (HiPIC)
California State University Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Myself
Name: Manvi chandra
Experience:
 2012 -2014
– Programmer Analyst at Cognizant Technology Solutions
 2015-2016 - Present : Master’s in information system
 Exposed to Big Data Analytics
 Pursuing research in Big data analytics and machine learning
 2007-2011-Bachelor of Technology in Electronics and
Communication Engineering.
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Introduction To Big Data
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data (IoT), Bioinformatics, Social
Computing, Streaming data, smart phone, online
game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Hadoop?
8
Hadoop Founder:
Doug Cutting
Chief Architect at Cloudera
High Performance Information Computing Center
Jongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can
store a large scale data and process
it faster in parallel
Hadoop
–Non-expensive Super Computer
–You can build and run your applications
High Performance Information Computing Center
Jongwook Woo
CSULA
Alternate of Hadoop MapReduce
Limitation in MapReduce
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab
In-memory storage for intermediate data
10 ~ 100x faster than N/W and Disk
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Machine Learning
Subfield of computer science that evolved from
the study of pattern recognition and
computational learning theory in artificial
intelligence.
Explores pattern recognition during data analysis
through computer science and statistics.
Machine learning is a method of data analysis
that automates analytical model building. Using
algorithms that iteratively learn from data,
machine learning allows computers to find
hidden insights without being explicitly
programmed where to look.
High Performance Information Computing Center
Jongwook Woo
CSULA
Machine Learning Studio
Microsoft Azure Machine Learning Studio is a
collaborative, drag-and-drop tool you can use
to build, test, and deploy predictive analytics
solutions on your data.
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
Machine Learning
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
In-Memory Data Computing
Faster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystems
HDFS
HBase, Hive, Sequence files
New Programming with faster data sharing
Good in complex multi-stage applications
– Iterative graph algorithms, Machine Learning
Interactive query
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
RDDs, Transformations, and Actions
Spark
Streaming
real-time
Spark
SQL
MLLib
machine
learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
DataFrames
RDD-Based
Matrices
Spark Cores
GraphX
(graph)
RDD-Based
Matrices
Spark
R
RDD-Based
Matrices
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark Drivers and Workers
Drivers
Client
–with SparkContext
• Create RDDs
Workers
Spark Executor
Run on cluster nodes
–Production
Run in local threads
–development
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD
Resilient Distributed Dataset (RDD)
Distributed collections of objects
–that can be cached in memory
RDD, DStream, SchemaRDD, PairRDD
Immutable
Lineage
–History of the objects
–Automatically and efficiently recompute lost
data
High Performance Information Computing Center
Jongwook Woo
CSULA
RDD Operations
Transformation
Define new RDDs from the current
–Lazy: not computed immediately
map(), filter(), join()
Actions
Return values
count(), collect(), take(), save()
High Performance Information Computing Center
Jongwook Woo
CSULA
Programming in Spark
Scala
Functional Programming
–Fundamental of programming is function
• Input/Output is function
No side effects
–No states
Python
Legacy, large Libraries
Java
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
SparkSQL
Turning an RDD into a Relation
Querying using SQL
Spark Streaming
DStream
– RDD in streaming
– Windows
• To select DStream from streaming data
MLib
Sparse vector support, Decision trees, Linear/Logistic
Regression, PCA
SVD and PCA
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Hydrogen gas power plant spark
model
o Separating the labeled column.
o Creation of RDD.
o Splitting the data into training and test sets.
o Training the dataset using Decision forest
regression algorithm.
o Evaluation of the result.
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark
Hydrogen gas power plant spark
model
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
 Myself
 Introduction To Big Data
 Hive Examples
 Spark Cores
 RDD
 Spark SQL, Streaming, ML
 Hydrogen Gas Power Plant Prediction
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The Cal State L.A. Hydrogen Research
and Fueling Facility (H2 Station) was
formally opened on May 7, 2014.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
The station is capable of producing hydrogen
onsite from renewable energy sources, using the
process known as electrolysis.
Cal State L.A. Hydrogen Research and Fueling
Facility became the first station in the nation
to sell hydrogen fuel by the kilogram to the
public.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Workflow
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Model
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
According to our research we are able to predict
Vehicle Pressure (Pressure of hydrogen gas within the
vehicle Hydrogen Storage System)using our model.
The algorithm used is decision forest regression.
Decision forest are an ensemble learning method for
classification, regression and other tasks, that operate
by constructing a multitude of decision trees at
training time and outputting the class that is
the mode of the classes (classification) or mean
prediction (regression) of the individual trees.
High Performance Information Computing Center
Jongwook Woo
CSULA
Hydrogen Gas Power Plant
Prediction Model
Results and observations
STATE OF CHARGE (SOC):-
– Ratio of hydrogen density within the vehicle
storage system to the full-fill density. SOC is
expressed as a percentage and is computed
based on the gas density as per formula below:
Our model predict vehicle pressure which in
turn could be used to determine the state of
charge.
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
High Performance Information Computing Center
Jongwook Woo
CSULA
References
Hadoop, http://hadoop.apache.org
Apache Spark op Word Count Example
(http://spark.apach.org )
Databricks (http://www.databricks.com )

Mais conteúdo relacionado

Mais procurados

Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
VMware Tanzu
 

Mais procurados (19)

Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Identifying the Potential of Near Data Processing for Apache Spark
Identifying the Potential of Near Data Processing for Apache SparkIdentifying the Potential of Near Data Processing for Apache Spark
Identifying the Potential of Near Data Processing for Apache Spark
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to spark
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 

Destaque

Destaque (18)

Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache Spark
 
Tinada
TinadaTinada
Tinada
 
Calendario digital de gatitos 2013
Calendario digital de gatitos 2013Calendario digital de gatitos 2013
Calendario digital de gatitos 2013
 
The Unknown Black Privileges in America
The Unknown Black Privileges in AmericaThe Unknown Black Privileges in America
The Unknown Black Privileges in America
 
Catlin Gabel Conversation with Asian and Pacific American Families
Catlin Gabel Conversation with Asian and Pacific American FamiliesCatlin Gabel Conversation with Asian and Pacific American Families
Catlin Gabel Conversation with Asian and Pacific American Families
 
Microvita 2050 - The project
Microvita 2050 - The projectMicrovita 2050 - The project
Microvita 2050 - The project
 
Final Report
Final ReportFinal Report
Final Report
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Neev Expertise in Ruby on Rails (RoR)
Neev Expertise in Ruby on Rails (RoR)Neev Expertise in Ruby on Rails (RoR)
Neev Expertise in Ruby on Rails (RoR)
 
Presentazione Mediabeta Consulenza Web Marketing
Presentazione Mediabeta Consulenza Web MarketingPresentazione Mediabeta Consulenza Web Marketing
Presentazione Mediabeta Consulenza Web Marketing
 
Big data using Public Cloud
Big data using Public CloudBig data using Public Cloud
Big data using Public Cloud
 
Master corpo e comunicazione
Master corpo e comunicazioneMaster corpo e comunicazione
Master corpo e comunicazione
 
บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558 บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558
 
ITSS Overview
ITSS OverviewITSS Overview
ITSS Overview
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Customer Experience Testing: The Key to Digital Success
Customer Experience Testing: The Key to Digital SuccessCustomer Experience Testing: The Key to Digital Success
Customer Experience Testing: The Key to Digital Success
 
Wh questions
Wh questionsWh questions
Wh questions
 

Semelhante a Big Data Analysis in Hydrogen Station using Spark and Azure ML

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
Shiyong Lu
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
Rajeev Kumar
 
Aastha Grover Resume (2)
Aastha Grover Resume (2)Aastha Grover Resume (2)
Aastha Grover Resume (2)
Aastha Grover
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
Jongwook Woo
 

Semelhante a Big Data Analysis in Hydrogen Station using Spark and Azure ML (20)

Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Poster
PosterPoster
Poster
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
 
Aastha Grover Resume (2)
Aastha Grover Resume (2)Aastha Grover Resume (2)
Aastha Grover Resume (2)
 
Scientific
Scientific Scientific
Scientific
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 

Mais de Jongwook Woo

Mais de Jongwook Woo (15)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Último

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Último (20)

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

Big Data Analysis in Hydrogen Station using Spark and Azure ML

  • 1. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Data Analysis and Prediction Using Spark Manvi Chandra, mchandr2@calstatela.edu Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Myself Name: Manvi chandra Experience:  2012 -2014 – Programmer Analyst at Cognizant Technology Solutions  2015-2016 - Present : Master’s in information system  Exposed to Big Data Analytics  Pursuing research in Big data analytics and machine learning  2007-2011-Bachelor of Technology in Electronics and Communication Engineering.
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Introduction To Big Data
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  • 8. High Performance Information Computing Center Jongwook Woo CSULA What is Hadoop? 8 Hadoop Founder: Doug Cutting Chief Architect at Cloudera
  • 9. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data Inexpensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –You can build and run your applications
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab In-memory storage for intermediate data 10 ~ 100x faster than N/W and Disk
  • 11. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Machine Learning Subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Explores pattern recognition during data analysis through computer science and statistics. Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Machine Learning Studio Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you can use to build, test, and deploy predictive analytics solutions on your data.
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS HBase, Hive, Sequence files New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Create RDDs Workers Spark Executor Run on cluster nodes –Production Run in local threads –development
  • 18. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  • 19. High Performance Information Computing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory RDD, DStream, SchemaRDD, PairRDD Immutable Lineage –History of the objects –Automatically and efficiently recompute lost data
  • 20. High Performance Information Computing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Spark SparkSQL Turning an RDD into a Relation Querying using SQL Spark Streaming DStream – RDD in streaming – Windows • To select DStream from streaming data MLib Sparse vector support, Decision trees, Linear/Logistic Regression, PCA SVD and PCA
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Spark Hydrogen gas power plant spark model o Separating the labeled column. o Creation of RDD. o Splitting the data into training and test sets. o Training the dataset using Decision forest regression algorithm. o Evaluation of the result.
  • 25. High Performance Information Computing Center Jongwook Woo CSULA Spark Hydrogen gas power plant spark model
  • 26. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  • 27. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) was formally opened on May 7, 2014.
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The station is capable of producing hydrogen onsite from renewable energy sources, using the process known as electrolysis. Cal State L.A. Hydrogen Research and Fueling Facility became the first station in the nation to sell hydrogen fuel by the kilogram to the public.
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Workflow
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Model
  • 31. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations
  • 32. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations According to our research we are able to predict Vehicle Pressure (Pressure of hydrogen gas within the vehicle Hydrogen Storage System)using our model. The algorithm used is decision forest regression. Decision forest are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations STATE OF CHARGE (SOC):- – Ratio of hydrogen density within the vehicle storage system to the full-fill density. SOC is expressed as a percentage and is computed based on the gas density as per formula below: Our model predict vehicle pressure which in turn could be used to determine the state of charge.
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Question?
  • 35. High Performance Information Computing Center Jongwook Woo CSULA References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )