SlideShare uma empresa Scribd logo
1 de 21
Big Data tools in practice
Darko Marjanović, darko@thingsolver.com
Miloš Milovanović, milos@thingsolver.com
Agenda
• Hadoop
• Spark
• Python
Hadoop
• Pros
• Linear scalability.
• Commodity hardware.
• Pricing and licensing.
• Any data types.
• Analytical queries.
• Integration with traditional
systems.
• Cons
• Implementation.
• Map Reduce ease of use.
• Intense calculations with little
data.
• In memory.
• Real time analytics.
The Apache Hadoop software library is a framework that allows the
distributed processing of large data sets across clusters of computers using
simple programming models.
Hadoop
• Hadoop Common
• HDFS
• Map Reduce
• YARN
Hadoop HDFS
Hadoop HDFS
Apache Spark
• Pros
• 100X faster than Map Reduce.
• Ease of use.
• Streaming, Mllib, Graph and SQL.
• Pricing and licensing.
• In memory.
• Integration with Hadoop.
• Cons
• Integration with traditional
systems.
• Limited memory per machine(GC).
• Configuration.
Apache Spark is a fast and general engine for big data processing, with
built-in modules for streaming, SQL, machine learning and graph
processing.
Spark
Spark stack
Resilient Distributed Datasets
A distributed memory abstraction that allows programmers to perform in-memory computations
on large clusters while retaining the fault tolerance of data flow model like MapReduce.*
• Immutability
• Lineage (reconstruct lost partitions)
• Fault tolerance through logging updates made to a dataset (single operation applied to
many records)
• Creation:
• Reading a dataset from storage (HDFS or any other)
• From other RDDs
*Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html
RDD operations
• Transformations
• Lazy evaluated (executed by calling
an action)
• Reduces wait states
• Better pipelining
• Actions
• Runned immediately
• Return value to the application or
export to storage system
• map(f : T ⇒ U)
• filter(f : T ⇒ Bool)
• groupByKey()
• join()
• count()
• collect()
• reduce(f : (T, T) ⇒ T)
• save(path: String)
Spark program lifecycle
Create RDD
(external data or parallelize collection)
Transformation
(lazy evaluated)
Cache RDD
(for reuse)
Action
(execute computation and return results)
Spark in a cluster mode
* http://spark.apache.org/docs/latest/img/cluster-overview.png
PySpark
• Python API for Spark
• Easy-to-use programming abstraction and parallel runtime:
• “Here’s an operation, run it on all of the data”
• Dynamically typed (RDDs can hold objects of multiple types)
• Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask
• Run Spark from Jupyter notebooks
Spark Dataframes
DataFrames are a common data science abstraction that go across languages.
A data frame is a table, or two-dimensional array-like structure, in which each column
contains measurements on one variable, and each row contains one case.
A Spark DataFrame is a distributed collection of data organized into named columns, and
can be created:
• - from structured data files
• - from Hive tables
• - from external databases
• - from RDDs
Some supported operations:
- slice data
• - sort data
• - aggregate data
• - join with other dataframes
Dataframe benefits
• Lazy evaluation
• Domain specific language for distributed data manipulation
• Automatic parallelization and cluster distribution
• Integration with pipeline API for Mllib
• Query structured data with SQL (using SQLContext)
• Integration with Pandas Dataframes (and other Python data libraries)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")
df.show()
df.select(“id”).show()
df.filter(df[”id”] > 10).show()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")
df.registerTempTable(“data”)
results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)
Pandas DF vs Spark DF
Single machine tool (all data needs to fit
to memory, except with HDF5)
Distributed (data > memory)
Better API Good API
No parallelism Parallel by default
Mutable Immutable
Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns
(https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )
A very popular benchmark
* https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png
Big Data tools in practice
Darko Marjanović, darko@thingsolver.com
Miloš Milovanović, milos@thingsolver.com

Mais conteúdo relacionado

Mais procurados

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonData Con LA
 
Tomer Shiran, MapR_Hadoop&SQL
Tomer Shiran, MapR_Hadoop&SQLTomer Shiran, MapR_Hadoop&SQL
Tomer Shiran, MapR_Hadoop&SQLThe Hive
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0SpringPeople
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsDataWorks Summit
 

Mais procurados (20)

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
 
Tomer Shiran, MapR_Hadoop&SQL
Tomer Shiran, MapR_Hadoop&SQLTomer Shiran, MapR_Hadoop&SQL
Tomer Shiran, MapR_Hadoop&SQL
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 

Semelhante a Big Data tools in practice

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 

Semelhante a Big Data tools in practice (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 

Mais de Darko Marjanovic

Hadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisa
Hadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisaHadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisa
Hadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisaDarko Marjanovic
 
Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?
Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?
Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?Darko Marjanovic
 
Data Science Conference Belgrade
Data Science Conference BelgradeData Science Conference Belgrade
Data Science Conference BelgradeDarko Marjanovic
 
Big data i arkitektura big data aplikacije meetup
Big data i arkitektura big data aplikacije meetupBig data i arkitektura big data aplikacije meetup
Big data i arkitektura big data aplikacije meetupDarko Marjanovic
 
Big data apache spark zamena za hadoop ili ne?
Big data   apache spark zamena za hadoop ili ne?Big data   apache spark zamena za hadoop ili ne?
Big data apache spark zamena za hadoop ili ne?Darko Marjanovic
 
Arhitektura big data aplikacije (tarabica)
Arhitektura big data aplikacije (tarabica)Arhitektura big data aplikacije (tarabica)
Arhitektura big data aplikacije (tarabica)Darko Marjanovic
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Hadoop infrastructure for education
Hadoop infrastructure for educationHadoop infrastructure for education
Hadoop infrastructure for educationDarko Marjanovic
 
Hadoop i sveprisutno racunarstvo
Hadoop i sveprisutno racunarstvoHadoop i sveprisutno racunarstvo
Hadoop i sveprisutno racunarstvoDarko Marjanovic
 

Mais de Darko Marjanovic (9)

Hadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisa
Hadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisaHadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisa
Hadoop ekosistem u praksi - socijalne mreže, unapređenje prodaje i servisa
 
Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?
Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?
Big Data: Apache Spark -novo pojačanje tradicionalnom BI ili ne?
 
Data Science Conference Belgrade
Data Science Conference BelgradeData Science Conference Belgrade
Data Science Conference Belgrade
 
Big data i arkitektura big data aplikacije meetup
Big data i arkitektura big data aplikacije meetupBig data i arkitektura big data aplikacije meetup
Big data i arkitektura big data aplikacije meetup
 
Big data apache spark zamena za hadoop ili ne?
Big data   apache spark zamena za hadoop ili ne?Big data   apache spark zamena za hadoop ili ne?
Big data apache spark zamena za hadoop ili ne?
 
Arhitektura big data aplikacije (tarabica)
Arhitektura big data aplikacije (tarabica)Arhitektura big data aplikacije (tarabica)
Arhitektura big data aplikacije (tarabica)
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop infrastructure for education
Hadoop infrastructure for educationHadoop infrastructure for education
Hadoop infrastructure for education
 
Hadoop i sveprisutno racunarstvo
Hadoop i sveprisutno racunarstvoHadoop i sveprisutno racunarstvo
Hadoop i sveprisutno racunarstvo
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Big Data tools in practice

  • 1. Big Data tools in practice Darko Marjanović, darko@thingsolver.com Miloš Milovanović, milos@thingsolver.com
  • 3. Hadoop • Pros • Linear scalability. • Commodity hardware. • Pricing and licensing. • Any data types. • Analytical queries. • Integration with traditional systems. • Cons • Implementation. • Map Reduce ease of use. • Intense calculations with little data. • In memory. • Real time analytics. The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models.
  • 4. Hadoop • Hadoop Common • HDFS • Map Reduce • YARN
  • 7.
  • 8.
  • 9. Apache Spark • Pros • 100X faster than Map Reduce. • Ease of use. • Streaming, Mllib, Graph and SQL. • Pricing and licensing. • In memory. • Integration with Hadoop. • Cons • Integration with traditional systems. • Limited memory per machine(GC). • Configuration. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
  • 10. Spark
  • 12. Resilient Distributed Datasets A distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow model like MapReduce.* • Immutability • Lineage (reconstruct lost partitions) • Fault tolerance through logging updates made to a dataset (single operation applied to many records) • Creation: • Reading a dataset from storage (HDFS or any other) • From other RDDs *Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html
  • 13. RDD operations • Transformations • Lazy evaluated (executed by calling an action) • Reduces wait states • Better pipelining • Actions • Runned immediately • Return value to the application or export to storage system • map(f : T ⇒ U) • filter(f : T ⇒ Bool) • groupByKey() • join() • count() • collect() • reduce(f : (T, T) ⇒ T) • save(path: String)
  • 14. Spark program lifecycle Create RDD (external data or parallelize collection) Transformation (lazy evaluated) Cache RDD (for reuse) Action (execute computation and return results)
  • 15. Spark in a cluster mode * http://spark.apache.org/docs/latest/img/cluster-overview.png
  • 16. PySpark • Python API for Spark • Easy-to-use programming abstraction and parallel runtime: • “Here’s an operation, run it on all of the data” • Dynamically typed (RDDs can hold objects of multiple types) • Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask • Run Spark from Jupyter notebooks
  • 17. Spark Dataframes DataFrames are a common data science abstraction that go across languages. A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. A Spark DataFrame is a distributed collection of data organized into named columns, and can be created: • - from structured data files • - from Hive tables • - from external databases • - from RDDs Some supported operations: - slice data • - sort data • - aggregate data • - join with other dataframes
  • 18. Dataframe benefits • Lazy evaluation • Domain specific language for distributed data manipulation • Automatic parallelization and cluster distribution • Integration with pipeline API for Mllib • Query structured data with SQL (using SQLContext) • Integration with Pandas Dataframes (and other Python data libraries) from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("data.json") df.show() df.select(“id”).show() df.filter(df[”id”] > 10).show() from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("data.json") df.registerTempTable(“data”) results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)
  • 19. Pandas DF vs Spark DF Single machine tool (all data needs to fit to memory, except with HDF5) Distributed (data > memory) Better API Good API No parallelism Parallel by default Mutable Immutable Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns (https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )
  • 20. A very popular benchmark * https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png
  • 21. Big Data tools in practice Darko Marjanović, darko@thingsolver.com Miloš Milovanović, milos@thingsolver.com