SlideShare uma empresa Scribd logo
1 de 38
High Performance Processing of
Streaming Data
Workshops on Dynamic Data Driven Applications Systems(DDDAS) In
conjunction with: 22nd International Conference on
High Performance Computing (HiPC), Bengaluru, India
12/16/2015
1
Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage
and Geoffrey Fox December 16, 2015
gcf@indiana.edu
http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/
Department of Intelligent Systems Engineering
School of Informatics and Computing, Digital Science Center
Indiana University Bloomington
Software Philosophy
• We use the concept of HPC-ABDS High Performance Computing
enhanced Apache Big Data Software Stack illustrated on next slide.
• HPC-ABDS is a collection of 350 software systems used in either HPC or
best practice Big Data applications. The latter include Apache, other open-
source and commercial systems
• HPC-ABDS helps ABDS by allowing HPC to add performance to ABDS
software systems
• HPC-ABDS helps HPC by bringing the rich functionality and software
sustainability model of commercial and open source software. These bring
a large community and expertise that is reasonably easy to find as it is
broadly taught both in traditional courses and by community activities such
as Meet up groups were for example:
– Apache Spark 107,000 meet-up members in 233 groups
– Hadoop 40,000 and installed in 32% of company data systems 2013
– Apache Storm 9,400 members
• This talk focuses on Storm; its use and how one can add high performance
212/16/2015
3
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-
Cutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
2) Distributed
Coordination:
Google
Chubby,
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
Eduroam
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl, OpenID,
SAML OAuth
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, Azure Machine
Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM
Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana
Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero,
OODT, Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream
Analytics, Floe
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Hama,
Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ,
NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub,
Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
21 layers
Over 350
Software
Packages
May 15
2015
Green implies HPC
Integration
12/16/2015
High Performance Computing Apache Big Data Software Stack
IOTCloud
• Device  Pub-SubStorm 
Datastore  Data Analysis
• Apache Storm provides scalable
distributed system for processing
data streams coming from devices
in real time.
• For example Storm layer can
decide to store the data in cloud
storage for further analysis or to
send control data back to the
devices
• Evaluating Pub-Sub Systems
ActiveMQ, RabbitMQ, Kafka,
Kestrel
Turtlebot
and Kinect
12/16/2015
4
6 Forms of
MapReduce
cover “all”
circumstances
Describes
different aspects
- Problem
- Machine
- Software
If these different
aspects match,
one gets good
performance
512/16/2015
Cloud controlled Robot Data Pipeline
612/16/2015
Message Brokers
RabbitMQ, Kafka
Gateway Sending
to
pub-sub
Sending
to
Persisting
to storage
Streamin
g
workflow
A stream
application with
some tasks
running in parallel
Multiple
streaming
workflows
Streaming Workflows
Apache Storm
Apache Storm comes from Twitter and supports Map-
Dataflow-Streaming computing model
Key ideas: Pub-Sub, fault-tolerance (Zookeeper), Bolts, Spouts
Simultaneous Localization & Mapping (SLAM)
𝑝(𝑥1:𝑡, 𝑚|𝑧1:𝑡, 𝑢1:𝑡−1) =
𝑝 𝑚 𝑥1:𝑡, 𝑧1:𝑡 𝑝(𝑥1:𝑡|𝑧1:𝑡, 𝑢1:𝑡−1
Particles are
distributed
in parallel tasks
Application
Build a map given the distance
measurements from robot to
objects around it and its pose
Streaming
Workflow
Rao-Blackwellized particle
filtering based algorithm for
SLAM. Distribute the particles
across parallel tasks and compute
in parallel.
Map building
happens
periodically12/16/2015
7
Parallel SLAM Simultaneous Localization and
Mapping by Particle Filtering
812/16/2015
Speedup
Robot Latency Kafka & RabbitMQ
912/16/2015
Kinect with
Turtlebot
and
RabbitMQ
RabbitMQ
versus Kafka
SLAM Latency variations for 4 or 20 way parallelism
Jitter due to Application or System influences such as Network delays, Garbage
collection and Scheduling of tasks
1012/16/2015
No Cut
Fluctuations decrease after Cut on #iterations per swarm member
Fault Tolerance at Message Broker
• RabbitMQ supports Queue replication and persistence to
disk across nodes for fault tolerance
• Can use a cluster of RabbitMQ brokers to achieve high
availability and fault tolerance
• Kafka stores the messages in disk and supports
replication of topics across nodes for fault tolerance.
Kafka's storage first approach may increase reliability but
can introduce increased latency
• Multiple Kafka brokers can be used to achieve high
availability and fault tolerance
Parallel Overheads SLAM Simultaneous Localization
and Mapping: I/O and Garbage Collection
12/16/2015
12
Parallel Overheads SLAM Simultaneous Localization
and Mapping: Load Imbalance Overhead
12/16/2015
13
Multi-Robot Collision Avoidance
Streaming Workflow
Information
from robots
Runs in
parallel
• Second parallel Storm application
• Velocity Obstacles (VOs) along with
other constrains such as acceleration
and max velocity limits,
• Non-Holonomic constraints, for
differential robots, and localization
uncertainty.
• NPC NPS measure parallelism
Control Latency
# Collisions
versus number
of robots
12/16/2015
14
Lessons from using Storm
• We successfully parallelized Storm as core software of two
robot planning applications
• We needed to replace Kafka by RabbitMQ to improve
performance
– Kafka had large variations in response time
• We reduced Garbage Collection overheads
• We see that we need to generalize Storm’s
– Map-Dataflow Streaming architecture to
– Map-Dataflow/Collective Streaming architecture
• Now we use HPC-ABDS to improve Storm communication
performance
1512/16/2015
16
Bringing Optimal Communications to Storm
12/16/2015
Both process based and thread based
parallelism is used
Worker and Task distribution of Storm
A worker hosts multiple tasks. B-1 is a
task of component B and W-1 is a task
of W
Communication links are
between workers
These are multiplexed among
the tasks
W-1
Worker
Node-1
B-1
W-3
Worker
W-2
W-5
Worker
Node-2
W-4
W-7
Worker
W-6
W-1
Worker
Node-1
B-1
W-3
Worker
W-2
W-5
Worker
Node-2
W-4
W-7
Worker
W-6
Memory Mapped File based
Communication
• Inter process communications using shared memory for a
single node
• Multiple writer single reader design
• A memory mapped file is created for each worker of a node
• Create the file under /dev/shm
• Writer breaks the message in to packets and puts them to file
• Reader reads the packets and assemble the message
• When a file becomes full move to another file
• PS all of this “well known” BUT not deployed
12/16/2015
17
Optimized Broadcast Algorithms
• Binary tree
– Workers arranged in a binary tree
• Flat tree
– Broadcast from the origin to 1 worker in each node
sequentially. This worker broadcast to other workers in the
node sequentially
• Bidirectional Rings
– Workers arranged in a line
– Starts two broadcasts from the origin and these traverse half
of the line
• All well known and we have used similar ideas of basic HPC-
ABDS to improve MPI for machine learning (using Java)
12/16/2015
18
Java MPI performs better than Threads I
128 24 core Haswell nodes with Java Machine Learning
Default MPI much worse than threads
Optimized MPI using shared memory node-based messaging is much better
than threads
1912/16/2015
Java MPI performs better than Threads II
128 24 core Haswell nodes
2012/16/2015
200K Dataset Speedup
Speedups show classic parallel computing structure
with 48 node single core as “sequential”
State of art dimension reduction routine
Speedups improve as problem size increases
48 nodes, 1 core to 128 nodes 24 cores is potential speedup of 64
2112/16/2015
Experimental Configuration
• 11 Node cluster
• 1 Node – Nimbus & ZooKeeper
• 1 Node – RabbitMQ
• 1 Node – Client
• 8 Nodes – Supervisors with 4 workers each
• Client sends messages with the current timestamp, the topology returns
a response with the same time stamp. Latency = current time -
timestamp
12/16/2015
22
W-1
W-5
W-n
B-1R-1 G-1RabbitMQ RabbitMQ
Client
Original
Binary Tree
Flat Tree
Bidirectional
Ring
Speedup of latency with both TCP based and Shared Memory based
communications for different algorithms and sizes
12/16/2015
23
Original and new Storm Broadcast Algorithms
Future Work
• Memory mapped communications require continuous
polling by a thread. If this tread does the processing of
the message, the polling overhead can be reduced.
• Scheduling of tasks should take the communications in to
account
• The current processing model has multiple threads
processing a message at different stages. Reduce the
number of threads to achieve predictable performance
• Improve the packet structure to reduce the overhead
• Compare with related Java MPI technology
• Add additional collectives to those supported by Storm
12/16/2015
24
Conclusions on initial HPC-ABDS
use in Apache Storm
• Apache Storm worked well with performance
enhancements
• For Binary tree performed the best
• Algorithms reduces the network traffic
• Shared memory communications reduce the
latency further
• Memory mapped file communications improve
performance
12/16/2015
25
Thank You
• References
– Our software https://github.com/iotcloud
– Apache Storm http://storm.apache.org/
– We will donate software to Storm
– SLAM paper
http://dsc.soic.indiana.edu/publications/SLAM_In_
the_cloud.pdf
– Collision Avoidance paper http://goo.gl/xdB8LZ
12/16/2015
26
Spare SLAM Slides
12/16/2015
27
• IoTCloud uses Zookeeper,
Storm, Hbase, RabbitMQ
for robot cloud control
• Focus on high performance
(parallel) control functions
• Guaranteed real time
response
12/16/2015
28
Parallel
simultaneous
localization and
mapping
(SLAM) in the
cloud
Latency with RabbitMQ
Different Message sizes in
bytes
Latency with Kafka
Note change in scales
for latency and
message size
12/16/2015
29
Robot Latency Kafka & RabbitMQ
Kinect with
Turtlebot
and
RabbitMQ
RabbitMQ
versus
Kafka
12/16/2015
30
Parallel SLAM Simultaneous Localization
and Mapping by Particle Filtering
12/16/2015
31
Spare High Performance
Storm Slides
12/16/2015
32
Memory Mapped Communication
12/16/2015
33
write Packet 1 Packet 2 Packet 3
Writer 01
Writer 02
Write
Write
Obtain the write location
atomically and increment
Shared File
Reader
Read packet by packet
sequentially
Use a new file when the file size is reached
Reader deletes the files after it reads them fully
ID No of
Packets
Packet
No
Dest Task Content
Length
Source
Task
Stream
Length
Stream Content
16 4 4 4 4 4 4Bytes
Fields
Packet Structure
Default Broadcast
3412/16/2015
W-1
Worker
Node-1
B-1
W-3
Worker
W-2
W-5
Worker
Node-2
W-4
W-7
Worker
W-6
B-1 wants to broadcast a message to W, it sends 6
messages through 3 TCP communication channels
and send 1 message to W-1 via shared memory
Memory Mapped Communication
12/16/2015
35
No significant difference
because we are using all
the workers in the cluster
beyond 30 workers capacity
A topology with pipeline going through all the workers
Non Optimized Time
Spare Parallel Tweet
Clustering with Storm Slides
12/16/2015
36
Parallel Tweet Clustering with Storm
• Judy Qiu, Emilio Ferrara and Xiaoming Gao
• Storm Bolts coordinated by ActiveMQ to synchronize
parallel cluster center updates – add loops to Storm
• 2 million streaming tweets processed in 40 minutes;
35,000 clusters
3712/16/2015
Sequential
Parallel –
eventually
10,000 bolts
Parallel Tweet Clustering with Storm
3812/16/2015
• Speedup on up to 96 bolts on two clusters Moe and Madrid
• Red curve is old algorithm;
• green and blue new algorithm
• Full Twitter – 1000 way parallelism
• Full Everything – 10,000 way parallelism

Mais conteúdo relacionado

Mais procurados

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitGanesan Narayanasamy
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Vijay Srinivas Agneeswaran, Ph.D
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platformst_ivanov
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 

Mais procurados (20)

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different Facets
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...
CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...
CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 

Semelhante a High Performance Processing of Streaming Data with Apache Storm

Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesDataWorks Summit
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
The New Stack Container Summit Talk
The New Stack Container Summit TalkThe New Stack Container Summit Talk
The New Stack Container Summit TalkThe New Stack
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)Robert Grossman
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1Milind gunjan
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCObject Automation
 
Demystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdfDemystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdfScyllaDB
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 

Semelhante a High Performance Processing of Streaming Data with Apache Storm (20)

Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond Kubernetes
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
The New Stack Container Summit Talk
The New Stack Container Summit TalkThe New Stack Container Summit Talk
The New Stack Container Summit Talk
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014
 
Computer project
Computer projectComputer project
Computer project
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
 
Demystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdfDemystifying the Distributed Database Landscape (DevOps) (1).pdf
Demystifying the Distributed Database Landscape (DevOps) (1).pdf
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 

Mais de Geoffrey Fox

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online EducationGeoffrey Fox
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityGeoffrey Fox
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationGeoffrey Fox
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data StackGeoffrey Fox
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 

Mais de Geoffrey Fox (20)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
 
Remarks on MOOC's
Remarks on MOOC'sRemarks on MOOC's
Remarks on MOOC's
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a Service
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

High Performance Processing of Streaming Data with Apache Storm

  • 1. High Performance Processing of Streaming Data Workshops on Dynamic Data Driven Applications Systems(DDDAS) In conjunction with: 22nd International Conference on High Performance Computing (HiPC), Bengaluru, India 12/16/2015 1 Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey Fox December 16, 2015 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington
  • 2. Software Philosophy • We use the concept of HPC-ABDS High Performance Computing enhanced Apache Big Data Software Stack illustrated on next slide. • HPC-ABDS is a collection of 350 software systems used in either HPC or best practice Big Data applications. The latter include Apache, other open- source and commercial systems • HPC-ABDS helps ABDS by allowing HPC to add performance to ABDS software systems • HPC-ABDS helps HPC by bringing the rich functionality and software sustainability model of commercial and open source software. These bring a large community and expertise that is reasonably easy to find as it is broadly taught both in traditional courses and by community activities such as Meet up groups were for example: – Apache Spark 107,000 meet-up members in 233 groups – Hadoop 40,000 and installed in 32% of company data systems 2013 – Apache Storm 9,400 members • This talk focuses on Storm; its use and how one can add high performance 212/16/2015
  • 3. 3 Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross- Cutting Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53 21 layers Over 350 Software Packages May 15 2015 Green implies HPC Integration 12/16/2015 High Performance Computing Apache Big Data Software Stack
  • 4. IOTCloud • Device  Pub-SubStorm  Datastore  Data Analysis • Apache Storm provides scalable distributed system for processing data streams coming from devices in real time. • For example Storm layer can decide to store the data in cloud storage for further analysis or to send control data back to the devices • Evaluating Pub-Sub Systems ActiveMQ, RabbitMQ, Kafka, Kestrel Turtlebot and Kinect 12/16/2015 4
  • 5. 6 Forms of MapReduce cover “all” circumstances Describes different aspects - Problem - Machine - Software If these different aspects match, one gets good performance 512/16/2015
  • 6. Cloud controlled Robot Data Pipeline 612/16/2015 Message Brokers RabbitMQ, Kafka Gateway Sending to pub-sub Sending to Persisting to storage Streamin g workflow A stream application with some tasks running in parallel Multiple streaming workflows Streaming Workflows Apache Storm Apache Storm comes from Twitter and supports Map- Dataflow-Streaming computing model Key ideas: Pub-Sub, fault-tolerance (Zookeeper), Bolts, Spouts
  • 7. Simultaneous Localization & Mapping (SLAM) 𝑝(𝑥1:𝑡, 𝑚|𝑧1:𝑡, 𝑢1:𝑡−1) = 𝑝 𝑚 𝑥1:𝑡, 𝑧1:𝑡 𝑝(𝑥1:𝑡|𝑧1:𝑡, 𝑢1:𝑡−1 Particles are distributed in parallel tasks Application Build a map given the distance measurements from robot to objects around it and its pose Streaming Workflow Rao-Blackwellized particle filtering based algorithm for SLAM. Distribute the particles across parallel tasks and compute in parallel. Map building happens periodically12/16/2015 7
  • 8. Parallel SLAM Simultaneous Localization and Mapping by Particle Filtering 812/16/2015 Speedup
  • 9. Robot Latency Kafka & RabbitMQ 912/16/2015 Kinect with Turtlebot and RabbitMQ RabbitMQ versus Kafka
  • 10. SLAM Latency variations for 4 or 20 way parallelism Jitter due to Application or System influences such as Network delays, Garbage collection and Scheduling of tasks 1012/16/2015 No Cut Fluctuations decrease after Cut on #iterations per swarm member
  • 11. Fault Tolerance at Message Broker • RabbitMQ supports Queue replication and persistence to disk across nodes for fault tolerance • Can use a cluster of RabbitMQ brokers to achieve high availability and fault tolerance • Kafka stores the messages in disk and supports replication of topics across nodes for fault tolerance. Kafka's storage first approach may increase reliability but can introduce increased latency • Multiple Kafka brokers can be used to achieve high availability and fault tolerance
  • 12. Parallel Overheads SLAM Simultaneous Localization and Mapping: I/O and Garbage Collection 12/16/2015 12
  • 13. Parallel Overheads SLAM Simultaneous Localization and Mapping: Load Imbalance Overhead 12/16/2015 13
  • 14. Multi-Robot Collision Avoidance Streaming Workflow Information from robots Runs in parallel • Second parallel Storm application • Velocity Obstacles (VOs) along with other constrains such as acceleration and max velocity limits, • Non-Holonomic constraints, for differential robots, and localization uncertainty. • NPC NPS measure parallelism Control Latency # Collisions versus number of robots 12/16/2015 14
  • 15. Lessons from using Storm • We successfully parallelized Storm as core software of two robot planning applications • We needed to replace Kafka by RabbitMQ to improve performance – Kafka had large variations in response time • We reduced Garbage Collection overheads • We see that we need to generalize Storm’s – Map-Dataflow Streaming architecture to – Map-Dataflow/Collective Streaming architecture • Now we use HPC-ABDS to improve Storm communication performance 1512/16/2015
  • 16. 16 Bringing Optimal Communications to Storm 12/16/2015 Both process based and thread based parallelism is used Worker and Task distribution of Storm A worker hosts multiple tasks. B-1 is a task of component B and W-1 is a task of W Communication links are between workers These are multiplexed among the tasks W-1 Worker Node-1 B-1 W-3 Worker W-2 W-5 Worker Node-2 W-4 W-7 Worker W-6 W-1 Worker Node-1 B-1 W-3 Worker W-2 W-5 Worker Node-2 W-4 W-7 Worker W-6
  • 17. Memory Mapped File based Communication • Inter process communications using shared memory for a single node • Multiple writer single reader design • A memory mapped file is created for each worker of a node • Create the file under /dev/shm • Writer breaks the message in to packets and puts them to file • Reader reads the packets and assemble the message • When a file becomes full move to another file • PS all of this “well known” BUT not deployed 12/16/2015 17
  • 18. Optimized Broadcast Algorithms • Binary tree – Workers arranged in a binary tree • Flat tree – Broadcast from the origin to 1 worker in each node sequentially. This worker broadcast to other workers in the node sequentially • Bidirectional Rings – Workers arranged in a line – Starts two broadcasts from the origin and these traverse half of the line • All well known and we have used similar ideas of basic HPC- ABDS to improve MPI for machine learning (using Java) 12/16/2015 18
  • 19. Java MPI performs better than Threads I 128 24 core Haswell nodes with Java Machine Learning Default MPI much worse than threads Optimized MPI using shared memory node-based messaging is much better than threads 1912/16/2015
  • 20. Java MPI performs better than Threads II 128 24 core Haswell nodes 2012/16/2015 200K Dataset Speedup
  • 21. Speedups show classic parallel computing structure with 48 node single core as “sequential” State of art dimension reduction routine Speedups improve as problem size increases 48 nodes, 1 core to 128 nodes 24 cores is potential speedup of 64 2112/16/2015
  • 22. Experimental Configuration • 11 Node cluster • 1 Node – Nimbus & ZooKeeper • 1 Node – RabbitMQ • 1 Node – Client • 8 Nodes – Supervisors with 4 workers each • Client sends messages with the current timestamp, the topology returns a response with the same time stamp. Latency = current time - timestamp 12/16/2015 22 W-1 W-5 W-n B-1R-1 G-1RabbitMQ RabbitMQ Client
  • 23. Original Binary Tree Flat Tree Bidirectional Ring Speedup of latency with both TCP based and Shared Memory based communications for different algorithms and sizes 12/16/2015 23 Original and new Storm Broadcast Algorithms
  • 24. Future Work • Memory mapped communications require continuous polling by a thread. If this tread does the processing of the message, the polling overhead can be reduced. • Scheduling of tasks should take the communications in to account • The current processing model has multiple threads processing a message at different stages. Reduce the number of threads to achieve predictable performance • Improve the packet structure to reduce the overhead • Compare with related Java MPI technology • Add additional collectives to those supported by Storm 12/16/2015 24
  • 25. Conclusions on initial HPC-ABDS use in Apache Storm • Apache Storm worked well with performance enhancements • For Binary tree performed the best • Algorithms reduces the network traffic • Shared memory communications reduce the latency further • Memory mapped file communications improve performance 12/16/2015 25
  • 26. Thank You • References – Our software https://github.com/iotcloud – Apache Storm http://storm.apache.org/ – We will donate software to Storm – SLAM paper http://dsc.soic.indiana.edu/publications/SLAM_In_ the_cloud.pdf – Collision Avoidance paper http://goo.gl/xdB8LZ 12/16/2015 26
  • 28. • IoTCloud uses Zookeeper, Storm, Hbase, RabbitMQ for robot cloud control • Focus on high performance (parallel) control functions • Guaranteed real time response 12/16/2015 28 Parallel simultaneous localization and mapping (SLAM) in the cloud
  • 29. Latency with RabbitMQ Different Message sizes in bytes Latency with Kafka Note change in scales for latency and message size 12/16/2015 29
  • 30. Robot Latency Kafka & RabbitMQ Kinect with Turtlebot and RabbitMQ RabbitMQ versus Kafka 12/16/2015 30
  • 31. Parallel SLAM Simultaneous Localization and Mapping by Particle Filtering 12/16/2015 31
  • 32. Spare High Performance Storm Slides 12/16/2015 32
  • 33. Memory Mapped Communication 12/16/2015 33 write Packet 1 Packet 2 Packet 3 Writer 01 Writer 02 Write Write Obtain the write location atomically and increment Shared File Reader Read packet by packet sequentially Use a new file when the file size is reached Reader deletes the files after it reads them fully ID No of Packets Packet No Dest Task Content Length Source Task Stream Length Stream Content 16 4 4 4 4 4 4Bytes Fields Packet Structure
  • 34. Default Broadcast 3412/16/2015 W-1 Worker Node-1 B-1 W-3 Worker W-2 W-5 Worker Node-2 W-4 W-7 Worker W-6 B-1 wants to broadcast a message to W, it sends 6 messages through 3 TCP communication channels and send 1 message to W-1 via shared memory
  • 35. Memory Mapped Communication 12/16/2015 35 No significant difference because we are using all the workers in the cluster beyond 30 workers capacity A topology with pipeline going through all the workers Non Optimized Time
  • 36. Spare Parallel Tweet Clustering with Storm Slides 12/16/2015 36
  • 37. Parallel Tweet Clustering with Storm • Judy Qiu, Emilio Ferrara and Xiaoming Gao • Storm Bolts coordinated by ActiveMQ to synchronize parallel cluster center updates – add loops to Storm • 2 million streaming tweets processed in 40 minutes; 35,000 clusters 3712/16/2015 Sequential Parallel – eventually 10,000 bolts
  • 38. Parallel Tweet Clustering with Storm 3812/16/2015 • Speedup on up to 96 bolts on two clusters Moe and Madrid • Red curve is old algorithm; • green and blue new algorithm • Full Twitter – 1000 way parallelism • Full Everything – 10,000 way parallelism