SlideShare uma empresa Scribd logo
1 de 77
Massive-Scale Analytics Applied to Real-World Problems
David A. Bader
• Full Professor and Chair, Computational Science and Engineering
• IEEE Fellow, AAAS Fellow
• High-performance computing and real-world applications: massive-scale data
analytics.
• Over $183M of research awards
• Steering Committees of major HPC conferences: IPDPS and HiPC
• EIC of IEEE Transactions on Parallel and Distributed Systems
• Elected chair of IEEE and SIAM committees on HPC
• 230+ publications, ≥ 7,500 citations, h-index ≥ 52
• National Science Foundation CAREER Award recipient
• Directed: NVIDIA GPU Center of Excellence
• Directed: Sony-Toshiba-IBM Center for the Cell/B.E. Processor
• Founder: Graph500 List benchmarking “Big Data” platforms
• Recognized as a “RockStar” of High Performance Computing by InsideHPC in 2012
and as HPCwire’s People to Watch in 2012 and 2014.
19 July 2018 David A. Bader 2
Innovate.
Collaborate.
Problem Solved.
• Computational Science and
Engineering (CSE) is a diverse,
interdisciplinary innovation
ecosystem composed of award-
winning faculty, researchers and
students that
• Solves real-world problems and
creates future leaders
• Enables breakthroughs in
scientific discovery and
engineering practice
• Uses the most advanced
resources, techniques and ideas
• Is highly collaborative with an
impressive roster of GT and
industry partners
David A. Bader 319 July 2018
Profile of CSE: History
• Founded in 2005, officially recognized as a
school in 2010.
• Focus on high performance computing, big
data, analytics & visualization, machine
learning, cybersecurity.
• $6.6 million in research expenditures;
approximately $39 million in active awards
(FY 2017)
• NSF South Big Data Hub partnership: $1.25
million over 3 years to support new analysis
projects in line with CSE mission.
19 July 2018 David A. Bader 4
High Performance
Computing
Machine
Learning
Analytics &
Visualization
Cybersecurity
Core Research Areas
Big Data
Design fast theoretic algorithms on
large-scale graphs, and detect
malicious activity
Develop new methods to analyze large
and complex data sets, transforming data
into value and solve grand challenges
Present data in ways that best yield insight
and support decisions as problems scale
and complexity increase
Construct and study algorithms that
build models, and make efficient data-
driven predictions or decisions
Devise computing solutions at the
absolute limits of scale and speed using
efficient, reliable and fast algorithms,
software, tools and applications
19 July 2018 David A. Bader 5
Profile of CSE: People
• By the numbers (Fall 2017):
• 47 faculty and staff
• (13 tenure-track faculty)
• 115 PhD students
• 148 masters students
• Award-winning research teams: ACM Gordon
Bell Prize awarded to team composed mainly of
CSE faculty and students.
• Other honors include: 1 Regents’ professor, 6 NSF
CAREER awards , 4 IEEE fellows, 2 AAAS fellows,
and 1 SIAM fellow
19 July 2018 David A. Bader 6
Strategic Partnership Program
David A. Bader 719 July 2018
• Georgia Tech is building Coda, a multi-story, 750,000-square-foot HPC building in the
heart of Atlanta (Midtown) with a targeted opening in January 2019
• Devoted to data science and high-performance computing for centralized collaboration
among industry, academia and government
• Location of CSE, IDEAS and the HPC Center, and the South BD Hub
• Georgia Tech is the anchor tenant, taking approximately one-half of the new
development. Remaining space will be for corporate entities and partners.
• The Institute plans to locate academic and leading-edge research programs in
computing and advanced big data analytics there.
A New Home for the Future of CSE
19 July 2018 David A. Bader 8
Exascale Streaming Data Analytics: Real-world challenges
All involve analyzing massive
streaming complex networks:
• Health care  disease spread, detection
and prevention of epidemics/pandemics
(e.g. SARS, Avian flu, H1N1 “swine” flu)
• Massive social networks  understanding
communities, intentions, population
dynamics, pandemic spread, transportation
and evacuation
• Intelligence  business analytics, anomaly
detection, security, knowledge discovery
from massive data sets
• Systems Biology  understanding complex
life systems, drug design, microbial research,
unravel the mysteries of the HIV virus;
understand life, disease,
• Electric Power Grid  communication,
transportation, energy, water, food supply
• Modeling and Simulation  Perform full-
scale economic-social-political simulations
0
50
100
150
200
250
300
350
400
450
Dec-04
Mar-05
Jun-05
Sep-05
Dec-05
Mar-06
Jun-06
Sep-06
Dec-06
Mar-07
Jun-07
Sep-07
Dec-07
Mar-08
Jun-08
Sep-08
Dec-08
Mar-09
Jun-09
Sep-09
Dec-09
Facebook Active Users
Million Users
Exponential growth:
Billions of active users
Sample queries:
Allegiance switching:
identify entities that switch
communities.
Community structure:
identify the genesis and
dissipation of communities
Phase change: identify
significant change in the
network structure
REQUIRES PREDICTING / INFLUENCE CHANGE IN REAL-TIME AT SCALE
Ex: discovered minimal
changes in O(billions)-size
complex network that could
hide or reveal top influencers
in the community
19 July 2018 David A. Bader 9
Graphs are pervasive in large-scale data analysis
• Sources of massive data: peta- and exa-scale simulations, experimental
devices, the Internet, scientific applications.
• New challenges for analysis: data sizes, heterogeneity, uncertainty, data
quality.
Astrophysics
Problem: Outlier detection.
Challenges: massive datasets,
temporal variations.
Graph problems: clustering,
matching.
Bioinformatics
Problem: Identifying drug target
proteins.
Challenges: Data heterogeneity,
quality.
Graph problems: centrality,
clustering.
Social Informatics
Problem: Discover emergent
communities, model spread of
information.
Challenges: new analytics routines,
uncertainty in data.
Graph problems: clustering,
shortest paths, flows.
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg
(2,3) www.visualComplexity.com 10DavidA. Bader
Network Analysis for Intelligence and Survelliance
• [Krebs ’04] Post 9/11 Terrorist Network
Analysis from public domain
information
• Plot masterminds correctly identified
from interaction patterns: centrality
• A global view of entities is often more
insightful
• Detect anomalous activities by
exact/approximate graph matching
Image Source: http://www.orgnet.com/hijackers.html
Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies
for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
11DavidA. Bader
Characterizing Graph-theoretic computations
• graph sparsity (m/n ratio)
• static/dynamic nature
• weighted/unweighted, weight
distribution
• vertex degree distribution
• directed/undirected
• simple/multi/hyper graph
• problem size
• granularity of computation at
nodes/edges
• domain-specific characteristics
• paths
• clusters
• partitions
• matchings
• patterns
• orderings
Input: Graph
abstraction
Problem: Find ***
Factors that influence
choice of algorithmGraph
algorithms
• traversal
• shortest path
algorithms
• flow algorithms
• spanning tree
algorithms
• topological
sort
…..
Graph problems are often recast as sparse
linear algebra (e.g., partitioning) or linear
programming (e.g., matching) computations
12David A. Bader
Massive Streaming Graph Analytics
David A. Bader 13
(A, B, t1, poke)
(A, C, t2, msg)
(A, D, t3, view wall)
(A, D, t4, post)
(B, A, t2, poke)
(B, A, t3, view wall)
(B, A, t4, msg)
Analysts
Mining Twitter for Social Good
David A. Bader 14
ICPP 2010
Image credit: bioethicsinstitute.org
• CDC / Nation-scale surveillance of
public health
• Cancer genomics and drug design
• computed Betweenness Centrality of
Human Proteome
Human Genome core protein interactions
Degree vs. Betweenness Centrality
Degree
1 10 100
BetweennessCentrality
1e-7
1e-6
1e-5
1e-4
1e-3
1e-2
1e-1
1e+0
Massive Data Analytics: Protecting our Nation
US High Voltage Transmission
Grid (>150,000 miles of line)
Public Health
David A. Bader 15
ENSG0
000014
5332.2
Kelch-
like
protein
implicat
ed in
breast
cancer
STING Initiative:
Focusing on Globally Significant Grand Challenges
• Many globally-significant grand challenges can be modeled by Spatio-
Temporal Interaction Networks and Graphs (or “STING”).
• Emerging real-world graph problems include
• detecting community structure in large social networks,
• defending the nation against cyber-based attacks,
• discovering insider threats (e.g. Ft. Hood shooter, WikiLeaks),
• improving the resilience of the electric power grid, and
• detecting and preventing disease in human populations.
• Unlike traditional applications in computational science and
engineering, solving these problems at scale often raises new research
challenges because of sparsity and the lack of locality in the massive
data, design of parallel algorithms for massive, streaming data analytics,
and the need for new exascale supercomputers that are energy-
efficient, resilient, and easy-to-program.
David A. Bader 1619 July 2018
Hierarchical Identify Verify Exploit (HIVE)
• SHARP: Software Toolkit for Accelerating
GrapH AlgoRithms on HIVE Processors
• Georgia Tech with University of Southern
California
• Launched Spring 2017
• Performers include Intel, Qualcomm,
PNNL, Northrop Grumman
• Challenges
• Programmers are required to exploit low-level
hardware and operating systems primitives.
• Limited portability of frameworks for new
architectures and accelerators.
David A. Bader 17
TA2 SHARP
OPTIMIZATION
DATAFLOW MODELLING
Preprocessing
Primitive
Unrolling (PU)
Database
Dataflow Model
ILP Optimization
Partially Annotated Code via
SHARP API s
Task to Hardware
Mapping and Optimal
Data Layout for Input
Code
Runtime Scheduler and
Resource Manager
STINGER
Graph
Algorithms Dynamic
Primitives
Hierarchical Primitive Decomposition
Architecture Info &
Hardware Primitives
from TA1
Abstract
Model of
HIVE
Hardware
Suitable Primitives
API Definitions
to TA3
Initial Mapping
Primitive Set
Evaluation
Data
Layout
Graph
Primitives
Data Store
Communication
Cost Model
Access Cost
Model
Decomposition
Analysis
Graph Primitives
from TA3
Primitive Feedback
and Dataset Analysis
to TA1
• Utilizing commodity hardware designed for different application domain
• Goals
• Design an unique library for first of its kind graph processor. Fully utilize new hardware features.
• Portable and scalable framework for massive graphs.
• Configurable data layout that will be decided by: input, algorithm, and targeted hardware.
19 July 2018
Power Efficiency Revolution for Embedded
Computing Technologies (PERFECT)
• Challenges
• Performance Per Watt is now a metric of concern.
• Data movement (caches, networks, storage devices) is becoming a
dominant factor in both execution time and in power consumption.
• Power consumption is limiting application and architecture scalabilty.
• Goals
• One approach to reducing power consumption is to reduce execution time.
• Find additional ways to utilized shared memory systems. Better shared
memory implementations can help reduce network size and limit network
data movement.
• Help user select: graph data layout, programming model (vertex centric vs.
edge centric, identify ideal accelerators for an application, load-balancing
techniques and much more.
David A. Bader 1819 July 2018
Evaluating Memory-Centric
Architectures for HPDA
• Jason Riedy (PI), David A. Bader, Tom Conte
• High-performance data analysis does not fit current CPU centric
architectures well.
• Need new approaches to achieve high performance.
• Emu Chick: Move threads to data!
• Application areas: Streaming graphs and sparse tensors.
David A. Bader 19
• Needs new programming paradigms, new algorithm optimizations, new ideas!
• Goals:
• Evaluate the Emu migratory thread system, and
• develop new methods optimzed for memory-centric architectures.
IARPA
19 July 2018
STINGER – Time Frame
David A. Bader 20
STINGER is officially
proposed. May 2009
First prototype, clustering
coefficients. Apr 2010
Structure tracking of
streaming social
networks. Apr 2011
High Performance Data
Structure for Streaming
Graphs. Sep 2012.
HPEC BEST PAPER AWARD
Dynamic betweenness
centrality algorithm,
Sep 2012
Streaming connected
component, Dec 2013
Performance evaluation
of open-source graph
data-bases, Feb 2014
Community detection in
dynamic networks Sep
2015
PageRank for Streaming
Graphs. May 2016
Streaming graph need
arises (over a decade
ago)
19 July 2018
Streaming graph example
David A. Bader 21
• Dynamic/Streaming:
• At time 𝑡:
• 𝑣 and 𝑤 become friends
• 𝐼𝑛𝑠𝑒𝑟𝑡 (𝑣, 𝑤)
• At time 𝑡:
• 𝑢 upsets 𝑣. 𝑢 and 𝑣 𝑎𝑟𝑒 no
longer friends
• 𝐷𝑒𝑙𝑒𝑡𝑒 𝑢, 𝑣
• small subgraph… 𝑣
𝑢
𝑤
19 July 2018
Streaming Analytics move us
from reporting the news to predictive analytics
Traditional HPC
• Great for “static” data sets.
• Massive scalability at the
cost of programmability.
• Great for dense problems.
• Sparse problems typically
underutilize the system.
Streaming Analytics
• Requires specialized analytics
and data structures.
• Rapidly changing data.
• Low data re-usage.
• Focused on memory operations
and not FLOPS.
David A. Bader 2219 July 2018
STING Extensible Representation (STINGER)
• Design goals:
• Enable algorithm designers to implement dynamic graph
algorithms with ease.
• Portable semantics for various platforms
• Good performance for all types of graph problems and
algorithms - static and dynamic.
• Assumes globally addressable memory access
• Support multiple, parallel readers and a single writer
• One server manages the graph data structures
• Multiple analytics run in background with read-only
permissions.
David A. Bader 2319 July 2018
STING Extensible Representation
David A. Bader 24
• Semi-dense edge list
blocks with free space
• Compactly stores
timestamps, types,
weights
• Maps from
application IDs to
storage IDs
• Deletion by negating
IDs, separate
compaction
19 July 2018
STINGER Graph & Analytic Update Process
Accumulate recent graph updates in main memory and
create a batch.
David A. Bader 25
Pre-process, Sort,
Reconcile
“Age off” old vertices
Modify STINGER graph
Update metrics (execute streaming analytics)
STINGER
graph
Insertions /
Deletions
Affected vertices
Change detection
19 July 2018
STING: High-level architecture
David A. Bader 26
◮ Server: Graph storage, kernel orchestration
◮ OpenMP + sufficiently POSIX-ish
◮ Multiple processes for resilience
19 July 2018
STINGER: as an analysis package
• Streaming edge insertions and deletions:
Performs new edge insertions, updates, and deletions in batches or individually.
Optimized to update at rates of over 3 million edges per second on graphs of one billion edges.
• Streaming clustering coefficients:
Tracks the local and global clustering coefficients of a graph.
• Streaming connected components:
Real time tracking of the connected components.
• Streaming Betweenness Centrality:
Find the key points within information flows and structural vulnerabilities.
• Streaming community detection:
Track and update the community structures within the graph as they change.
• Anything that a static graph package can do (and a whole lot more):
• Parallel agglomerative clustering:
Find clusters that are optimized for a user-defined edge scoring function.
• K-core Extraction:
Extract additional communities and filter noisy high-degree vertices.
• Classic breadth-first search:
Performs a parallel breadth-first search of the graph starting at a given source vertex to find shortest paths.
• Parallel connected components:
Finds the connected components in a static network.
David A. Bader 27
http://www.stingergraph.com/
19 July 2018
Streaming Updates
Update process
• Group updates into batches
• Updates can include insertions
and deletions
• Big batches ⇒ Better
performances
[HPEC; 2012]
Throughput rate
David A. Bader 28
Experiment setup
• 4x10 Intel E7-8870 processors
• RMAT Graph
• Vertices: 16M
• Edges: 128M
• Various batch sizes
• ~93% of updates are insertions
• ~7% of updates are deletions
Takeaway
• STINGER supports extremely fast updates.
• Updates are not the bottleneck for
analytics.
• Analytic computations are the bottleneck!
• Highly scalable
19 July 2018
Streaming Clustering
Coefficients & Triangle Counting
Background
• Scores how tightly bound players are in
their local community.
• Looks for common relationships for two
adjacent vertices.
• Hence the term triangle counting
• Complexity for static graph algorithm
(intersection based): 𝑂 𝑣 ⋅ 𝑑 𝑚𝑎𝑥
2
[MTAAP; 2010]
David A. Bader 29
Multiple streaming
implementations
• Brute-force – straightforward and exact
• Bloom-filter – approximate yet extremely fast
• Sorted-list – uses intersections. Fast and exact.
Larger batches give faster speedups.
Experiment setup
• Executed on two systems
• Cray XMT – 64 nodes
• 2x4 Intel E5530 system
• 8 cores, 16 threads
• Used RMAT synthetic graphs
• 2M vertices, 16 edges
• Hundreds of thousands updates per second
19 July 2018
Streaming Connected Components
Background
• Tracks connected components in
high velocity networks.
• Connected components imply that
players are connected to each
other some sequence of
relationships
[HiPC; 2013]
David A. Bader 30
Our algorithm
• Takes into account small-world
property
• Diameter is a small.
• Most players have numerous
relationships within the connected
component.
• Edge insertions are always easy.
• Very edge deletions are complex.
Takeaway
• Up to 1.26 million updates per second on
4 × 16 AMD (Opteron 6282)
• Hundreds of time faster than static
computation.
• Great for social networks.
• STINGER requires only 10% of execution
time. Rest of time - analytic update.
• Scalability similar to BFS.
Average edge degree:
19 July 2018
Dynamic Betweenness Centrality
Background
• Used for finding key players in network
based on the number of relationships
that go through them.
• Fastest known algorithm by Brandes
(2002) is still computationally
expensive for large networks:
𝑂 𝑉 ⋅ 𝑉 + 𝐸 .
[Social Computing; 2012]
David A. Bader 31
Our dynamic graph algorithm
• Supports optimizations:
• Approximation: reduces accuracy,
significantly faster.
• Parallelization: utilizes many core
systems
• Supports: vertex insertions & deletions
and edge insertions & deletions.
Experimental setup and takeaways
• 4x10 Intel E7-8870 processors
• Thousands of times faster than static
recomputations.
• Significantly reduces the amount of
necessary computations.
• Only small percentage of the graph is
affected due to update
19 July 2018
Streaming Community
Detection and Monitoring
Background
• Communities are typically defined by
groups of vertices with more intra-
relationships than inter-relationships.
• More formally: 𝑄 𝐶 =
𝐼𝑛𝑡𝑟𝑎 𝐶
𝐸
−
𝐼𝑛𝑡𝑒𝑟 𝐶
2
|𝐸|2
• In addition to the graph, an additional
community network is maintained.
• Significantly smaller than full network!
• Updates applied to community network.
[MTAAP; 2013]
David A. Bader 32
An agglomerative approach
• Certain types of updates do not
change the community structure.
• We only need to process updates that
“might” cause change.
• Few updates require a lot of process
time.
Experimental setup and takeaways
• 4x8 Intel E7-4820 system
• 32 cores, 64 threads
• Easily supports millions of updates per second.
• Bigger batches offer improved performance.
• Dynamic algorithm is 1000s of time faster than static
graph algorithm.
• Real-time tracking of communities with a network.
19 July 2018
Streaming Seed-Set Expansion
Background
• Seeds are vertices of interest.
• Seed Set Expansion is the process of
detecting a community around a seed.
• Streaming SSE – allows tracking vertices of
interest over time
• Important events such as community split
and merging can be reported
[ASONAM;2015]
David A. Bader 33
Algorithm details
• Greedy algorithm.
• Vertices are inserted one at a time into the
community.
• An edge update checks for possible changes in the
community.
• Pruning can be applied when an update causes a
big change in the community.
• Pruning makes things slower
• Pruning offers more accurate results in comparison
with static graph algorithm.
Takeaways
• Highly accurate in comparison to
static graph algorithm.
• Precision and recall typically
above 90%.
• Larger batches require more
work ⇒ smaller speedups
19 July 2018
Incremental Page-Rank
Background
• Pagerank is used by measuring the
importance of vertices by the number
and weight of links going through it.
• Works like a propagation algorithm.
• Algorithm continues until no changes
are detected.
[GABB;2016]
David A. Bader 34
Algorithm
• Uses STINGER to perform linear algebra
operations.
• Supports both insertions and deletions.
• Incremental implies that only a small subset
of the graph is traversed.
• Does only the necessary amount of work
Takeaway
• Large batches: reduce lower latency by > 2×
over restarting on average.
• Small batches: potentially hundreds of time
faster than restart.
• Improved power performance (modeled).
• Can deal with several thousand updates per
second.
19 July 2018
Community Centric Analysis (in
process)
Background
• Focuses on finding key players in communities
• Might be overlooked by network wide analytics.
• Computationally less expensive.
• Highly scalable
• Key players detected due to change to their
community upon extraction.
• We modify several widely used analytics for this
new type of computation.
• Starts off with an initial exploration of static
graphs
Modified metrics of interest
• Change in community modularity
• Change to the community diameter
• Change in the number of connected
components in the community
David A. Bader 35
Community Centric Approach
• Given a community 𝐶 and a metric 𝑀, for each vertex 𝑢 in
each community 𝐶:
• Calculate initial metric 𝑀𝑖𝑛𝑖𝑡𝑖𝑎𝑙 on community (left
figure) using static graph algorithm (done once)
• Remove vertex 𝑢 and links using STINGER
• Calculate changed metric 𝑀 𝑎𝑓𝑡𝑒𝑟 using 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 graph
algorithm using STINGER
• Look at change to community: Δ𝑀 𝑢 =
𝑀 𝑎𝑓𝑡𝑒𝑟
𝑀 𝑖𝑛𝑖𝑡𝑖𝑎𝑙
• Insert vertex 𝑢 and links using STINGER
Initial Findings
• A different way to use streaming analytics:
“𝑣𝑒𝑟𝑡𝑒𝑥 𝑑𝑒𝑙𝑡𝑎 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑠”
• Multiple metrics pinpoint same key vertices.
• Computationally efficient
• Over 20𝑋 faster than networks approach.
• Highly scalable
• Applicable to other metrics as well
19 July 2018
Current Research
19 July 2018 David A. Bader 36
Analysis of Centrality on Graphs
11/26/2016 localhost:8000
• Numerical Centrality
• Theoretically guaranteeing highly ranked vertices
from calculations of Katz Centrality from an
iterative solver
• Nathan, Sanders, Fairbanks, Henson, Bader. “Graph
Ranking Guarantees for Numerical Approximations
to Katz Centrality," International Conference on
Computational Science (ICCS). June 2017.
• Dynamic Centrality
• Develop algorithms to efficiently update Katz
Centrality in dynamic graphs: faster than static
recomputation and maintains high quality of
results. Algorithms are from both a linear algebraic
environment and personalized agglomerative
approach.
• Nathan and Bader. “A Dynamic Algorithm for
Updating Katz Centrality in Graphs," IEEE/ACM
International Conference on Social Networks
Analysis and Mining (ASONAM). July 2017.
• Nathan and Bader. “Approximating Personalized
Katz Centrality in Dynamic Graphs," 12th
International Conference on Parallel Processing
and Applied Mathematics (PPAM). September
2017.
David A. Bader 3719 July 2018
Dynamic Communities
• Goals
• Detect local communities in networks given seed vertices of interest
• Allows a relevant subgraph to be extracted for targeted analysis
• Incrementally update and track communities over time in dynamic graphs
• Publications
• Zakrzewska and Bader. “A dynamic algorithm for local community detection in graphs,” ASONAM 2015.
• Zakrzewska and Bader. “Tracking local communities in streaming graphs with a dynamic algorithm,”
SNAM Journal 6(1) 2016.
• Zakrzewska, Nathan, Fairbanks, Bader. “A local measure of community change in dynamic graphs,”
ASONAM 2016.
• Nathan, Zakrzewska, Riedy, Bader. “Local community detection in dynamic graphs using personalized
centrality,” Journal of Algorithms 2017.
David A. Bader 3819 July 2018
Sampling Streaming Graphs
• Challenges
• Many relational datasets are large, with new data constantly generated
• The volume of data may be too large to store or run graph analytics
• Goals
• Sample a stream of relational data (edges) to create a graph representation
• The sampling method should restrict both the number of vertices and edges to
limit the memory needed to store the sampled graph
• In some applications, newer data is more relevant
• Allow for sampling bias towards newer edges when needed or for
temporally uniform sampling
• Zakrzewska and Bader. “Streaming graph sampling with size restrictions,” ASONAM
2017.
David A. Bader 39
10
3
9
1
4
11
5
2
6
8
7
10
8
4
74
1
4
2
6
5
7
5
5
4
8
19 July 2018
STINGER: Where do you get it?
http://www.stingergraph.com/
• Gateway to
• code,
• development,
• documentation,
• presentations...
• Users / contributors /
questioners: Georgia Tech,
PNNL, CMU, Berkeley, Intel, Cray,
NVIDIA, IBM, Federal
Government, Ionic Security, Citi,
Accenture
David A. Bader 4019 July 2018
STINGER Development
Enterprise
• Tech transfer for GTRI
• Enterprise software integrity
• Nightly builds
• Unit testing required
Academic
• Maintained by Georgia Tech
• Ideal for prototyping.
• Sandbox for developing new
concepts
• When software matures…
David A. Bader 41
http://git.cc.gatech.edu/git/u/eriedy3/stinger.git/ https://github.com/stingergraph
19 July 2018
STINGER Summary
• Massive-Scale Streaming Analytics require
• Simple programming model
• Simple API.
• CSR-like in concept.
• STINGER has a lot more under the hood.
• Extremely fast updates
• Millions of updates per second.
• These must not be bottlenecks for updating an analytic.
• STINGER offers these
• STINGER has major performance benefits
• Thousands of times faster than static graph computation.
• Hundreds of thousands of updates per second for numerous
analytics.
• Real-time monitoring of underlying network.
David A. Bader 4219 July 2018
Conclusions
• Massive-Scale Streaming Analytics will require new
• High-performance computing platforms
• Streaming algorithms
• Energy-efficient implementations
and are promising to solve real-world challenges!
• Mapping applications to high performance
architectures may yield 6 or more orders of
magnitude performance improvement
David A. Bader 4319 July 2018
Acknowledgments
• Jason Riedy, Research Scientist, (Georgia Tech)
• Oded Green, Research Scientist, (Georgia Tech)
• Current Graduate Students (Georgia Tech):
• Xiaojing An
• James Fox
• Kasimir Gabert
• Euna Kim
• Recent Bader Alumni:
• Dr. Eisha Nathan (Lawrence Livermore National Lab)
• Dr. Vipin Sachdeva (IBM)
• Dr. Anita Zakrzewska (Lawrence Livermore National Lab)
• Dr. Lluis Miquel Munguia (Google)
• Prof. Kamesh Madduri (Penn State)
• Dr. David Ediger (GTRI)
• Dr. James Fairbanks (GTRI)
• Dr. Seunghwa Kang (Pacific Northwest National Lab)
David A. Bader 4419 July 2018
PhD students
Emily Rogers
James Fox
Xiaojing An
Euna Kim
Anita Zakrzewska Eisha Nathan
Chunxing Yin Kasimir Gabert Lluis Miquel Munguia
David A. Bader 45
Buzz
19 July 2018
Bader, Related Recent Publications (2005-2009)
• D.A. Bader, G. Cong, and J. Feo, “On the Architectural Requirements for Efficient Execution of Graph Algorithms,” The 34th International Conference on Parallel Processing (ICPP
2005), pp. 547-556, Georg Sverdrups House, University of Oslo, Norway, June 14-17, 2005.
• D.A. Bader and K. Madduri, “Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors,” The 12th International Conference on High
Performance Computing (HiPC 2005), D.A. Bader et al., (eds.), Springer-Verlag LNCS 3769, 465-476, Goa, India, December 2005.
• D.A. Bader and K. Madduri, “Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2,” The 35th International Conference on Parallel
Processing (ICPP 2006), Columbus, OH, August 14-18, 2006.
• D.A. Bader and K. Madduri, “Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks,” The 35th International Conference on Parallel Processing (ICPP 2006),
Columbus, OH, August 14-18, 2006.
• K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “Parallel Shortest Path Algorithms for Solving Large-Scale Instances,” 9th DIMACS Implementation Challenge -- The Shortest
Path Problem, DIMACS Center, Rutgers University, Piscataway, NJ, November 13-14, 2006.
• K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm
Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
• J.R. Crobak, J.W. Berry, K. Madduri, and D.A. Bader, “Advanced Shortest Path Algorithms on a Massively-Multithreaded Architecture,” First Workshop on Multithreaded
Architectures and Applications (MTAAP), Long Beach, CA, March 30, 2007.
• D.A. Bader and K. Madduri, “High-Performance Combinatorial Techniques for Analyzing Massive Dynamic Interaction Networks,” DIMACS Workshop on Computational Methods
for Dynamic Interaction Networks, DIMACS Center, Rutgers University, Piscataway, NJ, September 24-25, 2007.
• D.A. Bader, S. Kintali, K. Madduri, and M. Mihail, “Approximating Betewenness Centrality,” The 5th Workshop on Algorithms and Models for the Web-Graph (WAW2007), San Diego,
CA, December 11-12, 2007.
• David A. Bader, Kamesh Madduri, Guojing Cong, and John Feo, “Design of Multithreaded Algorithms for Combinatorial Problems,” in S. Rajasekaran and J. Reif, editors, Handbook
of Parallel Computing: Models, Algorithms, and Applications, CRC Press, Chapter 31, 2007.
• Kamesh Madduri, David A. Bader, Jonathan W. Berry, Joseph R. Crobak, and Bruce A. Hendrickson, “Multithreaded Algorithms for Processing Massive Graphs,” in D.A. Bader,
editor, Petascale Computing: Algorithms and Applications, Chapman & Hall / CRC Press, Chapter 12, 2007.
• D.A. Bader and K. Madduri, “SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks,” 22nd
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 14-18, 2008.
• S. Kang, D.A. Bader, “An Efficient Transactional Memory Algorithm for Computing Minimum Spanning Forest of Sparse Graphs,” 14th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPoPP), Raleigh, NC, February 2009.
• Karl Jiang, David Ediger, and David A. Bader. “Generalizing k-Betweenness Centrality Using Short Paths and a Parallel Multithreaded Implementation.” The 38th International
Conference on Parallel Processing (ICPP), Vienna, Austria, September 2009.
• Kamesh Madduri, David Ediger, Karl Jiang, David A. Bader, Daniel Chavarría-Miranda. “A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating
Betweenness Centrality on Massive Datasets.” 3rd Workshop on Multithreaded Architectures and Applications (MTAAP), Rome, Italy, May 2009.
• David A. Bader, et al. “STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation.” 2009.
46David A. Bader19 July 2018
Bader, Related Recent Publications (2010-2011)
• David Ediger, Karl Jiang, E. Jason Riedy, and David A. Bader. “Massive Streaming Data Analytics: A Case Study with Clustering
Coefficients,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.
• Seunghwa Kang, David A. Bader. “Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce cluster and a
Highly Multithreaded System:,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.
• David Ediger, Karl Jiang, Jason Riedy, David A. Bader, Courtney Corley, Rob Farber and William N. Reynolds. “Massive Social Network
Analysis: Mining Twitter for Social Good,” The 39th International Conference on Parallel Processing (ICPP 2010), San Diego, CA,
September 2010.
• Virat Agarwal, Fabrizio Petrini, Davide Pasetto and David A. Bader. “Scalable Graph Exploration on Multicore Processors,” The 22nd IEEE
and ACM Supercomputing Conference (SC10), New Orleans, LA, November 2010.
• Z. Du, Z. Yin, W. Liu, and D.A. Bader, “On Accelerating Iterative Algorithms with CUDA: A Case Study on Conditional Random Fields
Training Algorithm for Biological Sequence Alignment,” IEEE International Conference on Bioinformatics & Biomedicine, Workshop on
Data-Mining of Next Generation Sequencing Data (NGS2010), Hong Kong, December 20, 2010.
• D. Ediger, J. Riedy, H. Meyerhenke, and D.A. Bader, “Tracking Structure of Streaming Social Networks,” 5th Workshop on
Multithreaded Architectures and Applications (MTAAP), Anchorage, AK, May 20, 2011.
• D. Mizell, D.A. Bader, E.L. Goodman, and D.J. Haglin, “Semantic Databases and Supercomputers,” 2011 Semantic Technology
Conference (SemTech), San Francisco, CA, June 5-9, 2011.
• P. Pande and D.A. Bader, “Computing Betweenness Centrality for Small World Networks on a GPU,” The 15th Annual High
Performance Embedded Computing Workshop (HPEC), Lexington, MA, September 21-22, 2011.
• David A. Bader, Christine Heitsch, and Kamesh Madduri, “Large-Scale Network Analysis,” in J. Kepner and J. Gilbert, editor, Graph
Algorithms in the Language of Linear Algebra, SIAM Press, Chapter 12, pages 253-285, 2011.
• Jeremy Kepner, David A. Bader, Robert Bond, Nadya Bliss, Christos Faloutsos, Bruce Hendrickson, John Gilbert, and Eric Robinson,
“Fundamental Questions in the Analysis of Large Graphs,” in J. Kepner and J. Gilbert, editor, Graph Algorithms in the Language of
Linear Algebra, SIAM Press, Chapter 16, pages 353-357, 2011.
David A. Bader 4719 July 2018
Bader, Related Recent Publications (2012)
• E.J. Riedy, H. Meyerhenke, D. Ediger, and D.A. Bader, “Parallel Community Detection for Massive Graphs,” The 9th International Conference on Parallel Processing and Applied Mathematics (PPAM
2011), Torun, Poland, September 11-14, 2011. Lecture Notes in Computer Science, 7203:286-296, 2012.
• E.J. Riedy, D. Ediger, D.A. Bader, and H. Meyerhenke, “Parallel Community Detection for Massive Graphs,” 10th DIMACS Implementation Challenge -- Graph Partitioning and Graph Clustering, Atlanta,
GA, February 13-14, 2012.
• E.J. Riedy, H. Meyerhenke, D.A. Bader, D. Ediger, and T. Mattson, “Analysis of Streaming Social Networks and Graphs on Multicore Architectures,” The 37th IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, March 25-30, 2012.
• J. Riedy, H. Meyerhenke, and D.A. Bader, “Scalable Multi-threaded Community Detection in Social Networks,” 6th Workshop on Multithreaded Architectures and Applications (MTAAP), Shanghai,
China, May 25, 2012.
• H. Meyerhenke, E.J. Riedy, and D.A. Bader, “Parallel Community Detection in Streaming Graphs,” Minisymposium on Parallel Analysis of Massive Social Networks, 15th SIAM Conference on Parallel
Processing for Scientific Computing (PP12), Savannah, GA, February 15-17, 2012.
• D. Ediger, E.J. Riedy, H. Meyerhenke, and D.A. Bader, “Analyzing Massive Networks with GraphCT,” Poster Session, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12),
Savannah, GA, February 15-17, 2012.
• R.C. McColl, D. Ediger, and D.A. Bader, “Many-Core Memory Hierarchies and Parallel Graph Analysis,” Poster Session, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12),
Savannah, GA, February 15-17, 2012.
• E.J. Riedy, D. Ediger, H. Meyerhenke, and D.A. Bader, “STING: Software for Analysis of Spatio-Temporal Interaction Networks and Graphs,” Poster Session, 15th SIAM Conference on Parallel
Processing for Scientific Computing (PP12), Savannah, GA, February 15-17, 2012.
• Y. Chai, Z. Du, D.A. Bader, and X. Qin, "Efficient Data Migration to Conserve Energy in Streaming Media Storage Systems," IEEE Transactions on Parallel & Distributed Systems, 2012.
• M. S. Swenson, J. Anderson, A. Ash, P. Gaurav, Z. Sükösd, D.A. Bader, S.C. Harvey and C.E Heitsch, "GTfold: Enabling parallel RNA secondary structure prediction on multi-core desktops," BMC
Research Notes, 5:341, 2012.
• D. Ediger, K. Jiang, E.J. Riedy, and D.A. Bader, "GraphCT: Multithreaded Algorithms for Massive Graph Analysis," IEEE Transactions on Parallel & Distributed Systems, 2012.
• D.A. Bader and K. Madduri, "Computational Challenges in Emerging Combinatorial Scientific Computing Applications," in O. Schenk, editor, Combinatorial Scientific Computing, Chapman & Hall
/ CRC Press, Chapter 17, pages 471-494, 2012.
• O. Green, R. McColl, and D.A. Bader, "GPU Merge Path -- A GPU Merging Algorithm," 26th ACM International Conference on Supercomputing (ICS), San Servolo Island, Venice, Italy, June 25-29, 2012.
• O. Green, R. McColl, and D.A. Bader, "A Fast Algorithm for Streaming Betweenness Centrality," 4th ASE/IEEE International Conference on Social Computing (SocialCom), Amsterdam, The
Netherlands, September 3-5, 2012.
• D. Ediger, R. McColl, J. Riedy, and D.A. Bader, "STINGER: High Performance Data Structure for Streaming Graphs," The IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA,
September 20-22, 2012. Best Paper Award.
• J. Marandola, S. Louise, L. Cudennec, J.-T. Acquaviva and D.A. Bader, "Enhancing Cache Coherent Architecture with Access Patterns for Embedded Manycore Systems," 14th IEEE International
Symposium on System-on-Chip (SoC), Tampere, Finland, October 11-12, 2012.
• L.M. Munguía, E. Ayguade, and D.A. Bader, "Task-based Parallel Breadth-First Search in Heterogeneous Environments," The 19th Annual IEEE International Conference on High Performance
Computing (HiPC), Pune, India, December 18-21, 2012.
48David A. Bader19 July 2018
Bader, Related Recent Publications (2013)
• X. Liu, P. Pande, H. Meyerhenke, and D.A. Bader, "PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly," IEEE
Transactions on Parallel & Distributed Systems, 24(5):977-986, 2013.
• David A. Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner (eds.), Graph Partitioning and Graph Clustering, American
Mathematical Society, 2013.
• E. Jason Riedy, Henning Meyerhenke, David Ediger and David A. Bader, "Parallel Community Detection for Massive Graphs," in David A. Bader,
Henning Meyerhenke, Peter Sanders, and Dorothea Wagner (eds.), Graph Partitioning and Graph Clustering, American Mathematical Society,
Chapter 14, pages 207-222, 2013.
• S. Kang, D.A. Bader, and R. Vuduc, "Energy-Efficient Scheduling for Best-Effort Interactive Services to Achieve High Response Quality," 27th
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Boston, MA, May 20-24, 2013.
• J. Riedy and D.A. Bader, "Multithreaded Community Monitoring for Massive Streaming Graph Data," 7th Workshop on Multithreaded
Architectures and Applications (MTAAP), Boston, MA, May 24, 2013.
• D. Ediger and D.A. Bader, "Investigating Graph Algorithms in the BSP Model on the Cray XMT," 7th Workshop on Multithreaded Architectures
and Applications (MTAAP), Boston, MA, May 24, 2013.
• O. Green and D.A. Bader, "Faster Betweenness Centrality Based on Data Structure Experimentation," International Conference on
Computational Science (ICCS), Barcelona, Spain, June 5-7, 2013.
• Z. Yin, J. Tang, S. Schaeffer, and D.A. Bader, "Streaming Breakpoint Graph Analytics for Accelerating and Parallelizing the Computation of DCJ
Median of Three Genomes," International Conference on Computational Science (ICCS), Barcelona, Spain, June 5-7, 2013.
• T. Senator, D.A. Bader, et al., "Detecting Insider Threats in a Real Corporate Database of Computer Usage Activities," 19th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD), Chicago, IL, August 11-14, 2013.
• J. Fairbanks, D. Ediger, R. McColl, D.A. Bader and E. Gilbert, "A Statistical Framework for Streaming Graph Analysis," IEEE/ACM International
Conference on Advances in Social Networks Analysis and Modeling (ASONAM), Niagara Falls, Canada, August 25-28, 2013.
• A. Zakrzewska and D.A. Bader, "Measuring the Sensitivity of Graph Metrics to Missing Data," 10th International Conference on Parallel
Processing and Applied Mathematics (PPAM), Warsaw, Poland, September 8-11, 2013.
• O. Green and D.A. Bader, "A Fast Algorithm for Streaming Betweenness Centrality," 5th ASE/IEEE International Conference on Social
Computing (SocialCom), Washington, DC, September 8-14, 2013.
• R. McColl, O. Green, and D.A. Bader, "A New Parallel Algorithm for Connected Components in Dynamic Graphs," The 20th Annual IEEE
International Conference on High Performance Computing (HiPC), Bangalore, India, December 18-21, 2013.
49David A. Bader19 July 2018
Bader, Related Recent Publications (2014-2015)
• R. McColl, D. Ediger, J. Poovey, D. Campbell, and D.A. Bader, "A Performance Evaluation of Open Source Graph Databases," The 1st Workshop
on Parallel Programming for Analytics Applications (PPAA 2014) held in conjunction with the 19th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPoPP 2014), Orlando, Florida, February 16, 2014.
• O. Green, L.M. Munguia, and D.A. Bader, "Load Balanced Clustering Coefficients," The 1st Workshop on Parallel Programming for Analytics
Applications (PPAA 2014) held in conjunction with the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP
2014), Orlando, Florida, February 16, 2014.
• A. McLaughlin and D.A. Bader, "Revisiting Edge and Node Parallelism for Dynamic GPU Graph Analytics," 8th Workshop on Multithreaded
Architectures and Applications (MTAAP), held in conjuntion with The IEEE International Parallel and Distributed Processing Symposium (IPDPS
2014), Phoenix, AZ, May 23, 2014.
• Z. Yin, J. Tang, S. Schaeffer, D.A. Bader, "A Lin-Kernighan Heuristic for the DCJ Median Problem of Genomes with Unequal Contents," 20th
International Computing and Combinatorics Conference (COCOON), Atlanta, GA, August 4-6, 2014.
• Y. You, D.A. Bader and M.M. Dehnavi, "Designing an Adaptive Cross-Architecture Combination for Graph Traversal," The 43rd International
Conference on Parallel Processing (ICPP 2014), Minneapolis, MN, September 9-12, 2014.
• A. McLaughlin, J. Riedy, and D.A. Bader, "Optimizing Energy Consumption and Parallel Performance for Betweenness Centrality using
GPUs," The 18th Annual IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 9-11, 2014.
• A. McLaughlin and D.A. Bader, "Scalable and High Performance Betweenness Centrality on the GPU," The 26th IEEE and ACM Supercomputing
Conference (SC14), New Orleans, LA, November 16-21, 2014. Best Student Paper Finalist.
• D. Dauwe, E. Jonardi, R. Friese, S. Pasricha, A.A. Maciejewski, D.A. Bader, and H.J. Siegel, “A Methodology for Co-Location Aware Application
Performance Modeling in Multicore Computing,” 17th Workshop on Advances on Parallel and Distributed Processing Symposium (APDCM),
Hyderabad, India, May 25, 2015.
• A. Zakrzewska and D.A. Bader, “Fast Incremental Community Detection on Dynamic Graphs,” 11th International Conference on Parallel
Processing and Applied Mathematics (PPAM), Krakow, Poland, September 6-9, 2015.
• A. McLaughlin, J. Riedy, and D.A. Bader, “An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches,” The 19th Annual IEEE High
Performance Extreme Computing Conference (HPEC), Waltham, MA, September 15-17, 2015.
• A. McLaughlin, D. Merrill, M. Garland and D.A. Bader, “Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures,” The
24th International Conference on Parallel Architectures and Compilation Techniques (PACT), San Francisco, CA, October 18-21, 2015.
• A. McLaughlin and D.A. Bader, “Fast Execution of Simultaneous Breadth-First Searches on Sparse Graphs,'' The 21st IEEE International
Conference on Parallel and Distributed Systems (ICPADS), Melbourne, Australia, December 14-17, 2015.
50David A. Bader19 July 2018
Bader, Related Recent Publications (2016-2017)
• David Bader, Aleksandra Michalewicz, Oded Green, Jessie Birkett-Rees, Jason Riedy, James Fairbanks, and Anita Zakrzewska, “Semantic database applications at the
Samtavro Cemetery, Georgia,” , The 44th Computer Applications and Quantitative Methods in Archaeology Conference (CAA), Oslo, Norway, March 29 – April 2, 2016.
• Vipin Sachdeva, Srinivas Aluru, David A. Bader, “A Memory and Time Scalable Parallelization of the Reptile Error-Correction Code,” 15th IEEE International Workshop on
High Performance Computational Biology (HiCOMB), Chicago, IL, May 23, 2016.
• James Fairbanks, Anita Zakrzewska, and David A. Bader, “New Stopping Criteria For Spectral Partitioning,” IEEE/ACM International Conference on Advances in Social
Networks Analysis and Modeling (ASONAM), San Francisco, CA, August 18-21, 2016.
• Anita Zakrzewska, Eisha Nathan, James Fairbanks, and David A. Bader, “A Local Measure of Community Change in Dynamic Graphs,” IEEE/ACM International Conference on
Advances in Social Networks Analysis and Modeling (ASONAM), San Francisco, CA, August 18-21, 2016.
• Anita Zakrzewska and David A. Bader, “Aging Data in Dynamic Graphs: A Comparative Study,” 2nd International Workshop on Dynamics in Networks (DyNo), held in
conjunction with IEEE/ACM International Conference on Advances in Social Networks Analysis and Modeling (ASONAM), San Francisco, CA, August 18, 2016.
• O. Green and D.A. Bader, “cuSTINGER: Supporting Dynamic Graph Algorithms for GPUs,” The 20th Annual IEEE High Performance Extreme Computing Conference (HPEC),
Waltham, MA, September 13-15, 2016.
• Jeremy Kepner, Peter Aaltonen, David A. Bader, Aydin Buluc, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott
McMillan, Jose Moreira, John D. Owens, Carl Yang, Marcin Zalewski, and Timothy Mattson, “Mathematical Foundations of the GraphBLAS,” The 20th Annual IEEE High
Performance Extreme Computing Conference (HPEC), Waltham, MA, September 13-15, 2016.
• X. Hui, Z. Du, J. Liu, H. Sun, Y. He and D.A. Bader, “When Good Enough Is Better: Energy-Aware Scheduling for Multicore Servers,” 13th Workshop on High-Performance,
PowerAware Computing (HPPAC), Orlando, FL, May 29, 2017.
• E. Nathan, G. Sanders, J. Fairbanks, V. Henson and D.A. Bader, “Graph Ranking Guarantees for Numerical Approximations to Katz Centrality,” International Conference on
Computational Science (ICCS), Zurich, Switzerland, June 12-14, 2017.
• Anita Zakrzewska and David A. Bader, “Streaming Graph Sampling with Size Restrictions,” IEEE/ACM International Conference on Advances in Social Networks Analysis and
Modeling (ASONAM), Sydney, Australia, July 31 - August 3, 2017.
• Eisha Nathan and David A. Bader, “A Dynamic Algorithm for Updating Katz Centrality in Graphs,” , IEEE/ACM International Conference on Advances in Social Networks
Analysis and Modeling (ASONAM), Sydney, Australia, July 31 - August 3, 2017.
• E. Nathan and D.A. Bader, “Approximating Personalized Katz Centrality in Dynamic Graphs,” , 12th International Conference on Parallel Processing and Applied
Mathematics(PPAM), Lublin, Poland, September 10-13, 2017.
• O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia, S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, D. Bader, “Quickly Finding a Truss in a Haystack”, IEEE High
Performance Extreme Computing Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph Challenge Innovation Award)
• S. Zhou, K. Lakhotia, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, J. Fox, E. Kim, O. Green, D. Bader, “Design and Implementation of Parallel PageRank on Multicore
Platforms”, IEEE High Performance Extreme Computing Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph Challenge Student Innovation Award)
• D. Makkar, D. Bader, O. Green, “Deterministic and Parallel Triangle Counting in Streaming Graphs”, IEEE International Conference on High Performance Computing, Data,
and Analytics, Jaipur, India, 2017
51David A. Bader19 July 2018
Opportunities
• Application-oriented Opportunities:
• High performance computing for massive graphs
• Streaming analytics
• Informational Visualization techniques for massive
graphs
• Heterogeneous systems: Methodologies for combining
the use of the Cloud and Manycore for high-
performance computing
• Energy-efficient high-performance computing
David A. Bader 52
Opportunity 1: High performance computing for massive graphs
• Traditional HPC has focused primarily on solving large problems from
chemistry, physics, and mechanics, using dense linear algebra.
• HPC faces new challenges to deal with:
• time-varying interactions among entities, and
• massive-scale graph abstractions where the vertices represent the nouns or
entities and the edges represent their observed interactions.
• Few parallel computers run well on these problems because
• they often lack locality required to get high performance from distributed-
memory cache-based supercomputers.
• Case study: Massively threaded architectures are shown to run several orders
of magnitude faster than the fastest supercomputers on these types of
problems!
 A focused research agenda is needed to design algorithms that
scale on these new platforms.
David A. Bader 53
• While our high performance computers have delivered a sustained petaflop, they have
done so using the same antiquated batch processing style where a program and a
static data set are scheduled to compute in the next available slot.
• Today, data is overwhelming in volume and rate, and we struggle to keep up with these
streams.
Fundamental computer science research is needed in:
the design of streaming architectures, and
data structures and algorithms that can compute important analytics while sitting in the
middle of these torrential flows.
Opportunity 2: Streaming analytics
David A. Bader 54
Opportunity 3: Information Visualization techniques for
massive graphs
• Information Visualization today
• addresses traditional scientific computing (fluid flow, molecular dynamics), or
• when handling discrete data, scale to only hundreds of vertices at best.
 However, there is a strong need for visualization in the data sciences so that
analytics can gain understanding from data sets with from millions to
billions of interacting non-planar discrete entities.
• Applications include: data mining, intelligence, situational awareness
David A. Bader 55NNDB Mapper of George Washington
Twitter social
network using
Large Graph
Layout
Source: Akshay Java, from ebiquity group
Opportunity 4: Heterogeneous Systems, Cloud, Internet of Things:
Methodologies for combining the use of the Cloud, IoT, and
accelerators
• Today, there is a dichotomy between
using clouds (e.g. Hadoop, map-
reduce) for massive data storage,
filtering, summarization, and
massively parallel/multithreaded
systems for data-intensive
computation.
We must develop methodologies
for employing these complementary
systems for solving grand challenges
in data analysis.
David A. Bader 56
Steve Mills, SVP of IBM Software (left),
and Dr. John Kelly, SVP of IBM
Research, view Stream Computing
technology
Opportunity 5: Energy-efficient high-performance computing
• The main constraint for our ability to compute has changed
• from availability of compute resources
• to the ability to power and cool our systems within budget.
 Holistic research is needed that can permeate from the architecture
and systems up to the applications AND DATA CENTERS, whereby
energy use is a first-class object that can be optimized at all levels.
David A. Bader 57
Microsoft’s Chicago Million Server DataCenter
Acknowledgment of Support
David A. Bader 5819 July 2018
Backup Slides
David A. Bader 5919 July 2018
HPEC Graph Challenge “Innovation Award”
• Static Graph Challenge – Static Graph Challenge: Subgraph Isomorphism
• Finding the maximal K-Truss subgraph
• A Truss is a relaxation of a clique that still has a good amount of connectivity
• In collaboration with USC and the University of Verona
• O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia, S. Zhou, S.
Singapura, H. Zeng, R. Kannan, V. Prasanna, D. Bader, “Quickly Finding a
Truss in a Haystack”, IEEE High Performance Extreme Computing
Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph
Challenge Innovation Award)
• Trusses are found by deleting edges out of the graph,
• Uses dynamic graph techniques for solving a static graph problem.
• Greatly reduces the amount of work needed for finding the Truss.
David A. Bader 6019 July 2018
HPEC Graph Challenge
“Student Innovation Award”
• In collaboration with USC
• S. Zhou, K. Lakhotia, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, J. Fox, E. Kim,
O. Green, D. Bader, “Design and Implementation of Parallel PageRank
on Multicore Platforms”, IEEE High Performance Extreme Computing
Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph Challenge
Student Innovation Award)
• Uses an edge centric approach: edge list is partitioned into shards.
• Shards are shorted by destination.
• Greatly reduces the number of cache misses and data movement.
• Faster than PageRank Pipeline benchmark – 2.5X faster than the multi-
threaded version of PageRank pipeline.
• Applicable to accelerators such as FPGAs
David A. Bader 6119 July 2018
Graph500 Benchmark, www.graph500.org
• Cybersecurity
• 15 Billion Log Entires/Day (for large
enterprises)
• Full Data Scan with End-to-End Join
Required
• Medical Informatics
• 50M patient records, 20-200
records/patient, billions of individuals
• Entity Resolution Important
• Social Networks
• Example, Facebook, Twitter
• Nearly Unbounded Dataset Size
• Data Enrichment
• Easily PB of data
• Example: Maritime Domain Awareness
• Hundreds of Millions of Transponders
• Tens of Thousands of Cargo Ships
• Tens of Millions of Pieces of Bulk Cargo
• May involve additional data (images,
etc.)
• Symbolic Networks
• Example, the Human Brain
• 25B Neurons
• 7,000+ Connections/Neuron
David A. Bader 62
Defining a new set of benchmarks to guide the design of hardware architectures and
software systems intended to support such applications and to help procurements.
Graph algorithms are a core part of many analytics workloads.
Executive Committee: D.A. Bader, R. Murphy, M. Snir, A. Lumsdaine
• Five Business Area Data Sets:
19 July 2018
Heterogeneity in “Big Data” systems:
High Performance Data Analytics
• Analytic platforms will combine:
• Cloud (Hadoop/map-reduce)
• Stream processing
• Large shared-memory systems
• Massive multithreaded architectures
• Multicore and accelerators
The challenge: developing
methodologies for employing
these complementary systems in
an enterprise-class analytics
framework for solving grand
challenges in massive data
analysis for discovery, real-time
analytics, and forensics.
David A. Bader 63
Steve Mills, SVP of IBM Software (left),
and Dr. John Kelly, SVP of IBM
Research, view Stream Computing
technology
19 July 2018
Future Architectures
• Highly multithreaded
• High bandwidth (network and memory)
• Complex but flexible memory hierarchy
• Heterogeneous design in core capability and ISA
David A. Bader 6419 July 2018
National Strategic Computing Initiative
• (29 July 2015, The White House) The National Strategic Computing Initiative (NSCI) is an effort
to create a cohesive, multi-agency strategic vision and Federal investment strategy in high-
performance computing (HPC).
• This strategy will be executed in collaboration with industry and academia, maximizing the
benefits of HPC for the United States.
• HPC systems, through a combination of processing capability and storage capacity, can solve
computational problems that are beyond the capability of small- to medium-scale systems.
They are vital to the Nation’s interests in science, medicine, engineering, technology, and
industry.
• The NSCI will spur the creation and deployment of computing technology at the leading edge,
helping to advance Administration priorities for economic competiveness, scientific discovery,
and national security.
• The National Strategic Computing Initiative has five strategic themes.
1. Create systems that can apply exaflops of computing power to exabytes of data.
2. Keep the United States at the forefront of HPC capabilities.
3. Improve HPC application developer productivity.
4. Make HPC readily available.
5. Establish hardware technology for future HPC systems.
19 July 2018 David A. Bader
65
NSCI Anniversary Meeting, 29 July 2016
19 July 2018 David A. Bader
66
Dynamic Graphs and Streaming Support
David A. Bader 67
DynamicSupport
Galois
TitanDB
GraphX
GraphLab
Giraph
Gunrock
Boost
GraphCHI
Dynamic
Static
UpdateStreaming rate
STINGER
Ligra
Hornet
19 July 2018
Performance and Data Scalability
David A. Bader 68
Datasize
GaloisTitanDB
GraphX
GraphLab
Giraph
Gunrock
Boost
GraphCHI
Large
Small
Performance Scalability
Ligra
STINGER
GPU-Hornet
CPU-Hornet
19 July 2018
Overview of STINGER and Hornet Capabilities
David A. Bader 69
Algorithm Static Dynamic Implementation Formulation
Breadth first search   CPU+GPU VC
Triangle Counting   CPU+GPU VC
Connect components   CPU+GPU VC
Betweenness Centrality   CPU+GPU VC
Page Rank   CPU+GPU VC, BLAS
Katz Centrality   CPU+GPU VC, BLAS
Community Detection   CPU VC
Seed Set Expansion   CPU VC
K-Truss Decomposition  GPU VC
Maximal Independent Set  GPU VC
Legend: VC - Vertex Centric, BLAS
19 July 2018
Graph Analytics (Dynamic vs. Streaming)
• Many libraries support updates to graph
• They do not have dynamic graph analytics
David A. Bader 70
Streaming
RateSlow (very
low)
Fast
Graph and
algorithm properties
Dynamic
Static
STINGE
R
GraphLa
b
GiraphGraphX
Galois
TitanDB
19 July 2018
Boost
Gunrock
Scalability (Volume vs. Performance)
• Many libraries support updates to graph
• They do not have dynamic graph analytics
David A. Bader 71
Performanc
e
Scalability
Low Fast
Data
Scalability
Large
Small
STINGE
R
GraphLa
b
Giraph
GraphX
GaloisTitanDB
GraphCH
I
19 July 2018
PageRank - Performance Analysis
David A. Bader 72
• Lower is better.
• 2 orders of magnitude difference in performance.
• Still outperforms other static-only graph packages.
• Outperforms the distributed systems even for large networks
with plenty of computational demand!
• Some platforms did not complete in reasonable amount of time.
R-MAT Graph
• Vertices: 16M
• Edges: 128M
19 July 2018
Other algorithms
David A. Bader 73
R-MAT Graph
• Vertices: 1M
• Edges: 8M
• STINGER is orders of magnitude faster.
• Still outperforms other static-only graph packages.
Static Single-Source Shortest Path Static Connected Components
19 July 2018
High Performance Data Analytics (HPDA)
• With Pacific Northwest National Lab
• Project objectives –
• Develop novel tools and algorithms for
dealing with massive dynamic networks.
• Enable analysts to search through vast data
at near real-time speeds.
• Improve accuracy of past approaches
through the use of community centric
analysis
David A. Bader 74
• Successes
• Developed the first scalable dynamic graph data for the GPU (that also works for CPUs).
Data structure supports over 90 million updates per second.
• Designed novel dynamic algorithms for Katz Centrality and triangle counting
• Developed personalized centrality metrics
• Sketched out an asynchronous model, the first of its kind, for analyzing the correctness of
dynamic graph algorithms when the underlying graph is changing.
19 July 2018
Leveraging High Performance Computing
for Mixed-Integer Programming
19 July 2018 David A. Bader 75
In a joint collaboration with the ExxonMobil Upstream Research Company, we
focus on developing effective Mixed Integer Programming (MIP) methods for
difficult planning and scheduling problems arising in the petrochemical industry.
• Challenges
• MIPs are NP-Hard problems.
• Available parallel algorithms show poor
scalability.
• Goals :
• To study and develop new scalable parallel
algorithms for MIP solving.
• Our focus is on large scale industrial
optimization problems.
• To offer MIP practitioners a portfolio of
parallel algorithms, which emphasize
finding high quality solutions quickly as
well as proving optimality.
NSF XSCALA
• A Community Repository for Model-driven Design and Tuning of Data-
Intensive Applications for Extreme-scale Accelerator-based Systems
• Collaboration with Georgia Tech and University of Southern California
• Launched 2012
• Challenges
• Sparse data intense computations with irregular memory access patterns.
• Extremely hard for compilers to parallelize 
Requires hand tuning for new architectures and platforms.
David A. Bader 76
• Goals
• Design tools for automatic fine tuning and
optimizations
• Develop runtime scheduling techniques for
load-balancing and for hardware selection
• Offer programmers an intuitive modeling
environment for design-time and run-time
optimizations.
19 July 2018
Parallel Alternating Criteria Search
19 July 2018 David A. Bader 77
• A parallel Large Neighboorhood Search(LNS) heuristic aimed at obtaining high quality
feasible solutions for general Mixed Integer Programs (MIPs).
• Solution improvements are found by solving in parallel a large number of restricted subproblems, which
are derived from the original problem.
• It is the first parallel general purpose heuristic developed for MIPs.
• Significantly more scalable and effective at finding high quality solutions than current
commercial MIP solvers, especially for large-scale MIP instances.
• The framework provides an excellent platform for the rapid prototyping of highly
effective problem-specific heuristics, which exploit problem structure.

Mais conteúdo relacionado

Mais procurados

Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupEdward Curry
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
 
Smart IoT for Connected Manufacturing
Smart IoT for Connected ManufacturingSmart IoT for Connected Manufacturing
Smart IoT for Connected ManufacturingAmit Sheth
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Edward Curry
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013osimod
 
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingSLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingEdward Curry
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
2013 bio it world
2013 bio it world2013 bio it world
2013 bio it worldChris Dwan
 
Open Data in a Global Ecosystem
Open Data in a Global EcosystemOpen Data in a Global Ecosystem
Open Data in a Global EcosystemPhilip Bourne
 
The NEEDS vs. the WANTS in IoT
The NEEDS vs. the WANTS in IoTThe NEEDS vs. the WANTS in IoT
The NEEDS vs. the WANTS in IoTPrasant Misra
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADBeth Plale
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 

Mais procurados (20)

Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Big data trends in 2020
Big data trends in 2020Big data trends in 2020
Big data trends in 2020
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
 
Smart IoT for Connected Manufacturing
Smart IoT for Connected ManufacturingSmart IoT for Connected Manufacturing
Smart IoT for Connected Manufacturing
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013
 
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingSLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
2013 bio it world
2013 bio it world2013 bio it world
2013 bio it world
 
Summary of 3DPAS
Summary of 3DPASSummary of 3DPAS
Summary of 3DPAS
 
Open Data in a Global Ecosystem
Open Data in a Global EcosystemOpen Data in a Global Ecosystem
Open Data in a Global Ecosystem
 
The NEEDS vs. the WANTS in IoT
The NEEDS vs. the WANTS in IoTThe NEEDS vs. the WANTS in IoT
The NEEDS vs. the WANTS in IoT
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 

Semelhante a Massive-Scale Analytics Applied to Real-World Problems

Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsNeo4j
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
 
MassTLC Big Data Seminar Sept 20
MassTLC Big Data Seminar Sept 20MassTLC Big Data Seminar Sept 20
MassTLC Big Data Seminar Sept 20MassTLC
 
Mass tlc big data panel sep 20
Mass tlc big data panel sep 20Mass tlc big data panel sep 20
Mass tlc big data panel sep 20MassTLC
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. maigva
 
International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)albert ca
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...jybufgofasfbkpoovh
 
Data Mining & Knowledge Management Process (IJDKP)
Data Mining & Knowledge Management Process (IJDKP)Data Mining & Knowledge Management Process (IJDKP)
Data Mining & Knowledge Management Process (IJDKP)IJDKP
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfvishal choudhary
 
IEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big data
IEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big dataIEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big data
IEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big dataIEEEMEMTECHSTUDENTPROJECTS
 
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big dataIEEEMEMTECHSTUDENTSPROJECTS
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )albert ca
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 

Semelhante a Massive-Scale Analytics Applied to Real-World Problems (20)

Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
MassTLC Big Data Seminar Sept 20
MassTLC Big Data Seminar Sept 20MassTLC Big Data Seminar Sept 20
MassTLC Big Data Seminar Sept 20
 
Mass tlc big data panel sep 20
Mass tlc big data panel sep 20Mass tlc big data panel sep 20
Mass tlc big data panel sep 20
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm.
 
International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)International Journal of Data Mining & Knowledge Management Process(IJDKP)
International Journal of Data Mining & Knowledge Management Process(IJDKP)
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Data Mining & Knowledge Management Process (IJDKP)
Data Mining & Knowledge Management Process (IJDKP)Data Mining & Knowledge Management Process (IJDKP)
Data Mining & Knowledge Management Process (IJDKP)
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
IEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big data
IEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big dataIEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big data
IEEE 2014 DOTNET DATA MINING PROJECTS Data mining with big data
 
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Massive-Scale Analytics Applied to Real-World Problems

  • 1. Massive-Scale Analytics Applied to Real-World Problems
  • 2. David A. Bader • Full Professor and Chair, Computational Science and Engineering • IEEE Fellow, AAAS Fellow • High-performance computing and real-world applications: massive-scale data analytics. • Over $183M of research awards • Steering Committees of major HPC conferences: IPDPS and HiPC • EIC of IEEE Transactions on Parallel and Distributed Systems • Elected chair of IEEE and SIAM committees on HPC • 230+ publications, ≥ 7,500 citations, h-index ≥ 52 • National Science Foundation CAREER Award recipient • Directed: NVIDIA GPU Center of Excellence • Directed: Sony-Toshiba-IBM Center for the Cell/B.E. Processor • Founder: Graph500 List benchmarking “Big Data” platforms • Recognized as a “RockStar” of High Performance Computing by InsideHPC in 2012 and as HPCwire’s People to Watch in 2012 and 2014. 19 July 2018 David A. Bader 2
  • 3. Innovate. Collaborate. Problem Solved. • Computational Science and Engineering (CSE) is a diverse, interdisciplinary innovation ecosystem composed of award- winning faculty, researchers and students that • Solves real-world problems and creates future leaders • Enables breakthroughs in scientific discovery and engineering practice • Uses the most advanced resources, techniques and ideas • Is highly collaborative with an impressive roster of GT and industry partners David A. Bader 319 July 2018
  • 4. Profile of CSE: History • Founded in 2005, officially recognized as a school in 2010. • Focus on high performance computing, big data, analytics & visualization, machine learning, cybersecurity. • $6.6 million in research expenditures; approximately $39 million in active awards (FY 2017) • NSF South Big Data Hub partnership: $1.25 million over 3 years to support new analysis projects in line with CSE mission. 19 July 2018 David A. Bader 4
  • 5. High Performance Computing Machine Learning Analytics & Visualization Cybersecurity Core Research Areas Big Data Design fast theoretic algorithms on large-scale graphs, and detect malicious activity Develop new methods to analyze large and complex data sets, transforming data into value and solve grand challenges Present data in ways that best yield insight and support decisions as problems scale and complexity increase Construct and study algorithms that build models, and make efficient data- driven predictions or decisions Devise computing solutions at the absolute limits of scale and speed using efficient, reliable and fast algorithms, software, tools and applications 19 July 2018 David A. Bader 5
  • 6. Profile of CSE: People • By the numbers (Fall 2017): • 47 faculty and staff • (13 tenure-track faculty) • 115 PhD students • 148 masters students • Award-winning research teams: ACM Gordon Bell Prize awarded to team composed mainly of CSE faculty and students. • Other honors include: 1 Regents’ professor, 6 NSF CAREER awards , 4 IEEE fellows, 2 AAAS fellows, and 1 SIAM fellow 19 July 2018 David A. Bader 6
  • 7. Strategic Partnership Program David A. Bader 719 July 2018
  • 8. • Georgia Tech is building Coda, a multi-story, 750,000-square-foot HPC building in the heart of Atlanta (Midtown) with a targeted opening in January 2019 • Devoted to data science and high-performance computing for centralized collaboration among industry, academia and government • Location of CSE, IDEAS and the HPC Center, and the South BD Hub • Georgia Tech is the anchor tenant, taking approximately one-half of the new development. Remaining space will be for corporate entities and partners. • The Institute plans to locate academic and leading-edge research programs in computing and advanced big data analytics there. A New Home for the Future of CSE 19 July 2018 David A. Bader 8
  • 9. Exascale Streaming Data Analytics: Real-world challenges All involve analyzing massive streaming complex networks: • Health care  disease spread, detection and prevention of epidemics/pandemics (e.g. SARS, Avian flu, H1N1 “swine” flu) • Massive social networks  understanding communities, intentions, population dynamics, pandemic spread, transportation and evacuation • Intelligence  business analytics, anomaly detection, security, knowledge discovery from massive data sets • Systems Biology  understanding complex life systems, drug design, microbial research, unravel the mysteries of the HIV virus; understand life, disease, • Electric Power Grid  communication, transportation, energy, water, food supply • Modeling and Simulation  Perform full- scale economic-social-political simulations 0 50 100 150 200 250 300 350 400 450 Dec-04 Mar-05 Jun-05 Sep-05 Dec-05 Mar-06 Jun-06 Sep-06 Dec-06 Mar-07 Jun-07 Sep-07 Dec-07 Mar-08 Jun-08 Sep-08 Dec-08 Mar-09 Jun-09 Sep-09 Dec-09 Facebook Active Users Million Users Exponential growth: Billions of active users Sample queries: Allegiance switching: identify entities that switch communities. Community structure: identify the genesis and dissipation of communities Phase change: identify significant change in the network structure REQUIRES PREDICTING / INFLUENCE CHANGE IN REAL-TIME AT SCALE Ex: discovered minimal changes in O(billions)-size complex network that could hide or reveal top influencers in the community 19 July 2018 David A. Bader 9
  • 10. Graphs are pervasive in large-scale data analysis • Sources of massive data: peta- and exa-scale simulations, experimental devices, the Internet, scientific applications. • New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality. Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations. Graph problems: clustering, matching. Bioinformatics Problem: Identifying drug target proteins. Challenges: Data heterogeneity, quality. Graph problems: centrality, clustering. Social Informatics Problem: Discover emergent communities, model spread of information. Challenges: new analytics routines, uncertainty in data. Graph problems: clustering, shortest paths, flows. Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com 10DavidA. Bader
  • 11. Network Analysis for Intelligence and Survelliance • [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information • Plot masterminds correctly identified from interaction patterns: centrality • A global view of entities is often more insightful • Detect anomalous activities by exact/approximate graph matching Image Source: http://www.orgnet.com/hijackers.html Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47 11DavidA. Bader
  • 12. Characterizing Graph-theoretic computations • graph sparsity (m/n ratio) • static/dynamic nature • weighted/unweighted, weight distribution • vertex degree distribution • directed/undirected • simple/multi/hyper graph • problem size • granularity of computation at nodes/edges • domain-specific characteristics • paths • clusters • partitions • matchings • patterns • orderings Input: Graph abstraction Problem: Find *** Factors that influence choice of algorithmGraph algorithms • traversal • shortest path algorithms • flow algorithms • spanning tree algorithms • topological sort ….. Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations 12David A. Bader
  • 13. Massive Streaming Graph Analytics David A. Bader 13 (A, B, t1, poke) (A, C, t2, msg) (A, D, t3, view wall) (A, D, t4, post) (B, A, t2, poke) (B, A, t3, view wall) (B, A, t4, msg) Analysts
  • 14. Mining Twitter for Social Good David A. Bader 14 ICPP 2010 Image credit: bioethicsinstitute.org
  • 15. • CDC / Nation-scale surveillance of public health • Cancer genomics and drug design • computed Betweenness Centrality of Human Proteome Human Genome core protein interactions Degree vs. Betweenness Centrality Degree 1 10 100 BetweennessCentrality 1e-7 1e-6 1e-5 1e-4 1e-3 1e-2 1e-1 1e+0 Massive Data Analytics: Protecting our Nation US High Voltage Transmission Grid (>150,000 miles of line) Public Health David A. Bader 15 ENSG0 000014 5332.2 Kelch- like protein implicat ed in breast cancer
  • 16. STING Initiative: Focusing on Globally Significant Grand Challenges • Many globally-significant grand challenges can be modeled by Spatio- Temporal Interaction Networks and Graphs (or “STING”). • Emerging real-world graph problems include • detecting community structure in large social networks, • defending the nation against cyber-based attacks, • discovering insider threats (e.g. Ft. Hood shooter, WikiLeaks), • improving the resilience of the electric power grid, and • detecting and preventing disease in human populations. • Unlike traditional applications in computational science and engineering, solving these problems at scale often raises new research challenges because of sparsity and the lack of locality in the massive data, design of parallel algorithms for massive, streaming data analytics, and the need for new exascale supercomputers that are energy- efficient, resilient, and easy-to-program. David A. Bader 1619 July 2018
  • 17. Hierarchical Identify Verify Exploit (HIVE) • SHARP: Software Toolkit for Accelerating GrapH AlgoRithms on HIVE Processors • Georgia Tech with University of Southern California • Launched Spring 2017 • Performers include Intel, Qualcomm, PNNL, Northrop Grumman • Challenges • Programmers are required to exploit low-level hardware and operating systems primitives. • Limited portability of frameworks for new architectures and accelerators. David A. Bader 17 TA2 SHARP OPTIMIZATION DATAFLOW MODELLING Preprocessing Primitive Unrolling (PU) Database Dataflow Model ILP Optimization Partially Annotated Code via SHARP API s Task to Hardware Mapping and Optimal Data Layout for Input Code Runtime Scheduler and Resource Manager STINGER Graph Algorithms Dynamic Primitives Hierarchical Primitive Decomposition Architecture Info & Hardware Primitives from TA1 Abstract Model of HIVE Hardware Suitable Primitives API Definitions to TA3 Initial Mapping Primitive Set Evaluation Data Layout Graph Primitives Data Store Communication Cost Model Access Cost Model Decomposition Analysis Graph Primitives from TA3 Primitive Feedback and Dataset Analysis to TA1 • Utilizing commodity hardware designed for different application domain • Goals • Design an unique library for first of its kind graph processor. Fully utilize new hardware features. • Portable and scalable framework for massive graphs. • Configurable data layout that will be decided by: input, algorithm, and targeted hardware. 19 July 2018
  • 18. Power Efficiency Revolution for Embedded Computing Technologies (PERFECT) • Challenges • Performance Per Watt is now a metric of concern. • Data movement (caches, networks, storage devices) is becoming a dominant factor in both execution time and in power consumption. • Power consumption is limiting application and architecture scalabilty. • Goals • One approach to reducing power consumption is to reduce execution time. • Find additional ways to utilized shared memory systems. Better shared memory implementations can help reduce network size and limit network data movement. • Help user select: graph data layout, programming model (vertex centric vs. edge centric, identify ideal accelerators for an application, load-balancing techniques and much more. David A. Bader 1819 July 2018
  • 19. Evaluating Memory-Centric Architectures for HPDA • Jason Riedy (PI), David A. Bader, Tom Conte • High-performance data analysis does not fit current CPU centric architectures well. • Need new approaches to achieve high performance. • Emu Chick: Move threads to data! • Application areas: Streaming graphs and sparse tensors. David A. Bader 19 • Needs new programming paradigms, new algorithm optimizations, new ideas! • Goals: • Evaluate the Emu migratory thread system, and • develop new methods optimzed for memory-centric architectures. IARPA 19 July 2018
  • 20. STINGER – Time Frame David A. Bader 20 STINGER is officially proposed. May 2009 First prototype, clustering coefficients. Apr 2010 Structure tracking of streaming social networks. Apr 2011 High Performance Data Structure for Streaming Graphs. Sep 2012. HPEC BEST PAPER AWARD Dynamic betweenness centrality algorithm, Sep 2012 Streaming connected component, Dec 2013 Performance evaluation of open-source graph data-bases, Feb 2014 Community detection in dynamic networks Sep 2015 PageRank for Streaming Graphs. May 2016 Streaming graph need arises (over a decade ago) 19 July 2018
  • 21. Streaming graph example David A. Bader 21 • Dynamic/Streaming: • At time 𝑡: • 𝑣 and 𝑤 become friends • 𝐼𝑛𝑠𝑒𝑟𝑡 (𝑣, 𝑤) • At time 𝑡: • 𝑢 upsets 𝑣. 𝑢 and 𝑣 𝑎𝑟𝑒 no longer friends • 𝐷𝑒𝑙𝑒𝑡𝑒 𝑢, 𝑣 • small subgraph… 𝑣 𝑢 𝑤 19 July 2018
  • 22. Streaming Analytics move us from reporting the news to predictive analytics Traditional HPC • Great for “static” data sets. • Massive scalability at the cost of programmability. • Great for dense problems. • Sparse problems typically underutilize the system. Streaming Analytics • Requires specialized analytics and data structures. • Rapidly changing data. • Low data re-usage. • Focused on memory operations and not FLOPS. David A. Bader 2219 July 2018
  • 23. STING Extensible Representation (STINGER) • Design goals: • Enable algorithm designers to implement dynamic graph algorithms with ease. • Portable semantics for various platforms • Good performance for all types of graph problems and algorithms - static and dynamic. • Assumes globally addressable memory access • Support multiple, parallel readers and a single writer • One server manages the graph data structures • Multiple analytics run in background with read-only permissions. David A. Bader 2319 July 2018
  • 24. STING Extensible Representation David A. Bader 24 • Semi-dense edge list blocks with free space • Compactly stores timestamps, types, weights • Maps from application IDs to storage IDs • Deletion by negating IDs, separate compaction 19 July 2018
  • 25. STINGER Graph & Analytic Update Process Accumulate recent graph updates in main memory and create a batch. David A. Bader 25 Pre-process, Sort, Reconcile “Age off” old vertices Modify STINGER graph Update metrics (execute streaming analytics) STINGER graph Insertions / Deletions Affected vertices Change detection 19 July 2018
  • 26. STING: High-level architecture David A. Bader 26 ◮ Server: Graph storage, kernel orchestration ◮ OpenMP + sufficiently POSIX-ish ◮ Multiple processes for resilience 19 July 2018
  • 27. STINGER: as an analysis package • Streaming edge insertions and deletions: Performs new edge insertions, updates, and deletions in batches or individually. Optimized to update at rates of over 3 million edges per second on graphs of one billion edges. • Streaming clustering coefficients: Tracks the local and global clustering coefficients of a graph. • Streaming connected components: Real time tracking of the connected components. • Streaming Betweenness Centrality: Find the key points within information flows and structural vulnerabilities. • Streaming community detection: Track and update the community structures within the graph as they change. • Anything that a static graph package can do (and a whole lot more): • Parallel agglomerative clustering: Find clusters that are optimized for a user-defined edge scoring function. • K-core Extraction: Extract additional communities and filter noisy high-degree vertices. • Classic breadth-first search: Performs a parallel breadth-first search of the graph starting at a given source vertex to find shortest paths. • Parallel connected components: Finds the connected components in a static network. David A. Bader 27 http://www.stingergraph.com/ 19 July 2018
  • 28. Streaming Updates Update process • Group updates into batches • Updates can include insertions and deletions • Big batches ⇒ Better performances [HPEC; 2012] Throughput rate David A. Bader 28 Experiment setup • 4x10 Intel E7-8870 processors • RMAT Graph • Vertices: 16M • Edges: 128M • Various batch sizes • ~93% of updates are insertions • ~7% of updates are deletions Takeaway • STINGER supports extremely fast updates. • Updates are not the bottleneck for analytics. • Analytic computations are the bottleneck! • Highly scalable 19 July 2018
  • 29. Streaming Clustering Coefficients & Triangle Counting Background • Scores how tightly bound players are in their local community. • Looks for common relationships for two adjacent vertices. • Hence the term triangle counting • Complexity for static graph algorithm (intersection based): 𝑂 𝑣 ⋅ 𝑑 𝑚𝑎𝑥 2 [MTAAP; 2010] David A. Bader 29 Multiple streaming implementations • Brute-force – straightforward and exact • Bloom-filter – approximate yet extremely fast • Sorted-list – uses intersections. Fast and exact. Larger batches give faster speedups. Experiment setup • Executed on two systems • Cray XMT – 64 nodes • 2x4 Intel E5530 system • 8 cores, 16 threads • Used RMAT synthetic graphs • 2M vertices, 16 edges • Hundreds of thousands updates per second 19 July 2018
  • 30. Streaming Connected Components Background • Tracks connected components in high velocity networks. • Connected components imply that players are connected to each other some sequence of relationships [HiPC; 2013] David A. Bader 30 Our algorithm • Takes into account small-world property • Diameter is a small. • Most players have numerous relationships within the connected component. • Edge insertions are always easy. • Very edge deletions are complex. Takeaway • Up to 1.26 million updates per second on 4 × 16 AMD (Opteron 6282) • Hundreds of time faster than static computation. • Great for social networks. • STINGER requires only 10% of execution time. Rest of time - analytic update. • Scalability similar to BFS. Average edge degree: 19 July 2018
  • 31. Dynamic Betweenness Centrality Background • Used for finding key players in network based on the number of relationships that go through them. • Fastest known algorithm by Brandes (2002) is still computationally expensive for large networks: 𝑂 𝑉 ⋅ 𝑉 + 𝐸 . [Social Computing; 2012] David A. Bader 31 Our dynamic graph algorithm • Supports optimizations: • Approximation: reduces accuracy, significantly faster. • Parallelization: utilizes many core systems • Supports: vertex insertions & deletions and edge insertions & deletions. Experimental setup and takeaways • 4x10 Intel E7-8870 processors • Thousands of times faster than static recomputations. • Significantly reduces the amount of necessary computations. • Only small percentage of the graph is affected due to update 19 July 2018
  • 32. Streaming Community Detection and Monitoring Background • Communities are typically defined by groups of vertices with more intra- relationships than inter-relationships. • More formally: 𝑄 𝐶 = 𝐼𝑛𝑡𝑟𝑎 𝐶 𝐸 − 𝐼𝑛𝑡𝑒𝑟 𝐶 2 |𝐸|2 • In addition to the graph, an additional community network is maintained. • Significantly smaller than full network! • Updates applied to community network. [MTAAP; 2013] David A. Bader 32 An agglomerative approach • Certain types of updates do not change the community structure. • We only need to process updates that “might” cause change. • Few updates require a lot of process time. Experimental setup and takeaways • 4x8 Intel E7-4820 system • 32 cores, 64 threads • Easily supports millions of updates per second. • Bigger batches offer improved performance. • Dynamic algorithm is 1000s of time faster than static graph algorithm. • Real-time tracking of communities with a network. 19 July 2018
  • 33. Streaming Seed-Set Expansion Background • Seeds are vertices of interest. • Seed Set Expansion is the process of detecting a community around a seed. • Streaming SSE – allows tracking vertices of interest over time • Important events such as community split and merging can be reported [ASONAM;2015] David A. Bader 33 Algorithm details • Greedy algorithm. • Vertices are inserted one at a time into the community. • An edge update checks for possible changes in the community. • Pruning can be applied when an update causes a big change in the community. • Pruning makes things slower • Pruning offers more accurate results in comparison with static graph algorithm. Takeaways • Highly accurate in comparison to static graph algorithm. • Precision and recall typically above 90%. • Larger batches require more work ⇒ smaller speedups 19 July 2018
  • 34. Incremental Page-Rank Background • Pagerank is used by measuring the importance of vertices by the number and weight of links going through it. • Works like a propagation algorithm. • Algorithm continues until no changes are detected. [GABB;2016] David A. Bader 34 Algorithm • Uses STINGER to perform linear algebra operations. • Supports both insertions and deletions. • Incremental implies that only a small subset of the graph is traversed. • Does only the necessary amount of work Takeaway • Large batches: reduce lower latency by > 2× over restarting on average. • Small batches: potentially hundreds of time faster than restart. • Improved power performance (modeled). • Can deal with several thousand updates per second. 19 July 2018
  • 35. Community Centric Analysis (in process) Background • Focuses on finding key players in communities • Might be overlooked by network wide analytics. • Computationally less expensive. • Highly scalable • Key players detected due to change to their community upon extraction. • We modify several widely used analytics for this new type of computation. • Starts off with an initial exploration of static graphs Modified metrics of interest • Change in community modularity • Change to the community diameter • Change in the number of connected components in the community David A. Bader 35 Community Centric Approach • Given a community 𝐶 and a metric 𝑀, for each vertex 𝑢 in each community 𝐶: • Calculate initial metric 𝑀𝑖𝑛𝑖𝑡𝑖𝑎𝑙 on community (left figure) using static graph algorithm (done once) • Remove vertex 𝑢 and links using STINGER • Calculate changed metric 𝑀 𝑎𝑓𝑡𝑒𝑟 using 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 graph algorithm using STINGER • Look at change to community: Δ𝑀 𝑢 = 𝑀 𝑎𝑓𝑡𝑒𝑟 𝑀 𝑖𝑛𝑖𝑡𝑖𝑎𝑙 • Insert vertex 𝑢 and links using STINGER Initial Findings • A different way to use streaming analytics: “𝑣𝑒𝑟𝑡𝑒𝑥 𝑑𝑒𝑙𝑡𝑎 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑠” • Multiple metrics pinpoint same key vertices. • Computationally efficient • Over 20𝑋 faster than networks approach. • Highly scalable • Applicable to other metrics as well 19 July 2018
  • 36. Current Research 19 July 2018 David A. Bader 36
  • 37. Analysis of Centrality on Graphs 11/26/2016 localhost:8000 • Numerical Centrality • Theoretically guaranteeing highly ranked vertices from calculations of Katz Centrality from an iterative solver • Nathan, Sanders, Fairbanks, Henson, Bader. “Graph Ranking Guarantees for Numerical Approximations to Katz Centrality," International Conference on Computational Science (ICCS). June 2017. • Dynamic Centrality • Develop algorithms to efficiently update Katz Centrality in dynamic graphs: faster than static recomputation and maintains high quality of results. Algorithms are from both a linear algebraic environment and personalized agglomerative approach. • Nathan and Bader. “A Dynamic Algorithm for Updating Katz Centrality in Graphs," IEEE/ACM International Conference on Social Networks Analysis and Mining (ASONAM). July 2017. • Nathan and Bader. “Approximating Personalized Katz Centrality in Dynamic Graphs," 12th International Conference on Parallel Processing and Applied Mathematics (PPAM). September 2017. David A. Bader 3719 July 2018
  • 38. Dynamic Communities • Goals • Detect local communities in networks given seed vertices of interest • Allows a relevant subgraph to be extracted for targeted analysis • Incrementally update and track communities over time in dynamic graphs • Publications • Zakrzewska and Bader. “A dynamic algorithm for local community detection in graphs,” ASONAM 2015. • Zakrzewska and Bader. “Tracking local communities in streaming graphs with a dynamic algorithm,” SNAM Journal 6(1) 2016. • Zakrzewska, Nathan, Fairbanks, Bader. “A local measure of community change in dynamic graphs,” ASONAM 2016. • Nathan, Zakrzewska, Riedy, Bader. “Local community detection in dynamic graphs using personalized centrality,” Journal of Algorithms 2017. David A. Bader 3819 July 2018
  • 39. Sampling Streaming Graphs • Challenges • Many relational datasets are large, with new data constantly generated • The volume of data may be too large to store or run graph analytics • Goals • Sample a stream of relational data (edges) to create a graph representation • The sampling method should restrict both the number of vertices and edges to limit the memory needed to store the sampled graph • In some applications, newer data is more relevant • Allow for sampling bias towards newer edges when needed or for temporally uniform sampling • Zakrzewska and Bader. “Streaming graph sampling with size restrictions,” ASONAM 2017. David A. Bader 39 10 3 9 1 4 11 5 2 6 8 7 10 8 4 74 1 4 2 6 5 7 5 5 4 8 19 July 2018
  • 40. STINGER: Where do you get it? http://www.stingergraph.com/ • Gateway to • code, • development, • documentation, • presentations... • Users / contributors / questioners: Georgia Tech, PNNL, CMU, Berkeley, Intel, Cray, NVIDIA, IBM, Federal Government, Ionic Security, Citi, Accenture David A. Bader 4019 July 2018
  • 41. STINGER Development Enterprise • Tech transfer for GTRI • Enterprise software integrity • Nightly builds • Unit testing required Academic • Maintained by Georgia Tech • Ideal for prototyping. • Sandbox for developing new concepts • When software matures… David A. Bader 41 http://git.cc.gatech.edu/git/u/eriedy3/stinger.git/ https://github.com/stingergraph 19 July 2018
  • 42. STINGER Summary • Massive-Scale Streaming Analytics require • Simple programming model • Simple API. • CSR-like in concept. • STINGER has a lot more under the hood. • Extremely fast updates • Millions of updates per second. • These must not be bottlenecks for updating an analytic. • STINGER offers these • STINGER has major performance benefits • Thousands of times faster than static graph computation. • Hundreds of thousands of updates per second for numerous analytics. • Real-time monitoring of underlying network. David A. Bader 4219 July 2018
  • 43. Conclusions • Massive-Scale Streaming Analytics will require new • High-performance computing platforms • Streaming algorithms • Energy-efficient implementations and are promising to solve real-world challenges! • Mapping applications to high performance architectures may yield 6 or more orders of magnitude performance improvement David A. Bader 4319 July 2018
  • 44. Acknowledgments • Jason Riedy, Research Scientist, (Georgia Tech) • Oded Green, Research Scientist, (Georgia Tech) • Current Graduate Students (Georgia Tech): • Xiaojing An • James Fox • Kasimir Gabert • Euna Kim • Recent Bader Alumni: • Dr. Eisha Nathan (Lawrence Livermore National Lab) • Dr. Vipin Sachdeva (IBM) • Dr. Anita Zakrzewska (Lawrence Livermore National Lab) • Dr. Lluis Miquel Munguia (Google) • Prof. Kamesh Madduri (Penn State) • Dr. David Ediger (GTRI) • Dr. James Fairbanks (GTRI) • Dr. Seunghwa Kang (Pacific Northwest National Lab) David A. Bader 4419 July 2018
  • 45. PhD students Emily Rogers James Fox Xiaojing An Euna Kim Anita Zakrzewska Eisha Nathan Chunxing Yin Kasimir Gabert Lluis Miquel Munguia David A. Bader 45 Buzz 19 July 2018
  • 46. Bader, Related Recent Publications (2005-2009) • D.A. Bader, G. Cong, and J. Feo, “On the Architectural Requirements for Efficient Execution of Graph Algorithms,” The 34th International Conference on Parallel Processing (ICPP 2005), pp. 547-556, Georg Sverdrups House, University of Oslo, Norway, June 14-17, 2005. • D.A. Bader and K. Madduri, “Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors,” The 12th International Conference on High Performance Computing (HiPC 2005), D.A. Bader et al., (eds.), Springer-Verlag LNCS 3769, 465-476, Goa, India, December 2005. • D.A. Bader and K. Madduri, “Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2,” The 35th International Conference on Parallel Processing (ICPP 2006), Columbus, OH, August 14-18, 2006. • D.A. Bader and K. Madduri, “Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks,” The 35th International Conference on Parallel Processing (ICPP 2006), Columbus, OH, August 14-18, 2006. • K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “Parallel Shortest Path Algorithms for Solving Large-Scale Instances,” 9th DIMACS Implementation Challenge -- The Shortest Path Problem, DIMACS Center, Rutgers University, Piscataway, NJ, November 13-14, 2006. • K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007. • J.R. Crobak, J.W. Berry, K. Madduri, and D.A. Bader, “Advanced Shortest Path Algorithms on a Massively-Multithreaded Architecture,” First Workshop on Multithreaded Architectures and Applications (MTAAP), Long Beach, CA, March 30, 2007. • D.A. Bader and K. Madduri, “High-Performance Combinatorial Techniques for Analyzing Massive Dynamic Interaction Networks,” DIMACS Workshop on Computational Methods for Dynamic Interaction Networks, DIMACS Center, Rutgers University, Piscataway, NJ, September 24-25, 2007. • D.A. Bader, S. Kintali, K. Madduri, and M. Mihail, “Approximating Betewenness Centrality,” The 5th Workshop on Algorithms and Models for the Web-Graph (WAW2007), San Diego, CA, December 11-12, 2007. • David A. Bader, Kamesh Madduri, Guojing Cong, and John Feo, “Design of Multithreaded Algorithms for Combinatorial Problems,” in S. Rajasekaran and J. Reif, editors, Handbook of Parallel Computing: Models, Algorithms, and Applications, CRC Press, Chapter 31, 2007. • Kamesh Madduri, David A. Bader, Jonathan W. Berry, Joseph R. Crobak, and Bruce A. Hendrickson, “Multithreaded Algorithms for Processing Massive Graphs,” in D.A. Bader, editor, Petascale Computing: Algorithms and Applications, Chapman & Hall / CRC Press, Chapter 12, 2007. • D.A. Bader and K. Madduri, “SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks,” 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 14-18, 2008. • S. Kang, D.A. Bader, “An Efficient Transactional Memory Algorithm for Computing Minimum Spanning Forest of Sparse Graphs,” 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Raleigh, NC, February 2009. • Karl Jiang, David Ediger, and David A. Bader. “Generalizing k-Betweenness Centrality Using Short Paths and a Parallel Multithreaded Implementation.” The 38th International Conference on Parallel Processing (ICPP), Vienna, Austria, September 2009. • Kamesh Madduri, David Ediger, Karl Jiang, David A. Bader, Daniel Chavarría-Miranda. “A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets.” 3rd Workshop on Multithreaded Architectures and Applications (MTAAP), Rome, Italy, May 2009. • David A. Bader, et al. “STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation.” 2009. 46David A. Bader19 July 2018
  • 47. Bader, Related Recent Publications (2010-2011) • David Ediger, Karl Jiang, E. Jason Riedy, and David A. Bader. “Massive Streaming Data Analytics: A Case Study with Clustering Coefficients,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010. • Seunghwa Kang, David A. Bader. “Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce cluster and a Highly Multithreaded System:,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010. • David Ediger, Karl Jiang, Jason Riedy, David A. Bader, Courtney Corley, Rob Farber and William N. Reynolds. “Massive Social Network Analysis: Mining Twitter for Social Good,” The 39th International Conference on Parallel Processing (ICPP 2010), San Diego, CA, September 2010. • Virat Agarwal, Fabrizio Petrini, Davide Pasetto and David A. Bader. “Scalable Graph Exploration on Multicore Processors,” The 22nd IEEE and ACM Supercomputing Conference (SC10), New Orleans, LA, November 2010. • Z. Du, Z. Yin, W. Liu, and D.A. Bader, “On Accelerating Iterative Algorithms with CUDA: A Case Study on Conditional Random Fields Training Algorithm for Biological Sequence Alignment,” IEEE International Conference on Bioinformatics & Biomedicine, Workshop on Data-Mining of Next Generation Sequencing Data (NGS2010), Hong Kong, December 20, 2010. • D. Ediger, J. Riedy, H. Meyerhenke, and D.A. Bader, “Tracking Structure of Streaming Social Networks,” 5th Workshop on Multithreaded Architectures and Applications (MTAAP), Anchorage, AK, May 20, 2011. • D. Mizell, D.A. Bader, E.L. Goodman, and D.J. Haglin, “Semantic Databases and Supercomputers,” 2011 Semantic Technology Conference (SemTech), San Francisco, CA, June 5-9, 2011. • P. Pande and D.A. Bader, “Computing Betweenness Centrality for Small World Networks on a GPU,” The 15th Annual High Performance Embedded Computing Workshop (HPEC), Lexington, MA, September 21-22, 2011. • David A. Bader, Christine Heitsch, and Kamesh Madduri, “Large-Scale Network Analysis,” in J. Kepner and J. Gilbert, editor, Graph Algorithms in the Language of Linear Algebra, SIAM Press, Chapter 12, pages 253-285, 2011. • Jeremy Kepner, David A. Bader, Robert Bond, Nadya Bliss, Christos Faloutsos, Bruce Hendrickson, John Gilbert, and Eric Robinson, “Fundamental Questions in the Analysis of Large Graphs,” in J. Kepner and J. Gilbert, editor, Graph Algorithms in the Language of Linear Algebra, SIAM Press, Chapter 16, pages 353-357, 2011. David A. Bader 4719 July 2018
  • 48. Bader, Related Recent Publications (2012) • E.J. Riedy, H. Meyerhenke, D. Ediger, and D.A. Bader, “Parallel Community Detection for Massive Graphs,” The 9th International Conference on Parallel Processing and Applied Mathematics (PPAM 2011), Torun, Poland, September 11-14, 2011. Lecture Notes in Computer Science, 7203:286-296, 2012. • E.J. Riedy, D. Ediger, D.A. Bader, and H. Meyerhenke, “Parallel Community Detection for Massive Graphs,” 10th DIMACS Implementation Challenge -- Graph Partitioning and Graph Clustering, Atlanta, GA, February 13-14, 2012. • E.J. Riedy, H. Meyerhenke, D.A. Bader, D. Ediger, and T. Mattson, “Analysis of Streaming Social Networks and Graphs on Multicore Architectures,” The 37th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, March 25-30, 2012. • J. Riedy, H. Meyerhenke, and D.A. Bader, “Scalable Multi-threaded Community Detection in Social Networks,” 6th Workshop on Multithreaded Architectures and Applications (MTAAP), Shanghai, China, May 25, 2012. • H. Meyerhenke, E.J. Riedy, and D.A. Bader, “Parallel Community Detection in Streaming Graphs,” Minisymposium on Parallel Analysis of Massive Social Networks, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12), Savannah, GA, February 15-17, 2012. • D. Ediger, E.J. Riedy, H. Meyerhenke, and D.A. Bader, “Analyzing Massive Networks with GraphCT,” Poster Session, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12), Savannah, GA, February 15-17, 2012. • R.C. McColl, D. Ediger, and D.A. Bader, “Many-Core Memory Hierarchies and Parallel Graph Analysis,” Poster Session, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12), Savannah, GA, February 15-17, 2012. • E.J. Riedy, D. Ediger, H. Meyerhenke, and D.A. Bader, “STING: Software for Analysis of Spatio-Temporal Interaction Networks and Graphs,” Poster Session, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12), Savannah, GA, February 15-17, 2012. • Y. Chai, Z. Du, D.A. Bader, and X. Qin, "Efficient Data Migration to Conserve Energy in Streaming Media Storage Systems," IEEE Transactions on Parallel & Distributed Systems, 2012. • M. S. Swenson, J. Anderson, A. Ash, P. Gaurav, Z. Sükösd, D.A. Bader, S.C. Harvey and C.E Heitsch, "GTfold: Enabling parallel RNA secondary structure prediction on multi-core desktops," BMC Research Notes, 5:341, 2012. • D. Ediger, K. Jiang, E.J. Riedy, and D.A. Bader, "GraphCT: Multithreaded Algorithms for Massive Graph Analysis," IEEE Transactions on Parallel & Distributed Systems, 2012. • D.A. Bader and K. Madduri, "Computational Challenges in Emerging Combinatorial Scientific Computing Applications," in O. Schenk, editor, Combinatorial Scientific Computing, Chapman & Hall / CRC Press, Chapter 17, pages 471-494, 2012. • O. Green, R. McColl, and D.A. Bader, "GPU Merge Path -- A GPU Merging Algorithm," 26th ACM International Conference on Supercomputing (ICS), San Servolo Island, Venice, Italy, June 25-29, 2012. • O. Green, R. McColl, and D.A. Bader, "A Fast Algorithm for Streaming Betweenness Centrality," 4th ASE/IEEE International Conference on Social Computing (SocialCom), Amsterdam, The Netherlands, September 3-5, 2012. • D. Ediger, R. McColl, J. Riedy, and D.A. Bader, "STINGER: High Performance Data Structure for Streaming Graphs," The IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 20-22, 2012. Best Paper Award. • J. Marandola, S. Louise, L. Cudennec, J.-T. Acquaviva and D.A. Bader, "Enhancing Cache Coherent Architecture with Access Patterns for Embedded Manycore Systems," 14th IEEE International Symposium on System-on-Chip (SoC), Tampere, Finland, October 11-12, 2012. • L.M. Munguía, E. Ayguade, and D.A. Bader, "Task-based Parallel Breadth-First Search in Heterogeneous Environments," The 19th Annual IEEE International Conference on High Performance Computing (HiPC), Pune, India, December 18-21, 2012. 48David A. Bader19 July 2018
  • 49. Bader, Related Recent Publications (2013) • X. Liu, P. Pande, H. Meyerhenke, and D.A. Bader, "PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly," IEEE Transactions on Parallel & Distributed Systems, 24(5):977-986, 2013. • David A. Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner (eds.), Graph Partitioning and Graph Clustering, American Mathematical Society, 2013. • E. Jason Riedy, Henning Meyerhenke, David Ediger and David A. Bader, "Parallel Community Detection for Massive Graphs," in David A. Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner (eds.), Graph Partitioning and Graph Clustering, American Mathematical Society, Chapter 14, pages 207-222, 2013. • S. Kang, D.A. Bader, and R. Vuduc, "Energy-Efficient Scheduling for Best-Effort Interactive Services to Achieve High Response Quality," 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Boston, MA, May 20-24, 2013. • J. Riedy and D.A. Bader, "Multithreaded Community Monitoring for Massive Streaming Graph Data," 7th Workshop on Multithreaded Architectures and Applications (MTAAP), Boston, MA, May 24, 2013. • D. Ediger and D.A. Bader, "Investigating Graph Algorithms in the BSP Model on the Cray XMT," 7th Workshop on Multithreaded Architectures and Applications (MTAAP), Boston, MA, May 24, 2013. • O. Green and D.A. Bader, "Faster Betweenness Centrality Based on Data Structure Experimentation," International Conference on Computational Science (ICCS), Barcelona, Spain, June 5-7, 2013. • Z. Yin, J. Tang, S. Schaeffer, and D.A. Bader, "Streaming Breakpoint Graph Analytics for Accelerating and Parallelizing the Computation of DCJ Median of Three Genomes," International Conference on Computational Science (ICCS), Barcelona, Spain, June 5-7, 2013. • T. Senator, D.A. Bader, et al., "Detecting Insider Threats in a Real Corporate Database of Computer Usage Activities," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Chicago, IL, August 11-14, 2013. • J. Fairbanks, D. Ediger, R. McColl, D.A. Bader and E. Gilbert, "A Statistical Framework for Streaming Graph Analysis," IEEE/ACM International Conference on Advances in Social Networks Analysis and Modeling (ASONAM), Niagara Falls, Canada, August 25-28, 2013. • A. Zakrzewska and D.A. Bader, "Measuring the Sensitivity of Graph Metrics to Missing Data," 10th International Conference on Parallel Processing and Applied Mathematics (PPAM), Warsaw, Poland, September 8-11, 2013. • O. Green and D.A. Bader, "A Fast Algorithm for Streaming Betweenness Centrality," 5th ASE/IEEE International Conference on Social Computing (SocialCom), Washington, DC, September 8-14, 2013. • R. McColl, O. Green, and D.A. Bader, "A New Parallel Algorithm for Connected Components in Dynamic Graphs," The 20th Annual IEEE International Conference on High Performance Computing (HiPC), Bangalore, India, December 18-21, 2013. 49David A. Bader19 July 2018
  • 50. Bader, Related Recent Publications (2014-2015) • R. McColl, D. Ediger, J. Poovey, D. Campbell, and D.A. Bader, "A Performance Evaluation of Open Source Graph Databases," The 1st Workshop on Parallel Programming for Analytics Applications (PPAA 2014) held in conjunction with the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2014), Orlando, Florida, February 16, 2014. • O. Green, L.M. Munguia, and D.A. Bader, "Load Balanced Clustering Coefficients," The 1st Workshop on Parallel Programming for Analytics Applications (PPAA 2014) held in conjunction with the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2014), Orlando, Florida, February 16, 2014. • A. McLaughlin and D.A. Bader, "Revisiting Edge and Node Parallelism for Dynamic GPU Graph Analytics," 8th Workshop on Multithreaded Architectures and Applications (MTAAP), held in conjuntion with The IEEE International Parallel and Distributed Processing Symposium (IPDPS 2014), Phoenix, AZ, May 23, 2014. • Z. Yin, J. Tang, S. Schaeffer, D.A. Bader, "A Lin-Kernighan Heuristic for the DCJ Median Problem of Genomes with Unequal Contents," 20th International Computing and Combinatorics Conference (COCOON), Atlanta, GA, August 4-6, 2014. • Y. You, D.A. Bader and M.M. Dehnavi, "Designing an Adaptive Cross-Architecture Combination for Graph Traversal," The 43rd International Conference on Parallel Processing (ICPP 2014), Minneapolis, MN, September 9-12, 2014. • A. McLaughlin, J. Riedy, and D.A. Bader, "Optimizing Energy Consumption and Parallel Performance for Betweenness Centrality using GPUs," The 18th Annual IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 9-11, 2014. • A. McLaughlin and D.A. Bader, "Scalable and High Performance Betweenness Centrality on the GPU," The 26th IEEE and ACM Supercomputing Conference (SC14), New Orleans, LA, November 16-21, 2014. Best Student Paper Finalist. • D. Dauwe, E. Jonardi, R. Friese, S. Pasricha, A.A. Maciejewski, D.A. Bader, and H.J. Siegel, “A Methodology for Co-Location Aware Application Performance Modeling in Multicore Computing,” 17th Workshop on Advances on Parallel and Distributed Processing Symposium (APDCM), Hyderabad, India, May 25, 2015. • A. Zakrzewska and D.A. Bader, “Fast Incremental Community Detection on Dynamic Graphs,” 11th International Conference on Parallel Processing and Applied Mathematics (PPAM), Krakow, Poland, September 6-9, 2015. • A. McLaughlin, J. Riedy, and D.A. Bader, “An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches,” The 19th Annual IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 15-17, 2015. • A. McLaughlin, D. Merrill, M. Garland and D.A. Bader, “Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures,” The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), San Francisco, CA, October 18-21, 2015. • A. McLaughlin and D.A. Bader, “Fast Execution of Simultaneous Breadth-First Searches on Sparse Graphs,'' The 21st IEEE International Conference on Parallel and Distributed Systems (ICPADS), Melbourne, Australia, December 14-17, 2015. 50David A. Bader19 July 2018
  • 51. Bader, Related Recent Publications (2016-2017) • David Bader, Aleksandra Michalewicz, Oded Green, Jessie Birkett-Rees, Jason Riedy, James Fairbanks, and Anita Zakrzewska, “Semantic database applications at the Samtavro Cemetery, Georgia,” , The 44th Computer Applications and Quantitative Methods in Archaeology Conference (CAA), Oslo, Norway, March 29 – April 2, 2016. • Vipin Sachdeva, Srinivas Aluru, David A. Bader, “A Memory and Time Scalable Parallelization of the Reptile Error-Correction Code,” 15th IEEE International Workshop on High Performance Computational Biology (HiCOMB), Chicago, IL, May 23, 2016. • James Fairbanks, Anita Zakrzewska, and David A. Bader, “New Stopping Criteria For Spectral Partitioning,” IEEE/ACM International Conference on Advances in Social Networks Analysis and Modeling (ASONAM), San Francisco, CA, August 18-21, 2016. • Anita Zakrzewska, Eisha Nathan, James Fairbanks, and David A. Bader, “A Local Measure of Community Change in Dynamic Graphs,” IEEE/ACM International Conference on Advances in Social Networks Analysis and Modeling (ASONAM), San Francisco, CA, August 18-21, 2016. • Anita Zakrzewska and David A. Bader, “Aging Data in Dynamic Graphs: A Comparative Study,” 2nd International Workshop on Dynamics in Networks (DyNo), held in conjunction with IEEE/ACM International Conference on Advances in Social Networks Analysis and Modeling (ASONAM), San Francisco, CA, August 18, 2016. • O. Green and D.A. Bader, “cuSTINGER: Supporting Dynamic Graph Algorithms for GPUs,” The 20th Annual IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 13-15, 2016. • Jeremy Kepner, Peter Aaltonen, David A. Bader, Aydin Buluc, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, Jose Moreira, John D. Owens, Carl Yang, Marcin Zalewski, and Timothy Mattson, “Mathematical Foundations of the GraphBLAS,” The 20th Annual IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 13-15, 2016. • X. Hui, Z. Du, J. Liu, H. Sun, Y. He and D.A. Bader, “When Good Enough Is Better: Energy-Aware Scheduling for Multicore Servers,” 13th Workshop on High-Performance, PowerAware Computing (HPPAC), Orlando, FL, May 29, 2017. • E. Nathan, G. Sanders, J. Fairbanks, V. Henson and D.A. Bader, “Graph Ranking Guarantees for Numerical Approximations to Katz Centrality,” International Conference on Computational Science (ICCS), Zurich, Switzerland, June 12-14, 2017. • Anita Zakrzewska and David A. Bader, “Streaming Graph Sampling with Size Restrictions,” IEEE/ACM International Conference on Advances in Social Networks Analysis and Modeling (ASONAM), Sydney, Australia, July 31 - August 3, 2017. • Eisha Nathan and David A. Bader, “A Dynamic Algorithm for Updating Katz Centrality in Graphs,” , IEEE/ACM International Conference on Advances in Social Networks Analysis and Modeling (ASONAM), Sydney, Australia, July 31 - August 3, 2017. • E. Nathan and D.A. Bader, “Approximating Personalized Katz Centrality in Dynamic Graphs,” , 12th International Conference on Parallel Processing and Applied Mathematics(PPAM), Lublin, Poland, September 10-13, 2017. • O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia, S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, D. Bader, “Quickly Finding a Truss in a Haystack”, IEEE High Performance Extreme Computing Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph Challenge Innovation Award) • S. Zhou, K. Lakhotia, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, J. Fox, E. Kim, O. Green, D. Bader, “Design and Implementation of Parallel PageRank on Multicore Platforms”, IEEE High Performance Extreme Computing Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph Challenge Student Innovation Award) • D. Makkar, D. Bader, O. Green, “Deterministic and Parallel Triangle Counting in Streaming Graphs”, IEEE International Conference on High Performance Computing, Data, and Analytics, Jaipur, India, 2017 51David A. Bader19 July 2018
  • 52. Opportunities • Application-oriented Opportunities: • High performance computing for massive graphs • Streaming analytics • Informational Visualization techniques for massive graphs • Heterogeneous systems: Methodologies for combining the use of the Cloud and Manycore for high- performance computing • Energy-efficient high-performance computing David A. Bader 52
  • 53. Opportunity 1: High performance computing for massive graphs • Traditional HPC has focused primarily on solving large problems from chemistry, physics, and mechanics, using dense linear algebra. • HPC faces new challenges to deal with: • time-varying interactions among entities, and • massive-scale graph abstractions where the vertices represent the nouns or entities and the edges represent their observed interactions. • Few parallel computers run well on these problems because • they often lack locality required to get high performance from distributed- memory cache-based supercomputers. • Case study: Massively threaded architectures are shown to run several orders of magnitude faster than the fastest supercomputers on these types of problems!  A focused research agenda is needed to design algorithms that scale on these new platforms. David A. Bader 53
  • 54. • While our high performance computers have delivered a sustained petaflop, they have done so using the same antiquated batch processing style where a program and a static data set are scheduled to compute in the next available slot. • Today, data is overwhelming in volume and rate, and we struggle to keep up with these streams. Fundamental computer science research is needed in: the design of streaming architectures, and data structures and algorithms that can compute important analytics while sitting in the middle of these torrential flows. Opportunity 2: Streaming analytics David A. Bader 54
  • 55. Opportunity 3: Information Visualization techniques for massive graphs • Information Visualization today • addresses traditional scientific computing (fluid flow, molecular dynamics), or • when handling discrete data, scale to only hundreds of vertices at best.  However, there is a strong need for visualization in the data sciences so that analytics can gain understanding from data sets with from millions to billions of interacting non-planar discrete entities. • Applications include: data mining, intelligence, situational awareness David A. Bader 55NNDB Mapper of George Washington Twitter social network using Large Graph Layout Source: Akshay Java, from ebiquity group
  • 56. Opportunity 4: Heterogeneous Systems, Cloud, Internet of Things: Methodologies for combining the use of the Cloud, IoT, and accelerators • Today, there is a dichotomy between using clouds (e.g. Hadoop, map- reduce) for massive data storage, filtering, summarization, and massively parallel/multithreaded systems for data-intensive computation. We must develop methodologies for employing these complementary systems for solving grand challenges in data analysis. David A. Bader 56 Steve Mills, SVP of IBM Software (left), and Dr. John Kelly, SVP of IBM Research, view Stream Computing technology
  • 57. Opportunity 5: Energy-efficient high-performance computing • The main constraint for our ability to compute has changed • from availability of compute resources • to the ability to power and cool our systems within budget.  Holistic research is needed that can permeate from the architecture and systems up to the applications AND DATA CENTERS, whereby energy use is a first-class object that can be optimized at all levels. David A. Bader 57 Microsoft’s Chicago Million Server DataCenter
  • 58. Acknowledgment of Support David A. Bader 5819 July 2018
  • 59. Backup Slides David A. Bader 5919 July 2018
  • 60. HPEC Graph Challenge “Innovation Award” • Static Graph Challenge – Static Graph Challenge: Subgraph Isomorphism • Finding the maximal K-Truss subgraph • A Truss is a relaxation of a clique that still has a good amount of connectivity • In collaboration with USC and the University of Verona • O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia, S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, D. Bader, “Quickly Finding a Truss in a Haystack”, IEEE High Performance Extreme Computing Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph Challenge Innovation Award) • Trusses are found by deleting edges out of the graph, • Uses dynamic graph techniques for solving a static graph problem. • Greatly reduces the amount of work needed for finding the Truss. David A. Bader 6019 July 2018
  • 61. HPEC Graph Challenge “Student Innovation Award” • In collaboration with USC • S. Zhou, K. Lakhotia, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, J. Fox, E. Kim, O. Green, D. Bader, “Design and Implementation of Parallel PageRank on Multicore Platforms”, IEEE High Performance Extreme Computing Conference (HPEC), Waltham, Massachusetts, 2017 (HPEC Graph Challenge Student Innovation Award) • Uses an edge centric approach: edge list is partitioned into shards. • Shards are shorted by destination. • Greatly reduces the number of cache misses and data movement. • Faster than PageRank Pipeline benchmark – 2.5X faster than the multi- threaded version of PageRank pipeline. • Applicable to accelerators such as FPGAs David A. Bader 6119 July 2018
  • 62. Graph500 Benchmark, www.graph500.org • Cybersecurity • 15 Billion Log Entires/Day (for large enterprises) • Full Data Scan with End-to-End Join Required • Medical Informatics • 50M patient records, 20-200 records/patient, billions of individuals • Entity Resolution Important • Social Networks • Example, Facebook, Twitter • Nearly Unbounded Dataset Size • Data Enrichment • Easily PB of data • Example: Maritime Domain Awareness • Hundreds of Millions of Transponders • Tens of Thousands of Cargo Ships • Tens of Millions of Pieces of Bulk Cargo • May involve additional data (images, etc.) • Symbolic Networks • Example, the Human Brain • 25B Neurons • 7,000+ Connections/Neuron David A. Bader 62 Defining a new set of benchmarks to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads. Executive Committee: D.A. Bader, R. Murphy, M. Snir, A. Lumsdaine • Five Business Area Data Sets: 19 July 2018
  • 63. Heterogeneity in “Big Data” systems: High Performance Data Analytics • Analytic platforms will combine: • Cloud (Hadoop/map-reduce) • Stream processing • Large shared-memory systems • Massive multithreaded architectures • Multicore and accelerators The challenge: developing methodologies for employing these complementary systems in an enterprise-class analytics framework for solving grand challenges in massive data analysis for discovery, real-time analytics, and forensics. David A. Bader 63 Steve Mills, SVP of IBM Software (left), and Dr. John Kelly, SVP of IBM Research, view Stream Computing technology 19 July 2018
  • 64. Future Architectures • Highly multithreaded • High bandwidth (network and memory) • Complex but flexible memory hierarchy • Heterogeneous design in core capability and ISA David A. Bader 6419 July 2018
  • 65. National Strategic Computing Initiative • (29 July 2015, The White House) The National Strategic Computing Initiative (NSCI) is an effort to create a cohesive, multi-agency strategic vision and Federal investment strategy in high- performance computing (HPC). • This strategy will be executed in collaboration with industry and academia, maximizing the benefits of HPC for the United States. • HPC systems, through a combination of processing capability and storage capacity, can solve computational problems that are beyond the capability of small- to medium-scale systems. They are vital to the Nation’s interests in science, medicine, engineering, technology, and industry. • The NSCI will spur the creation and deployment of computing technology at the leading edge, helping to advance Administration priorities for economic competiveness, scientific discovery, and national security. • The National Strategic Computing Initiative has five strategic themes. 1. Create systems that can apply exaflops of computing power to exabytes of data. 2. Keep the United States at the forefront of HPC capabilities. 3. Improve HPC application developer productivity. 4. Make HPC readily available. 5. Establish hardware technology for future HPC systems. 19 July 2018 David A. Bader 65
  • 66. NSCI Anniversary Meeting, 29 July 2016 19 July 2018 David A. Bader 66
  • 67. Dynamic Graphs and Streaming Support David A. Bader 67 DynamicSupport Galois TitanDB GraphX GraphLab Giraph Gunrock Boost GraphCHI Dynamic Static UpdateStreaming rate STINGER Ligra Hornet 19 July 2018
  • 68. Performance and Data Scalability David A. Bader 68 Datasize GaloisTitanDB GraphX GraphLab Giraph Gunrock Boost GraphCHI Large Small Performance Scalability Ligra STINGER GPU-Hornet CPU-Hornet 19 July 2018
  • 69. Overview of STINGER and Hornet Capabilities David A. Bader 69 Algorithm Static Dynamic Implementation Formulation Breadth first search   CPU+GPU VC Triangle Counting   CPU+GPU VC Connect components   CPU+GPU VC Betweenness Centrality   CPU+GPU VC Page Rank   CPU+GPU VC, BLAS Katz Centrality   CPU+GPU VC, BLAS Community Detection   CPU VC Seed Set Expansion   CPU VC K-Truss Decomposition  GPU VC Maximal Independent Set  GPU VC Legend: VC - Vertex Centric, BLAS 19 July 2018
  • 70. Graph Analytics (Dynamic vs. Streaming) • Many libraries support updates to graph • They do not have dynamic graph analytics David A. Bader 70 Streaming RateSlow (very low) Fast Graph and algorithm properties Dynamic Static STINGE R GraphLa b GiraphGraphX Galois TitanDB 19 July 2018
  • 71. Boost Gunrock Scalability (Volume vs. Performance) • Many libraries support updates to graph • They do not have dynamic graph analytics David A. Bader 71 Performanc e Scalability Low Fast Data Scalability Large Small STINGE R GraphLa b Giraph GraphX GaloisTitanDB GraphCH I 19 July 2018
  • 72. PageRank - Performance Analysis David A. Bader 72 • Lower is better. • 2 orders of magnitude difference in performance. • Still outperforms other static-only graph packages. • Outperforms the distributed systems even for large networks with plenty of computational demand! • Some platforms did not complete in reasonable amount of time. R-MAT Graph • Vertices: 16M • Edges: 128M 19 July 2018
  • 73. Other algorithms David A. Bader 73 R-MAT Graph • Vertices: 1M • Edges: 8M • STINGER is orders of magnitude faster. • Still outperforms other static-only graph packages. Static Single-Source Shortest Path Static Connected Components 19 July 2018
  • 74. High Performance Data Analytics (HPDA) • With Pacific Northwest National Lab • Project objectives – • Develop novel tools and algorithms for dealing with massive dynamic networks. • Enable analysts to search through vast data at near real-time speeds. • Improve accuracy of past approaches through the use of community centric analysis David A. Bader 74 • Successes • Developed the first scalable dynamic graph data for the GPU (that also works for CPUs). Data structure supports over 90 million updates per second. • Designed novel dynamic algorithms for Katz Centrality and triangle counting • Developed personalized centrality metrics • Sketched out an asynchronous model, the first of its kind, for analyzing the correctness of dynamic graph algorithms when the underlying graph is changing. 19 July 2018
  • 75. Leveraging High Performance Computing for Mixed-Integer Programming 19 July 2018 David A. Bader 75 In a joint collaboration with the ExxonMobil Upstream Research Company, we focus on developing effective Mixed Integer Programming (MIP) methods for difficult planning and scheduling problems arising in the petrochemical industry. • Challenges • MIPs are NP-Hard problems. • Available parallel algorithms show poor scalability. • Goals : • To study and develop new scalable parallel algorithms for MIP solving. • Our focus is on large scale industrial optimization problems. • To offer MIP practitioners a portfolio of parallel algorithms, which emphasize finding high quality solutions quickly as well as proving optimality.
  • 76. NSF XSCALA • A Community Repository for Model-driven Design and Tuning of Data- Intensive Applications for Extreme-scale Accelerator-based Systems • Collaboration with Georgia Tech and University of Southern California • Launched 2012 • Challenges • Sparse data intense computations with irregular memory access patterns. • Extremely hard for compilers to parallelize  Requires hand tuning for new architectures and platforms. David A. Bader 76 • Goals • Design tools for automatic fine tuning and optimizations • Develop runtime scheduling techniques for load-balancing and for hardware selection • Offer programmers an intuitive modeling environment for design-time and run-time optimizations. 19 July 2018
  • 77. Parallel Alternating Criteria Search 19 July 2018 David A. Bader 77 • A parallel Large Neighboorhood Search(LNS) heuristic aimed at obtaining high quality feasible solutions for general Mixed Integer Programs (MIPs). • Solution improvements are found by solving in parallel a large number of restricted subproblems, which are derived from the original problem. • It is the first parallel general purpose heuristic developed for MIPs. • Significantly more scalable and effective at finding high quality solutions than current commercial MIP solvers, especially for large-scale MIP instances. • The framework provides an excellent platform for the rapid prototyping of highly effective problem-specific heuristics, which exploit problem structure.