DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Â
Spark Technology Center IBM
1. Spark Technology Center
IBM Apache Spark
The start of something big in data and design.
J. White Bear
Spark Technology Center
2. IBM Spark
IBM Investment in Computing
Linux, 1999
13,000,000 lines of code.
500+ Server Solutions
Ushered in Computer Science
System 360, 1964
10,000,000 lines of code.
54 Peripheral Solutions
Ushered in Information Science
Apache Spark, 2015
400,000 lines of code.
20 Data & Analytics Solutions
Ushered in Data Science
3. IBM Spark
About Me
3
Education
⢠University of Michigan- Computer Science
⢠Databases, Machine Learning/Computational Biology,
Cryptography
⢠University of California San Francisco-
⢠Multi-objective Optimization/Computational
Biology/Bioinformatics
⢠McGill University
⢠Machine Learning/ Multi-objective Optimization for Path
Planning/ Cryptography
Industry
⢠IBM (6 months)
⢠Amazon
⢠TeraGrid
⢠Pfizer
⢠Research at UC Berkeley, Purdue University, and every
university I ever attended. ď
Fun Facts (?)
I love research for its own sake. I like robots,
helping to cure diseases, advocating for
social change and reform, and breaking
encryptions. Also, all activities involving the
Ocean and I usually hate taking pictures. ď
4. IBM Spark
Outline
4
⢠Brief overview of the state and direction of robotics
Introduction
⢠Definition of SLAM
⢠Key Challenges
What is SLaM?
⢠Benefits
⢠Current Approaches
Why SLaM on IoT/Spark?
⢠The Approach
⢠Framework and Architecture
The Framework
⢠Challenges / Recommendations
The Results
Next Steps
Demo with Gazebo
Questions and Answers
5. IBM Spark
Introduction: Robotics Today
5
FIRST Robotics World Championship
NASA Glenn Research Center in Cleveland sponsored
Tri-C's team.
Tartan Racingâs Boss, the robotic SUV that won the 2007
DARPA Urban Challenge,
South Korean Team, KAIST wins the DARPA Robot
Challenge
Amazon Drones
6. IBM Spark
Introduction: Robotics Tomorrow
6
Navigate stores, museums and other indoor locations, with
directions overlaid onto your surroundings. Google Tango
Nanorobots wade through blood to deliver drugs
Space/underground/underwater rescue and
exploration. Places humans canât go.
SLaM and ML on automated wheelchair
7. IBM Spark
What is SLaM?
7
Simultaneous Localization and Mapping (SLAM)
⢠Formal Definition
⢠Given a series of sensor observations over discrete time steps the SLAM problem is to compute an estimate
of the agent's and a map of the environment. All quantities are usually probabilistic, so the objective is to
compute (as an example variant):
â˘Computational problem of constructing or updating a map of an unknown environment while
simultaneously keeping track of an agent's location within it.
â˘SLAM algorithms use various implementations to attempt to find heuristics to make this problem tractable
using machine learning and probabilistic models
â˘GPS cannot account for unknown barriers, precision navigation, moving objects, or any areas with satellite
interference including weather phenomena.
8. IBM Spark
What is SLAM?
8
What are some of the key challenges in SLAM?
â˘Computer vision correctly and identifying images observed
â˘Moving objects Non-static environments, such as those containing other vehicles or
pedestrians, continue to present research challenges. (collision detection)
â˘Data Association-refers to the problem of ascertaining which parts of one image
correspond to which parts of another image, where differences are due to movement
of the camera, the elapse of time, and/or movement of objects in the photos.
â˘Loop closure is the problem of recognizing a previously visited location and updating
the states accordingly.
10. IBM Spark
Why SLAM on IoT?
10
SLAM in IoT
⢠"[SLAM] is one of the fundamental challenges of robotics . . . [but it] seems that almost all the
current approaches can not perform consistent maps for large areas, mainly due to the increase of
the computational cost and due to the uncertainties that become prohibitive when the scenario
becomes larger."[12] Generally, complete 3D SLAM solutions are highly computationally intensive as
they use complex real-time particle filters, sub-mapping strategies or hierarchical combination of
metric topological representations, etc. (Wiki)
⢠Computational costs become prohibitive on embedded systems, especially smaller robotic
modules. The data becomes large and the calculations and corrections over time and space
become much more important. Specifically, SlaM increases exponentially with the number of
landmarks found.
⢠The state uncertainty increase with time and space, and must be bounded by some form of
machine learning to predict and use accurate corrections in the algorithm
⢠Additional sensors, rapid movements, processing visual input adds additional computational
burdensâŚ
11. IBM Spark
Why SLAM on IoT?
11
The Benefits
â˘Seamless integration and scaling allowing users to easily improve
the heuristics of the algorithm without losing any of the
performance expectations of an embedded system.
â˘Including smart cities, lawn mowing, dog walking, kitchen
appliances, or even communication inside the human body
creating a truly unique interaction between humans and robotics
â˘Large scale evaluation of performance metrics for all IoT systems
(Big Data)
â˘Monitoring and control of sensors based on stored data (eg
reducing sensor usage to conserve power)
12. IBM Spark
Why SLaM on IoT?
12
Current Approaches
⢠Robot Operating System (ROS) a collection of software frameworks for robot software development
⢠Providing operating system-like functionality on a heterogeneous computer cluster.
⢠Hardware abstraction, low-level device control, implementation of commonly used functionality,
message-passing between processes, and package management.
⢠No true real-time analytics! Despite the importance of reactivity and low latency in robot control,
ROS is not a Realtime OS
⢠Difficult to scale in IoT! Adding a heterogenous swarm, or integrating interactions requires significant
planning.
⢠There is a need! Are there any plans to build Kalman filtering and system identification into this
framework? https://github.com/sryza/spark-timeseries/issues/19
⢠We need a framework that can do this! Enter Apache Kafka and Spark Streaming!
13. IBM Spark
The Framework
13
The Approach
â˘Extended Kalman Filter (matrix based update/estimation)
â˘Nonlinear version of the Kalman filter which linearizes about
an estimate of the current mean and covariance. de facto
standard in the theory of nonlinear state estimation eg
navigation systems and GPS. (wiki)
â˘TurtleBot (standard robotics research bot)
â˘Gazebo Simulator (3D simulator with sensors input and
feedback)
17. IBM Spark
The Framework
17
IBM SoftLayer cluster with 3 Nodes.
Node 1:
Management
Node
Apache Kafka
(Multithreaded
Producers are each
assigned a sensor)
Simulator/Sensor
Data
Mapping Agent
Node 2:
Hadoop/Spark
Spark Streaming
Consumer/ Apache
Kafka Producer to
Simulator
Spark Streaming
Spark ML
Analytics
Node 3:
Hadoop/Spark
Spark Streaming
Consumer/ Apache
Kafka Producer to
Simulator
Spark Streaming
Spark ML/
Analytics
18. IBM Spark
The Framework
18
Apache Kafka
Spark
Streaming
Spark ML/
Analytics and
Computation
Apache Kafka
Simulated
Turtlebot
⢠Odometry, pose and orientation data
for every movement.
⢠Laser scan data every 30ms with
over 1200 data points per read!
⢠One robot and not even all the
sensors!
A high performing plug n play cloud for
smart robotics, drones and intelligent
systems that allows easily tuneable
interactions for scientists and industry in
any environment!
19. IBM Spark
The Framework
19
A high performing plug n play cloud for smart robotics, drones and intelligent systems
that allows easily tuneable interactions for scientists and industry in any environment!
â˘EKF is calculated primarily using matrix operations!
â˘Distributed raw sensor data using Apache Kafka. Number of sensors
limited only by Kafka cluster!
â˘Improved performance using RDDs and Spark ML for computational
intensive tasks!
â˘Fast/optimized learning and analytics!
â˘Real-time sensor messaging!
â˘Easy sensor integration and scaling!
â˘Retention of data over time for improved optimizations and accuracy!
20. IBM Spark
The Framework: Apache Kafka
20
Kafka Integration
â˘Multithreaded Producers for easy scaling and hardware timing
â˘Apache Kafka Java Api backed by a thread pool to handle concurrency
â˘Allows shared instances of Producer
â˘Large scale sensors distributions can be partitioned for easier analysis, and significantly
decreased latency
24. IBM Spark
The Framework: Spark ML, RANSAC
24
Spark ML with RANSAC
â˘RANSAC
⢠One of many iterative method to estimate parameters of a mathematical model from a set of
observed data which contains outliers.
⢠Default methodology for determining whether a series of landmark forms a wall or structure
â˘Ideal for consumption with high-throughput batches in Spark Streaming!
â˘Integrated as an online learning algorithm (This framework) as back-end iterative process in
Spark Streaming/ Spark!
26. IBM Spark
The Results
26
Key Challenges
â˘Network Latency
â˘Embedded vs Framework
â˘Matrix computations and updates to large matrices
â˘Jacobian (derivatives), Inversion, Transpositon, Multiplication,
Addition/ Subtraction, Gaussian
â˘Covariance/Estimation computations
â˘Coordinating movement with computation
â˘Spark ML to correctly interpret visual landmark data, minimizing errors
27. IBM Spark
The Results
27
Challenges
â˘~4KLOC (Java != verbose ď)
â˘Java lambda documentation
â˘Kafka topics from Spark Streaming consumer
â˘Real-life latency depends on the type of connection and creates
additional noise
â˘Matrix computation
â˘Defining heuristics
â˘Communicator to sim, need a solid class
28. IBM Spark
The Results
28
Measuring Network Latency in artificially throttled IO simulators. Timing was kept static to
measure real delays in the messages over the cluster and between the simulator against file IO.
PERF1 (w/ Sim) vs PERF2 (file
IO) Iterations: 10
Iterations: 200
30. IBM Spark
The Results
31
Measuring landmark acquisition and cpu time Embedded vs Framework at 500 iterations.
Framework completed 500 iterations
with expected exponential growth
Embedded failed to complete
at 500 iterations (up to ~300)
31. IBM Spark
The Results
32
Measuring landmark acquisition and cpu time Embedded vs Framework for complete map.
Both installations were run until the number of landmarks/maps were roughly equivalent and
iterations marked.
Iterations: ~100, Time ~2 min Iterations: ~100, Time ~30-40s
32. IBM Spark
The Results
33
Forthcoming Benchmarks.
Iterations: ~100 Iterations: ~20
⢠Apache Kafka latency to brokers
⢠RANSAC convergence of Spark Streaming batches
⢠Spark Streaming batch processing throughput in relation to
processing time
33. IBM Spark
The Results
34
Performance Tuning and Optimization
â˘Sparse and distributed matrices in Spark ML
â˘Optimize matrix computations (EKF)
â˘Separate threads for Apache Kafka producers
â˘Spark Streaming batches timed to sensor input cycles to avoid heavy loads
misaligned updates (This could also be tuned using device profiles).
â˘Slower movement/reduced data points to synchronize calculations with
movement and discovery
â˘Rapid movement are larger RDDs should create new RDDs and matrices for
updates using existing heuristics, updates can sometimes create bottlenecks
â˘Standard Spark performance tuning: cpu core maximization, and executors
â˘*Scheduled feature extraction to minimize accumulated error in long runs
â˘*New parameters/ large skew from ground truth should trigger updates
34. IBM Spark
Next Steps
35
⢠Expanded stochastic analysis beyond gradient descent
⢠Kalman Filter and Extended Kalman Filter
⢠Improving accuracy and precision with an end to end pipeline that allows
customization/optimization
⢠Path Planning algorithms to improve search and search times
⢠Incorporate swarms/particles
⢠A complete robotics library or even extension to handle robotics, computer vision or any
of the ai/machine learning problems specifics to robotics is publishable and opens the
door to a whole new group of scientists.
⢠Further scaling and optimization with robotic swarms and rapid/increased volume sensor
data
35. IBM Spark
Conclusion
36
IBM IoT Cloud Open Platform for Industries
IBM Bluemix IoT Zone
IBM IoT Ecosystem
More to comeâŚ.!!!
37. IBM Spark
Q & A
39
Contact Information:
J. White Bear (jwhiteb@us.ibm.com)
IBM Spark Technology Center
425 Market St San Francisco, CA
Special thanks to IBM, the IBM Spark team at Spark Technology Center for your input,
taking time to discuss, and allowing me time to work on this project.
Sampada Basakar
Vijay Bommireddipalli
Fred Reiss
Luciano Resende
Increase in error over time and readjusting this
Identifying landmarks
Not so simple after all. Actually itâs very computationally challenging which is why we decided to move things to the cloud.
ROS is great, but can you really fine tune your parameters and ML algorithms. Is it easily portable and integrated with the next generation of robots that are going to need realtime processing and fast analytics spanning robots and sensors over time?
Unlike its linear counterpart, the extended Kalman filter in general is not an optimal estimator (of course it is optimal if the measurement and the state transition model are both linear, as in that case the extended Kalman filter is identical to the regular one). In addition, if the initial estimate of the state is wrong, or if the process is modeled incorrectly, the filter may quickly diverge, owing to its linearization. Another problem with the extended Kalman filter is that the estimated covariance matrix tends to underestimate the true covariance matrix and therefore risks becoming inconsistent in the statistical sense without the addition of "stabilising noise"[citation needed] .
Having stated this, the extended Kalman filter can give reasonable performance, and is arguably the de facto standard in navigation systems and GPS.
Simple graph what this looks like in code is quite different updating the main H matrix alone is the largest computation in both size and cpu usage. It holds all the state and landmark data and must updated based on all the corresponding matrices. The code is large with this one, but it doesnât have to be building this alone as a library would cut down on over a 1000 lines of code.
 Bayesian inference and estimating a joint probability distributionover the variables for each timeframe.
Standard architecture, add ibm ambari etc
This is clearly a problem that announces itself in the big data space
Kafka streaming
The takeaway is that ibm is already in the IoT space and preparing for the next generation of smart cities. Our continued open source innovation is a part of that.
The takeaway is that ibm is already in the IoT space and preparing for the next generation of smart cities. Our continued open source innovation is a part of that.