After a brief introduction to programmatic ads and RTB we go through the evolution of Jampp's data platform to handle the enormous about of data we need to process.
4. Jampp is a leading mobile app
marketing and retargeting platform.
Founded in 2013, Jampp has offices in San
Francisco, London, Berlin, Buenos Aires, São
Paulo and Cape Town.
We help companies grow their business by
seamlessly acquiring, engaging & retaining
mobile app users.
5. Jampp’s platform combines machine
learning with big data for programmatic ad
buying which optimizes towards in-app
activity.
Our platform processes +200,000 RTB ad bid
requests per second (17+ billions per day)
which amounts to about 300 MB/s or 25 TB
of data per day.
6. How does programmatic ads work?
DOWNLOAD
APP
Source /
Exchange
Jampp
Tracking
Platform
AppStore /
Google Play
App
Install
Postback
Postback
8. Jampp Events
1. RTB:
a. Auction: the exchange asks if we want to bid for the
impression.
b. Bid/Non-Bid: bid with price or non-bid (less than 80ms).
c. Impression: the ad is displayed to the user.
2. Non-RTB:
a. Click: event that marks when the user clicks on the ad.
b. Install: install of the app on first app open.
c. Event: in app events like purchase, view, favorited.
9. Data @ Jampp
● Our platform started using RDBMSs and a
traditional Data Warehouse architecture on Amazon
Web Services.
● Data grew exponentially and data needs became
more complex.
● In the last year alone, 2500%+ in-app events and
500%+ RTB bids.
● This made us evolve our architecture to be able to
effectively handle Big Data.
12. Jampp Initial Systems: Bidder
● OpenRTB bidding system implementation that runs on
200+ virtual machines with 70GB RAM each.
● Strong latency requirements. Less than 80ms to answer a
request.
● Written in Cython and uses ZMQ for communication.
● Heavy use of coherent caching to comply with latency
requirements.
● Data is continually replicated and enriched from MySQL
by the replicator process.
13. Jampp Initial Systems: Cupper
● Event tracking system written in Node.js.
● Tracks clicks, installs and in-app events. (200+
millions per day)
● Can be scaled horizontally (10 instances) and is
located behind a load balancer (ELB).
● Uses a MySQL database to store attributed events
and Kinesis to store organics.
14. Jampp Initial Systems: API
● PostgreSQL is used as a Data Warehouse database apart
from the use the bidder does.
● An API exposes the data for querying with a caching
layer.
● Fact tables are maintained with hourly, daily and
monthly granularity and high cardinality dimensions are
removed in large fact tables for data older than 15 days.
● Data is continually aggregated through an aggregation
process written in Python.
16. Emerging Needs
● Log forensics capabilities - as our systems and company
scale and we integrate with more outside systems.
● More historical and granular data for advanced analytics
and model training.
● The need to make the data readily available to other
systems outside from the traditional RDBMS arose. Some
of these systems are too demanding for RDBMS to
handle easily.
18. New System Characteristics
● The new system was based on Amazon Elastic Map
Reduce.
● Data imported hourly from RDBMSs with Sqoop.
● Logs are imported every 10 minutes from different
sources to S3 tables.
● Facebook PrestoDB and Apache Spark are used for
interactive log and analytics.
19. New System Characteristics
● Scalable storage and processing capabilities using
HDFS, YARN and Hive for ETLs and data storage.
● Connectors from different languages like Python,
Julia and Java/Scala.
● Data archiving in S3 for long term storage and
enabling other data processing technologies.
20. Aspects that needed improvement
● Data still imported in batch mode. Delay was larger
for MySQL data than with Python replicator.
● EMR not great for long running clusters.
● The EMR cluster is not designed with strong multi-
user capabilities. It is better to have multiple
clusters with few users than a big one with many.
● Data still being accumulated in RDBMSs (clicks,
installs, events).
21. Final stage of the evolution
● Real-time event processing architecture based on
best practices for stream processing in AWS.
● Uses Amazon Kinesis for streaming data storage
and Amazon Lambda for data processing.
● DynamoDB and Redis are used for temporal data
storage for enrichment and analytics.
● S3 gives us a Source of Truth for batch data
applications and Kinesis for stream processing.
23. Still, it isn’t perfect...
● There is no easy way to manage windows and out or
order data with Amazon Lambda.
● Consistency of DynamoDB and S3.
● Price of AWS managed services for events with large
numbers compared to custom maintained solutions.
● ACID guarantees of RDBMs are not an easy thing to part
with.
● SQL and indexes in RDBMs make forensics easier.
24. Benefits of the Evolution
● Enables the use of stream processing frameworks to
keep data as fresh as economically possible.
● Decouples data from processing to enable multiple Big
Data engines running on different clusters/
infrastructure.
● Easy on demand scaling given by AWS managed tools
like AWS Lambda, AWS DynamoDB and AWS EMR.
● Monitoring, logs and alerts managed by AWS
Cloudwatch.
26. Key Take Aways
● Ad tech is a technologically intensive market which
complies with the three Vs from Big Data.
● As the business’ data needs grows in complexity specialized
data systems need to be put in place.
● Using technologies that are meant to scale easily and are
managed by a third party can bring you peace of mind.
● Stream processing is fundamental in new Big Data Projects.
● There is currently no one tool that clearly fulfills all the
needs for scalable and correct stream processing.