2. Data is being produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
3. Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020
5. Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon
Dynamo DB
Amazon
Elasticsearch
Amazon EMR
Amazon EC2Amazon
Redshift
AWS Data PipelineAWS Database Migration Service AWS Glue
Amazon
Athena
Amazon Kinesis
Analytics
Collect Store Process / Analyze
AWS IoT
Amazon
QuickSight
6. Building a Big Data Application
web clients
mobile clients
DBMS
Amazon Redshift
AWS Cloudcorporate data center
Build a data warehouse with Amazon Redshift
7. Structured Data Processing
• Petabyte-scale relational, MPP, data warehousing
• Fully managed with SSD and HDD platforms
• Built-in end-to-end security, including customer-managed keys
• Fault-tolerant. Automatically recovers from disk and node failures
• Data automatically backed up to Amazon S3 with cross-region
backup capability for global disaster recovery
• Over 140 new features added since launch
• $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale
from 160 GB to 2 PB of compressed data with just a few clicks
Amazon Redshift
9. Building a Big Data Application
web clients
mobile clients
DBMS
Amazon Redshift
AWS Cloudcorporate data center
Migrate your data to AWS
AWS Database
Migration Service
AWS Direct Connect
AWS Snowball
10. Start your first migration in 10 minutes or less
Keep your apps running during the migration
Migrate to databases running on Amazon EC2,
Amazon RDS, or Amazon Redshift
AWS
Database
Migration Service
11. AWS Snowball: PB-scale Data Transport
E-ink shipping
label
Ruggedized
case
“8.5G Impact”
All data encrypted
end-to-end
50TB & 80TB
10G network
Rain & dust
resistant
Tamper-resistant
case & electronics
13. Building a Big Data Application
web clients
mobile clients
DBMS
Amazon Redshift
Amazon
QuickSight
AWS Cloudcorporate data center
Visualize your data with Amazon QuickSight
AWS Database
Migration Service
AWS Direct Connect
AWS Import/Export
& Snowball
14. Business Intelligence
• Fast and cloud-powered
• Easy to use, no infrastructure to manage
• Scales to 100s of thousands of users
• Quick calculations with SPICE
• 1/10th the cost of legacy BI software
Amazon
QuickSight
15. What if your data isn’t structured?
What if you don’t need all the raw data?
What if you need to combine multiple data sets?
16. Serverless Event Processing
• Serverless compute service that runs your code in
response to events
• Extend AWS services with user-defined custom logic
• Write custom code in Node.js, Python, and Java
• Pay only for the requests served and compute time
required - billing in increments of 100 milliseconds
AWS Lambda
17. Building a Big Data Application
web clients
mobile clients
DBMS
Amazon Redshift
Amazon
QuickSight
AWS Cloud
Event-driven data transformations with AWS Lambda
corporate data center
AWS Lambda
Structured Data
In Amazon S3
Raw data
In Amazon S3
18. How will this work at scale?
What if the data processing exceeds the timeout?
19. Semi-structured/Unstructured Data Processing
• Hadoop, Hive, Presto, Spark, Tez, Impala etc.
• Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3
and HBase on S3, Phoenix, Tez, Flink.
• New applications added within 30 days of their open source release
• Fully managed, Auto Scaling clusters with support for on-demand and
spot pricing
• Support for HDFS and S3 file systems enabling separated compute and
storage; multiple clusters can run against the same data in S3
• Support for end-to-end encryption, IAM/VPC, S3 client-side encryption
with customer managed keys and AWS KMS. HIPAA-eligible.
Amazon EMR
20. Building a Big Data Application
web clients
mobile clients
DBMS
Amazon Redshift
Amazon
QuickSight
AWS Cloud
Transform and explore your data at scale with Amazon EMR
corporate data center
Amazon EMR Structured Data
In Amazon S3
Raw data
In Amazon S3
22. Serverless Query Processing
• Serverless query service for querying data in S3 using standard SQL with
no infrastructure to manage
• No data loading required; query directly from Amazon S3
• Use standard ANSI SQL queries with support for joins, JSON, and window
functions
• Support for multiple data formats include text, CSV, TSV, JSON, Avro,
ORC, Parquet
• Pay per query only when you’re running queries based on data scanned.
If you compress your data, you pay less and your queries run faster
Amazon
Athena
23. Building a Big Data Application
Extend your data warehouse to S3 with Amazon Athena
web clients
mobile clients
DBMS
Raw data
In Amazon S3
Amazon Redshift
Staging Data
in Amazon S3
Amazon
QuickSight
AWS Cloudcorporate data center
Amazon
EMR
Amazon
Athena
24. A Data Lake on AWS
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis
Analytics
RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
26. Trustpilot at a glance
“Trustpilot is an online review platform to help people choose services and products with
confidence and to help companies to harness the power of reviews.”
- 30 million reviews in total
- 1 million new reviews each month
- 1.5 billion page impressions each month
- 15 million emails sent each month
27. Data at Trustpilot
Everything we build must be tracked and measured:
- 100 GB of log files each day
- 3.5 million tracking events each day
We’re extremely data driven: data always wins.
28. Traditional data warehousing didn’t work anymore
Some of the issues we encountered:
- Teams were stepping on each others’ toes
- Not a clear source of truth
- Difficult discovery of data to gain insights
- Poor (or no) data governance
- Couldn’t “just” store data
- Storage is expensive
29. Data Lake to the rescue
“A Data Lake is a central repository to store massive amounts of data in its natural
format.”
Some of the benefits of a Data Lake:
- Teams can implement compute jobs (ETL/MR) independently
- Clear source of truth and easier discovery of data
- Clear path to implement data governance (e.g. security, privacy)
- Just store it (schema-on-read)
- Storage is cheap (separation of compute and storage)
30. How we built a Data Lake
Components:
- Ingestion
- Central Storage
- Processing & Analytics
- Access & User Interface
- Catalog & Search
31. Ingestion
- Quick ingestion of raw data
- Support for any type of data
- Unstructured
- Semi-structured (JSON, XML)
- Structured (CSV, Columnar)
- No need to force data into
a pre-defined schema
- Batch and Stream support
32. Central Storage on S3
- High availability (system uptime)
- High durability (data redundancy)
- Store massive amounts of data
- Cheap (starts at $0.023 per GB)
S3 Event Triggers
- Lambda or SQS, SNS
33. Catalog & Search
- Avoid the “Data Swamp”
- Discovery of data
- Metadata storage
34. Access & User Interface
- Ingestion via Upload
- Access data catalog and metadata
- Data Lake API
AWS Data Lake Solution
- goo.gl/8k1MXq
36. How the Data Lake helped us
- Getting our data sane again
- Data is easier to discover
- Teams can move faster
- Analytics are much faster
- Cost savings
Lessons learned
- S3 Event Triggers + Lambdas rock
- Meta data is fuzzy and hard to get right
37. Thank you ;)
Martin Buberl
Director of Engineering at Trustpilot
mbl@trustpilot.com | @martinbuberl
38. A Data Lake on AWS
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis
Analytics
RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
39. Recommended next session:
13:15 - Getting Started with Amazon QuickSight
14:00 - Big Data Architectural Patterns and Best Practices