2. Segmentação de Clientes
Otimização de Marketing
Modelagem e previsão Financeira
Ad targeting & real time bidding
Análise Clickstream
Detecção de Fraude
Redução de Custos operacionais
Casos de Uso
4. Tudo começa com Dados…
VOLUME VELOCIDADE VARIEDADEDesafio
Necessidades
x
Possibilidades
BATCH
Relatórios
REAL-TIME
Alertas
Predição
Previsão
5. Tendência: Aplicações inteligentes
Baseado no que
conhece de seu
”user”:
Ele vai usar seu
produto?
Baseado no seu
conhecimento sobre
pedidos:
Este pedido é
fraudulento?
Baseado no que conhece
sobre Notícias:
Quais outros artigos
seriam interessantes?
6.
7. Fraud Detection
FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion
trading events per day and securely store over 5 petabytes of data,
attaining savings of $10-20mm per year.
8. Dados Desacoplados “data bus”
• Data → Store → Process → Store → Analyze → Answers
Ferramenta certa para cada necessidade
• Estrutura, latência, capacidade, padrões de acesso
Use arquiteturas Lambda
• Imutáveis (append-only) log, camada batch/speed/serving
Uso de Serviços Gerenciados
• Escalabilidade/Elasticidade, disponibilidade, confiabilidade,
segurança, no/low admin
Big data ≠ big cost
Princípios de Arquitetura – Big Data
9. Modelo de Processamento Simplificado
Coleta Armazenamento Processamento/
Análise
Consumo
Tempo de Resposta (Latency)?
Capacidade?
Custo?
10. Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
AWS Lambda
Apache Storm on
EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL DB & Graph DB
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Machine Learning
Open Source
Tool of Choice
on EC2
FontesdeDadosArquitetura Lambda
AWS
Data Science Sandbox
Visualization /
Reporting
Amazon Kinesis
Analytics
Athena
11. Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Arquitetura de Big Data
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Amazon S3
Raw Data
Amazon EMR
ETL
Advanced
Analytics
MLlib
Event Capture
Amazon Kinesis
Stream Analysis
Amazon EMR
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
3 main charectiristic of DATA
Velocity
Moves at very high rates
Valuable in its temporal, high velocity state
Volume
Fast-moving data creates massive historical archives
Valuable for mining patterns, trends and relationships
Variety
Structured (logs, business transactions)
Semi-structured and unstructured
Condense these slides into 3-box
Go faster – only key points
Before we go into solving the Big architecture, I want to introduce some “tried and test” architecture principles.
Here at AWS we believe you should be using the right tool for the job – “instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. We’ll talk about this more.
Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each other…this has been tried and battle test.
Managed services – this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense….you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.
Lambda Architecture -
throughput = f (volume, request rate)
latency
Cost
Event to action/answer latency?
http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-BE3BA3E4-1AC5-4E7A-B542-015056D8EDAF
Kinesis -> $52.14 per month
SQS -> $133.42 per month for puts or $400/month (put, get, delete)
DynamoDB -> $3809.88 per month (10TB of storage cost itself is $2500/month)
Cost (100rpsx 35KB)
$52/month
$133/month * 2 = $266/month
?
Amazon DynamoDB Service (US-East)
$
Provisioned Throughput Capacity:
$120
Indexed Data Storage:
$2560.90
DynamoDB Streams:
$1.3
Amazon SQS Service (US-East)
Pricing Example
Let’s assume that our data producers put 100 records per second in aggregate, and each record is 35KB. In this case, the total data input rate is 3.4MB/sec (100 records/sec*35KB/record). For simplicity, we assume that the throughput and data size of each record are stable and constant throughout the day. Please note that we can dynamically adjust the throughput of our Amazon Kinesis stream at any time.
We first calculate the number of shards needed for our stream to achieve the required throughput. As one shard provides a capacity of 1MB/sec data input and supports 1000 records/sec, four shards provide a capacity of 4MB/sec data input and support 4000 records/sec. So a stream with four shards satisfies our required throughput of 3.4MB/sec at 100 records/sec.
We then calculate our monthly Amazon Kinesis costs using Amazon Kinesis pricing in the US-East Region:
Shard Hour: One shard costs $0.015 per hour, or $0.36 per day ($0.015*24). Our stream has four shards so that it costs $1.44 per day ($0.36*4). For a month with 31 days, our monthly Shard Hour cost is $44.64 ($1.44*31).
PUT Payload Unit (25KB): As our record is 35KB, each record contains two PUT Payload Units. Our data producers put 100 records or 200 PUT Payload Units per second in aggregate. That is 267,840,000 records or 535,680,000 PUT Payload Units per month. As one million PUT Payload Units cost $0.014, our monthly PUT Payload Units cost is $7.499 ($0.014*535.68).
Adding the Shard Hour and PUT Payload Unit costs together, our total Amazon Kinesis costs are $1.68 per day, or $52.14 per month. For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.
Amazon Elasticsearch service allows you to easily and securely deploy and scale an ELK stack in minutes. Integration with Logstash is tightly coupled and a Kibana instance is automatically configured for you. The service automatically detects and replaces failed Elasticsearch nodes, reducing the overhead associated with self-managed infrastructure and Elasticsearch software.
Ideal Usage Patterns
análise logs
análise data stream updates from other AWS services
Provide customers a rich search and navigation experience
Usage monitoring for mobile applications
Performance
Depends on multiple factors including instance type, workload, index, number of shards used, read replicas
Storage configurations –instance storage or EBS storage
Cost Model
Pay as you go
Only pay for compute and storage