Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch

Uma visão sobre Fast-Data:
Spark, VoltDB e Elasticsearch
Luiz Henrique Zambom Santana

Agenda
● Introdução
● Processamento: Apache Spark
● Armazenamento: VoltDB
● Analytics: Elasticsearch
● Conclusões
2

Introdução
Parte 1

No início era o Apache Hadoop
8

Arquitetura Fast Data
Parte 2

Not only SQLSadalage e Fowler, 2012
(http://martinfowler.com/books/nosql.html)
Relational databases will be a
footnote in historyNathan Marz, 2014
(http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it)
Armazenamento
14

SQL and NoSQL will merge
“Not yet SQL”
Michael Stonebraker, 2015
https://www.youtube.com/watch?v=KRcecxdGxvQ
Armazenamento
15

Processamento
Matei Zaharia
16

O problema que vamos tratar...
18

https://github.com/lhzsantana/fastdata
20

Processadores:
Apache Spark
Parte 2

Agenda - Processadores: Apache Spark
● Frameworks para processamento de Big Data
● Arquitetura do Apache Spark
● Funcionamento do cluster
● Fluxo de processamento
○ Directed Acyclic Graph (DAG)
○ Resilient Distributed Dataset (RDD)
○ Evolução do RDD
● Exercícios
22

Frameworks para processamento de Big Data
● Streaming
○ Apache Spark, Apache Storm
● Filas
○ Apache Kafka, RabbitMQ
● Gestão de Clusters
○ Apache Mesos, Apache Zookeeper
● Aprendizado de Máquina
○ Apache Spark, Apache Mahout, IBM Watson, TensorFlow
● Estatística
○ Apache Spark
● Gestão de Memória
○ Apache Spark, Apache Ignite
23

Arquitetura do Apache Spark
24

Directed Acyclic Graph (DAG)
28

Resilient Distributed Dataset (RDD)
29

Exercícios
1. Classificar os Tweets de POA
--------------------- Lição de casa :) ---------------------
2. Receber em modo Streaming os dados do Twitter e filtrar pela latitude e
longitude de POA
3. Criar uma fila de processamento para adicionar os Tweets coletados no
exercício do Spark para contar as palavras mais comuns nesses Tweets
4. Usando como base o código de streaming do Twitter, fazer um streaming
direto do Cassandra
a. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
5. Configurar o Ignite para armazenar os Twitters e usar a API SQL do Ignite
para consultar esses Twitters 36

Bancos de Dados:
VoltDB
Parte 3

Agenda - Banco de Dados: VoltDB
● Problemas com SQL
● NewSQL
● VoltDB
● Como o VoltDB entrega o que promete?
● Exercício
38

Problema SQL: múltiplos gargalos
39

NewSQL: definições
● Definições
○ SQL como interface principal
○ Suporte a transações ACID
○ Sem lock no controle de concorrência
○ Alto desempenho
○ Arquitetura escalável (share nothing)
● In-memory
○ Alta
○ Baixa latência
○ Sem gerenciamento de bugger
○ Sem locks e latches
● HBase, Clustrix, NuoDB e VoltDB
41

NewSQL: como?
● Particionamento
○ Sharding
● Controle de concorrência por agendamento ou multi-versões
● Indexação
● Replicação
43

VoltDB
● Evolução do C-Store e H-Store
○ http://hstore.cs.brown.edu/documentation/faq/
● Visão:
○ VoltDB relies on horizontal partitioning
down to the individual hardware thread to
scale, k-safety (synchronous replication) to
provide high availability, and a
combination of continuous snapshots and
command logging for durability (crash
recovery)
47

K-safety
48
● A segurança K é uma medida
de quantas cópias dos dados
existem no cluster

Comandos VoltDB
● ./voltdb init
● ./voltdb start
● ./sqlcmd
● CREATE TABLE users (id INTEGER UNIQUE NOT NULL, username VARCHAR(15), city VARCHAR(15));
● CREATE TABLE tweets (id INTEGER UNIQUE NOT NULL, body VARCHAR(150), userId INTEGER);
● insert into users values (1,'lhzsantana','floripa');
● insert into tweets values (1,'sol em floripa', 1);
● select * from users;
● select * from users u inner join tweets t on u.id=t.userId;
● drop table users;
● CREATE INDEX name_idx ON users (name);
● PARTITION TABLE users ON COLUMN city;
● show tables;
57

Instalação do VoltDB
Referência: https://www.voltdb.com/try-voltdb/download-enterprise/
58

Exercícios
1. Salvar os dados de streaming e os dados do DataPoa no VoltDB
--------------------- Lição de casa :) ---------------------
2. Consultar os usuários que mais postam no Twitter
59

Visualização e analytics:
Elasticsearch
Parte 4

Agenda - Visualização e analytics: Elasticsearch
● Elasticsearch?
61

Elasticsearch?
•Tempo real
•Flexível
•Livre de esquema e muito escalável
• Iniciado por Shay Banon em 2010
• Desenvolvido pela comunidade
• Código aberto em:
• https://github.com/elastic/elasticsearch
• Atualmente apoiado pela Elastic
62

Onde é usado?
Mais casos de uso em: https://www.elastic.co/use-cases 63

Visão geral
•Cluster
•Lucene
•Índice
•Mapeamento
•Tipo Lucene Banco de dados relacional (BDR)
Índice (Index) Esquema
Type Tabela
Documento (JSON) Linha
Campo (Field) Coluna
Mapeamento (Mapping) Estrutura da tabela
Query DSL SQL 64

Arquitetura - Indexação
Client API ElasticsearchUsuários
Usuários
65

Arquitetura - Busca
Usuários
66

Arquitetura - Autocomplete (“search as you type”)
Usuários
67

Cliente API ElasticsearchUsuários
Usuários
68

Arquitetura - Indexação em lote
ElasticsearchAPI
69

Desenvolvimento com o Elasticsearch - Pontos importantes
● Gerenciamento dos dados
○ Backup do Elasticsearch?
■ Geralmente o Elasticsearch é usado como um repositório volátil
■ Backup:
● https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots
.html
■ Segurança:
● https://www.elastic.co/products/shield
● Mapeamento
70

Desenvolvimento com o Elasticsearch - Instalação
•Simplesmente baixar e desempacotar
• Elasticsearch e Kibana
• https://www.elastic.co/downloads/elasticsearch
• https://www.elastic.co/downloads/kibana
• Iniciar:
• /bin/elasticsearch
• /bin/kibana
•Em produção não é tão simples:
• http://logz.io/blog/deploy-elk-production/
71

Verificar se está funcionando
(localhost:9200)
72

Criar índice e mapeamento no Elasticsearch
• Exemplo simples “Tweet” e “Comentário”
• Criar um índice seria tão simples como:
• PUT erdb
• O mapeamento no Elasticsearch é flat
• O Elasticsearch está configurado para buscar palavras em Inglês
• Usar um analisador para português
• GIST:
• https://gist.github.com/lhzsantana/4f940684075ce115d799
74

Indexar alguns documentos
POST erbd/tweet/1
{
"author":"Luiz",
"text":"Tá muito sol para falar de Elasticsearch",
"hashtag":"#queriatánapraia"
}
POST tweet/post
{
"author":"Luiz",
"text":"Esse post não tem ID",
"hashtag":"#seráqfunciona"
}
75

POST erbd/tweet/1
{
"author":"Luiz",
"text":"Elasticsearch é mais legal que praia",
"hashtag":"#sqn"
}
POST erbd/tweet
{
"author":"Luiz",
"text":"O mapeamento do Elasticsearch é flexível",
"local":"Florianópolis",
"hashtag":"#schemaless"
}
76

POST erbd/comments/1
{
"author":"Anônimo",
"text": "Até agora não vi nada de
Spark",
"hashtag":"#taenrolando"
}
POST erbd/comment/1?parent=1
{
"author":"Anônimo",
"text": "Jurerê ou PHP?",
"hashtag":"#queriatánapraia"
}
77

Como ficou o mapeamento?
• GET erbd/_mapping
• Mapping com o nome errado “comments”
• O campo “local” não tem analisador
https://gist.github.com/lhzsantana/b72dd13f339ff29b4682
78

Mapeamento com Geo Point
PUT erbd
{
"mappings": {
"crash": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
79

Busca
GET /_search
{
"query": {
"match_all": {}
}
}
GET /erbd/_search
{
"query": {
"match_all": {}
}
}
GET /erbd/tweet/_search
{
"query": {
"match_all": {}
}
}
GET /erbd/tweet,comment/_search
{
"query": {
"match_all": {}
}
}
GET /erbd, sbbd/_search
{
"query": {
"match_all": {}
}
}
GET /erbd,sbbd/tweet,comment/
_search
{
"query": {
"match_all": {}
}}
80

Busca
GET /_search
{
"query": {
"match": {
"author": "luiz"
}
}
}
81

Busca
GET /_search
{
"query": {
"match": {
"local": "florianopolis"
}
}
}
82

Busca - bool, boost e agregações
GET /erbd/tweet,comment
/_search
{
"sort": [
{
"author": {
"order": "desc"
}
}
],
"size": 100,
"query": {
"bool": {
"should": [{
"match": {
"author": "anônimo"
}}, {
"match": {
"local": "Florianópolis"
}
}
]
} ,
"aggs" : {
"hashtags" : {
"terms" : { "field" : "author.raw"
}
}
https://gist.github.com/lhzsantana/f552751d
a153741657
83

Busca - bool, boost e agregações
{
"took": 88,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": null,
"hits": [{
"_index": "phpsc",
"_type": "post",
"_id": "AVLauzKDtyulCxogNOoi",
"_score": null,
"_source": {
"author": "Luiz",
"text": "O mapeamento do
Elasticsearch é flexível",
"aggregations": {
"hashtags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "anônimo",
"doc_count": 1
},
{
"key": "luiz",
"doc_count": 1
}
84

Exercícios
1. Enviar dados do Twitter e do DataPOA para o Elasticsearch
2. Buscar palavras nos dados
--------------------- Lição de casa :) ---------------------
3. Fazer gráfico de calor
4. Fazer gráfico de barras
85

Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch

Mais conteúdo relacionado

Semelhante a Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch

Mais de Luiz Henrique Zambom Santana

Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch