Ciro Cavani 
Personalização 
Singularity 
Ambiente de Computação Interativa para Big Data 
baseado no Spark e IPython. 
Globo.com 
HackDay 02/12/2014
Motivação 
A tecnologia necessária para mudar como a 
Globo.com faz negócio está em produção. 
Hadoop2, Kafka e Spark. 
A ideia é orientar a Globo para tomar decisões 
baseada em dados.
Proposta 
● ter acesso a todos os dados da empresa 
● rodar algoritmos de machine learning 
● identificar informações relevantes 
● formular hipóteses e explorar os dados 
● formular experimentos, testes AB 
● um sistema interativo
Hadoop 
Hadoop2 é dois sistemas: 
● HDFS, sistema de 
arquivos distribuído; 
● YARN, sistema de 
execução distribuído. 
HBase, Pig, Mahout, Solr 
imagem: http://hortonworks.com/hadoop/yarn/
Kafka 
Cluster de distribuição de 
mensagens (bilhões por dia) 
criado pelo LinkedIn. 
Performance - alto throughput 
Escalabilidade - muitos 
consumidores 
Mensagens pequenas, não 
estruturadas / opacas (bytes) 
imagem: http://hortonworks.com/hadoop/kafka/
Spark 
A fast and general-purpose cluster computing system. 
High-level APIs in Python 
Spark SQL for SQL and structured data processing 
MLlib for machine learning 
GraphX for graph processing 
Spark Streaming for stream processing 
http://spark.apache.org/
IPython Notebook 
web-based interactive 
computational environment 
where you can combine code 
execution, text, mathematics, 
plots and rich media into a 
single document.
Wolfram Language (inspiração) 
http://youtu.be/_P9HqHVPeik 
Stephen Wolfram introduces the 
Wolfram Language in this video that 
shows how the symbolic 
programming language enables 
powerful functional programming, 
querying of large databases, flexible 
interactivity, easy deployment, and 
much, much more.
Databricks Cloud (inspiração) 
http://youtu.be/dJQ5lV5Tldw 
The Databricks Cloud provides the full 
power of Spark to you, in the cloud, 
plus a powerful set of features for 
exploring and visualization your data, 
as well as writing and deploying 
production data products. 
* Visualize data right as you explore it 
* Collaborate in real-time 
* Export your analysis to production 
dashboards in seconds
Jupyter e Julia (futuro) 
http://youtu.be/jhlVHoeB05A 
This talk will begin with an introduction 
to the Julia language, both explaining 
why it is able to attain C-like 
performance in many cases. (...) we 
will explain how connecting to the 
IPython "Jupyter" front-end from an 
IJulia back-end allows Julia to benefit 
from IPython's rich multimedia 
notebook interface, and how Julia can 
even use IPython 2's interactive-widget 
infrastructure to provide truly interactive 
computations. 
https://github.com/stevengj/Julia-EuroSciPy14
Globo.com 
Gostou? 
Quer Trabalhar na Globo.com? 
Estamos Contratando 
https://github.com/globocom/IWantToWorkAtGloboCom 
ciro.cavani@corp.globo.com 
https://www.linkedin.com/in/cirocavani

Singularity @ Globo.com HackDay 2014-12-02

  • 1.
    Ciro Cavani Personalização Singularity Ambiente de Computação Interativa para Big Data baseado no Spark e IPython. Globo.com HackDay 02/12/2014
  • 2.
    Motivação A tecnologianecessária para mudar como a Globo.com faz negócio está em produção. Hadoop2, Kafka e Spark. A ideia é orientar a Globo para tomar decisões baseada em dados.
  • 3.
    Proposta ● teracesso a todos os dados da empresa ● rodar algoritmos de machine learning ● identificar informações relevantes ● formular hipóteses e explorar os dados ● formular experimentos, testes AB ● um sistema interativo
  • 4.
    Hadoop Hadoop2 édois sistemas: ● HDFS, sistema de arquivos distribuído; ● YARN, sistema de execução distribuído. HBase, Pig, Mahout, Solr imagem: http://hortonworks.com/hadoop/yarn/
  • 5.
    Kafka Cluster dedistribuição de mensagens (bilhões por dia) criado pelo LinkedIn. Performance - alto throughput Escalabilidade - muitos consumidores Mensagens pequenas, não estruturadas / opacas (bytes) imagem: http://hortonworks.com/hadoop/kafka/
  • 6.
    Spark A fastand general-purpose cluster computing system. High-level APIs in Python Spark SQL for SQL and structured data processing MLlib for machine learning GraphX for graph processing Spark Streaming for stream processing http://spark.apache.org/
  • 7.
    IPython Notebook web-basedinteractive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document.
  • 10.
    Wolfram Language (inspiração) http://youtu.be/_P9HqHVPeik Stephen Wolfram introduces the Wolfram Language in this video that shows how the symbolic programming language enables powerful functional programming, querying of large databases, flexible interactivity, easy deployment, and much, much more.
  • 11.
    Databricks Cloud (inspiração) http://youtu.be/dJQ5lV5Tldw The Databricks Cloud provides the full power of Spark to you, in the cloud, plus a powerful set of features for exploring and visualization your data, as well as writing and deploying production data products. * Visualize data right as you explore it * Collaborate in real-time * Export your analysis to production dashboards in seconds
  • 12.
    Jupyter e Julia(futuro) http://youtu.be/jhlVHoeB05A This talk will begin with an introduction to the Julia language, both explaining why it is able to attain C-like performance in many cases. (...) we will explain how connecting to the IPython "Jupyter" front-end from an IJulia back-end allows Julia to benefit from IPython's rich multimedia notebook interface, and how Julia can even use IPython 2's interactive-widget infrastructure to provide truly interactive computations. https://github.com/stevengj/Julia-EuroSciPy14
  • 13.
    Globo.com Gostou? QuerTrabalhar na Globo.com? Estamos Contratando https://github.com/globocom/IWantToWorkAtGloboCom ciro.cavani@corp.globo.com https://www.linkedin.com/in/cirocavani