O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Introduction toData EngineeringVivek A. Ganesanvivganes@gmail.com
AgendaCopyright 2013, Vivek A. Ganesan, All rights reserved 1o Introductiono What is data engineering?o Why data engineering?o Required Skillso Questions?
IntroductionCopyright 2013, Vivek A. Ganesan, All rights reserved 2o What’s with the name?o All other names were taken o Gods = Geeks on Datao Well, it is now Geeking out on Datao Why a Data Geek?o Geeks are coolo Data Geeks are way coolPartial Omniscience (Super power of Prediction)
Data, Data, Data!Copyright 2013, Vivek A. Ganesan, All rights reserved 3• Significant increase in data (Volume)• Social Networks• Transaction Logs• Fast streams of data (Velocity)• Sensor data• Machine-to-machine data• Different kinds of data (Variety)• Text• Audio• Video• This trend is only going to grow!Note : EB = Exabyte = 1 million PetabytesBig Data Trends
Before Big DataCopyright 2013, Vivek A. Ganesan, All rights reserved 4• Life was simple … well mostly• The ETL engineers managed datapipelines• The Data Scientists (they weren’tcalled that, btw, they weremostly Statisticians whoprogrammed in SAS, SPSS or S)did the analysis• Data Warehouses, Data martsand OLAP cubes were theplatforms• Data Analysts mostly generatedreports but they were proficientin SQL, Excel, Pivot Tables etc.• Data Architects …well, they architected• They managed :• Data models• Star Schemas• Data Governance• Master DataManagement(MDM)• Data Security• For the most part, theyhad to coax differentgroups to share data
Big Data – What Changed?Copyright 2013, Vivek A. Ganesan, All rights reserved 5• Life … got interesting• Huge data volumes – ETL becamea problem• Traditional Statistical toolscouldn’t handle the volume• Data Warehouses, Data martsand OLAP cubes not primaryanalytical means – “in situ”analysis preferred i.e. no movingdata to an analytics platform• Data Analysts still on point forreports but now they no longerhad SQL interfaces (thanks toNoSQL and Map Reduce)• Data Architects …well, they still need toarchitect • Still need :• Data models• Data Governance• Data Security• For the most part, theyhad to coax differentgroups to share data• They have to do all ofthis when thetechnology is rapidlyevolving
Life in the Big Data UniverseCopyright 2013, Vivek A. Ganesan, All rights reserved 6• The Good• Data recognized as an asset• Data Driven Products morecommon• Working with Data is cool• The Bad• Complexity is overwhelming• No sophisticated toolset yet• Technology is fast changing• The Ugly• No SQL!• Security• Governance• Performance• The Opportunity• Solve for :• SQL semantics• Data Governance• Data Security• Benchmarking, Profiling andPerformancemeasurement tools• Build :• Real-time solutions• Data Marts/DataWarehouses on top
Life in the Big Data UniverseCopyright 2013, Vivek A. Ganesan, All rights reserved 7Data Scientist Data AnalystData Engineer• Building Models• Validation/Testing• Algorithms• ContinuousImprovement• Knowledge of :• Statistics• Linear Algebra• MachineLearning• R,Matlab etc.• Deep DomainKnowledge• Report Generation• Data Exploration• Hypotheses Testing• Pattern Discovery• Correlations• SerendipitousDiscovery• Data Pipelines• Manage Platforms• ProductionalizeAlgorithms• Agile Development• Knowledge of :• Platforms• Algorithms• Java, C++ etc.• Scriptinglanguagueslike python
Data EngineeringCopyright 2013, Vivek A. Ganesan, All rights reserved 8• Strong CS Background• Algorithms• Database theory• Scripting languages• Server side languages• Distributed Systems Background• Clusters• Networking• Monitoring/Performance• Data Science/Machine Learning• Search/IR• Text Analytics• Classification• Clustering• Infrastructure• Hadoop• Cassandra• Mongo DB• Platforms• Solr• Hive• HBase• Mahout• Applications• RecommendationEngines• Fraud Prevention• Disease Prevention
Data Engineer’s RoleCopyright 2013, Vivek A. Ganesan, All rights reserved 9• Data Dialysis – Cleaning up Data• Hard to do at Scale• Newer tools in this space• Great scope for innovation• ETL -> ELT• Distributed Bulk loading• Full-fledged data pipelines• Supporting both data scientistsand data analysts• Productionalizing algorithms• Production support• Optimization• A/B Testing and ContinuousImprovement
About this Meetup : StructureCopyright 2013, Vivek A. Ganesan, All rights reserved 10• Agile teams• Monthly Scrum• Week 1 : Introduction to Problem• Week 2 : Algorithm + Platform• Week 3 : Technical help(Algorithm, Platform, Testing andDeployment)• Week 4 : Panel + Demo• Showcase Startups/Experts inthe space• Teams show demos• Panel judges winners• We might have prizes (needsto be figured out)• Weekly Meetup (onMondays)• Might move to a biggervenue if there isenough demand
About this Meetup : ScheduleCopyright 2013, Vivek A. Ganesan, All rights reserved 11• May 29th : Kickoff• Scrum 1• June 3rd – CollaborativeFiltering Introduction• June 10th – Mongo DBIntroduction• June 17th – Analytics onMongo DB• June 24th – Panel + Demo• Scrum 2 (TBD)• Come along now, it willbe fun!• Oh, the name
Questions? Comments?Thank You!E-mail: email@example.comTwitter : onevivekCopyright 2013, Vivek A. Ganesan, All rightsreserved12