How to Build a Data-Driven Company: From Infrastructure to Insights

Companies like Buffer, SeatGeek, and Asana aren’t just talking about the value of data, they’re building data infrastructure that can actually deliver it. Join this 45-minute webinar to learn why these companies are investing in data and what you need to know to keep up.

Publicada em: Tecnologia
  #datastack
  #datastack What you're going to learn 1 How top engineering organizations are building their data infrastructure The 7 core challenges of data integration Why companies like Asana, Buffer, and SeatGeek choose Redshift for their analytics warehouse ...and much more! 2 3
  #datastack Data Infrastructure: Then and Now
  #datastack The traditional approach: ETL
  #datastack How companies are doing it today: ELT
  #datastack Benefits of this approach 1.Redshift is performant enough to handle most transformations 2.Users prefer performing transformations in a language they already use (SQL) or with UI 3.Transformations are much simpler, more transparent 4.Performing transformations alongside raw data is great for auditability
  #datastack Data Integration Data Warehouse BI/Analytics What the stack looks like
  #datastack Data Integration
  13. 13. #datastack Why consolidation matters
  14. 14. #datastack#datastack internal analytics Shaun
  #datastack Quick poll What top five data sources are a top priority for you to integrate/keep integrated? ● production databases ● events ● error logs ● billing ● email marketing ● crm ● advertising ● erp ● a/b testing ● support
  16. 16. #datastack “A year ago, we were facing a lot of stability problems with our data processing. When there was a major shift in a graph, people immediately questioned the data integrity. It was hard to distinguish interesting insights from bugs. Data science is already an art so you need the infrastructure to give you trustworthy answers to the questions you ask. 99% correctness is not good enough. And on the data infrastructure team, we were spending a lot of time churning on fighting urgent fires, and that prevented us from making much long-term progress. It was painful.” - Marco Gallotta, Asana, How to Build Stable, Accessible Data Infrastructure at a Startup
  17. 17. #datastack “Our story would end here if real-time processing were perfect. But it’s not: some events can come in days late, some time ranges need to be re- processed after initial ingestion due to code changes or data revisions, various components of the real-time pipeline can fail, and so on.” - Gian Merlino, MetaMarkets, Building a Data Pipeline That Handles Billions of Events in Real-Time
  #datastack 7 core challenges of data integration Connections: Every API is a unique and special snowflake Accuracy: Ordering data on a distributed system Latency: Large object data stores (Amazon S3, Redshift) are optimized for batches not streams Scale: Data will grow exponentially as your company grows Flexibility: you're interacting with systems you don't control Monitoring: Notifications for expired credentials, errors, notifications of disruptions Maintenance: Justifying investment in ongoing maintenance/improvement
  #datastack Or...try Pipeline
  #datastack Warehousing Infrastructure
  #datastack Analytics warehouse Redshift is the most common analytics warehouse. Chosen by: Asana, Braintree, Looker, Seatgeek, VigLink, Buffer
  #datastack AirBnB experiment Hive Redshift Test 1: 3 billion rows of data 28 minutes <6 minutes Test 2: two joins with millions of rows 182 seconds 8 seconds Cost $1.29/hour/node $0.85/hour/node
  #datastack Periscope research
  #datastack DiamondStream's dashboard query performance
  #datastack Business Intelligence & Analytics
  #datastack A broken model ● Feedback loop is broken ● Disparate reporting ● Non-unified decision making ● Versioning ● Reusability is lost Marketing Finance AM
  #datastack Constraints of SQL SQL is versatile, but shares the same flavor as assembly-only languages such as Perl Can write but not read Promotes one-off, piecemeal analysis Disparate interpretation
  #datastack The critical multiplier: modeling Any SQL Data Warehouse Modeling Layer What's our most successful marketing campaign How does our Q4 Pipeline looks? Who are our healthiest / happiest customers?
  #datastack analytics ● Data access ● Uniform definitions ● A Shared View ● Collaboration ● Analytical Speed
  #datastack What You Can Do
  #datastack analytics tools Week 1 Week 2-3 RJMetrics Pipeline BLOCKS
