SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
Building and Deploying Large Scale
        Real Time News System with
       MySQL and Distributed Cache
Presented	
  to	
  MySQL	
  Conference	
  
Apr.	
  13,	
  2011	
  
Who am I?
                          Pag
                          e2


  Tao Cheng <tao.cheng@teamaol.com>, AOL Real
   Time News (RTN).
  Worked on Mail and Browser clients in the ‘90 and
   then moved to web backend servers since.
  Not an expert but am happy to share my experience
   and brainstorm solutions.




Presentation for
[CLIENT]
Agenda

  AOL Real Time News (RTN): what it is?
  Requirements
  Technical solutions with focus on MySQL
  Deployment Topology
  Operational Monitoring
  Metrics Collection
Agenda

  Tips for query tuning and optimization
  Heuristic Query Optimization Algorithm
  Lessons learned
  Q & A
Real Time News : background
                           Pag
                           e5


AOL deployed its large scale Real Time News (RTN)
system in 2007.
This system ingests and processes news from 30,000
sources on every second around the clock. Today, its
data store, MySQL, has accumulated over several
billions of rows and terabytes of data.
However, news are delivered to end users in close to
real time fashion. This presentation shares how it is
done and the lessons learned.


Presentation for
AOLU Un-University
Brief Intro: sample features
                                  Pag
                                  e6


  Data presentation: return most recent news in
     flat view – most recent news about an entity. An entity could
      be a person, a company, a sports team, etc.
     topic clusters – most recent news grouped by topics. A topic is
      a group of news about an event, headline news, etc.
  News filtering by
     source types such as news, blogs, press releases, regional, etc.

     relevancy level (high, medium, low, etc) to the entities .

  Data Delivery: push (to subscribers) and pull
  Search by entities, categories (National, Sports,
    Finance, etc), topics, document ID, etc.
Presentation for
[CLIENT]
Requirements for Phase I (2006)
                                 Pag
                                 e7


  Commodity hardware: 4 CPU, 16 GB MEM, 600 GB
   disk space.
  Data ingestion rate = 250K docs/day; average
   document size = 5 KB.
  Data retention period: 7 days to forever
  Est. data set size: (1.25 GB/day or 456 GB/year) +
   space for indexes, schema change, and optimization.
  Response time: < 30 milli-second/query
  Throughputs: > 400 queries/sec/server
  Up time: 99.999%
Presentation for
[CLIENT]
Solutions: MySQL + Bucky
                                      Pag
                                      e8


  MySQL
     Serve raw/distinct queries

     Back fill

  Bucky Technology (AOL’s distributed cache &
    computing framework)
      Write ahead cache: pre-compute query results and push them
       into cache.
      Messaging (optional): push data directly to subscribers
           Updatesare pushed to data consumers or browsers via AIM
            Complex.
  Updates go to both database and cache.

Presentation for
[CLIENT]
Architecture Diagram (over-simplified)
                                                        Pag
                                                        e9




     WWW

                                       AIM	
           push

   Relegence	
  




    Ingestor	
       Distributed	
  
                        Cache	
  
                                                 Gateway	
           pull
                                                               WWW
                     Distributed	
  
                        Cache	
                  Gateway	
  



   Asset	
  DB	
  




Presentation for
[CLIENT]
Data Model: SOR v.s. Query DB
                                  Pag
                                  e 10


  Separate query from storage to keep tables small and
   query fast.
  System of Record (SOR): has all raw data
      The authoritative data store; designed for data storage
      Normalized schema: for simple key look-up; no table join.

  Query DB – de-normalized for query speed
     avoid JOIN, reduce # of trips to DB, increase throughputs.

  Read/write small chunk of data at a time so database
   can get requests out quickly and process more.
  Use replication to achieve linear scalability for read.

Presentation for
[CLIENT]
Design Strategies: partitioning (Why)
                                  Pag
                                  e 11


  Dataset too big to fit on one host
  Performance consideration: divide and conquer
     Write: more masters (Nx) to take writes

     Read: smaller tables + more (NxM) slaves to handle read.

  Fault tolerance – distribute the risk and reduce the
   impact of system failure
  Easier Maintenance – size does matter
      Faster nightly backup, disaster recovery, schema change, etc.
      Faster optimization –need optimization to reclaim disk space
       after deletion, rebuild indexes to improve query speed.


Presentation for
[CLIENT]
Design Strategies: partitioning (How)
                                    Pag
                                    e 12


  Partition on most used keys (look at query patterns)
     Document table – on document ID

     Entity table – on entity ID

  Simple hash on IDs – no partition map; thus no
   competition of read/write locks on yet another table
  Managing growth: add another partition set
      New documents are written into both old and new partition
       sets for a few weeks. Then, stop writing into the old partitions.
      Queries go to the new partitions first and then the old ones if
       in-sufficient results found.
  Works great in our case but might not for everyone.
Presentation for
[CLIENT]
Schema design: De-normalization
                                       Pag
                                       e 13


  Make query tables small:
     put only essential attributes in the de-normalized tables

     store long text attributes in separate tables.

  De-normalization: how to store and match attributes
     Single value attributes (1:1) : document ID, short string, date
      time, etc. – one column, one row.
     Multi-value attributes (1:many): tricky but feasible
          Use  multiple rows with composite index/key: (c1, c2, etc.)
          One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val
           like ‘%id2%’”
          One row but multiple columns, e.g., group1, group2, etc. – SQL:
           group1=val1 OR group2=val2 ...

Presentation for
[CLIENT]
Tips for indexing
                                Pag
                                e 14


  Simple key – for metadata retrieval
  Composite key – find matching documents
     Start with low cardinality and most used columns

     Order matter: (c1, c2, c3) != (c2, c3, c1)

  InnoDB – all secondary indexes contain primary key
     Make primary key short to keep index size small

     Queries using secondary index references primary key too.

  Integer v.s. String – comparison of numeric values is
   faster => index hash values of long string instead.
  Index length – title:varchar(255) => idx_title(32)
  Enforce referential integrity on application side.
Presentation for
[CLIENT]
MySQL configuration
                                  Pag
                                  e 15


  Storage engine: InnoDB – row level locking
  Table space – one file per table
     Easier to maintain (schema change, optimization, etc.)

  Character set: ‘UTF-8’
     Disable persistent connection (5.0.x)

     skip-character-set-client-handshake

  Enable slow query log to identify bad queries.
  System variables for memory buffer size
     innodb_buffer_pool_size: data and indexes

     Sort_buffer_size, max_heap_table_size, tmp_table_size

     Query cache size=0; tables are updated constantly
Presentation for
[CLIENT]
Runtime statistics (per server)
                                 Pag
                                 e 16


  Average write rate:
     daily: < 40 tps

     max at 400 tps during recovery

     Perform best when write rate < 100 tps

  Query rate: 20~80 qps
  Query response time – shorter when indexes and
    data are in memory
      75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60
      95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60

      CPU Idle %: > 99%.



Presentation for
[CLIENT]
Pag
                   e 17




Presentation for
[CLIENT]
Deployment Topology Consideration
                                   Pag
                                   e 18


•  Minimum configuration: host/DC redundency
   •  DC1: host 1 (master), host 3 (slave)

   •  DC2: host 2 (failover master), host 4 (slave)

•  Data locality: significant when network latency is a
    concern (100 Mbps)
    •    3,000 qps when DB is on remote host.
    •    15,000 qps when DB is on local host.
•  Linking dependent servers across data centers
   •  Push cross link up as far as possible (Topology 3): link to
      dependent servers in the same data center.


Presentation for
[CLIENT]
Deployment Topology 1: minimum config
                             Pag
                             e 19
   Date Center 1


       DB          DB



                          Data      WWW
                        Consumer




       DB          DB


   Date Center 2


Presentation for
[CLIENT]
Topology 2: link across DCs (bad)
                                   Pag
                                   e 20


                                        Data
                   DB   V                        V
       DB                             Consumer
                        I                        I
                        P                        P
                                        Data
                   DB                 Consumer       G
                                                     S
                                                     L   WWW
                            GSLB
                                                     B

                                        Data
                   DB   V                        V
                                      Consumer
                        I                        I
       DB               P                        P
                                        Data
                   DB
                                      Consumer

Presentation for
[CLIENT]
Topology 3: link to same DC (better)
                             Pag
                             e 21


                                Data
                   DB   V                V
       DB                     Consumer
                        I                I
                        P                P
                                Data
                   DB         Consumer       G
                                             S
                                             L   WWW
                                             B

                                Data
                   DB   V                V
                              Consumer
                        I                I
       DB               P                P
                                Data
                   DB
                              Consumer

Presentation for
[CLIENT]
Topology 4: use local UNIX socket
                              Pag
                              e 22


                            Data
                     DB                V
       DB                 Consumer
                                       I
                                       P
                            Data
                     DB   Consumer         G
                                           S
                                           L   WWW
                                           B

                            Data
                     DB   Consumer     V
                                       I
       DB                              P
                            Data
                     DB
                          Consumer

Presentation for
[CLIENT]
Production Monitoring
                            Pag
                            e 23


  Operational Monitoring: logcheck, Scout/NOC alert,
   etc.
  DB monitoring on replication failure, latency, read/
   write rate, performance metrics.




Presentation for
[CLIENT]
Metrics Collection
                                   Pag
                                   e 24


  Graphing collected metrics: visualize and collate
    operational metrics.
      Help analyzing and fine tuning server performance.
      Help trace production issues and identify point of failure.

  What metrics are important?
     Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU
      swap/paging
     Server: Throughputs, response time

  Comparison: line up charts (throughputs, response
    time, CPU, disk i/o) in the same time window.

Presentation for
[CLIENT]
Pag
                   e 25




Presentation for
[CLIENT]
Pag
                   e 26




Presentation for
[CLIENT]
Pag
                   e 27




Presentation for
[CLIENT]
Tuning and Optimizing Queries
                                 Pag
                                 e 28


  Explain: mysql> explain SELECT ... FROM …
  Watch out for tmp table usage, table scan, etc.
  SQL_NO_CACHE
  MySQL Query profiler
     mysql> set profiling=1;

  Linux OS Cache: leave enough memory on host
  USE INDEX hint to choose INDEX explicitly
     use wisely: most of the time, MySQL chooses the right index
      for you. But, when table size grows, index cardinality might
      change.

Presentation for
[CLIENT]
Important MySQL statistics
                               Pag
                               e 29


  SHOW GLOBAL STATUS…
     Qcache_free_blocks

     Qcache_free_memory

     Qcache_hits

     Qcache_inserts

     Qcache_lowmem_prunes

     Qcache_not_cached

     Qcache_queries_in_cache

     Select_scan

     Sort_scan




Presentation for
[CLIENT]
Important MySQL statistics (cont.)
                               Pag
                               e 30

      Table_locks_waited
      Innodb_row_lock_current_waits

      Innodb_row_lock_time

      Innodb_row_lock_time_avg

      Innodb_row_lock_time_max

      Innodb_row_lock_waits

      Select_scan

      Slave_open_temp_tables




Presentation for
[CLIENT]
Heuristic Query Optimization Algorithm
                                    Pag
                                    e 31


  Primary for complex cluster queries: find latest N
   topics and related stories.
  Strategy: reduce the number of records database
   needs to load from disk to perform a query.
      Pick a default query range. If in-sufficient docs are returned,
       expand query range proportionally.
      If none return => sparse data => drop the range and retry.

      Save query range for future references.

  Result: reduce number of rows needed to process
    from millions to hundreds => cut query time down
    from minutes to less than 10 ms.
Presentation for
[CLIENT]
Query	
  range	
  
                                               Cluster	
  query	
  
               look	
  up	
  
                                             NumOfTripToDB	
  =0	
  

                                  no	
  
              Has query                    Use default
               range?                        range
                                                                    Compute docs to range ratio and
                                                                  prorate it to a range that would return
                                                                        sufficient amount of docs.

       Bound query with the
        range and send it to
                DB                                                                                              yes	
  
                                                                                 NumOfTrip
                                                                                 ToDB	
  >=2?	
  
                        NumOfTripToDB++	
  



             Suf@icient	
                                                                             yes	
  
              results	
                                                          numOfResults                             Send original
               from	
                                                               == 0?                                 query to DB
               query	
  
              engine?	
  
                                            Query	
  
                                            Engine	
  

          yes	
  

      Compute docs to range
       ratio and save it back                                Return query
      to the look up table for                             results to clients.
             future use.
Presentation for
[CLIENT]
Lessons Learned
                           Pag
                           e 33


  Always load test well ahead of launch (2 weeks) to
   avoid fire drill.
  Don’t rely on cache solely. Database needs to be able
   to serve reasonable amount of queries on its own.
  Separate cache from applications to avoid cold start.
  Keep transaction/query simple and return fast.
  Avoid table join; limit it to 2 if really needed.
  Avoid stored procedure: results are not cached; need
   DBA when altering implementation.

Presentation for
[CLIENT]
Lessons Learned (cont.)
                           Pag
                           e 34


  Avoid using ‘offset’ in LIMIT clause; use application
    based pagination instead.
  Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT
  If possible, exclude text/blob columns from query
    results to avoid disk I/O.
  Store text/blob in separate table to speed up backup,
    optimization, and schema change.
  Separate real time v.s. archive data for better
    performance and easier maintenance.
  Keep table size under control ( < 100 GB) ; optimized
    periodically.
Presentation for
[CLIENT]
Lessons Learned (cont.)
                                  Pag
                                  e 35


  Put SQL statement (templates) in resource files so
   you can tune it without binary change.
  Set up replication in dev & qa to catch replication
   issues earlier
      Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above)
      Auto-increment + (INSERT.. ON DUPLICATE UPDATE…)

      Date time column: default to NOW()

      Oversized data: increase max_allowed_packet

      Replication lag: transactions that involve index update/
       deletion often take longer to complete.
  Host and data center redundancy is important –
    don’t put all eggs in one basket.
Presentation for
[CLIENT]
RTN 3 Redesign
                                   Pag
                                   e 36


  Free Text Search with SOLR
     Real time v.s. archive shards.

     1 minute latency w/o Ramdisk.

  Asset DB partitioned – 5 rows/doc -> 25 rows/doc
  Avoid (System) Virtual Machine; instead, stack high
    end hosts with processes that use different system
    resources (CPU, MEM, disk space, etc)
      Better network and system resource utilization – cost effective.
      Data Locality

  More processors (< 12 ) help when under load.

Presentation for
[CLIENT]
Q&A
                        Pag
                        e 37


  Questions or comments?




Presentation for
[CLIENT]
Pag
                   e 38


  THANK YOU !!




Presentation for
[CLIENT]

Mais conteúdo relacionado

Mais procurados

Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionAditya Trivedi
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceIBM Danmark
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to NetezzaVijaya Chandrika
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformBob Ward
 
Whats New Sql Server 2008 R2
Whats New Sql Server 2008 R2Whats New Sql Server 2008 R2
Whats New Sql Server 2008 R2Eduardo Castro
 
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)Girish Srivastava
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenEDB
 
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBhawani N Prasad
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradataAsis Mohanty
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflakeSivakumar Ramar
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIParesh Nayak,OCP®,Prince2®
 
Polyglot Database - Linuxcon North America 2016
Polyglot Database - Linuxcon North America 2016Polyglot Database - Linuxcon North America 2016
Polyglot Database - Linuxcon North America 2016Dave Stokes
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really DoingDave Stokes
 

Mais procurados (18)

Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdution
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse appliance
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data Platform
 
Whats New Sql Server 2008 R2
Whats New Sql Server 2008 R2Whats New Sql Server 2008 R2
Whats New Sql Server 2008 R2
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
 
Databases in the Cloud
Databases in the CloudDatabases in the Cloud
Databases in the Cloud
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradata
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part II
 
Polyglot Database - Linuxcon North America 2016
Polyglot Database - Linuxcon North America 2016Polyglot Database - Linuxcon North America 2016
Polyglot Database - Linuxcon North America 2016
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 

Destaque

Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraPooja Ajmera
 
Choosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalChoosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalHeather Choi
 
Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013Tanya Cashorali
 
Using Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingUsing Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingSenturus
 
Performance data visualization with r and tableau
Performance data visualization with r and tableauPerformance data visualization with r and tableau
Performance data visualization with r and tableauEnkitec
 
R Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersR Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersRsquared Academy
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Amazon Web Services
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 
Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Charlie Greenbacker
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBernard Marr
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data VisualizationRaffael Marty
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualizationlesterathayde
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 

Destaque (18)

Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
 
Choosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalChoosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_Final
 
Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013
 
Using Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingUsing Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales Forecasting
 
Performance data visualization with r and tableau
Performance data visualization with r and tableauPerformance data visualization with r and tableau
Performance data visualization with r and tableau
 
R Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersR Markdown Tutorial For Beginners
R Markdown Tutorial For Beginners
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 
Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualization
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Lista 2 (1)
Lista 2 (1)Lista 2 (1)
Lista 2 (1)
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 

Semelhante a Building and deploying large scale real time news system with my sql and distributed cache mysql_conf

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
MinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with CassandraMinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with CassandraJeff Smoley
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Boni Bruno
 
Compare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerCompare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerAlexDepo
 
Cassandra's Odyssey @ Netflix
Cassandra's Odyssey @ NetflixCassandra's Odyssey @ Netflix
Cassandra's Odyssey @ NetflixRoopa Tangirala
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Trivadis
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 

Semelhante a Building and deploying large scale real time news system with my sql and distributed cache mysql_conf (20)

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
MinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with CassandraMinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with Cassandra
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810
 
Compare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL ServerCompare Clustering Methods for MS SQL Server
Compare Clustering Methods for MS SQL Server
 
Cassandra's Odyssey @ Netflix
Cassandra's Odyssey @ NetflixCassandra's Odyssey @ Netflix
Cassandra's Odyssey @ Netflix
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Building and deploying large scale real time news system with my sql and distributed cache mysql_conf

  • 1. Building and Deploying Large Scale Real Time News System with MySQL and Distributed Cache Presented  to  MySQL  Conference   Apr.  13,  2011  
  • 2. Who am I? Pag e2   Tao Cheng <tao.cheng@teamaol.com>, AOL Real Time News (RTN).   Worked on Mail and Browser clients in the ‘90 and then moved to web backend servers since.   Not an expert but am happy to share my experience and brainstorm solutions. Presentation for [CLIENT]
  • 3. Agenda   AOL Real Time News (RTN): what it is?   Requirements   Technical solutions with focus on MySQL   Deployment Topology   Operational Monitoring   Metrics Collection
  • 4. Agenda   Tips for query tuning and optimization   Heuristic Query Optimization Algorithm   Lessons learned   Q & A
  • 5. Real Time News : background Pag e5 AOL deployed its large scale Real Time News (RTN) system in 2007. This system ingests and processes news from 30,000 sources on every second around the clock. Today, its data store, MySQL, has accumulated over several billions of rows and terabytes of data. However, news are delivered to end users in close to real time fashion. This presentation shares how it is done and the lessons learned. Presentation for AOLU Un-University
  • 6. Brief Intro: sample features Pag e6   Data presentation: return most recent news in   flat view – most recent news about an entity. An entity could be a person, a company, a sports team, etc.   topic clusters – most recent news grouped by topics. A topic is a group of news about an event, headline news, etc.   News filtering by   source types such as news, blogs, press releases, regional, etc.   relevancy level (high, medium, low, etc) to the entities .   Data Delivery: push (to subscribers) and pull   Search by entities, categories (National, Sports, Finance, etc), topics, document ID, etc. Presentation for [CLIENT]
  • 7. Requirements for Phase I (2006) Pag e7   Commodity hardware: 4 CPU, 16 GB MEM, 600 GB disk space.   Data ingestion rate = 250K docs/day; average document size = 5 KB.   Data retention period: 7 days to forever   Est. data set size: (1.25 GB/day or 456 GB/year) + space for indexes, schema change, and optimization.   Response time: < 30 milli-second/query   Throughputs: > 400 queries/sec/server   Up time: 99.999% Presentation for [CLIENT]
  • 8. Solutions: MySQL + Bucky Pag e8   MySQL   Serve raw/distinct queries   Back fill   Bucky Technology (AOL’s distributed cache & computing framework)   Write ahead cache: pre-compute query results and push them into cache.   Messaging (optional): push data directly to subscribers   Updatesare pushed to data consumers or browsers via AIM Complex.   Updates go to both database and cache. Presentation for [CLIENT]
  • 9. Architecture Diagram (over-simplified) Pag e9 WWW AIM   push Relegence   Ingestor   Distributed   Cache   Gateway   pull WWW Distributed   Cache   Gateway   Asset  DB   Presentation for [CLIENT]
  • 10. Data Model: SOR v.s. Query DB Pag e 10   Separate query from storage to keep tables small and query fast.   System of Record (SOR): has all raw data   The authoritative data store; designed for data storage   Normalized schema: for simple key look-up; no table join.   Query DB – de-normalized for query speed   avoid JOIN, reduce # of trips to DB, increase throughputs.   Read/write small chunk of data at a time so database can get requests out quickly and process more.   Use replication to achieve linear scalability for read. Presentation for [CLIENT]
  • 11. Design Strategies: partitioning (Why) Pag e 11   Dataset too big to fit on one host   Performance consideration: divide and conquer   Write: more masters (Nx) to take writes   Read: smaller tables + more (NxM) slaves to handle read.   Fault tolerance – distribute the risk and reduce the impact of system failure   Easier Maintenance – size does matter   Faster nightly backup, disaster recovery, schema change, etc.   Faster optimization –need optimization to reclaim disk space after deletion, rebuild indexes to improve query speed. Presentation for [CLIENT]
  • 12. Design Strategies: partitioning (How) Pag e 12   Partition on most used keys (look at query patterns)   Document table – on document ID   Entity table – on entity ID   Simple hash on IDs – no partition map; thus no competition of read/write locks on yet another table   Managing growth: add another partition set   New documents are written into both old and new partition sets for a few weeks. Then, stop writing into the old partitions.   Queries go to the new partitions first and then the old ones if in-sufficient results found.   Works great in our case but might not for everyone. Presentation for [CLIENT]
  • 13. Schema design: De-normalization Pag e 13   Make query tables small:   put only essential attributes in the de-normalized tables   store long text attributes in separate tables.   De-normalization: how to store and match attributes   Single value attributes (1:1) : document ID, short string, date time, etc. – one column, one row.   Multi-value attributes (1:many): tricky but feasible   Use multiple rows with composite index/key: (c1, c2, etc.)   One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val like ‘%id2%’”   One row but multiple columns, e.g., group1, group2, etc. – SQL: group1=val1 OR group2=val2 ... Presentation for [CLIENT]
  • 14. Tips for indexing Pag e 14   Simple key – for metadata retrieval   Composite key – find matching documents   Start with low cardinality and most used columns   Order matter: (c1, c2, c3) != (c2, c3, c1)   InnoDB – all secondary indexes contain primary key   Make primary key short to keep index size small   Queries using secondary index references primary key too.   Integer v.s. String – comparison of numeric values is faster => index hash values of long string instead.   Index length – title:varchar(255) => idx_title(32)   Enforce referential integrity on application side. Presentation for [CLIENT]
  • 15. MySQL configuration Pag e 15   Storage engine: InnoDB – row level locking   Table space – one file per table   Easier to maintain (schema change, optimization, etc.)   Character set: ‘UTF-8’   Disable persistent connection (5.0.x)   skip-character-set-client-handshake   Enable slow query log to identify bad queries.   System variables for memory buffer size   innodb_buffer_pool_size: data and indexes   Sort_buffer_size, max_heap_table_size, tmp_table_size   Query cache size=0; tables are updated constantly Presentation for [CLIENT]
  • 16. Runtime statistics (per server) Pag e 16   Average write rate:   daily: < 40 tps   max at 400 tps during recovery   Perform best when write rate < 100 tps   Query rate: 20~80 qps   Query response time – shorter when indexes and data are in memory   75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60   95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60   CPU Idle %: > 99%. Presentation for [CLIENT]
  • 17. Pag e 17 Presentation for [CLIENT]
  • 18. Deployment Topology Consideration Pag e 18 •  Minimum configuration: host/DC redundency •  DC1: host 1 (master), host 3 (slave) •  DC2: host 2 (failover master), host 4 (slave) •  Data locality: significant when network latency is a concern (100 Mbps) •  3,000 qps when DB is on remote host. •  15,000 qps when DB is on local host. •  Linking dependent servers across data centers •  Push cross link up as far as possible (Topology 3): link to dependent servers in the same data center. Presentation for [CLIENT]
  • 19. Deployment Topology 1: minimum config Pag e 19 Date Center 1 DB DB Data WWW Consumer DB DB Date Center 2 Presentation for [CLIENT]
  • 20. Topology 2: link across DCs (bad) Pag e 20 Data DB V V DB Consumer I I P P Data DB Consumer G S L WWW GSLB B Data DB V V Consumer I I DB P P Data DB Consumer Presentation for [CLIENT]
  • 21. Topology 3: link to same DC (better) Pag e 21 Data DB V V DB Consumer I I P P Data DB Consumer G S L WWW B Data DB V V Consumer I I DB P P Data DB Consumer Presentation for [CLIENT]
  • 22. Topology 4: use local UNIX socket Pag e 22 Data DB V DB Consumer I P Data DB Consumer G S L WWW B Data DB Consumer V I DB P Data DB Consumer Presentation for [CLIENT]
  • 23. Production Monitoring Pag e 23   Operational Monitoring: logcheck, Scout/NOC alert, etc.   DB monitoring on replication failure, latency, read/ write rate, performance metrics. Presentation for [CLIENT]
  • 24. Metrics Collection Pag e 24   Graphing collected metrics: visualize and collate operational metrics.   Help analyzing and fine tuning server performance.   Help trace production issues and identify point of failure.   What metrics are important?   Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU swap/paging   Server: Throughputs, response time   Comparison: line up charts (throughputs, response time, CPU, disk i/o) in the same time window. Presentation for [CLIENT]
  • 25. Pag e 25 Presentation for [CLIENT]
  • 26. Pag e 26 Presentation for [CLIENT]
  • 27. Pag e 27 Presentation for [CLIENT]
  • 28. Tuning and Optimizing Queries Pag e 28   Explain: mysql> explain SELECT ... FROM …   Watch out for tmp table usage, table scan, etc.   SQL_NO_CACHE   MySQL Query profiler   mysql> set profiling=1;   Linux OS Cache: leave enough memory on host   USE INDEX hint to choose INDEX explicitly   use wisely: most of the time, MySQL chooses the right index for you. But, when table size grows, index cardinality might change. Presentation for [CLIENT]
  • 29. Important MySQL statistics Pag e 29   SHOW GLOBAL STATUS…   Qcache_free_blocks   Qcache_free_memory   Qcache_hits   Qcache_inserts   Qcache_lowmem_prunes   Qcache_not_cached   Qcache_queries_in_cache   Select_scan   Sort_scan Presentation for [CLIENT]
  • 30. Important MySQL statistics (cont.) Pag e 30   Table_locks_waited   Innodb_row_lock_current_waits   Innodb_row_lock_time   Innodb_row_lock_time_avg   Innodb_row_lock_time_max   Innodb_row_lock_waits   Select_scan   Slave_open_temp_tables Presentation for [CLIENT]
  • 31. Heuristic Query Optimization Algorithm Pag e 31   Primary for complex cluster queries: find latest N topics and related stories.   Strategy: reduce the number of records database needs to load from disk to perform a query.   Pick a default query range. If in-sufficient docs are returned, expand query range proportionally.   If none return => sparse data => drop the range and retry.   Save query range for future references.   Result: reduce number of rows needed to process from millions to hundreds => cut query time down from minutes to less than 10 ms. Presentation for [CLIENT]
  • 32. Query  range   Cluster  query   look  up   NumOfTripToDB  =0   no   Has query Use default range? range Compute docs to range ratio and prorate it to a range that would return sufficient amount of docs. Bound query with the range and send it to DB yes   NumOfTrip ToDB  >=2?   NumOfTripToDB++   Suf@icient   yes   results   numOfResults Send original from   == 0? query to DB query   engine?   Query   Engine   yes   Compute docs to range ratio and save it back Return query to the look up table for results to clients. future use. Presentation for [CLIENT]
  • 33. Lessons Learned Pag e 33   Always load test well ahead of launch (2 weeks) to avoid fire drill.   Don’t rely on cache solely. Database needs to be able to serve reasonable amount of queries on its own.   Separate cache from applications to avoid cold start.   Keep transaction/query simple and return fast.   Avoid table join; limit it to 2 if really needed.   Avoid stored procedure: results are not cached; need DBA when altering implementation. Presentation for [CLIENT]
  • 34. Lessons Learned (cont.) Pag e 34   Avoid using ‘offset’ in LIMIT clause; use application based pagination instead.   Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT   If possible, exclude text/blob columns from query results to avoid disk I/O.   Store text/blob in separate table to speed up backup, optimization, and schema change.   Separate real time v.s. archive data for better performance and easier maintenance.   Keep table size under control ( < 100 GB) ; optimized periodically. Presentation for [CLIENT]
  • 35. Lessons Learned (cont.) Pag e 35   Put SQL statement (templates) in resource files so you can tune it without binary change.   Set up replication in dev & qa to catch replication issues earlier   Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above)   Auto-increment + (INSERT.. ON DUPLICATE UPDATE…)   Date time column: default to NOW()   Oversized data: increase max_allowed_packet   Replication lag: transactions that involve index update/ deletion often take longer to complete.   Host and data center redundancy is important – don’t put all eggs in one basket. Presentation for [CLIENT]
  • 36. RTN 3 Redesign Pag e 36   Free Text Search with SOLR   Real time v.s. archive shards.   1 minute latency w/o Ramdisk.   Asset DB partitioned – 5 rows/doc -> 25 rows/doc   Avoid (System) Virtual Machine; instead, stack high end hosts with processes that use different system resources (CPU, MEM, disk space, etc)   Better network and system resource utilization – cost effective.   Data Locality   More processors (< 12 ) help when under load. Presentation for [CLIENT]
  • 37. Q&A Pag e 37   Questions or comments? Presentation for [CLIENT]
  • 38. Pag e 38   THANK YOU !! Presentation for [CLIENT]