SlideShare uma empresa Scribd logo
1 de 30
Searching	
  Conversa/ons	
  
using	
  Hadoop:	
  More	
  than	
     find the talk
Just	
  Analy/cs

Jacques	
  Nadeau,	
  CTO	
  
jacques@yapmap.com	
  
@intjesus	
  


June	
  13,	
  2012	
  
	
  
Agenda	
  
 ü What	
  is	
  YapMap?	
  
 •  FiLng	
  Hadoop	
  into	
  your	
  architecture	
  
 •  YapMap	
  Approach	
  
    –  Crawling	
  
    –  Processing	
  
    –  Index	
  Genera/on	
  
    –  Results	
  
 •  Opera/ons,	
  GeLng	
  Started	
  &	
  Ques/ons	
  
What	
  is	
  YapMap?	
  
 •  A	
  visual	
  search	
  technology	
  	
  
 •  Focused	
  on	
  threaded	
  
    conversa/ons	
  
 •  Built	
  to	
  provide	
  beWer	
  
    context	
  and	
  ranking	
  
 •  Built	
  on	
  Hadoop	
  ecosystem	
  
    for	
  massive	
  scale	
  
 •  Two	
  self-­‐funded	
  guys	
  
 •  Motoyap.com	
  largest	
  
    implementa/on	
  at	
  650mm	
                www.motoyap.com	
  
    automo/ve	
  docs	
  
Why	
  do	
  this?	
  
                         •  Discussion	
  forums	
  and	
  
                            mailings	
  list	
  primary	
  
                            home	
  for	
  many	
  hobbies	
  
                         •  Threaded	
  search	
  sucks	
  
                             –  No	
  context	
  in	
  the	
  middle	
  
                                of	
  the	
  conversa/on	
  
How	
  does	
  it	
  work?	
  
                                 Post	
  1	
  
                                 Post	
  2	
  
                                          Post	
  3	
  
                                                   Post	
  4	
  
                                 Post	
  5	
  
                                          Post	
  6	
  
Conceptual	
  data	
  model	
  
                                                             Thread	
  
                  Post	
  1	
  
                  Post	
  2	
  
                           Post	
  3	
                         Sub-­‐thread	
  
                                    Post	
  4	
  
                   Post	
  5	
  
                           Post	
  6	
  

                                                    Individual	
  post	
  


 •  Single	
  thread	
  scaWered	
  across	
  many	
  web	
  pages	
  
 •  Posts	
  don’t	
  necessarily	
  arrive	
  in	
  order	
  
A	
  YapMap	
  search	
  result	
  page	
  
Agenda	
  
 •  What	
  is	
  YapMap?	
  
 ü FiLng	
  Hadoop	
  into	
  your	
  architecture	
  
 •  YapMap	
  Approach	
  
    –  Crawling	
  
    –  Processing	
  
    –  Index	
  Genera/on	
  
    –  Results	
  
 •  Opera/ons,	
  GeLng	
  Started	
  &	
  Ques/ons	
  
Evolu/on	
  of	
  Hadoop	
  
 Hadoop	
  Today	
                              Hadoop	
  Tomorrow	
  
 •  Batch	
  analysis	
  system	
               •  Real-­‐/me	
  enterprise	
  
                                                   applica/on	
  pladorm	
  
 •  Lacks	
  enterprise	
  features	
           •  Strong	
  Enterprise	
  Features	
  
    (e.g.	
  HA,	
  Stability,	
  compat)	
  

 •  Limited	
  applica/ons	
                    •  BI,	
  Email/Collabora/on,	
  
    primarily	
  BI	
  &	
  analy/cs	
             Marke/ng	
  DW,	
  etc.	
  

 •  Clusters	
  focused	
  on	
  point	
        •  Shared	
  resource	
  suppor/ng	
  
    use	
  cases	
                                 a	
  large	
  number	
  of	
  use	
  cases	
  
Complementary	
  to	
  exis/ng	
  technologies	
  

 Tradi-onal	
  Tools	
                     Hadoop	
  Addi-ons	
  
 •    Glassfish	
  3.1.2	
  (EJB&CDI)	
     •      Zookeeper	
  
 •    MySQL	
                              •      HBase	
  
 •    RabbitMQ	
                           •      MapReduce	
  
 •    Protobuf	
                           •      MapRfs/HDFS	
  
 •    Varnish	
                            •      Mahout	
  	
  
 •    Riak	
                               	
  
General	
  architecture	
  

                 RabbitMQ	
                    MapReduce	
  

                          Processing	
          Indexing	
         Results	
  
      Crawler	
  
                           Pipeline	
            Engine	
       Presenta/on	
  


                         HBase	
                               Riak	
  
                                     HDFS/MapRfs	
  
                                      Zookeeper	
  

     MySQL	
                                                              MySQL	
  
Hadoop	
  doesn’t	
  solve	
  all	
  problems	
  
 	
                          MySQL                     HBase	
                            Riak
 Primary	
  Use              Business	
                Storage	
  of	
  crawl	
  data,	
  Storage	
  of	
  
                             management	
              processing	
  pipeline             components	
  
                             information	
                                                directly	
  related	
  to	
  
                                                                                          presentation
 Key	
  features	
  that	
   Transactions,	
  SQL,	
   Consistency,	
  redundancy,	
   Predictable	
  l ow	
  
 drove	
  selection          JPA                       memory	
  to	
  persitence	
       latency,	
  full	
  
                                                       ratio	
                            uptime,	
  max	
  one	
  
                                                                                          IOP	
  per	
  object
 Average	
  Object	
  Size                  Small                                     20k                             2k
 Object	
  Count                     <1	
  million                         500	
  million                   1	
  billion
 System	
  Count                                 2                                     10                              8
 Memory	
  Footprint                         <1gb                                  120gb                          240gb
 Dataset	
  Size valuated	
  Voldemort	
  and	
  Cassandra	
  
    We	
  also	
  e                         10mb                                    10tb                             2tb
How	
  we	
  use	
  Hadoop	
  
 •  Zookeeper	
                                     •  Corosync,	
  Accord,	
  JGroups	
  
       –  Distributed	
  Locks	
  
       –  Cluster	
  membership	
  
          coordina/on	
  
       –  Index	
  distribu/on	
  coordina/on	
  
                                                    •  Teradata,	
  Exadata,	
  sharded	
  
 •  HBase	
                                            MySQL,	
  Cassandra	
  
       –    Primary	
  Data	
  store	
  
       –    Crawl	
  Caching	
  
       –    Data	
  merging	
  
       –    Processing	
  Pipeline	
                •  MPI,	
  JPPF,	
  Clustered	
  EJB	
  
 •  MapReduce	
  
       –  Index	
  genera/on	
                      •  Gluster,	
  SAN/NAS,	
  Lustre	
  
 •  MapRfs/HDFS	
  
       –  Index	
  storage	
                        •  Carrot2,	
  Lingpipe,	
  Lexaly/cs	
  	
  
 •  Mahout	
  	
  
       –  Cluster	
  iden/fica/on	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Crawling	
  	
  
YapMap	
  crawling	
  challenges	
  
 •  Depth	
  versus	
  breadth	
  
 •  Crawls	
  must	
  be	
  throWled	
  to	
  avoid	
  overloading	
  
 •  Avoid	
  duplicate	
  crawling	
  
 •  Save	
  progress	
  of	
  long	
  running	
  crawls	
  
 •  Need	
  an	
  elas/c	
  and	
  full	
  distributed	
  approach	
  
    to	
  crawling	
  
 •  Crawler	
  death	
  managed	
  
Crawler	
  overview	
  
      RabbitMQ	
  
                                                                         5.	
  Crawler	
  Outputs	
  
    1.	
  New	
  
                                       4.	
  Crawler	
  retrieves	
      Posts	
  (using	
  
    Crawl	
  job	
  
                                       external	
  assets	
              append	
  as	
  
    arrives	
  
                                                                         necessary)	
  

                                                                                   DFS	
  
                                  Crawler	
  
    2.	
  Crawler	
  checks	
  
    document	
  cache	
  
                                                                          Aier	
  achieving	
  	
  /me	
  
         HBase	
                                                            and/or	
  quan/ty	
  
                                        6.	
  Crawler	
  generates	
       thresholds,	
  crawl	
  
                                        more	
  crawl	
  tasks	
         pauses,	
  checkpoints	
  in	
  
      3.	
  Crawler	
  
                                                                          HBase	
  and	
  resubmits	
  
      Acquires	
  
                                                                          to	
  RabbitMQ	
  queue	
  
      Domain	
  Lock	
  
       Zookeeper	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Processing	
  Pipeline	
  
Processing	
  pipeline	
  challenges	
  
 •    Independent	
  posts	
  =>	
  complete	
  threads	
  
 •    Split	
  long	
  threads	
  into	
  mul/ple	
  sub-­‐threads	
  
 •    Fully	
  parallel	
  processing	
  pipeline	
  
 •    Accommodate	
  out	
  of	
  order	
  data	
  
Processing	
  pipeline	
  using	
  HBase	
  
 •     Mul/ple	
  steps	
  with	
  checkpoints	
  to	
  manage	
  failures	
  
 •     Idempotent	
  opera/ons	
  at	
  each	
  stage	
  of	
  process	
  
 •     U/lize	
  op/mis/c	
  locking	
  to	
  do	
  coordinated	
  merges	
  
 •     Use	
  regular	
  cleanup	
  scans	
  to	
  pick	
  up	
  lost	
  tasks	
  
 •     Control	
  batch	
  size	
  of	
  messages	
  to	
  control	
  throughput	
  versus	
  latency	
  
 •     Out	
  of	
  order	
  input	
  assumed	
  

 Posts	
  from	
  	
                      Message	
                        Message	
                           Batch	
  
 Crawler	
                                                                          Process	
  &	
  pre-­‐   Indexing	
  
                         Build	
  thread	
       Merge	
  +	
  split	
  
                                                                                      index	
  sub-­‐
                             parts	
               threads	
                                                    RT	
  
                                                                                        threads	
  
                                                                                                             Indexing	
  


                                         HBase	
                                           Riak	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Index	
  Genera/on	
  
Index	
  genera/on	
  challenges	
  
 •  Shard	
  size	
  control	
  
 •  Index	
  ordering	
  
 •  Maintain	
  inverted	
  and	
  un-­‐inverted	
  data	
  in	
  
    parallel	
  
 •  Minimize	
  merging	
  costs	
  
 •  Support	
  mul/-­‐grain	
  indexing	
  and	
  scoring	
  
Index	
  Shards	
  loosely	
  based	
  on	
  HBase	
  regions	
  
 •  HBase	
  primary	
  key	
       Pre-­‐index	
  Docs	
  
    order	
  is	
  same	
  as	
  
    index	
  order	
  
 •  Shards	
  sized	
  based	
                 R1	
           Shard	
  1	
  
    on	
  paralleliza/on	
  
    requirements	
  
      –  Typically	
  ~5gb	
                   R2	
  
                                                              Shard	
  2	
  
         each	
  
 •  Shards	
  are	
  based	
  
    on	
  snapshots	
  of	
                    R3	
  
                                                              Shard	
  3	
  
    splits	
  for	
  data	
  
    locality	
  
MapReduce	
  for	
  Index	
  Genera/on	
  

     IndexedTableInputFormat	
                      Term:	
  Pos/ng	
  Lists	
  	
  

                      Map	
                                                                        Reduce	
  
                       Map	
                                                                        Reduce	
  
                                                    Barrier	
  Map	
  Split	
  	
  
   Term	
  Distribu/on	
  Par//oner	
               Sta/s/cs	
                           FileAndPutOutputCommiWer	
  



                                                                     Inverted	
  data	
  	
  
                                    Un-­‐inverted	
                  characteris/cs	
            Inverted	
  
 Un-­‐inverted	
  data	
  	
        Data	
  
                                                                                                 Indices	
  &	
  dic/onaries	
  
 characteris/cs	
  


                                                                          DFS	
  
                        HBase	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Results	
  Presenta/on	
  
Presenta/on	
  Layer	
  Challenges	
  
 •    Distributed	
  search	
  tree	
  
 •    High	
  performance	
  index	
  loading	
  and	
  serving	
  
 •    No	
  SPOF	
  
 •    Effec/ve	
  memory	
  management	
  	
  &	
  alloca/on	
  
 •    Automa/c	
  cluster	
  management	
  
 •    Smart	
  index	
  distribu/on	
  
Results	
  Presenta/on	
  Layer	
  
                                             1.	
  Request	
   5.	
  Response	
  

                                                                                    2.	
  Query	
  Zookeeper	
  for	
  
            4.	
  Retrieve	
  assets	
  
                                              Results	
  SServer	
                  ac/ve	
  servers	
  
 Riak	
                                        Results	
   erver	
                                                 Zookeeper	
  
            3.	
  Fan-­‐out	
  request,	
  	
  
            consolidate	
  responses	
                                                                                 3.	
  Register	
  
                                                                                                                       new	
  shard	
  
                 Shard	
                    Shard	
                   Shard	
                  Shard	
                 availability	
  
                Daemon	
                   Daemon	
                  Daemon	
                 Daemon	
  
             Index	
  Server	
                                    Index	
  Server	
  

                                           1.	
  Load	
  shard	
  profile	
  &	
                         2.	
  Parallel	
  load	
  and	
  
                                           configure	
  memory	
                                         integrate	
  shard	
  

                                                                   HBase	
                    DFS	
  
Agenda	
  
 •  What	
  is	
  YapMap?	
  
 •  FiLng	
  Hadoop	
  into	
  your	
  architecture	
  
 •  YapMap	
  Approach	
  
    –  Crawling	
  
    –  Processing	
  
    –  Index	
  Genera/on	
  
    –  Results	
  
 ü Opera/ons,	
  GeLng	
  Started	
  &	
  Ques/ons	
  
Opera/ons	
  
 •  Hardware	
  
     –  Supermicro	
  with	
  8	
  core	
  low	
  power	
  chips,	
  low	
  power	
  ddr3	
  
     –  WD	
  Black	
  2TB	
  drives	
  
     –  DDR	
  Infiniband	
  using	
  IPoIB	
  for	
  index	
  loading	
  performance	
  
 •  Soiware	
  
     –  Started	
  on	
  Cloudera,	
  switched	
  to	
  MapR’s	
  M3	
  distribu/on	
  
        of	
  Hadoop	
  
 •  GC	
  was	
  painful,	
  now	
  manageable	
  
     –  HBase	
  now	
  supports	
  MSLAB	
  for	
  writes	
  and	
  off-­‐heap	
  block	
  
        cache	
  to	
  support	
  larger	
  memory	
  usage	
  
     –  Shard	
  servers	
  u/lize	
  large	
  pages	
  to	
  minimize	
  
        fragmenta/on	
  	
  
     –  Shard	
  servers	
  do	
  immediate	
  large	
  alloca/ons	
  to	
  minimize	
  
        GC	
  problems	
  
GeLng	
  Started	
  
 •  Amazon	
  Elas/c	
  Map	
  Reduce	
  
     –  Common	
  Crawl	
  dataset	
  is	
  a	
  great	
  data	
  set	
  to	
  start	
  
        with	
  
 •  Cheap	
  old-­‐gen	
  cluster	
  if	
  you	
  want	
  to	
  run	
  things	
  
    like	
  HBase	
  
     –  We	
  built	
  a	
  effec/ve	
  6	
  node	
  Hadoop/HBase	
  cluster	
  for	
  
        $1500	
  (Craigslist,	
  eBay)	
  
     –  Mailing	
  lists	
  are	
  liWered	
  with	
  performance	
  and	
  
        interconnec/vity	
  challenges	
  when	
  using	
  cloud	
  
        compu/ng	
  resources	
  to	
  do	
  Hadoop	
  stuff	
  
Ques/ons	
  
 •  Why	
  not	
  Lucene/Solr/Elas/cSearch/KaWa/etc?	
  
       –  Not	
  built	
  to	
  work	
  well	
  with	
  Hadoop	
  and	
  HBase	
  (Blur.io	
  is	
  first	
  to	
  tackle	
  this	
  head	
  on)	
  
       –  Data	
  locality	
  between	
  threads	
  and	
  posts	
  to	
  do	
  document-­‐at-­‐once	
  scoring	
  
 •  Why	
  not	
  store	
  indices	
  directly	
  in	
  HBase?	
  
       –  Single	
  cell	
  storage	
  would	
  be	
  the	
  only	
  way	
  to	
  do	
  it	
  efficiently	
  	
  	
  
       –  No	
  such	
  thing	
  as	
  a	
  single	
  cell	
  no-­‐read	
  append	
  (HBASE-­‐5993)	
  
       –  No	
  single	
  cell	
  par/al	
  read	
  	
  
 •  Why	
  use	
  Riak	
  for	
  presenta/on	
  side?	
  
       –  Hadoop	
  SPOF	
  
       –  Even	
  with	
  newer	
  Hadoop	
  versions,	
  HBase	
  does	
  not	
  do	
  sub-­‐second	
  row-­‐level	
  HA	
  on	
  node	
  
          failure	
  (HBASE-­‐2357)	
  
       –  Riak	
  has	
  more	
  predictable	
  latency	
  
 •  Why	
  did	
  you	
  switch	
  to	
  MapR?	
  
       –  Index	
  load	
  performance	
  was	
  substan/ally	
  faster	
  
       –  Snapshots	
  in	
  trial	
  copy	
  were	
  nice	
  for	
  those	
  30	
  days	
  
       –  Less	
  impact	
  on	
  HBase	
  performance	
  

Mais conteúdo relacionado

Mais procurados

An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14iwrigley
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 

Mais procurados (20)

An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
hadoop @ Ibmbigdata
hadoop @ Ibmbigdatahadoop @ Ibmbigdata
hadoop @ Ibmbigdata
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 

Semelhante a Searching conversations with hadoop

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 

Semelhante a Searching conversations with hadoop (20)

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Philly DB MapR Overview
Philly DB MapR OverviewPhilly DB MapR Overview
Philly DB MapR Overview
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Searching conversations with hadoop

  • 1. Searching  Conversa/ons   using  Hadoop:  More  than   find the talk Just  Analy/cs Jacques  Nadeau,  CTO   jacques@yapmap.com   @intjesus   June  13,  2012    
  • 2. Agenda   ü What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
  • 3. What  is  YapMap?   •  A  visual  search  technology     •  Focused  on  threaded   conversa/ons   •  Built  to  provide  beWer   context  and  ranking   •  Built  on  Hadoop  ecosystem   for  massive  scale   •  Two  self-­‐funded  guys   •  Motoyap.com  largest   implementa/on  at  650mm   www.motoyap.com   automo/ve  docs  
  • 4. Why  do  this?   •  Discussion  forums  and   mailings  list  primary   home  for  many  hobbies   •  Threaded  search  sucks   –  No  context  in  the  middle   of  the  conversa/on  
  • 5. How  does  it  work?   Post  1   Post  2   Post  3   Post  4   Post  5   Post  6  
  • 6. Conceptual  data  model   Thread   Post  1   Post  2   Post  3   Sub-­‐thread   Post  4   Post  5   Post  6   Individual  post   •  Single  thread  scaWered  across  many  web  pages   •  Posts  don’t  necessarily  arrive  in  order  
  • 7. A  YapMap  search  result  page  
  • 8. Agenda   •  What  is  YapMap?   ü FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
  • 9. Evolu/on  of  Hadoop   Hadoop  Today   Hadoop  Tomorrow   •  Batch  analysis  system   •  Real-­‐/me  enterprise   applica/on  pladorm   •  Lacks  enterprise  features   •  Strong  Enterprise  Features   (e.g.  HA,  Stability,  compat)   •  Limited  applica/ons   •  BI,  Email/Collabora/on,   primarily  BI  &  analy/cs   Marke/ng  DW,  etc.   •  Clusters  focused  on  point   •  Shared  resource  suppor/ng   use  cases   a  large  number  of  use  cases  
  • 10. Complementary  to  exis/ng  technologies   Tradi-onal  Tools   Hadoop  Addi-ons   •  Glassfish  3.1.2  (EJB&CDI)   •  Zookeeper   •  MySQL   •  HBase   •  RabbitMQ   •  MapReduce   •  Protobuf   •  MapRfs/HDFS   •  Varnish   •  Mahout     •  Riak    
  • 11. General  architecture   RabbitMQ   MapReduce   Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   HBase   Riak   HDFS/MapRfs   Zookeeper   MySQL   MySQL  
  • 12. Hadoop  doesn’t  solve  all  problems     MySQL HBase   Riak Primary  Use Business   Storage  of  crawl  data,  Storage  of   management   processing  pipeline components   information   directly  related  to   presentation Key  features  that   Transactions,  SQL,   Consistency,  redundancy,   Predictable  l ow   drove  selection JPA memory  to  persitence   latency,  full   ratio   uptime,  max  one   IOP  per  object Average  Object  Size Small 20k 2k Object  Count <1  million 500  million 1  billion System  Count 2 10 8 Memory  Footprint <1gb 120gb 240gb Dataset  Size valuated  Voldemort  and  Cassandra   We  also  e 10mb 10tb 2tb
  • 13. How  we  use  Hadoop   •  Zookeeper   •  Corosync,  Accord,  JGroups   –  Distributed  Locks   –  Cluster  membership   coordina/on   –  Index  distribu/on  coordina/on   •  Teradata,  Exadata,  sharded   •  HBase   MySQL,  Cassandra   –  Primary  Data  store   –  Crawl  Caching   –  Data  merging   –  Processing  Pipeline   •  MPI,  JPPF,  Clustered  EJB   •  MapReduce   –  Index  genera/on   •  Gluster,  SAN/NAS,  Lustre   •  MapRfs/HDFS   –  Index  storage   •  Carrot2,  Lingpipe,  Lexaly/cs     •  Mahout     –  Cluster  iden/fica/on  
  • 14. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Crawling    
  • 15. YapMap  crawling  challenges   •  Depth  versus  breadth   •  Crawls  must  be  throWled  to  avoid  overloading   •  Avoid  duplicate  crawling   •  Save  progress  of  long  running  crawls   •  Need  an  elas/c  and  full  distributed  approach   to  crawling   •  Crawler  death  managed  
  • 16. Crawler  overview   RabbitMQ   5.  Crawler  Outputs   1.  New   4.  Crawler  retrieves   Posts  (using   Crawl  job   external  assets   append  as   arrives   necessary)   DFS   Crawler   2.  Crawler  checks   document  cache   Aier  achieving    /me   HBase   and/or  quan/ty   6.  Crawler  generates   thresholds,  crawl   more  crawl  tasks   pauses,  checkpoints  in   3.  Crawler   HBase  and  resubmits   Acquires   to  RabbitMQ  queue   Domain  Lock   Zookeeper  
  • 17. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Processing  Pipeline  
  • 18. Processing  pipeline  challenges   •  Independent  posts  =>  complete  threads   •  Split  long  threads  into  mul/ple  sub-­‐threads   •  Fully  parallel  processing  pipeline   •  Accommodate  out  of  order  data  
  • 19. Processing  pipeline  using  HBase   •  Mul/ple  steps  with  checkpoints  to  manage  failures   •  Idempotent  opera/ons  at  each  stage  of  process   •  U/lize  op/mis/c  locking  to  do  coordinated  merges   •  Use  regular  cleanup  scans  to  pick  up  lost  tasks   •  Control  batch  size  of  messages  to  control  throughput  versus  latency   •  Out  of  order  input  assumed   Posts  from     Message   Message   Batch   Crawler   Process  &  pre-­‐ Indexing   Build  thread   Merge  +  split   index  sub-­‐ parts   threads   RT   threads   Indexing   HBase   Riak  
  • 20. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Index  Genera/on  
  • 21. Index  genera/on  challenges   •  Shard  size  control   •  Index  ordering   •  Maintain  inverted  and  un-­‐inverted  data  in   parallel   •  Minimize  merging  costs   •  Support  mul/-­‐grain  indexing  and  scoring  
  • 22. Index  Shards  loosely  based  on  HBase  regions   •  HBase  primary  key   Pre-­‐index  Docs   order  is  same  as   index  order   •  Shards  sized  based   R1   Shard  1   on  paralleliza/on   requirements   –  Typically  ~5gb   R2   Shard  2   each   •  Shards  are  based   on  snapshots  of   R3   Shard  3   splits  for  data   locality  
  • 23. MapReduce  for  Index  Genera/on   IndexedTableInputFormat   Term:  Pos/ng  Lists     Map   Reduce   Map   Reduce   Barrier  Map  Split     Term  Distribu/on  Par//oner   Sta/s/cs   FileAndPutOutputCommiWer   Inverted  data     Un-­‐inverted   characteris/cs   Inverted   Un-­‐inverted  data     Data   Indices  &  dic/onaries   characteris/cs   DFS   HBase  
  • 24. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Results  Presenta/on  
  • 25. Presenta/on  Layer  Challenges   •  Distributed  search  tree   •  High  performance  index  loading  and  serving   •  No  SPOF   •  Effec/ve  memory  management    &  alloca/on   •  Automa/c  cluster  management   •  Smart  index  distribu/on  
  • 26. Results  Presenta/on  Layer   1.  Request   5.  Response   2.  Query  Zookeeper  for   4.  Retrieve  assets   Results  SServer   ac/ve  servers   Riak   Results   erver   Zookeeper   3.  Fan-­‐out  request,     consolidate  responses   3.  Register   new  shard   Shard   Shard   Shard   Shard   availability   Daemon   Daemon   Daemon   Daemon   Index  Server   Index  Server   1.  Load  shard  profile  &   2.  Parallel  load  and   configure  memory   integrate  shard   HBase   DFS  
  • 27. Agenda   •  What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   ü Opera/ons,  GeLng  Started  &  Ques/ons  
  • 28. Opera/ons   •  Hardware   –  Supermicro  with  8  core  low  power  chips,  low  power  ddr3   –  WD  Black  2TB  drives   –  DDR  Infiniband  using  IPoIB  for  index  loading  performance   •  Soiware   –  Started  on  Cloudera,  switched  to  MapR’s  M3  distribu/on   of  Hadoop   •  GC  was  painful,  now  manageable   –  HBase  now  supports  MSLAB  for  writes  and  off-­‐heap  block   cache  to  support  larger  memory  usage   –  Shard  servers  u/lize  large  pages  to  minimize   fragmenta/on     –  Shard  servers  do  immediate  large  alloca/ons  to  minimize   GC  problems  
  • 29. GeLng  Started   •  Amazon  Elas/c  Map  Reduce   –  Common  Crawl  dataset  is  a  great  data  set  to  start   with   •  Cheap  old-­‐gen  cluster  if  you  want  to  run  things   like  HBase   –  We  built  a  effec/ve  6  node  Hadoop/HBase  cluster  for   $1500  (Craigslist,  eBay)   –  Mailing  lists  are  liWered  with  performance  and   interconnec/vity  challenges  when  using  cloud   compu/ng  resources  to  do  Hadoop  stuff  
  • 30. Ques/ons   •  Why  not  Lucene/Solr/Elas/cSearch/KaWa/etc?   –  Not  built  to  work  well  with  Hadoop  and  HBase  (Blur.io  is  first  to  tackle  this  head  on)   –  Data  locality  between  threads  and  posts  to  do  document-­‐at-­‐once  scoring   •  Why  not  store  indices  directly  in  HBase?   –  Single  cell  storage  would  be  the  only  way  to  do  it  efficiently       –  No  such  thing  as  a  single  cell  no-­‐read  append  (HBASE-­‐5993)   –  No  single  cell  par/al  read     •  Why  use  Riak  for  presenta/on  side?   –  Hadoop  SPOF   –  Even  with  newer  Hadoop  versions,  HBase  does  not  do  sub-­‐second  row-­‐level  HA  on  node   failure  (HBASE-­‐2357)   –  Riak  has  more  predictable  latency   •  Why  did  you  switch  to  MapR?   –  Index  load  performance  was  substan/ally  faster   –  Snapshots  in  trial  copy  were  nice  for  those  30  days   –  Less  impact  on  HBase  performance