Enterprise Search Summit Keynote: A Big Data Architecture for Search

•Download as PPTX, PDF•

1 like•1,322 views

This presentation was given by Search Technologies' CEO Kamran Khan at the November 2013 Enterprise Search Summit / KMWorld in Washington DC. He discussed how modern search engines are currently being combined with powerful independent content processing pipelines and the distributed processing technologies from big data to form new and exciting enterprise search architecture, delivering results only available to the biggest companies with the deepest pockets in the past. For more information visit http://www.searchtechnologies.com/.

Technology

A Big Data Architecture for Search
Kamran Khan, CEO
The expert in the search space

Search Technologies Overview
Ascot, UK
Karlsruhe, DE

Cincinnati, OH
Herndon, VA
San Diego, CA
San Jose, CR

• The leading IT Services company dedicated to
Enterprise Search & Search-based Applications
• Implementation, Consulting, Managed Services
• 120 employees and growing
• Independent, working with all of the leading
software vendors and open source alternatives

The expert in the search space

500+ Customers

The expert in the search space

What Is Big Data?

The expert in the search space

Where Did Modern Big Data Come From?

Web
Web
Servers
Servers
Web
Servers

Content
Content
Content

The expert in the search space

What is Big Data?

LOG FILES
LOG FILES

LOG FILES

LOG FILES

LOG FILES
LOG FILES

LOG FILES

The expert in the search space

What is Big Data?
Too big for a single machine
• Physically impossible for a single machine

Data Aggregation & Analysis
• Simply transforming data records is not enough
• Must aggregate / “boil down” the data

Batch Processing
• Very long running jobs (not real-time)

Message: Lots of Data  “Big Data”

The expert in the search space

Enabling Technologies

Big Data For Search

Hadoop

Elastic /
Cloud
Computing

Modern
Statistical
Analysis

The expert in the search space

What is Big Data?

Content

Content

Content

Content

Content

Content

Content

Content

Content

Content

Content

Content

Content

Content

Content

Hadoop
The expert in the search space

A Traditional Integrated Architecture
Does a lot of what we need for Enterprise Search
Content Sources
SharePoint

Search Engine
File System

Aspire
Connectors

ETC.

Search
Index

Connector

RDBMS
Employee
Directory

Index Pipeline

Limitations
•
•
•
•

Limited support for modern analytics
Limited support for content processing
Re-indexing takes too long
Limits ability to do continuous improvement cycle

The expert in the search space

Why Content Processing is Important
Content Sources
Employee
Directory

File System

Search Engine
Aspire
Connectors
Connector

Content
Processing

Index Pipeline

Search
Index

RDBMS
Employee
Directory
ETC.

• Powerful & Complete Content Processing Service
• Clean and consistent data and metadata
• Ability to supplement metadata

• Support for Continuous Improvement Cycle
• Develop and maintain processing IP
• Ability to easily migrate to new search engines

The expert in the search space

A New Enterprise Search Architecture
Content Sources
Employee
Directory

File System

Aspire
Connectors
Connector

Content
Processing &
Tokenization

Search Engine
Search
Index Pipeline
Index

RDBMS

Secure
Cache

Employee
Directory

Analytics

ETC.

•
•
•
•
•

Docs, Log files,
Supplemental
Data

Integrated Platform (Docs, Log Files and External data)
Reduced Cost
Better Agility and Scalability
Fast Reindexing
Expanded Functionality
The expert in the search space

Advanced Features & Analytics Enabled
Search and Match
Forward and Reverse Citation
Latent Semantic Analysis
More Precise Term Weighting Beyond TF/IDF
Near Duplicate Detection
Document Topic Tagging
Results ranking including popularity
Recommendations based on user behavior
Suggested queries based on user behavior
The expert in the search space

In Summary
New architectureBig Data Technology better:
Structured for search providing Will
Analytics and other functionality Search
Revolutionize Enterprise
Content processing
Agility
Economics and scalability

Big Data architectures will significantly move search
forward

The expert in the search space

For further information
www.searchtechnologies.com
The expert in the search space

What's hot

Lju LazarevicConnected Data World

Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics

David Kuilman | Creating a Semantic Enterprise Content model to support conti...semanticsconference

II-SDV 2016 - QWAM Content IntelligenceDr. Haxel Consult

REA Group's journey with Data Cataloging and Amundsenmarkgrover

MAAK KENNIS MET HET NIEUWSTE GENERATIE DATA MANAGEMENT PLATFORM - Big Data Ex...webwinkelvakdag

Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...DataWorks Summit

Solution architectureRajat Agrawal

Amundsen at Brex and Looker integrationmarkgrover

Building the Inform Semantic Publishing Ecosystem: from Author to AudienceVital.AI

Fried data summit big data for lob contentJeff Fried

SharePoint Search Topology and OptimizationMike Maadarani

AzureDay - Introduction Big Data Analytics.Łukasz Grala

Google search vs Solr search for Enterprise searchVeera Shekar

Business Track: How MongoDB Helps Telefonia Digital Accelerate Time to MarketMongoDB

RightsDirektDr. Haxel Consult

Network and IT OperationsNeo4j

MongoDB and Hadoop: Driving Business InsightsMongoDB

Semantic Graph Databases: The Evolution of Relational DatabasesCambridge Semantics

Scalable, Fast Analytics with Graph - Why and HowCambridge Semantics

What's hot (20)

Lju Lazarevic

Sustainability Investment Research Using Cognitive Analytics

David Kuilman | Creating a Semantic Enterprise Content model to support conti...

II-SDV 2016 - QWAM Content Intelligence

REA Group's journey with Data Cataloging and Amundsen

MAAK KENNIS MET HET NIEUWSTE GENERATIE DATA MANAGEMENT PLATFORM - Big Data Ex...

Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...

Solution architecture

Amundsen at Brex and Looker integration

Building the Inform Semantic Publishing Ecosystem: from Author to Audience

Fried data summit big data for lob content

SharePoint Search Topology and Optimization

AzureDay - Introduction Big Data Analytics.

Google search vs Solr search for Enterprise search

Business Track: How MongoDB Helps Telefonia Digital Accelerate Time to Market

RightsDirekt

Network and IT Operations

MongoDB and Hadoop: Driving Business Insights

Semantic Graph Databases: The Evolution of Relational Databases

Scalable, Fast Analytics with Graph - Why and How

Viewers also liked

Online Model Updating with Spark StreamingKeira Zhou

Data Science, Big Data and YouJoel Saltz

Enterprise Search Best Practices Webinar 4.2013Search Technologies

Enterprise Search: An Information Architect's PerspectivePeter Morville

Oracle big data spatial and graphdyahalom

TriHUG: Lucene Solr HadoopGrant Ingersoll

Real-time searching of big data with Solr and HadoopRogue Wave Software

Private Cloud Delivers Big Data in Oil & Gas v4Andy Moore

Integrating Hadoop & SolrLucidworks

Ektron 8.5 RC - SearchBillCavaUs

Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution

Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks

Architecture of a search engineSylvain Utard

Introduction to Lucene & Solr and UsecasesRahul Jain

Big Data: It’s all about the Use CasesJames Serra

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit

Viewers also liked (20)

Online Model Updating with Spark Streaming

Data Science, Big Data and You

Enterprise Search Best Practices Webinar 4.2013

Enterprise Search: An Information Architect's Perspective

Oracle big data spatial and graph

TriHUG: Lucene Solr Hadoop

Real-time searching of big data with Solr and Hadoop

Private Cloud Delivers Big Data in Oil & Gas v4

Integrating Hadoop & Solr

Ektron 8.5 RC - Search

Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...

Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...

Architecture of a search engine

Introduction to Lucene & Solr and Usecases

Big Data: It’s all about the Use Cases

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

Similar to Enterprise Search Summit Keynote: A Big Data Architecture for Search

The Enterprise Search Market in a NutshellDr. Haxel Consult

The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group

SharePoint Fest Chicago PresentationConcept Searching, Inc

How did it go? The first large enterprise search project in Europe using Shar...Petter Skodvin-Hvammen

Customer Feedback Analytics for Starbucks Nishant Gandhi

SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...Agnes Molnar

SPLive Orlando - Beyond the Search Center - Application or Solution?Agnes Molnar

Sharepoint & Dynamics CRMrujuta4radix

14 Tips for Planning ECM Content Migration to SharePointJoel Oleson

Improve Performance in Fast Search for SharePoint - ComperioComperio - Search Matters.

Enterprise search Information Netwoven Inc.

Architectures styles and deployment on the hadoopAnu Ravindranath

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks

The art of information architecture in Office 365Simon Rawson

Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman

Accelerating analytics in a new era of dataArnon Shimoni

Develop a Custom Data Solution Architecture with NorthBayAmazon Web Services

Transform your DBMS to drive engagement innovation with Big DataAshnikbiz

Rapid Data Exploration With HadoopPeter Skomoroch

ESPC13 - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar

Similar to Enterprise Search Summit Keynote: A Big Data Architecture for Search (20)

The Enterprise Search Market in a Nutshell

The Data Lake and Getting Buisnesses the Big Data Insights They Need

SharePoint Fest Chicago Presentation

How did it go? The first large enterprise search project in Europe using Shar...

Customer Feedback Analytics for Starbucks

SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...

SPLive Orlando - Beyond the Search Center - Application or Solution?

Sharepoint & Dynamics CRM

14 Tips for Planning ECM Content Migration to SharePoint

Improve Performance in Fast Search for SharePoint - Comperio

Enterprise search Information

Architectures styles and deployment on the hadoop

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

The art of information architecture in Office 365

Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...

Accelerating analytics in a new era of data

Develop a Custom Data Solution Architecture with NorthBay

Transform your DBMS to drive engagement innovation with Big Data

Rapid Data Exploration With Hadoop

ESPC13 - 10 Things I Like in SharePoint 2013 Search

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

🐬 The future of MySQL is Postgres 🐘RTylerCroy

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

A Domino Admins Adventures (Engage 2024)Gabriella Davis

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Scaling API-first – The story of a global engineering organizationRadu Cotescu

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Real Time Object Detection Using Open CVKhem

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Advantages of Hiring UIUX Design Service Providers for Your Business

Handwritten Text Recognition for manuscripts and early printed texts

🐬 The future of MySQL is Postgres 🐘

2024: Domino Containers - The Next Step. News from the Domino Container commu...

CNv6 Instructor Chapter 6 Quality of Service

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Data Cloud, More than a CDP by Matt Robison

Driving Behavioral Change for Information Management through Data-Driven Gree...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

08448380779 Call Girls In Friends Colony Women Seeking Men

A Domino Admins Adventures (Engage 2024)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Scaling API-first – The story of a global engineering organization

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Real Time Object Detection Using Open CV

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Powerful Google developer tools for immediate impact! (2023-24 C)

Enterprise Search Summit Keynote: A Big Data Architecture for Search

1. A Big Data Architecture for Search Kamran Khan, CEO The expert in the search space

2. Search Technologies Overview Ascot, UK Karlsruhe, DE Cincinnati, OH Herndon, VA San Diego, CA San Jose, CR • The leading IT Services company dedicated to Enterprise Search & Search-based Applications • Implementation, Consulting, Managed Services • 120 employees and growing • Independent, working with all of the leading software vendors and open source alternatives The expert in the search space

3. 500+ Customers The expert in the search space

4. What Is Big Data? The expert in the search space

5. Where Did Modern Big Data Come From? Web Web Servers Servers Web Servers Content Content Content The expert in the search space

6. What is Big Data? LOG FILES LOG FILES LOG FILES LOG FILES LOG FILES LOG FILES LOG FILES The expert in the search space

7. What is Big Data? Too big for a single machine • Physically impossible for a single machine Data Aggregation & Analysis • Simply transforming data records is not enough • Must aggregate / “boil down” the data Batch Processing • Very long running jobs (not real-time) Message: Lots of Data  “Big Data” The expert in the search space

8. Enabling Technologies Big Data For Search Hadoop Elastic / Cloud Computing Modern Statistical Analysis The expert in the search space

9. What is Big Data? Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Hadoop The expert in the search space

10. A Traditional Integrated Architecture Does a lot of what we need for Enterprise Search Content Sources SharePoint Search Engine File System Aspire Connectors ETC. Search Index Connector RDBMS Employee Directory Index Pipeline Limitations • • • • Limited support for modern analytics Limited support for content processing Re-indexing takes too long Limits ability to do continuous improvement cycle The expert in the search space

11. Why Content Processing is Important Content Sources Employee Directory File System Search Engine Aspire Connectors Connector Content Processing Index Pipeline Search Index RDBMS Employee Directory ETC. • Powerful & Complete Content Processing Service • Clean and consistent data and metadata • Ability to supplement metadata • Support for Continuous Improvement Cycle • Develop and maintain processing IP • Ability to easily migrate to new search engines The expert in the search space

12. A New Enterprise Search Architecture Content Sources Employee Directory File System Aspire Connectors Connector Content Processing & Tokenization Search Engine Search Index Pipeline Index RDBMS Secure Cache Employee Directory Analytics ETC. • • • • • Docs, Log files, Supplemental Data Integrated Platform (Docs, Log Files and External data) Reduced Cost Better Agility and Scalability Fast Reindexing Expanded Functionality The expert in the search space

13. Advanced Features & Analytics Enabled Search and Match Forward and Reverse Citation Latent Semantic Analysis More Precise Term Weighting Beyond TF/IDF Near Duplicate Detection Document Topic Tagging Results ranking including popularity Recommendations based on user behavior Suggested queries based on user behavior The expert in the search space

14. In Summary New architectureBig Data Technology better: Structured for search providing Will Analytics and other functionality Search Revolutionize Enterprise Content processing Agility Economics and scalability Big Data architectures will significantly move search forward The expert in the search space

15. For further information www.searchtechnologies.com The expert in the search space

Editor's Notes

First image)OK, you would think the Wikipedia definition would be enough, however it says “so much data it is awkward”, that is not very helpful and wasn’t large amounts of data awkward 20 years ago?So a look through a Google image search for the term “big data” should help, a picture is worth a thousand words.Second image) OK, a classic venn diagram with the terms Volume (a lot of data), Velocity (that comes fast) and variety (different types of data), we have all heard these three terms with big data, however what type of data and what do you do with it to be big data?Third image) OK, this definition graphic combines the wiki def with the classic three terms, but still not really sure what it is.Fourth image) Hold it I thought there were three V’s, now there are four?Fifth image) No Five!Sixth image) Now it looks like a parking zone for all kinds of big numbers and buzz termsSeventh image) And social network iconsEighth image) Maybe it is just a black hole of numbers and termsStill often leaving us confused.
OK, so if one of the aspects of this revolution involves “Big Data”. I am required to give some explanation. Modern big data came out of companies such as Google and Yahoo needing to process their expanding log files that contained the data on their users behavior and ever increasing numbers of web pages. We are talking about the ability to deal with massive amounts of data objects and the agitate to analyze them utilizing multiple servers to rapidly achieve a result.
Slides 4 and 5 should be combined to have a cool animation of the big single log file object breaking up into the multitude of jobs and then those jobs combining to provide a result.What Google, Yahoo and others developed, that is now called Hadoop, is the ability to take a large single processing job, break it into many small jobs all doing the same thing on only a part of the data set and then have these jobs all report their results to a function that reduces them to a desired result such as a report or analytic display. I don’t know why we took so long to get to this, bees and ants have been doing this since there were bees and ants!
As is typical with a next generation of any industry, a set of enabling technologies comes together at a point in time.We believe the three central enabling technologies are:- Hadoop for reliable and scalable job processing- Elastic or Cloud computing to provide Hadoop the cost effective computing and storage infrastructure- Modern statistics analysis that is based on working on large complete data sets rather then the old style analysis that was designed to work on minimal and incomplete data sets.
Slides 4 and 5 should be combined to have a cool animation of the big single log file object breaking up into the multitude of jobs and then those jobs combining to provide a result.What Google, Yahoo and others developed, that is now called Hadoop, is the ability to take a large single processing job, break it into many small jobs all doing the same thing on only a part of the data set and then have these jobs all report their results to a function that reduces them to a desired result such as a report or analytic display. I don’t know why we took so long to get to this, bees and ants have been doing this since there were bees and ants!
Here is a traditional search architecture. In a traditional enterprise search system, much of this iterative improvement work is done in the indexing pipeline. This is where metadata is prepared, awkward headers and footers are removed, and content is normalized to help the relevancy algorithm compare dissimilar documentsWith the traditional architecture, any meaningful change to the index pipeline will require a full re-index.As data sets grow, this becomes increasingly onerous. A typical enterprise search indexing rate is 3 documents per second.Getting the data from the repositories is almost always the bottleneck. The are not set up for mass bulk exports.So even if you have a modest amount of content – say 10 million documents, it takes weeks to re-index. I’m sure most of you already know how long reindexing times impedes agility and solution quality. You want to use new metadata to support additional features, you want to use cleaner and more consistent content for better precision, you want to add some entity extraction to increase relevance, however you delay because of the pain and expense until some disaster force you to do it.Commonly used workarounds, such as developing systems based on a small sample of data, has its own sever limitations – but no time to go into those here
By including Hadoop with its file system designed for massive amounts of content and automated and reliable distributed processing, running within either your existing servers or using elastic computing within a cloud; and utilizing a strong independent content processing framework that can also take advantage of Hadoop, we now have an architecture that supports modern analytics, advanced content processing and shorter re-indexing times. We believe this architecture can reduce reindexing 10 million document, that often takes several weeks to a day or two.All of this combines to provide an environment were you can effectively have a continuous improvement cycle and iterative development
If you now include hadoop in the architecture, there are some dramatic potential effects.1) You now have an integrated platform to utilize the traditional documents of search along with other data that has customarily outside of search solutions but are gaining in importance such as log files that define user behaviour and supplemental data such as dictionaries, taxonomies, wiki data.2) A high performance and elastic platform that supports cloud computing and functionality that was not feasible before, I will talk about those on the next slide3) A platform that can reduce integration, development and management cost.4) A platform that can enable your organization to be more agile and your search architecture more saleable5) And the ability to preform reindexing at a much faster rate then ever possible.
Let me use the top two examples on this list as an illustration?Let me be clear, we are not saying these things were not possible before, it just was not practical for most organizations due to past hardware, software and programing resource constraints.Search and Match – this is the ability to do a search across the index and then do a precise matching of other documents based on the concepts within those documents. We are currently developing this for one of the largest staffing and recruiting agencies in the world . Finding CVs to match job requirements and job requirements to match CVs.Forward and reverse citations – Many business and technical documents have links and other forms of citations to other documents. Your enterprise many not current be as link rich as the internet, however that is changing and if you are in certain business such as medical research, insurance, legal and financial the ability to utilize citations in your search solutions is critical. We have developed a solution for a major patent company that is the leader in intellectual property (IP) management, that utilizes the forward and reverse patent citations to accomplish its mission.
[This may be too over the top.Kam you need to say the conclusion you are comfortable with, this is just something to work from]We are very excited about the future of enterprise search.There may have been a bit of a stall in our industry in the last few years, however we see a revolution coming.The next generation of search:Powered by a Haddop based architectureSupported by elastic cloud computingExtended by new and exciting analysis techniquesThis will increase functionality and agility while lowering costWe will be able to do what we only dreamed about in the pastIt may still not be Captain Kirk talking to a cognitive computer, however we will effectively utilize the massive amount of content that we have at our disposaland provide a new level of support and experience our users crave.
So, I’m out of time. Thanks for your attentionIf you have any questions, please come and find our booth in the KMWorld hall.

Enterprise Search Summit Keynote: A Big Data Architecture for Search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Enterprise Search Summit Keynote: A Big Data Architecture for Search

Similar to Enterprise Search Summit Keynote: A Big Data Architecture for Search (20)

Recently uploaded

Recently uploaded (20)

Enterprise Search Summit Keynote: A Big Data Architecture for Search

Editor's Notes