This presentation was given by Search Technologies' CEO Kamran Khan at the November 2013 Enterprise Search Summit / KMWorld in Washington DC. He discussed how modern search engines are currently being combined with powerful independent content processing pipelines and the distributed processing technologies from big data to form new and exciting enterprise search architecture, delivering results only available to the biggest companies with the deepest pockets in the past. For more information visit http://www.searchtechnologies.com/.
Powerful Google developer tools for immediate impact! (2023-24 C)
Enterprise Search Summit Keynote: A Big Data Architecture for Search
1. A Big Data Architecture for Search
Kamran Khan, CEO
The expert in the search space
2. Search Technologies Overview
Ascot, UK
Karlsruhe, DE
Cincinnati, OH
Herndon, VA
San Diego, CA
San Jose, CR
• The leading IT Services company dedicated to
Enterprise Search & Search-based Applications
• Implementation, Consulting, Managed Services
• 120 employees and growing
• Independent, working with all of the leading
software vendors and open source alternatives
The expert in the search space
4. What Is Big Data?
The expert in the search space
5. Where Did Modern Big Data Come From?
Web
Web
Servers
Servers
Web
Servers
Content
Content
Content
The expert in the search space
6. What is Big Data?
LOG FILES
LOG FILES
LOG FILES
LOG FILES
LOG FILES
LOG FILES
LOG FILES
The expert in the search space
7. What is Big Data?
Too big for a single machine
• Physically impossible for a single machine
Data Aggregation & Analysis
• Simply transforming data records is not enough
• Must aggregate / “boil down” the data
Batch Processing
• Very long running jobs (not real-time)
Message: Lots of Data “Big Data”
The expert in the search space
8. Enabling Technologies
Big Data For Search
Hadoop
Elastic /
Cloud
Computing
Modern
Statistical
Analysis
The expert in the search space
9. What is Big Data?
Content
Content
Content
Content
Content
Content
Content
Content
Content
Content
Content
Content
Content
Content
Content
Hadoop
The expert in the search space
10. A Traditional Integrated Architecture
Does a lot of what we need for Enterprise Search
Content Sources
SharePoint
Search Engine
File System
Aspire
Connectors
ETC.
Search
Index
Connector
RDBMS
Employee
Directory
Index Pipeline
Limitations
•
•
•
•
Limited support for modern analytics
Limited support for content processing
Re-indexing takes too long
Limits ability to do continuous improvement cycle
The expert in the search space
11. Why Content Processing is Important
Content Sources
Employee
Directory
File System
Search Engine
Aspire
Connectors
Connector
Content
Processing
Index Pipeline
Search
Index
RDBMS
Employee
Directory
ETC.
• Powerful & Complete Content Processing Service
• Clean and consistent data and metadata
• Ability to supplement metadata
• Support for Continuous Improvement Cycle
• Develop and maintain processing IP
• Ability to easily migrate to new search engines
The expert in the search space
12. A New Enterprise Search Architecture
Content Sources
Employee
Directory
File System
Aspire
Connectors
Connector
Content
Processing &
Tokenization
Search Engine
Search
Index Pipeline
Index
RDBMS
Secure
Cache
Employee
Directory
Analytics
ETC.
•
•
•
•
•
Docs, Log files,
Supplemental
Data
Integrated Platform (Docs, Log Files and External data)
Reduced Cost
Better Agility and Scalability
Fast Reindexing
Expanded Functionality
The expert in the search space
13. Advanced Features & Analytics Enabled
Search and Match
Forward and Reverse Citation
Latent Semantic Analysis
More Precise Term Weighting Beyond TF/IDF
Near Duplicate Detection
Document Topic Tagging
Results ranking including popularity
Recommendations based on user behavior
Suggested queries based on user behavior
The expert in the search space
14. In Summary
New architectureBig Data Technology better:
Structured for search providing Will
Analytics and other functionality Search
Revolutionize Enterprise
Content processing
Agility
Economics and scalability
Big Data architectures will significantly move search
forward
The expert in the search space
First image)OK, you would think the Wikipedia definition would be enough, however it says “so much data it is awkward”, that is not very helpful and wasn’t large amounts of data awkward 20 years ago?So a look through a Google image search for the term “big data” should help, a picture is worth a thousand words.Second image) OK, a classic venn diagram with the terms Volume (a lot of data), Velocity (that comes fast) and variety (different types of data), we have all heard these three terms with big data, however what type of data and what do you do with it to be big data?Third image) OK, this definition graphic combines the wiki def with the classic three terms, but still not really sure what it is.Fourth image) Hold it I thought there were three V’s, now there are four?Fifth image) No Five!Sixth image) Now it looks like a parking zone for all kinds of big numbers and buzz termsSeventh image) And social network iconsEighth image) Maybe it is just a black hole of numbers and termsStill often leaving us confused.
OK, so if one of the aspects of this revolution involves “Big Data”. I am required to give some explanation. Modern big data came out of companies such as Google and Yahoo needing to process their expanding log files that contained the data on their users behavior and ever increasing numbers of web pages. We are talking about the ability to deal with massive amounts of data objects and the agitate to analyze them utilizing multiple servers to rapidly achieve a result.
Slides 4 and 5 should be combined to have a cool animation of the big single log file object breaking up into the multitude of jobs and then those jobs combining to provide a result.What Google, Yahoo and others developed, that is now called Hadoop, is the ability to take a large single processing job, break it into many small jobs all doing the same thing on only a part of the data set and then have these jobs all report their results to a function that reduces them to a desired result such as a report or analytic display. I don’t know why we took so long to get to this, bees and ants have been doing this since there were bees and ants!
As is typical with a next generation of any industry, a set of enabling technologies comes together at a point in time.We believe the three central enabling technologies are:- Hadoop for reliable and scalable job processing- Elastic or Cloud computing to provide Hadoop the cost effective computing and storage infrastructure- Modern statistics analysis that is based on working on large complete data sets rather then the old style analysis that was designed to work on minimal and incomplete data sets.
Slides 4 and 5 should be combined to have a cool animation of the big single log file object breaking up into the multitude of jobs and then those jobs combining to provide a result.What Google, Yahoo and others developed, that is now called Hadoop, is the ability to take a large single processing job, break it into many small jobs all doing the same thing on only a part of the data set and then have these jobs all report their results to a function that reduces them to a desired result such as a report or analytic display. I don’t know why we took so long to get to this, bees and ants have been doing this since there were bees and ants!
Here is a traditional search architecture. In a traditional enterprise search system, much of this iterative improvement work is done in the indexing pipeline. This is where metadata is prepared, awkward headers and footers are removed, and content is normalized to help the relevancy algorithm compare dissimilar documentsWith the traditional architecture, any meaningful change to the index pipeline will require a full re-index.As data sets grow, this becomes increasingly onerous. A typical enterprise search indexing rate is 3 documents per second.Getting the data from the repositories is almost always the bottleneck. The are not set up for mass bulk exports.So even if you have a modest amount of content – say 10 million documents, it takes weeks to re-index. I’m sure most of you already know how long reindexing times impedes agility and solution quality. You want to use new metadata to support additional features, you want to use cleaner and more consistent content for better precision, you want to add some entity extraction to increase relevance, however you delay because of the pain and expense until some disaster force you to do it.Commonly used workarounds, such as developing systems based on a small sample of data, has its own sever limitations – but no time to go into those here
By including Hadoop with its file system designed for massive amounts of content and automated and reliable distributed processing, running within either your existing servers or using elastic computing within a cloud; and utilizing a strong independent content processing framework that can also take advantage of Hadoop, we now have an architecture that supports modern analytics, advanced content processing and shorter re-indexing times. We believe this architecture can reduce reindexing 10 million document, that often takes several weeks to a day or two.All of this combines to provide an environment were you can effectively have a continuous improvement cycle and iterative development
If you now include hadoop in the architecture, there are some dramatic potential effects.1) You now have an integrated platform to utilize the traditional documents of search along with other data that has customarily outside of search solutions but are gaining in importance such as log files that define user behaviour and supplemental data such as dictionaries, taxonomies, wiki data.2) A high performance and elastic platform that supports cloud computing and functionality that was not feasible before, I will talk about those on the next slide3) A platform that can reduce integration, development and management cost.4) A platform that can enable your organization to be more agile and your search architecture more saleable5) And the ability to preform reindexing at a much faster rate then ever possible.
Let me use the top two examples on this list as an illustration?Let me be clear, we are not saying these things were not possible before, it just was not practical for most organizations due to past hardware, software and programing resource constraints.Search and Match – this is the ability to do a search across the index and then do a precise matching of other documents based on the concepts within those documents. We are currently developing this for one of the largest staffing and recruiting agencies in the world . Finding CVs to match job requirements and job requirements to match CVs.Forward and reverse citations – Many business and technical documents have links and other forms of citations to other documents. Your enterprise many not current be as link rich as the internet, however that is changing and if you are in certain business such as medical research, insurance, legal and financial the ability to utilize citations in your search solutions is critical. We have developed a solution for a major patent company that is the leader in intellectual property (IP) management, that utilizes the forward and reverse patent citations to accomplish its mission.
[This may be too over the top.Kam you need to say the conclusion you are comfortable with, this is just something to work from]We are very excited about the future of enterprise search.There may have been a bit of a stall in our industry in the last few years, however we see a revolution coming.The next generation of search:Powered by a Haddop based architectureSupported by elastic cloud computingExtended by new and exciting analysis techniquesThis will increase functionality and agility while lowering costWe will be able to do what we only dreamed about in the pastIt may still not be Captain Kirk talking to a cognitive computer, however we will effectively utilize the massive amount of content that we have at our disposaland provide a new level of support and experience our users crave.
So, I’m out of time. Thanks for your attentionIf you have any questions, please come and find our booth in the KMWorld hall.