See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow's enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.
Building AI-Driven Apps Using Semantic Kernel.pptx
Search + Big Data: It's (still) All About the User- Grant Ingersoll
1. Search + Big Data:
It’s (still) All About the User
Grant Ingersoll, Chief Scientist – Lucid Imagination
grant@lucidimagination.com
October 19, 2011
2. Promise and Reality
“Data is increasingly digital air: the oxygen we
breathe and the carbon dioxide that we exhale. It
can be a source of both sustenance and pollution.”
Six Provocations for Big Data
by Danah Boyd and Kate Crawford
“The truth is, I spend most of my time trying to
reduce the size of my data so it can be analyzed.”
Hilary Mason, Chief Scientist, Bitly @ Strata
6. Benefits
§ End users
• Better relevance/conversion
• Serendipity
• Better/faster insight
§ Business:
• ROI
• Awareness across organization
• Enablement
• Agility
7. Needs
§ Fast, efficient, scalable search
§ Large scale, cost effective storage
§ Processing Power:
• Large scale distributed for whole data consumption
• Streaming/In Memory for real time needs
• Ability to learn
§ Willingness to ask questions
9. Search
§ Good scalable, search a given
• Talks: Chitouras, Sturlese, Binns, Miller
§ Custom Relevancy via function queries, boosts
§ Explore other relevance models
• Talks: Muir, Pugh
• Lucene/Solr trunk has pluggable scoring (BM25, etc.)
§ NRT for timeliness
• Talks: Busch
10. Discovery
Facets
• Talks: Yonik
• Classification, Taxonomy
Clustering
• Talk: Frank S.
Suggestions
• Auto-suggest, Spelling,
More Like This,
Related Searches, search trails
Visualization
12. Analytics for End Users
Offline Online
• Popularity/Click • Trends/Stats
• Link Analysis
• Search Trails • Social/Personal
• Recommendations
• Spellchecking weights • Location
• Collocations
STORM
13. Analytics for Internal Users
Offline Online
• Top X • Trends
• Zero results
• MRR, MAP • Operational alerts
• User segmentation (QPS,
• Location, conversions DPS, etc)
• Ad hoc Analysis
GIRAPH
14. What’s Missing?
§ The glue is up to you (us?)
• Lucene Index -> Pig/Others
• Mahout -> Pig/Others
• Mahout -> Lucene/Solr
• Logs -> Pig/Others
§ Nice to have:
• More in-index functionality (that performs)
§ Aggregations
§ Arbitrary stats
§ Complex Joins
15. What’s Next?
“I can have all the data I want to have – but I still
have to communicate it to our players. It has to
get into their minds. And they have to utilize it. ”
Brad Stevens, Head Basketball Coach,
Butler University in Oct. ‘11 McKinsey Quarterly