Mais conteúdo relacionado
Semelhante a Applied enterprise semantic mining (20)
Mais de Mark Tabladillo (20)
Applied enterprise semantic mining
- 1. Mark Tabladillo Ph.D.
Data Mining Scientist
MarkTab Inc.
Applied Enterprise
Semantic Mining
T E X T M I N I NG W I T H S Q L S E RVER 2 0 1 2
P R ESENTED AT AT L A NTA M I CROS OFT BU S I N ESS I N T EL LIGENCE G ROU P
JA N UA RY 2 8 , 2 0 1 3
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
- 3. Introduction
SQL Server 2012 has new Programmability Enhancements
◦ Statistical Semantic Search
◦ File Tables
◦ Full-Text Search Improvements
These combined technologies make SQL Server 2012 a strong contender in text mining
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 4. Challenges
Building and Maintaining Applications with relational and non-relational data is hard
◦ Complex integration
◦ Duplicated functionality
◦ Compensation for unavailable services
80% of all data is not stored in databases!
Most of it is “unstructured”
(2012, Michael Rys, Microsoft)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 6. History
July 2008
◦ Microsoft purchases Powerset for US$100 Million
◦ Google Dismisses Semantic Search
◦ http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m-
plus/
◦ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 7. History
March 2009
◦ Google announces “snippets” as relevant to search
◦ The media picks this story up as “semantic search”
◦ http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-
results.html#!/2009/03/two-new-improvements-to-google-results.html
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 8. History
February 2012
◦ Google announces Knowledge Graph, an explicit application of semantic search
◦ http://mashable.com/2012/02/13/google-knowledge-graph-change-search/
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 9. History
April 2012
◦ Microsoft purchases 800+ patents from AOL for US$1 Billion
◦ Among the patents are semantic search and metadata querying – older than Google
◦ http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 10. New in SQL Server 2012
HT TP://MSDN.MICROSOFT.COM/EN -US/LIBRARY/CC645577.ASPX
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
- 11. Goals of Semantic Search
Reduce the cost of managing all data
Simplify the development of applications over all data
Provide management and programming services for all data
Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich
Application Experience on top
(2012, Michael Rys, Microsoft)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 12. Statistical Semantic Search
Identifies statistically relevant key phrases
Based on these phrases, can identify (by score) similar documents
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 13. FileTables
Built on existing SQL Server FILESTREAM technology
Files and documents
◦ Stored in special tables in SQL Server
◦ Accessed if they were stored in the file system
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 14. Full-Text Search Enhancements
Property search: search on tagged properties (such as author or title)
Customizable NEAR: find words or phrases close to one another
New Word Breakers and Stemmers (for many languages)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 15. From Documents to Output
Office
Varchar
PDF
NVarchar
Rowset
Output
with Scores
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 16. “Beyond Relational” vs. “Adoption”
Start with unstructured (meaning non-relational) data
Use Windows technology
◦ Reading and Writing Files (Win32 API)
◦ iFilters for reading proprietary formats
Develop indexed structure from unstructured data
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 17. (iFilter Required)
iFilters Full-Text
Documents Keyword
Index
“FTI”
Semantic
Key Phrase
Semantic Index –
Semantic Document Database Tag Index
Similarity Index “DSI” “TI”
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 18. “iFilter”?
IFilters are components that allow search services to index content of specific file types, letting
you search for content in those files.
They are intended for use with Microsoft Search Services (SharePoint, SQL, Exchange, Windows
Search).
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 19. Microsoft Office 2010 Filters Pack
Legacy Office Filter (97-2003; .doc, .ppt, .xls)
Metro Office Filter (2007; .docx, .pptx, .xlsx)
Zip Filter
OneNote filter
Visio Filter
Publisher Filter
Open Document Format Filter
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 20. Adobe PDF iFilter 9 for 64-bit platforms
Allows PDF search
Not currently supported for Windows 7 or 8
◦ But I used it anyway
Add the Bin directory to your path
◦ Computer (right click), Properties, Advanced System Settings, Environment Variables
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 21. “Semantic Language Statistics
Database”?
This database contains the statistical language models required by semantic search.
A single semantic language statistics database contains the language models for all the
languages that are supported for semantic indexing.
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 22. Languages Currently Supported
Traditional Chinese
German
English
French
Italian
Brazilian
Russian
Swedish
Simplified Chinese
British English
Portuguese
Chinese (Hong Kong SAR, PRC)
Spanish
Chinese (Singapore)
Chinese (Macau SAR)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 23. Phases of Semantic Indexing
Full Text Keyword Index “FTI”
Semantic Document Similarity
Index “DSI”
Semantic Key Phrase Index –
Tag Index “TI”
http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 24. Performance
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
- 25. Integrated Full Text Search (iFTS)
Improved Performance and Scale:
◦ Scale-up to 350M documents for storage and search
◦ iFTS query performance 7-10 times faster than in SQL Server 2008
◦ Worst-case iFTS query response times less than 3 sec for corpus
◦ Similar or better than main database search competitors
(2012, Michael Rys, Microsoft)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 26. Linear Scale of FTI/TI/DSI
First known linearly scaling end-to-end Search and Semantic product in the industry
Time in Seconds vs. Number of Documents
(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 27. Conclusion
SQL Server 2012 adds new text processing capabilities
This technology scales linearly
Microsoft invites millions of documents for enterprise-level applications
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 28. Network
MarkTab Consulting
◦ http://marktab.com
Blog
◦ http://marktab.net
Twitter
◦ @marktabnet
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 29. Appendix
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
- 31. Demo: My Semantic Search Sample
http://mysemanticsearch.codeplex.com/
Requires:
◦ iFilters
◦ Semantic Language Statistics Database
◦ IIS7, IIS6, with Windows Authentication
◦ .NET 4.0
◦ Silverlight 4.0
◦ FILESTREAM (complete)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 32. Demo: T-SQL and Documents
Naveen Garg
Requires Adventure Works (from Codeplex)
http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-
search-in-sql-server-codename-denali-release.aspx
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE
- 33. Abstract
SQL Server 2012 debuts a new Semantic Platform (commonly known as the Semantic Search
applied task). This text mining technology leverages the already established Full Text Index and
builds semantic indexes in a two-phase process. This session's detailed description and demo
give you important information for the enterprise implementation of Tag Index and Document
Similarity Index. The demo is a web-based Silverlight application showing how to interactively
use semantic search. Currently, the indexes work for 15 languages. We'll also look at strategy
tips for how to best leverage the new semantic technology with existing Microsoft text and data
mining functionality.
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
WORLDWIDE