Personalized Search-Building a prototype to infer the user's interest
ODSC-East-2016_Marmanis_Public
1. @ODSC
Search:
Beyond Lucene with
Open Source
Deep-Learning
Babis Marmanis
@marmanis
babis@marmanis.com
OPEN DATA SCIENCE CONFERENCE
Boston | May 20-22nd 2 0 1 6
13. babis, google ads, file:/E:/code/github/yooreeka/data/ch02/biz-03.html
babis, google ads, file:/E:/code/github/yooreeka/data/ch02/biz-03.html
babis, google ads, file:/E:/code/github/yooreeka/data/ch02/biz-02.html
dmitry, google ads, file:/E:/code/github/yooreeka/data/ch02/biz-01.html
dmitry, google ads, file:/E:/code/github/yooreeka/data/ch02/biz-01.html
dmitry, google ads, file:/E:/code/github/yooreeka/data/ch02/biz-01.html
dmitry, google ads, file:/E:/code/github/yooreeka/data/ch02/biz-01.html
16. Deep Learning via Multi-Layer NNs
Input Nodes
Hidden
Nodes Output Nodes
17.
18.
19.
20.
21. Deep Learning Models
• Recurrent Nets and LSTMs
• Convolutional Nets (ConvNets)
• Restricted Boltzmann Machines
• Denoising Autoencoders
22. Some ideas to experiment with …
• Obtain representations of words based on a
large corpus that you have indexed
-- Automatic query expansion
-- Boost related content
• Learn from sequences of clicks and predict
other relevant sequences
• Create profile based summaries
Welcome everyone and good afternoon.
I appreciate the opportunity to speak with you today.
I am Babis Marmanis and this session is about searching and, in particular,
the role that deep learning libraries can play in improving our search results.
Let’s consider a simple web search, a task that each of us performs dozens or even hundreds of times per day.
The slide shows the results for the query term “yooreeka”, which is the name of an open source machine library.
I am only showing the results of Yahoo and Google, in the interest of space, but I could have shown a number of other search engines as well.
Which search engine provides us with better results?
Both engines do a decent job but the ranking is not quite good and neither engine gives me what I was looking for as #1.
This is not surprising. In both cases, the engine doesn’t know who I am, where I am, what my intention is, and so on.
In other words, these search engines do not have enough information to help me get, exactly what I want within the context of my present activity.
This is not true only for web search, it is also true for searches within our applications.
In fact, I would argue that it is even more important that our applications incorporate contextual information in their search engines and most of what I will talk about here is advice for enterprise software and niche search engines.
So, …
Context is multi-dimensional. In order to explore the various dimensions that are relevant in your application ask questions such as:
What is the purpose of the search?
What is the nature of the content that our objects consist of?
What is the nature of structure of the objects that we search on?
What characteristic dimensions are available to us about the users who conduct the search?
When the context is determined by the later, we speak of “personalization” of the search results.
I gave an example of applying contextual information in the case of a real-world Text and Data Mining system, in domain specific R&D work.
Now, let us quickly go through another example, much simpler this time, and experiment with adding context in the IR workflow, in order to illustrate the main ideas and make these concepts more concrete.
Let us consider a set of news feeds covering a variety of topics, and set as our goal the creation of a system that will enable us to identify articles that are relevant to us. The feeds are typical web pages with links to other web pages (outlinks)
Let us also confine the input of our system to be a traditional textbox.
That textbox captures a query string that expresses the information needs of the user.
In order to tackle such a fairly common task one would typically invoke the power of Lucene. Either directly or through bigger systems such as Solr and Elasticsearch, which encapsulate Lucene’s excellent search engine.
Lucene can help you quickly index your documents and execute queries against those indices; Lucene, Solr, and Elastic are easy to use and provide a rich set of search engine features. So, you may wonder: “If Lucene is so sophisticated and efficient, why bother with anything else?”
Let us modify our example a bit to begin our journey towards searching beyond Lucene.
We are going to add new feeds in our corpus and these new feeds will now contain SPAM.
The deliberate creation of deceptive web pages can significantly impair the effectiveness of traditional IR techniques.
Indeed, quickly running a query with text “google ads” gives as #1 result the bad web site. In this simple example, we have only one bad website, however, in a real-world scenario the user could be inundated with undesirable results.
When search engines relied solely on traditional IR techniques, web surfing —our national online sport— wasn’t as rewarding as it is today.
This brings us to one of the first major techniques that have improved search results in the past 15 years or so, namely, Link analysis.
Link analysis, that is algorithms such as PageRank and HITS, leverage information associated with the structure of a graph that represents a relationship between the various objects of our search space.
In order to improve the results that we get from Lucene, we will use an open source library called Yooreeka.
For production purposes, and big data, you should consider the PageRank implementation by Apache Spark’s GraphX project.
Yooreeka offers, amongst other things, an implementation of PageRank.
So, we will calculate the PageRank for the websites of our example and we will combine the final ranking of the results by combining the relevance score coming from Lucene and the PageRank score of each website that is included in the results.
Moreover, we will pretend that we have captured the interaction (clickstream) of every user with the system for a given query.
These data will be given to a NaiveBayes classifier that will calculate the probability that a feed is relevant given a user and a query.
That probability will also affect the results
As you can see, the spam went at the bottom of the result set for both users.
Moreover, the preferences of the users have been reflected in the search results.
Now, the purpose of our simple example was to motivate us to consider search systems that apply contextual knowledge about their users and their search objectives – such as histories, persona profiles, constraints of any type, or devices of access. These kind of systems would be able to provide targeted search experiences and higher-quality results.
This is not something new and, within the IR community, there is an increased desire in applying knowledge of user interests, intentions, and context in order to improve aspects of search such as relevance ranking and query suggestion. This is especially important for exploratory and/or complex tasks that can span multiple queries or search sessions, such as the case of the RightFind explorer that we saw earlier.
When the context provided by the interactions that occur during complex tasks is taken into account by search systems, we can support the users’ broader information needs. Recommendation systems can also benefit from taking into account contextual information, for the very same reasons that search systems can.
The approach delineated above has helped us to improve our search results, however, it has the following disadvantages:
We needed to know ahead of time that the content had structure (the links between the feeds)
We needed to know in advance that PageRank was a good way to assign greater weight to pages that are frequently linked to by other pages.
We needed to explicitly define the features and provide the values of the user’s clicks
Now, imagine a system where context is inferred rather than being hard-coded. Where Features are automatically detected rather than being selected a priori.
It is possible to build such a search system by implementing a deep learning component that can help us with that.
So, let’s have a five minute introduction in deep learning
Over the past decade, we have seen fantastic progress in machine learning and, in particular, what is known as “deep learning” that has led to various Hollywood scenaria about the “rise of the machines” and so on. Rest assured, Skynet is not yet on its way to world domination.
We do not yet have learning algorithms that can discover the visual cues and semantic concepts that are necessary to replicate most of what a human can do.
What the state-of-the-art allows us to do today is to:
Generate hierarchical representations from raw data
Use generic algorithms to do so – even if from case to case the same algorithm is used in a different learning architecture
Although these things are more modest than the ultimate goal of AI research, they can be very valuable in practice.
The term “Deep Learning” was coined by Geoffrey Hinton and the key aspect of it is the provision of fast learning algorithms for deep (highly-connected) belief networks.
Since the mid-80s, we knew that we can build general purpose machine learning systems based on multi-layered Neural Networks, such as the one I am showing in this slide.
ML NNs consist of computational nodes of three classes:
Input nodes, which collect the data fed to the network
Hidden nodes, where all the “magic” happens
Output nodes, where we collect the results
This type of networks can be trained by the back-propagation algorithm to perform any mapping between the input and the output.
However, there were a few problems. For example:
Labeled data were required in order to calculate the errors, while most data that we have are unlabeled.
The learning time was onerous for multiple hidden layers.
Training could get stuck in local optima that were far from the global optimum
Building a Multi-Layered Neural Network that is production ready is a major undertaking. However, over the past couple of years, open source libraries that can help you with deep learning implementations have matured and two major libraries were open-sourced by two software giants, namely, CNTK by Microsoft and TensorFlow by Google.
We will now provide a brief summary of four libraries:
Let’s start our tour with Keras and Theano.
Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU.
Plenty of good documentation, publicly available tutorials, and YouTube videos
Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
Use Keras if you need a deep learning library that:
allows for easy and fast prototyping (through total modularity, minimalism, and extensibility).
supports both convolutional networks and recurrent networks, as well as combinations of the two.
supports arbitrary connectivity schemes (including multi-input and multi-output training).
runs seamlessly on CPU and GPU.
If you are a Python fan, that’s a good choice.
This is a great choice if Java or Scala are your preference.
It supports a wide range of neural networks:
Restricted Boltzmann machines
Convolutional Nets (images)
Recurrent Nets/LSTMs (time series and sensor data)
Recursive autoencoders
Deep-belief networks
Deep Autoencoders (Question-Answer/data compression)
And it can scale on top of Apache Spark. It also has commercial support
Enter the “big boys” club!
CNTK is one of those tookits that can help in the widespread adoption of deep learning in the real-world.
Open Sourced in November of 2015!
Core written in C++
It offers Different front ends for specifying/driving the computation
Computation is a dataflow graph with tensors (the edges are N-dimensional arrays), and it can be fully distributed to different processes, CPUs, GPUs
The libraries that we mentioned allow us to create the most frequently used and widely successful deep-learning models
Especially, CNTK and TensorFlow provide production quality characteristics. For example:
Ease of expression: If you have a great idea, it shouldn’t take you 2 months to implement it● Scalability: You should be able to run your experiments quickly and not slow down to a crawl when the size of your data grows● Portability: You probably want to experiment on your laptop or your workstation as you design the architecture of your model but you want to test it in a scalable and powerful infrastructure● Reproducibility: It should not be hard to share your results with colleagues to validate an idea or get feedback●
The essence of what we discussed today is that the state-of-the-art in search involves more than what you would get from a straightforward TFIDF index. Results can be improved significantly by incorporating various deep-learning models, which take as input contextual data and produce sequential or hierarchical representations of their input that can be later used to boost the scoring of appropriate documents before our Search service returns its results to our application.
The power of deep-learning models is now available to everyone, thanks to the open source libraries that we reviewed, and can be used to improve search results in a myriad of ways by incorporating signals from contextual data into the search results. For nearly every application, capturing and processing contextual data should be straightforward to do, if it is not already happening through logging or other event monitoring. Lastly, the computational power required for that work is now available since the deep learning libraries are designed to account for any GPUs and run in fully distributed mode taking advantage of on-demand cloud-based infrastructures.
Thank you.