This tutorial, offered at the 10th International Conference on Web Engineering, presents the peculiarities of advanced Web search applications, describes some tools and techniques that can be exploited, and offers a methodological approach to development. The approach proposed in this tutorial is based on the paradigm of Model Driven Development (MDD), where models are the core artifacts of the application life-cycle and model transformations progressively refine models to achieve an executable version of the system. To cope with the process-intensive nature of the main interactions (i.e., content analysis, query management, etc.), we describe the use of Process Models (e.g., BPMN models). Indeed, search-based applications are considered as process- and content-intensive applications, due to the trends towards exploratory search and search as a process visions.
i.e. it might not be clear to the system whether the user is “recall-oriented” or “precision-oriented”
In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
complex search is characterized by: multiple searches, possibly over multiple sessions and spanning multiple sources of information; a combination of exploration and more directed information finding activities; the need of note-taking, the variation of the search goal during the search process.
In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
In information retrieval, users express their information needs as queries sub- mitted to the system. While in data management systems like data-bases user are often required to express queries in a formal, structured language (e.g., SQL, XQuery, which have exact matching predicates and unambiguous se- mantics), in information retrieval the semantics of the query corresponds to the semantics associated with its content, which is interpreted in order to re- trieve the relevant results. Hence, it is not possible to provide a taxonomy for information retrieval queries based, for instance, on the expressive power of the underlying query language. Nonetheless, we can provide a functional classification of queries as follows.
From an high-level perspective, “search” is enabled by mechanisms which allow the extraction of contents from data repositories (e.g., text file, audio file, video file, databases, etc). Contents are therefore processed in order to build an index of the managed information, optimized for efficiently answer to users’ queries. Before being indexed, contents are analyzed and enriched with annotations 1 that build contents’ representation. Along with the index, search leverages on ranking models, i.e., mathematical methods that associates a score to the relevance of a content item w.r.t. a query. Once contents are indexed, multiple user interfaces (e.g., Web applications) provide users the means to interact with the search engine by executing queries and displaying the retrieved results.
We define (i) an Indexing process (represented as a dashed line), which addresses the indexation of contents coming from the application data sources (thus involving data retrieval from external sources, transformation or aggregation of the retrieved data and, finally, their indexation) (ii) a Query and Result Presentation (QRP) process (represented as a solid line), addressing the operations related to query execution, orchestration and result-set composition (iii) a User Interaction process (represented as a dotted line), i.e., the way users interact with the application’s functionalities.
One aspect of the proposed development framework is the definition of a methodology for the design and implementation of the application to be produced. A development approach based on a formal methodology and appropriate high level modeling languages smoothly incorporates change management into the mainstream production life-cycle, and greatly reduces the risk of breaking the software engineering process due to the occurrence of changes. The proposed methodology follows the path of the MDD approach by leveraging on a incremental, iterative design steps that foster separation of concerns among the actors involved in the SBA design. The Conceptual Design macro activity represents the core of the development lifecycle, since it involves the main design activities In the terminology of MDD, the BPMN Process Model can be seen as a Computation Independent Model (CIM), which specifies SBA requirements for the CAI and QRP processes; as we will see, instead, the UI process is address as an Interaction pattern composition activity. The WebML application model is a Platform Independent Model (PIM), which exploits SOA and Web hypertext interfaces as a technical space. Finally, the application code is a Platform Specific Model (PSM) for the Java 2 technical space. Initially, requirements are conceptualized in a Domain Model, which formalizes the essential data objects managed by the application, and a Process Model, which pinpoints the workflow of the CAI,QRP and UI processes. The link between the domain and process models is established by the type of objects that flow between activities. The designed solutions do not take into account domain specific informations like the schema of the adopted search technologies, or the format of the annotations produced by the analysis components. Nonetheless, the focus on a specific class of applications allows one to include, in the business model, high-level concepts relative to the applications’ domain. For SBA, for instance, the concept of query, user, index and so on. The use of an high-level model combined with coarse grained domain concepts allows one to address the designed application in perspective, possibly by creating designs that can be applied to classes of applications (e.g., audiovisual search engines), more than punctual solutions. Abstract-level notation, though, cannot be translated into running code,due to the lack of platform-specific details (e.g., the technologies adopted by actual search engines, analysis components, deployment platform etc.) needed to enact code generation. The Domain Model and Process Model are then subject to a first (CIM to PIM) transformation, which produces the Application Model and process metadata. objects. Therefore, coarse-grained design is followed by refinements that take into account more domain-specific information, like the structure and format for the contents, the annotations and indexes. To do so, a finer grained model is adopted, in order to enable the definition of domain-and application-specific details that can lead to automatic code generation. The proposed approach is generic enough in order to adopt alternative modeling languages, both for process and application design. This slide discusses how to derive an application model from high-level process model. The proposed framework employ the BPMN modeling language for process specification and the WebML modeling language for the design of hypertextes and Web service orchestrations
Let’s now have a bird’s eye view on some reference, example design for all the 3 identified SBA’s processes. The CAI process can be defined as the work to be performed by the actors of a SBA to achieve the indexation of a content item . The goal of the domain model is to formalize content- and index-related data and metadata managed by the search applications. Such models build on five basic domain concepts: + Content Item : a Content Item is an individual information unit which is relevant in a search based Web application for indexing purposes. + Annotation : an annotation is the textual information associated with a content item for indexing and searching purposes. Such information might be of different nature, being both manual annotation, provided by the content provider or by the user, and automatically generated annotation, produced by the search application during the Indexing process. + Usage Group : Content Items are published by one or more Content Provider, which is responsible for their publication. A Usage Group is an access profile specified by a content provider to define the set of operations allowed for a given content item to a set of users: + Index : the notion of Index, well known in many disciplines of computer science, denotes a data structure designed in order to optimize speed and performance in finding relevant content items for a search query.
User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
User interaction design, instead, requires a little paradigmatic shift in the proposed methodology, since we manage it not as a process but as an assembly of standard interaction schema expressed as patterns. The reason for this shift stands in the common knowledge that the user interaction cannot be expressed as a linear process, given that users acts driven by task which cannot always be serialized . Traditional information retrieval is inherently based on users searching for information, the so-called “information need” . Recent studies extended the importance of such cognitive process, embedding it into a broader category named information seeking . Such an extension is motivated by the fact that information needs and retrieval stem from social, cultural, biological, and anthropological contexts, that broaden the ways information are gathered. A commonly accepted taxonomy of information seeking has four modes are identified. (This taxonomy considers two orthogonal classification dimensions: Directed and Undirected respectively refer to whether an individual explicitly seeks information by specifying her need by means of a query, or is more or less randomly exposing herself to information; Active and Passive, instead, refer to whether the individual does anything actively to acquire information, or she is passively available to absorb information, but does not seek it out. With the advent of the so-called Web 2.0, the four information seeking interactions listed in the previous section have been enhanced by the availability of new features. The transformation of end-users from passive recipients of content and communication into active contributors gave them new flavors, providing additional means for all the four interaction modes.) We identified more than 30 patterns, that we organized into 3 categories: + Query and result presentation patterns, containing general-purpose patterns that enable the execution and the presentation of the results of queries addressed to the search application; + Information Interaction patterns, for the specification of the four information seeking modalities presented in the previous section; + Permission Management patterns, which contains general purpose patterns that enable usage permission management.
Thanks to the implemented extensions, we inject more information in the higher level model, thus leading to: + finer-grained application models + less errors + more efficiency. Transformations were implemented in ATL, a language for model transformations. Here’s a graphical example of model transformation among BPMN* activities and WebML model, and here’s just to give you a hint of how transformations are coded
Indri/Lemur Language modeling BM25, Okapi, Cosine similarity, inQuery Lucene TF-IDF, weighted by term occurrences Fielded search Terrier Okapi BM25, language modeling and TF-IDF Divergence from Randomness Your own re-ranking code using open search
Not enough comparative benchmarks out there. Hard to do; we really need standards Optimize each platform, per hardware and data set Lot of platforms, with different APIs, options and numerical settings Need good diverse data sets, small & large Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).
Larger data set (3x larger than the Twitter one) we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.
<!-- When a message on portType an operation &quot;process&quot; instantiate a variable named &quot;Request&quot; --> <!-- tipicamente la request conterrà un solo Record. Record multipli sono prodotti ad esempio da annotatori che esaminano archivi zip|rar|tgz. L'extension activity verrà eseguita se l'attributo workflow-attribute' presente sul record contiene il valore &quot;split&quot;. Le condizioni sono espresse come espressioni XPath e gli attributi e annotazioni utilizzati devono essere espressamente resi disponibili al workflow BPEL tramite configurazione (di org.eclipse.smila.blackboard). -->
RAP – Rich Ajax Platform G-Eclipse: extensible framework including a GRID model for seamless integration of GRID/Cloud resources. It support different Grid/Cloud interfaces, including AWS
Example: the token “saw” Stemming it might return just “s” Lemmatization attempts to return “see” or “saw” depending on whether the use of the token is a verb or a noun