SlideShare uma empresa Scribd logo
1 de 15
Proposal for a

              Thesis in the Field of

            Information Technology

  In Partial Fulfillment of the Requirements

     For a Master of Liberal Arts Degree



               Harvard University

                Extension School

                    10/18/2004


                    Clifford Lyon

               53 West Emerson Street

              Melrose, MA 02176-3109

                   (617) 225-3293

                   (781) 663-7703

                clyon928@comcast.net

            Proposed Start Date: 10/4/2004

        Anticipated Date of Graduation: 6/2005

Thesis Directors: Sergei Makar-Limanov and Bhiksha Raj




                          1
1 Tentative Thesis Title:
Visualization of High-Dimensional Clickstream Data Using Java

Keywords: Clustering, Unsupervised Learning, Critic, Search, Data Visualization, Java 2D/3D,
Clickstream, Data Mining, Machine Learning


2 Abstract
Unsupervised learning holds out a promise for the discovery of objectively valid disaggregate
patterns within large clickstream data stores. Using an interactive data visualization interface and
clustering algorithms, the software designed and delivered by this project will allow the
exploration of clickstream data in a subjectively meaningful way.


3 Thesis Project Description

3.1 Background
Clickstream data accumulated by a commercial website offers site managers the potential for
objective insight into their audience unparalleled in other publishing media. Unlike their print,
television, and radio counterparts, web publishers have access to a detailed record of events
generated by their visitors. Each time a visitor requests a URL, a webserver records the request
and some information about the visitor’s browser in a log file. However, the potential for insight
remains in large part unrealized for the commercial Internet despite the availability of this
detailed behavior record, well-established machine learning algorithms, exponential growth in
processing power, and decreased memory and storage cost. This is in contrast to the evident
success of personalization and targeting efforts by sites such as Amazon.com and Netflix.com
that estimate the posterior probability of user decisions from previous behavior to present
contextually relevant recommendations. Content automation is certainly one positive outcome of
modeling behavior using clickstream data. However, it is fundamentally an application of
knowledge at the transaction level, not at the enterprise level. In contrast, this project seeks to
recognize patterns in web data at a high level, and to build an interface capable of presenting
these patterns to a non-technical (business) user in a meaningful way.


3.1.1 Challenge of Clickstream Data
A key reason for the lack of progress in the application of standard machine learning algorithms
to clickstream data lies in the nature of the data itself. In recent years, academics in market
research and applied economics have started building behavioral models using clickstream data.
The initial papers are interesting and encouraging. However, the shape of the data presents a
fundamental challenge. Unlike typical market research surveys and polls, web data is vast, noisy,
and censored. For example, the website contributing data for this project records more than one
billion events each month. The interesting events are those generated by real people interacting
with the website using a web browser. Software robots making requests for content generate
noise in the system. For very different reasons, these robots traverse a website using the same
protocol and transactional processes as real people. For example, a robot might gather
                                                 2
information for use in a search index, cache pages for a proxy server, or artificially increase
popularity for a particular product featured on a site by repeatedly requesting information. The
noise is not easily separable from interesting events. This is partly because it is easy for a robot
to do everything a real person would do, and partly because a person who configures his or her
browser to interact as minimally as possible with the webserver may appear to be a robot.
Moreover, a proxy server will appear as a single user, but in reality may convey requests for
thousands of users. Typically, a time series known as a session stores the sequence of events
generated by a user during a site visit. Analyzing time series data can help separate robot
generated events from real traffic. However, sessions are censored in the sense that there is no
event signifying completion. In particular, because the start of an event marks the end of the
previous event in the series, the final event has an unknown duration. These factors make it
difficult to model user-website interaction using raw clickstream data.

Recent efforts to apply market research techniques to clickstream data have generally used a
regression model to expose some sort of interesting user behavior pattern. Typically, websites
have extensive reporting at an aggregate level, but little real insight into user segments or the
differences between these segments. A recurring theme in recent research is the need to model
behavior in a disaggregate way to account for and expose behavior away from the mean. In
particular, Bucklin and Sismeiro (2003) suggest that accounting for user heterogeneity is of
critical importance, and that using aggregate metrics can potentially lead to the wrong
conclusions.


3.2 Approach
We propose inverting the general approach to the user behavior problem: rather than build
vectors of user or visit behavior, we will build vectors of pages, with features derived from user
or visit behavior. Metric design will account for user heterogeneity by incorporating aggregate
metrics from user dimensions as features. For example, the duration of time spent on the page
might vary depending on the time of day or the position of the page in the session. Representing
duration at a disaggregate level ensures the preservation of variance that allows users, and so
pages, to be successfully partitioned. The utility of an inverted approach is two-fold: first, we
express results in terms of website entities, which are under a site manager’s control. A low
repeat-visit rate, while important information for a site manager to know, suggests no direct
action. On the other hand, a site manager who learns that certain categories of pages are less
likely to generate repeat visits has a clear area on which to focus efforts to improve the site.
Second, using the data produced by our learning exercise as extra input for an existing user based
model may improve its predictive power. By modeling pages using unsupervised learning first,
we remove the bias of categorical features established by the top-down human design of the site.
When looking at behavior on a website, it is important to distinguish the hierarchical site
structure from the behavior on it as much as possible. Remodeling the business-driven
categorical entity as a behavioral class structure can help create better user models. A stretch
goal for the project is to use the page classification in a user behavior model to demonstrate the
value of clustering as a way of segmenting data behaviorally and supporting heterogeneity.

The software delivered by this project will cluster pages on the website according to their natural
order in the data. Clustering groups similar pages together. For example, pages visited on the
weekend by a young audience might fall into one cluster, and pages visited at the start and end of
the workday by an older group might fall into a second cluster. The pages within each group or
cluster are more similar to each other than to pages in other groups. Thus, the ordering emerges
from the data itself, rather than from an external agent. In this sense, the order is “natural.” This

                                                 3
approach is termed “unsupervised learning” because there is no known target class for the input
data; the model is fit to the features of the input data. The clustering algorithm assigns each page
to a class based on features derived from user interaction on that page. A user interface (GUI)
will visualize the clusters. We hope that the framework can make the often opaque results of
unsupervised learning subjectively meaningful for the site manager, that is, someone who
understands the problem domain well, but not the specifics of the machine learning process.
Unless the results have subjective meaning to the user of the software, the results will not be
useful. The software framework should be generalizable. Although the data set for our
investigation is specific, the application should perform reasonably well on other data sets.
Application testing includes scenarios using some of the common public domain machine
learning data sets, such as the iris data set (UCI Machine Learning Repository Content Summary).

In order to present data to the user, the GUI will project high-dimensional clustered page vectors
in two or three dimensions. There are established methods for achieving such a projection, such
as using the first two or three eigenvalues. Generally, the idea is to eliminate or merge features in
a way that minimizes the introduction of error into the system as information is lost. The user
interface will allow the user to search among the clusters for items of interest. The user will act
as a critic by using the search function to establish subjective validity of a given set of clusters,
and by suggesting (weighting) a direction for more useful results. While this technique has some
precedent in machine learning literature (Duda, R, Hart, P. & Stork, D. 2001, p. 565), we are
unaware of specific applications that use search as a tool for cluster exploration.

Although one readily finds examples of Java data visualization programs on the internet, we
found none that offered the feedback mechanism proposed here. An application notable for its
approach to dimensionality reduction is the two-dimensional cluster-visualization program
produced by IBM’s Alphaworks program, which can be found at

        http://www.alphaworks.ibm.com/formula/CViz

The Alphaworks program places cluster exemplars at the origin and extents of the x- and y- axes,
and plots instances based on similarity. The program translates similarity into Euclidean distance
on the plane. The x- and y-axes have no units. The exemplars at the origin and extents of the
axes triangulate the placement of clustered items in the two-dimensional space: the software
places items on the plane based on similarity to the three exemplars. This method has a few nice
properties: it is fast, it does not require a lot of extra computation, and it is visually meaningful
and intuitive. Exploring this method in three dimensions would be an interesting exercise. It
might provide a parsimonious way to scale the cluster space to a low dimensional representation.

A search for similar or related material uncovered no other papers on the topic of clustering web
pages for data visualization.


3.3 Data Description
The clustering algorithm will use data from a well-known shopping services, advice, and news
website. The site records approximately 70 million page events each day. Each time a user loads
a page, a tracking image is loaded, and the resulting log line in the server log represents a single
page event. Each page event has clickstream attributes from the webserver logline and attributes
derived from the site delivery application and site meta-data. Additionally, links into and out of
the site are tracked using an HTTP redirect. These redirects share the same attributes as the page
events.

                                                 4
The following table represents a sample of data available at the atomic level from the database:

Field Name                  Description
                            Unique identified for the session in which the page event occurred. (A
SESSION_ID
                            session is continuous activity with gaps of no more than 30 minutes.
EVENT_SEQ_NUM               The sequence number of the event w/in the session
REFERRING_HOST              If the data is from an external site, the hostname of the external site
                            Foreign key to third party demographic data based on IP address.
NETWORK_IP
                            Provides Country, US State, DMA, Line speed.
IP_ADDRESS                  Client IP Address
USER_AGENT                  The user agent of the browser performing the page request
EDITION                     The “branding” of the page.
PAGE_TYPE                   Identifies the template used to serve the page by the content application
PAGE_DURATION               Amount of time spent on the page
TIME_SINCE_SESS_START       Time elapsed since the first event of the session
IS_REG_USER                 Whether the client was a registered user
IS_NEW_USER                 Whether the client has been to the site before (cookie based)
IS_COOKIED_USER             Whether the client allows cookies
                            The sequence number of the page w/in the session (in contrast to the
PAGE_SEQ_NUM
                            event_seq_num, which included redirects)
IS_LAST_PAGE                Whether the event was the last page
TIMESTAMP                   The date and time of the page request
ANONYMOUS_ID                ID based on website cookie
SITE_ID                     The site number of the event (40 total sites)
ONTOLOGY_NODE_ID            The location of the page in the site navigational hierarchy
IS_IAB_ROBOT                Whether the user agent is a known robot
IS_BEHAVIORAL_ROBOT         Whether the user agent behaves like a robot
                            The search phrase that the user types, if any. (Includes third party sites
SEARCH_PHRASE
                            like Google.)
REGISTRATION_ID             The id of the registered user, if any.

                                    Table 1 Sample Data Fields



These fields are the raw material that will form the aggregate page vectors. Four entities uniquely
identify a page on the website: site, page type, ontology, and asset. “Site” is a business
dimension that groups content together at a high-level. The dataset contains tens of sites. “Page
type” is an application dimension identifying the template used to render the content. The data
contains thousands of page types. An “Ontology” node is a navigational dimension describing
the area on the site where the page lives – for example, a “door”, or a “story” page. The data
contains thousands of ontology nodes. “Asset” refers to a particular piece of content or a product
featured on a page. There are tens of thousands of assets active each day, and millions
historically. We have intentionally left Asset out of the page vector key, as this would produce
far too many instances to be useful for clustering. We expect between one and ten thousand
vectors for clustering, depending on the choice of sites.

The page vector will have the following structure:

          SITE_ID, PAGE_TYPE, ONTOLOGY_NODE, derived attributes 1..n.

The derived attributes will be behavioral in nature, and computed from activity over some period:
30 or 60 days, for example. Initially, we are considering the following attributes for each page:

                                                  5
•   Count total page views

    •   Count 1 page sessions (this page was the only page)

    •   Count 2-5 page sessions (this page occurring in)

    •   Count 5-10 page sessions

    •   Count 10+ page sessions

    •   Count registered user visits

    •   Count anonymous user visits

    •   Repeat visitor rate

    •   Average hits/day

    •   Average hits/weekday

    •   Average hits/weekend

    •   Average hits by hour of day, flattened

    •   Count session starts

    •   Count session stops

    •   Count leads (occurring in sessions with this page in it – leads are redirects to a partner
        site.)

    •   Total time spent on the page

    •   Average duration

    •   Average duration, weekday

    •   Average duration, weekend

    •   Average duration, by hours 1-24, by Time Zone

    •   Average “place in session” – where this event occurs, as a fraction of all events occurring

    •   Anonymous and Registered user visits, for each world country (flattened)

    •   Anonymous and Registered user visits, for each US State

This is a starting point. As mentioned, asset is not part of the page vector, but we could preserve
attributes of the asset to qualify the metrics. For example, rather than using the product entity


                                                 6
itself to identify a page, we could use the product category. We expect finding an identifiable set
of attributes for the page vector will require some exploration.


3.4 Data Flow
At a logical level, there are two major application components and two sources of input into the
system. The inputs are the data and the user feedback, and the components are the machine
learning and the data visualization components. Figure 1 illustrates the flow of external data into
and through the system:



              Machine Learning



                                                                                External
                      Assign                     Instances       Import Data
                                                                                 Data
                      Clusters




                                                 Scale for
                                                Presentation




                      Data Visualization




                                 Process User
           User                                                Present Data
                                     Input




                                      Figure 1 Data Flow Diagram



    •   External Data flows into the system as a Weka dataset, a flat set of vectors containing
        page information.

    •   The machine-learning component applies the clustering algorithm to the vectors and
        thereby classifies each instance.

    •   The machine-learning component projects the clusters in two or three dimensions for
        presentation in a user interface.

    •   The data visualization component processes user feedback after the presentation.


                                                       7
•   Depending on the feedback, the data visualization component re-presents the data, or re-
        classifies and then re-presents the data.


3.5 Architecture
The diagram in Figure 2 shows three physical architectural components. The shaded elements
represent elements that do not exist today; the unshaded elements represent third-party software
or data sources. The following subsections describe each of the three components.


                                                                                      Pre-Processing

                   ClickStream              Extraction Script
                    DataStore                                          ARFF file




                                                                               Unsupervised Learning
               Cluster Engine (Driver)
                                            Weka.core.Instances



                                         Weka.classifiers.Evaluation


                                                                                      Existing Clusterer
                                         Weka.classifiers.Clusterers


                                                                                       New Clusterer
                                     Multi-Dimensional Scaling Filter




               Java GUI
                                            Weka.core.Instances               Visualization/Interaction


                                         2D/3D Projection Window



                                               Control Panel




                                              Figure 2 System Diagram




3.5.1 Pre-processing
The preprocessing step prepares data for use. The extraction script reads data from a database
and writes it to the proprietary Weka data format. The Java application reads the formatted data
from disk into memory. We will aggregate the atomic events in the database to the page level
record described above. We will filter out robot traffic as much as possible, using already
available flags in the database. The aggregation will discount the last event of the session for
mean duration calculations.
                                                                8
3.5.1.1 Clickstream Data Store
The data store exists today in the form of a large data warehouse for an anonymous web-
publishing company. They have agreed to share data for the project on condition that we
obfuscate user or commercially identifiable information. There will be some data transformation
within the data store: a script will aggregate the individual events with the page as the key,
forming the page vector described in the data description section.


3.5.1.2 Extraction Module
The extraction script pulls data from the data store. This module is a placeholder for the process
that creates a flat file in Weka format from the database. It may be a series of scripts, or it may
be a set of actions undertaken to spool query results to disk manually and add a header. Any code
developed will be handed in for inspection, but it should be understood that evidence of
completion is the Weka data file rather than code that created it. Anyone seeking to recreate this
experiment on his or her own would have to code this module by hand; the rest would flow from
there. As such, the extraction module is formally outside the bounds of the project.


3.5.1.3 Attribute-Relation File Format (ARFF) file
The Attribute-Relation File Format (ARFF) file constitutes the boundary of the application. A
Java program using the Weka class libraries can easily read the file into an in-memory
representation for machine learning by the various Weka modules. A website describing the
Weka data file format in detail is found at

        http://www.cs.waikato.ac.nz/~ml/weka/arff.html


3.5.2 Unsupervised Learning
The unsupervised learning component performs the actual clustering. It clusters instances read
from the pre-processing component and refines or redistributes clusters based on feedback from
the user through the GUI component.


3.5.2.1 Cluster Engine
The cluster engine is a driver that uses the Weka data-mining framework to read the ARFF file
into memory and exercise a clustering algorithm on that data. After performing the unsupervised
learning algorithm, it prepares data for presentation by creating low-dimensional projections of
the instances. The program appends the location in the low-dimensional space to the existing
attributes of the instance. At this point in the data path, the instance includes the original
features, the cluster identifier and any related cluster metrics such as distance from the centroid,
and the newly appended location in low-dimensional space.


3.5.2.2 Multi Dimensional Scaling (MDS) Filter
The task of this component is to take the n-dimensional feature vector from the input data and
scale it to a projection suitable for presentation, i.e. either a two- or three-dimensional vector.
The filter preserves the distance between the points in the original space as closely as possible by
minimizing an error function. The MDS filter is completely independent of the clustering, and
could be applied itself to the data. This filter will do some sort of analysis like Principle
                                                 9
Components Analysis (PCA) on the data (Bishop, C. 1995, Appendix E). It may be we can make
use of the existing cluster information in a clever way as another alternative – in which case this
could stay in the data path as a no-op.


3.5.2.3 New Cluster Strategies
The Weka framework allows for the easy introduction of new clustering strategies. More general
classification techniques could make use of the generic classifier container as well. The Weka
library offers several clustering choices out-of-the-box: Cobweb, Expectation Maximization,
Farthest-first, and K-means (Witten, I. & Frank, E. 2000, pp. 210-227). This is a good start but
by no means exhaustive. The application does not strictly require additional clustering strategies
to function; as such, the new strategies are candidates for scope reduction.


3.5.3 Visualization/Interaction
The visualization/interaction component is the user interface to the clustered data. This
component allows the user to view the data. It allows the user to transform or refine the clusters
through a limited set of interactions.


3.5.3.1 Java GUI
The Graphical User Interface presents the user with a two- or three-dimensional projection of the
source data, using color to represent the class membership established by clustering. Intuitively,
the intensity of the color can represent the distance from the centroid or “fuzzy” class
memberships. Additionally, GUI presents the user with a set of controls that allow non-
destructive and destructive data transformations. Non-destructive operations include standard
graphical transformations such as rotation, pan, zoom. A search capability allows a user to locate
specific instances or groups of instances in the scatter plot. Such a capability may further refine
the subjectivity of the results. A proposed destructive operation of the data allows the user to act
as a critic by demonstrating what a more appropriate result might be by “forcing” their own bias
into the model. The system will re-cluster and re-present the modified instances. If happening
interactively, this would likely happen on a random sample of the data.


3.5.3.2 Projection Window (GUI Component)
The projection window is a Java component that displays a colored scatterplot of two-
dimensional or three-dimensional data. It should handle non-destructive transformations like
rotation, pan, zoom, scale, color/grayscale toggling, selection, and indicating “interesting”
instances.


3.5.3.3 Control Panel (GUI Component)
The control panel allows a user to interact with the GUI. It will provide a menu of custom
controls fulfilling all the tasks defined for the interface.


4 Work Plan
At a high level, there are two phases to the project: building the tools, and using the tools to
explore the data.
                                              10
The approach to the initial phase is to work backwards through the data path, and establish
baseline functionality. The first component completed is the last in the data path, the Java GUI
front-end. Pseudo data will functionally test the GUI. The clustering engine is scheduled next.
Finally the MDS module, which in effect bridges the two initial components, will be completed.
The exception to the rule is the projection window. The work on the projection window will
happen in two parts. The initial round of work will produce a basic visualization window using
test data; a second round of development will extend the functionality for application-specific
features.

The next phase involves experimentation. Once the first component in the data path brings in the
data, the real project is underway. In addition to trials using various combinations of features,
implementation of custom clustering strategies and incremental GUI changes happen during this
phase.

Ideally, the tools would complete by Christmas, and experimentation would start in January. The
schedule below presents a slightly less aggressive view of the timeline. Code complete happens
on 2/1/2005. Pulling in this date to 1/1/2005 is a stretch goal for the development cycle.


4.1 Assumptions, Risks and Alternatives
The original code written for the project will be in Java. The cluster engine will use the Weka
open source data-mining framework (Weka 3 - Data Mining with Open Source Machine Learning
Software in Java. 2004). The GUI will use Swing components (Geary, D. 1999). It might use the
standard Java 3D extensions (Java 3D API. 2004). If it does use the AWT-based 3D API, the
GUI will use heavyweight AWT components instead of their Swing counterparts (Geary, D.
1997). Other small scripts will extract and format data for import; these may be in SQL, Perl, or
some other language. CVS versioning software will facilitate milestone releases. The eventual
application will run client-side. It might be packaged as a “Java Web Start” application.

Risks and Alternatives:

    •   Unachievable Schedule – the schedule as indicated below is aggressive.

            o   Alternative: Use third party components in the GUI, especially for prototyping

            o   Alternative: Use existing clustering software only

    •   Personal Schedule Conflict – we’re expecting our 2nd child 4/15/2005

            o   Alternative: Enter into the program later. Re-negotiate graduation date.

            o   Alternative: Build in extension.

    •   No signal in target data – the experiment could fail.

            o   Alternative: Establish signal before undertaking the project using sample data

            o   Alternative: Establish validity of negative outcome; success of tool



                                               11
4.2 Preliminary Schedule
Figure 3 shows a high-level view of the schedule.

                                                                                    1/17/2005
                                                                          Full integ, Built-in Clustering



                                                                  12/20/2004 - 1/3/2005
                                                                          Break
                                                                                                  1/31/2005             2/28/2005
                                    11/22/2004                                                  Code Complete        Exploration Ends
 10/4/2004         11/1/2004        Clustering       12/12/2004                                                                         3/31/2005
   Start          GUI Complete     Dummy Data          Scaling                                                                          Work ends




                       11/1/2004                 12/1/2004                  1/1/2005                   2/1/2005            3/1/2005

    10/4/2004                                                                                                                              3/31/2005




                                             Figure 3 High-level View of Schedule



Table 2 shows a detailed view of the proposed schedule:

 Design
 Dev
 Unit test
 Integ Test
 Explore




                                                                                              Scaling/                            New
                                 Projectio          Controlle            Clustering           Projectio           Extractio       Clustering
 Week           Java GUI         n                  r                    Engine               n                   n               Algorithms
 4-Oct
 11-Oct
 18-Oct
 25-Oct
 1-Nov
 8-Nov
 15-Nov
 22-Nov
 29-Nov
 6-Dec
 13-Dec
 20-Dec
 27-Dec
 3-Jan
 10-Jan
 17-Jan
 24-Jan
 31-Jan
 7-Feb
 14-Feb
 21-Feb
 28-Feb
 7-Mar
 14-Mar
 21-Mar

                                                                            12
28-Mar

                                  Table 2 Detailed Schedule View




5 Glossary

Centroid                      A pseudo exemplar serving as the statistical center of a given class.

Clustering                    Clustering algorithms find groups of items that are similar. For
                              example, clustering could be used by an insurance company to
                              group customers according to income, age, types of policies
                              purchased and prior claims experience. It divides a data set so that
                              records with similar content are in the same group, and groups are
                              as different as possible from each other. Since the categories are
                              unspecified, this is sometimes referred to as unsupervised learning.
                              (Two Crows: Data Mining Glossary. 2001).

Unsupervised Learning         As distinct from supervised learning, the classification of unlabeled
                              data.

Data Mining                   The process of automatically extracting valid, useful, previously
                              unknown, and ultimately comprehensible information from large
                              databases and using it to make crucial business decisions.
                              “Torturing the data until they confess” (Hsu, W. 2001)

Weka                          An open source Java project for machine learning and data mining
                              found at: http://www.cs.waikato.ac.nz/~ml/weka/

Webserver                     A software application for serving content to browsers on the world
                              wide web.




6 References

6.1 Works Cited
The following is a list of references cited in the document.

Bishop, C. (1995). Neural Networks for Pattern Recognition. New York: Oxford Press.

Bucklin, R., & Sisemiro, C. (2003). A Model of Website Browsing Behavior Estimated on
   Clickstream Data. Journal of Marketing Research, XL, 249-267. Retrieved August 15, 2004,
   from
   http://www.anderson.ucla.edu/faculty/randy.bucklin/papers/bucklinandsismeiro2003.pdf


                                                 13
Duda, R, Hart, P., & Stork, D. (2001) Pattern Classification. New York: John Wiley & Sons.

Geary, D (1997) Graphic Java 1.1: Mastering the AWT. New York: Prentice Hall.

Geary, D. (1999) Graphic Java Volume II: Swing. New York: Prentice Hall.

Hsu, W. (2001). Knowledge Discovery in Databases and Data Mining. Retrieved October 17,
   2004, from
   http://www.kddresearch.org/Courses/Fall-2003/CIS732/Lectures/Lecture-28-20011204.pdf

Java 3D API. Retrieved October 17, 2004 from http://java.sun.com/products/java-media/3D/

Two Crows: Data Mining Glossary. (2001). Retrieved October 17, 2004, from
   http://www.twocrows.com/glossary.htm#anchor311516

UCI Machine Learning Repository Content Summary. Retrieved October 16, 2004, from
   http://www.ics.uci.edu/~mlearn/MLSummary.html

Weka 3 - Data Mining with Open Source Machine Learning Software in Java. (2004) Retrieved
   October 17, 2004, from: http://www.cs.waikato.ac.nz/ml/weka/

Witten, I., Frank, E. (2000) Data Mining. San Diego: Academic Press.


6.2 Works Consulted
The following is a list of works consulted while researching the topic.

Anasari, A., & Mela, C. (2003). E-Customization. [Electronic Version] Journal of Marketing
   Research, XL, 131-145. Retrieved August 10, 2004, from
   http://faculty.fuqua.duke.edu/~mela/bio/Ansari_Mela_2003.pdf

Moe, W. & Fader, P. (2002) Capturing Evolving Visit Behavior in Clickstream Data [Electronic
   Version] Retrieved August 10, 2004, from http://www-
   marketing.wharton.upenn.edu/ideas/pdf/00-003.pdf

Moe, W. & Fader, P. (2003) Dynamic Purchase Behavior at e-Commerce Sites [Electronic
   Version] Retrieved August 10, 2004, from:
   http://www-marketing.wharton.upenn.edu/ideas/pdf/Fader/Moe-Fader%20conversion
   %200303.pdf

Montgomery, A., Li, S., Srinivasan, K., & Liechty, J (2004) Modeling Online Browsing and Path
  Analysis Using Clickstream Data [Electronic Version] Retrieved August 10, 2004, from
  http://www.andrew.cmu.edu/user/alm3/papers/purchase%20conversion.pdf


6.3 Works To Be Consulted
The following is a list of works marked for future review.

Jain, A., Murty, M., & Flynn, P. (1999). Data Clustering: A Review. ACM Computing Surveys,
    31(3). Retrieved August 15, 2004, from http://portal.acm.org/citation.cfm?id=331499.331504

                                                14
Leouski, A., & Swan, R. (1997). Interactive Cluster Visualization for Information Retrieval.
   Retrieved August 10, 2004, from
   http://citeseer.ist.psu.edu/rd/41003322%2C82112%2C1%2C0.25%2CDownload/http
   %3AqSqqSqciir.cs.umass.eduqSqinfoqSqpsfilesqSqirpubsqSqir-116.ps.gz

Procopiuc, C., Jones, M., Agarwal, P., & Murali, T. (2002) A Monte Carlo Algorithm for Fast
    Projective Clustering. [Electronic Version] Presented at ACM SIGMOD 2002. Retrieved
    August 10, 2004, from http://www.research.att.com/resources/papers/Clustering.pdf




                                               15

Mais conteúdo relacionado

Mais procurados

Semantically enriched web usage mining for predicting user future movements
Semantically enriched web usage mining for predicting user future movementsSemantically enriched web usage mining for predicting user future movements
Semantically enriched web usage mining for predicting user future movementsIJwest
 
Implementation of Privacy Policy Specification System for User Uploaded Image...
Implementation of Privacy Policy Specification System for User Uploaded Image...Implementation of Privacy Policy Specification System for User Uploaded Image...
Implementation of Privacy Policy Specification System for User Uploaded Image...rahulmonikasharma
 
Entity linking with a knowledge baseissues, techniques, and solutions
Entity linking with a knowledge baseissues, techniques, and solutionsEntity linking with a knowledge baseissues, techniques, and solutions
Entity linking with a knowledge baseissues, techniques, and solutionsShakas Technologies
 
Web aggregation and mashup with kapow mashup server
Web aggregation and mashup with kapow mashup serverWeb aggregation and mashup with kapow mashup server
Web aggregation and mashup with kapow mashup serverYudep Apoi
 
Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...Michael Steurer
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkEditor IJCATR
 
IRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systemsIRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systemsIRJET Journal
 
Summary of Paper : Taxonomy of websearch by Broder
Summary of Paper : Taxonomy of websearch by BroderSummary of Paper : Taxonomy of websearch by Broder
Summary of Paper : Taxonomy of websearch by BroderBhavesh Singh
 
Data mining on Social Media
Data mining on Social MediaData mining on Social Media
Data mining on Social Mediahome
 
Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.Silvia Puglisi
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social networkakash_mishra
 
Travel Recommendation Approach using Collaboration Filter in Social Networking
Travel Recommendation Approach using Collaboration Filter in Social NetworkingTravel Recommendation Approach using Collaboration Filter in Social Networking
Travel Recommendation Approach using Collaboration Filter in Social NetworkingIRJET Journal
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015Marianne Sweeny
 
WEB MINING – A CATALYST FOR E-BUSINESS
WEB MINING – A CATALYST FOR E-BUSINESSWEB MINING – A CATALYST FOR E-BUSINESS
WEB MINING – A CATALYST FOR E-BUSINESSacijjournal
 
C03406021027
C03406021027C03406021027
C03406021027theijes
 
Design the Search Experience
Design the Search ExperienceDesign the Search Experience
Design the Search ExperienceMarianne Sweeny
 

Mais procurados (19)

Semantically enriched web usage mining for predicting user future movements
Semantically enriched web usage mining for predicting user future movementsSemantically enriched web usage mining for predicting user future movements
Semantically enriched web usage mining for predicting user future movements
 
Cursorcomp ipm
Cursorcomp ipmCursorcomp ipm
Cursorcomp ipm
 
Implementation of Privacy Policy Specification System for User Uploaded Image...
Implementation of Privacy Policy Specification System for User Uploaded Image...Implementation of Privacy Policy Specification System for User Uploaded Image...
Implementation of Privacy Policy Specification System for User Uploaded Image...
 
Entity linking with a knowledge baseissues, techniques, and solutions
Entity linking with a knowledge baseissues, techniques, and solutionsEntity linking with a knowledge baseissues, techniques, and solutions
Entity linking with a knowledge baseissues, techniques, and solutions
 
Web aggregation and mashup with kapow mashup server
Web aggregation and mashup with kapow mashup serverWeb aggregation and mashup with kapow mashup server
Web aggregation and mashup with kapow mashup server
 
Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...Predicting Social Interactions from Different Sources of Location-based Knowl...
Predicting Social Interactions from Different Sources of Location-based Knowl...
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social Network
 
IRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systemsIRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systems
 
Summary of Paper : Taxonomy of websearch by Broder
Summary of Paper : Taxonomy of websearch by BroderSummary of Paper : Taxonomy of websearch by Broder
Summary of Paper : Taxonomy of websearch by Broder
 
Data mining on Social Media
Data mining on Social MediaData mining on Social Media
Data mining on Social Media
 
Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.Analysis, modelling and protection of online private data.
Analysis, modelling and protection of online private data.
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social network
 
Travel Recommendation Approach using Collaboration Filter in Social Networking
Travel Recommendation Approach using Collaboration Filter in Social NetworkingTravel Recommendation Approach using Collaboration Filter in Social Networking
Travel Recommendation Approach using Collaboration Filter in Social Networking
 
Jx2517481755
Jx2517481755Jx2517481755
Jx2517481755
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
 
Personalized Web Search
Personalized Web SearchPersonalized Web Search
Personalized Web Search
 
WEB MINING – A CATALYST FOR E-BUSINESS
WEB MINING – A CATALYST FOR E-BUSINESSWEB MINING – A CATALYST FOR E-BUSINESS
WEB MINING – A CATALYST FOR E-BUSINESS
 
C03406021027
C03406021027C03406021027
C03406021027
 
Design the Search Experience
Design the Search ExperienceDesign the Search Experience
Design the Search Experience
 

Destaque

Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.docbutest
 
Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.butest
 
Abstract
AbstractAbstract
Abstractbutest
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence butest
 
Curriculum Vitae
Curriculum VitaeCurriculum Vitae
Curriculum Vitaebutest
 
GCD.263.doc
GCD.263.docGCD.263.doc
GCD.263.docbutest
 
Machine Learning: finding patterns Outline
Machine Learning: finding patterns OutlineMachine Learning: finding patterns Outline
Machine Learning: finding patterns Outlinebutest
 
download
downloaddownload
downloadbutest
 
MINUTES OF REGULAR BOARD OF EDUCATION MEETING
MINUTES OF REGULAR BOARD OF EDUCATION MEETINGMINUTES OF REGULAR BOARD OF EDUCATION MEETING
MINUTES OF REGULAR BOARD OF EDUCATION MEETINGbutest
 
Topic_6
Topic_6Topic_6
Topic_6butest
 
MSWord
MSWordMSWord
MSWordbutest
 
High-level
High-levelHigh-level
High-levelbutest
 
RFP document template
RFP document templateRFP document template
RFP document templatebutest
 
AQRONOMİYA VƏ EKOLOGİYA
AQRONOMİYA VƏ EKOLOGİYAAQRONOMİYA VƏ EKOLOGİYA
AQRONOMİYA VƏ EKOLOGİYAbutest
 
Handout of course information.
Handout of course information.Handout of course information.
Handout of course information.butest
 
New Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting NotesNew Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting Notesbutest
 
A Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.docA Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.docbutest
 
經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案R.O.C.Executive Yuan
 
MoI_Blue_Three Ideas on Entertaining in a Presentation_2015
MoI_Blue_Three Ideas on Entertaining in a Presentation_2015MoI_Blue_Three Ideas on Entertaining in a Presentation_2015
MoI_Blue_Three Ideas on Entertaining in a Presentation_2015Martin Barnes
 

Destaque (20)

Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.doc
 
Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.
 
Abstract
AbstractAbstract
Abstract
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 
Curriculum Vitae
Curriculum VitaeCurriculum Vitae
Curriculum Vitae
 
GCD.263.doc
GCD.263.docGCD.263.doc
GCD.263.doc
 
Machine Learning: finding patterns Outline
Machine Learning: finding patterns OutlineMachine Learning: finding patterns Outline
Machine Learning: finding patterns Outline
 
download
downloaddownload
download
 
MINUTES OF REGULAR BOARD OF EDUCATION MEETING
MINUTES OF REGULAR BOARD OF EDUCATION MEETINGMINUTES OF REGULAR BOARD OF EDUCATION MEETING
MINUTES OF REGULAR BOARD OF EDUCATION MEETING
 
Topic_6
Topic_6Topic_6
Topic_6
 
MSWord
MSWordMSWord
MSWord
 
High-level
High-levelHigh-level
High-level
 
RFP document template
RFP document templateRFP document template
RFP document template
 
AQRONOMİYA VƏ EKOLOGİYA
AQRONOMİYA VƏ EKOLOGİYAAQRONOMİYA VƏ EKOLOGİYA
AQRONOMİYA VƏ EKOLOGİYA
 
Unisa2
Unisa2Unisa2
Unisa2
 
Handout of course information.
Handout of course information.Handout of course information.
Handout of course information.
 
New Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting NotesNew Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting Notes
 
A Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.docA Research Platform for Coevolving Agents.doc
A Research Platform for Coevolving Agents.doc
 
經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案經濟部:「再生水資源發展條例」草案
經濟部:「再生水資源發展條例」草案
 
MoI_Blue_Three Ideas on Entertaining in a Presentation_2015
MoI_Blue_Three Ideas on Entertaining in a Presentation_2015MoI_Blue_Three Ideas on Entertaining in a Presentation_2015
MoI_Blue_Three Ideas on Entertaining in a Presentation_2015
 

Semelhante a LyonALMProposal20041018.doc

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...ijdkp
 
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...IOSR Journals
 
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...IRJET Journal
 
An effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentAn effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentijdpsjournal
 
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...idescitation
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technologyanchalsinghdm
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Certain Issues in Web Page Prediction, Classification and Clustering in Data ...
Certain Issues in Web Page Prediction, Classification and Clustering in Data ...Certain Issues in Web Page Prediction, Classification and Clustering in Data ...
Certain Issues in Web Page Prediction, Classification and Clustering in Data ...IJAEMSJORNAL
 
Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  dannyijwest
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
A Survey on: Utilizing of Different Features in Web Behavior Prediction
A Survey on: Utilizing of Different Features in Web Behavior PredictionA Survey on: Utilizing of Different Features in Web Behavior Prediction
A Survey on: Utilizing of Different Features in Web Behavior PredictionEditor IJMTER
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET Journal
 
IJRET : International Journal of Research in Engineering and TechnologyImprov...
IJRET : International Journal of Research in Engineering and TechnologyImprov...IJRET : International Journal of Research in Engineering and TechnologyImprov...
IJRET : International Journal of Research in Engineering and TechnologyImprov...eSAT Publishing House
 
Web log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy cWeb log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy cijcsa
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 

Semelhante a LyonALMProposal20041018.doc (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
AN INTELLIGENT OPTIMAL GENETIC MODEL TO INVESTIGATE THE USER USAGE BEHAVIOUR ...
 
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...Performance of Real Time Web Traffic Analysis Using Feed  Forward Neural Netw...
Performance of Real Time Web Traffic Analysis Using Feed Forward Neural Netw...
 
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
 
An effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentAn effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded content
 
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
H0314450
H0314450H0314450
H0314450
 
Ab03401550159
Ab03401550159Ab03401550159
Ab03401550159
 
A Clustering Based Approach for knowledge discovery on web.
A Clustering Based Approach for knowledge discovery on web.A Clustering Based Approach for knowledge discovery on web.
A Clustering Based Approach for knowledge discovery on web.
 
Certain Issues in Web Page Prediction, Classification and Clustering in Data ...
Certain Issues in Web Page Prediction, Classification and Clustering in Data ...Certain Issues in Web Page Prediction, Classification and Clustering in Data ...
Certain Issues in Web Page Prediction, Classification and Clustering in Data ...
 
Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
A Survey on: Utilizing of Different Features in Web Behavior Prediction
A Survey on: Utilizing of Different Features in Web Behavior PredictionA Survey on: Utilizing of Different Features in Web Behavior Prediction
A Survey on: Utilizing of Different Features in Web Behavior Prediction
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
 
IJRET : International Journal of Research in Engineering and TechnologyImprov...
IJRET : International Journal of Research in Engineering and TechnologyImprov...IJRET : International Journal of Research in Engineering and TechnologyImprov...
IJRET : International Journal of Research in Engineering and TechnologyImprov...
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
Web log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy cWeb log data analysis by enhanced fuzzy c
Web log data analysis by enhanced fuzzy c
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

LyonALMProposal20041018.doc

  • 1. Proposal for a Thesis in the Field of Information Technology In Partial Fulfillment of the Requirements For a Master of Liberal Arts Degree Harvard University Extension School 10/18/2004 Clifford Lyon 53 West Emerson Street Melrose, MA 02176-3109 (617) 225-3293 (781) 663-7703 clyon928@comcast.net Proposed Start Date: 10/4/2004 Anticipated Date of Graduation: 6/2005 Thesis Directors: Sergei Makar-Limanov and Bhiksha Raj 1
  • 2. 1 Tentative Thesis Title: Visualization of High-Dimensional Clickstream Data Using Java Keywords: Clustering, Unsupervised Learning, Critic, Search, Data Visualization, Java 2D/3D, Clickstream, Data Mining, Machine Learning 2 Abstract Unsupervised learning holds out a promise for the discovery of objectively valid disaggregate patterns within large clickstream data stores. Using an interactive data visualization interface and clustering algorithms, the software designed and delivered by this project will allow the exploration of clickstream data in a subjectively meaningful way. 3 Thesis Project Description 3.1 Background Clickstream data accumulated by a commercial website offers site managers the potential for objective insight into their audience unparalleled in other publishing media. Unlike their print, television, and radio counterparts, web publishers have access to a detailed record of events generated by their visitors. Each time a visitor requests a URL, a webserver records the request and some information about the visitor’s browser in a log file. However, the potential for insight remains in large part unrealized for the commercial Internet despite the availability of this detailed behavior record, well-established machine learning algorithms, exponential growth in processing power, and decreased memory and storage cost. This is in contrast to the evident success of personalization and targeting efforts by sites such as Amazon.com and Netflix.com that estimate the posterior probability of user decisions from previous behavior to present contextually relevant recommendations. Content automation is certainly one positive outcome of modeling behavior using clickstream data. However, it is fundamentally an application of knowledge at the transaction level, not at the enterprise level. In contrast, this project seeks to recognize patterns in web data at a high level, and to build an interface capable of presenting these patterns to a non-technical (business) user in a meaningful way. 3.1.1 Challenge of Clickstream Data A key reason for the lack of progress in the application of standard machine learning algorithms to clickstream data lies in the nature of the data itself. In recent years, academics in market research and applied economics have started building behavioral models using clickstream data. The initial papers are interesting and encouraging. However, the shape of the data presents a fundamental challenge. Unlike typical market research surveys and polls, web data is vast, noisy, and censored. For example, the website contributing data for this project records more than one billion events each month. The interesting events are those generated by real people interacting with the website using a web browser. Software robots making requests for content generate noise in the system. For very different reasons, these robots traverse a website using the same protocol and transactional processes as real people. For example, a robot might gather 2
  • 3. information for use in a search index, cache pages for a proxy server, or artificially increase popularity for a particular product featured on a site by repeatedly requesting information. The noise is not easily separable from interesting events. This is partly because it is easy for a robot to do everything a real person would do, and partly because a person who configures his or her browser to interact as minimally as possible with the webserver may appear to be a robot. Moreover, a proxy server will appear as a single user, but in reality may convey requests for thousands of users. Typically, a time series known as a session stores the sequence of events generated by a user during a site visit. Analyzing time series data can help separate robot generated events from real traffic. However, sessions are censored in the sense that there is no event signifying completion. In particular, because the start of an event marks the end of the previous event in the series, the final event has an unknown duration. These factors make it difficult to model user-website interaction using raw clickstream data. Recent efforts to apply market research techniques to clickstream data have generally used a regression model to expose some sort of interesting user behavior pattern. Typically, websites have extensive reporting at an aggregate level, but little real insight into user segments or the differences between these segments. A recurring theme in recent research is the need to model behavior in a disaggregate way to account for and expose behavior away from the mean. In particular, Bucklin and Sismeiro (2003) suggest that accounting for user heterogeneity is of critical importance, and that using aggregate metrics can potentially lead to the wrong conclusions. 3.2 Approach We propose inverting the general approach to the user behavior problem: rather than build vectors of user or visit behavior, we will build vectors of pages, with features derived from user or visit behavior. Metric design will account for user heterogeneity by incorporating aggregate metrics from user dimensions as features. For example, the duration of time spent on the page might vary depending on the time of day or the position of the page in the session. Representing duration at a disaggregate level ensures the preservation of variance that allows users, and so pages, to be successfully partitioned. The utility of an inverted approach is two-fold: first, we express results in terms of website entities, which are under a site manager’s control. A low repeat-visit rate, while important information for a site manager to know, suggests no direct action. On the other hand, a site manager who learns that certain categories of pages are less likely to generate repeat visits has a clear area on which to focus efforts to improve the site. Second, using the data produced by our learning exercise as extra input for an existing user based model may improve its predictive power. By modeling pages using unsupervised learning first, we remove the bias of categorical features established by the top-down human design of the site. When looking at behavior on a website, it is important to distinguish the hierarchical site structure from the behavior on it as much as possible. Remodeling the business-driven categorical entity as a behavioral class structure can help create better user models. A stretch goal for the project is to use the page classification in a user behavior model to demonstrate the value of clustering as a way of segmenting data behaviorally and supporting heterogeneity. The software delivered by this project will cluster pages on the website according to their natural order in the data. Clustering groups similar pages together. For example, pages visited on the weekend by a young audience might fall into one cluster, and pages visited at the start and end of the workday by an older group might fall into a second cluster. The pages within each group or cluster are more similar to each other than to pages in other groups. Thus, the ordering emerges from the data itself, rather than from an external agent. In this sense, the order is “natural.” This 3
  • 4. approach is termed “unsupervised learning” because there is no known target class for the input data; the model is fit to the features of the input data. The clustering algorithm assigns each page to a class based on features derived from user interaction on that page. A user interface (GUI) will visualize the clusters. We hope that the framework can make the often opaque results of unsupervised learning subjectively meaningful for the site manager, that is, someone who understands the problem domain well, but not the specifics of the machine learning process. Unless the results have subjective meaning to the user of the software, the results will not be useful. The software framework should be generalizable. Although the data set for our investigation is specific, the application should perform reasonably well on other data sets. Application testing includes scenarios using some of the common public domain machine learning data sets, such as the iris data set (UCI Machine Learning Repository Content Summary). In order to present data to the user, the GUI will project high-dimensional clustered page vectors in two or three dimensions. There are established methods for achieving such a projection, such as using the first two or three eigenvalues. Generally, the idea is to eliminate or merge features in a way that minimizes the introduction of error into the system as information is lost. The user interface will allow the user to search among the clusters for items of interest. The user will act as a critic by using the search function to establish subjective validity of a given set of clusters, and by suggesting (weighting) a direction for more useful results. While this technique has some precedent in machine learning literature (Duda, R, Hart, P. & Stork, D. 2001, p. 565), we are unaware of specific applications that use search as a tool for cluster exploration. Although one readily finds examples of Java data visualization programs on the internet, we found none that offered the feedback mechanism proposed here. An application notable for its approach to dimensionality reduction is the two-dimensional cluster-visualization program produced by IBM’s Alphaworks program, which can be found at http://www.alphaworks.ibm.com/formula/CViz The Alphaworks program places cluster exemplars at the origin and extents of the x- and y- axes, and plots instances based on similarity. The program translates similarity into Euclidean distance on the plane. The x- and y-axes have no units. The exemplars at the origin and extents of the axes triangulate the placement of clustered items in the two-dimensional space: the software places items on the plane based on similarity to the three exemplars. This method has a few nice properties: it is fast, it does not require a lot of extra computation, and it is visually meaningful and intuitive. Exploring this method in three dimensions would be an interesting exercise. It might provide a parsimonious way to scale the cluster space to a low dimensional representation. A search for similar or related material uncovered no other papers on the topic of clustering web pages for data visualization. 3.3 Data Description The clustering algorithm will use data from a well-known shopping services, advice, and news website. The site records approximately 70 million page events each day. Each time a user loads a page, a tracking image is loaded, and the resulting log line in the server log represents a single page event. Each page event has clickstream attributes from the webserver logline and attributes derived from the site delivery application and site meta-data. Additionally, links into and out of the site are tracked using an HTTP redirect. These redirects share the same attributes as the page events. 4
  • 5. The following table represents a sample of data available at the atomic level from the database: Field Name Description Unique identified for the session in which the page event occurred. (A SESSION_ID session is continuous activity with gaps of no more than 30 minutes. EVENT_SEQ_NUM The sequence number of the event w/in the session REFERRING_HOST If the data is from an external site, the hostname of the external site Foreign key to third party demographic data based on IP address. NETWORK_IP Provides Country, US State, DMA, Line speed. IP_ADDRESS Client IP Address USER_AGENT The user agent of the browser performing the page request EDITION The “branding” of the page. PAGE_TYPE Identifies the template used to serve the page by the content application PAGE_DURATION Amount of time spent on the page TIME_SINCE_SESS_START Time elapsed since the first event of the session IS_REG_USER Whether the client was a registered user IS_NEW_USER Whether the client has been to the site before (cookie based) IS_COOKIED_USER Whether the client allows cookies The sequence number of the page w/in the session (in contrast to the PAGE_SEQ_NUM event_seq_num, which included redirects) IS_LAST_PAGE Whether the event was the last page TIMESTAMP The date and time of the page request ANONYMOUS_ID ID based on website cookie SITE_ID The site number of the event (40 total sites) ONTOLOGY_NODE_ID The location of the page in the site navigational hierarchy IS_IAB_ROBOT Whether the user agent is a known robot IS_BEHAVIORAL_ROBOT Whether the user agent behaves like a robot The search phrase that the user types, if any. (Includes third party sites SEARCH_PHRASE like Google.) REGISTRATION_ID The id of the registered user, if any. Table 1 Sample Data Fields These fields are the raw material that will form the aggregate page vectors. Four entities uniquely identify a page on the website: site, page type, ontology, and asset. “Site” is a business dimension that groups content together at a high-level. The dataset contains tens of sites. “Page type” is an application dimension identifying the template used to render the content. The data contains thousands of page types. An “Ontology” node is a navigational dimension describing the area on the site where the page lives – for example, a “door”, or a “story” page. The data contains thousands of ontology nodes. “Asset” refers to a particular piece of content or a product featured on a page. There are tens of thousands of assets active each day, and millions historically. We have intentionally left Asset out of the page vector key, as this would produce far too many instances to be useful for clustering. We expect between one and ten thousand vectors for clustering, depending on the choice of sites. The page vector will have the following structure: SITE_ID, PAGE_TYPE, ONTOLOGY_NODE, derived attributes 1..n. The derived attributes will be behavioral in nature, and computed from activity over some period: 30 or 60 days, for example. Initially, we are considering the following attributes for each page: 5
  • 6. Count total page views • Count 1 page sessions (this page was the only page) • Count 2-5 page sessions (this page occurring in) • Count 5-10 page sessions • Count 10+ page sessions • Count registered user visits • Count anonymous user visits • Repeat visitor rate • Average hits/day • Average hits/weekday • Average hits/weekend • Average hits by hour of day, flattened • Count session starts • Count session stops • Count leads (occurring in sessions with this page in it – leads are redirects to a partner site.) • Total time spent on the page • Average duration • Average duration, weekday • Average duration, weekend • Average duration, by hours 1-24, by Time Zone • Average “place in session” – where this event occurs, as a fraction of all events occurring • Anonymous and Registered user visits, for each world country (flattened) • Anonymous and Registered user visits, for each US State This is a starting point. As mentioned, asset is not part of the page vector, but we could preserve attributes of the asset to qualify the metrics. For example, rather than using the product entity 6
  • 7. itself to identify a page, we could use the product category. We expect finding an identifiable set of attributes for the page vector will require some exploration. 3.4 Data Flow At a logical level, there are two major application components and two sources of input into the system. The inputs are the data and the user feedback, and the components are the machine learning and the data visualization components. Figure 1 illustrates the flow of external data into and through the system: Machine Learning External Assign Instances Import Data Data Clusters Scale for Presentation Data Visualization Process User User Present Data Input Figure 1 Data Flow Diagram • External Data flows into the system as a Weka dataset, a flat set of vectors containing page information. • The machine-learning component applies the clustering algorithm to the vectors and thereby classifies each instance. • The machine-learning component projects the clusters in two or three dimensions for presentation in a user interface. • The data visualization component processes user feedback after the presentation. 7
  • 8. Depending on the feedback, the data visualization component re-presents the data, or re- classifies and then re-presents the data. 3.5 Architecture The diagram in Figure 2 shows three physical architectural components. The shaded elements represent elements that do not exist today; the unshaded elements represent third-party software or data sources. The following subsections describe each of the three components. Pre-Processing ClickStream Extraction Script DataStore ARFF file Unsupervised Learning Cluster Engine (Driver) Weka.core.Instances Weka.classifiers.Evaluation Existing Clusterer Weka.classifiers.Clusterers New Clusterer Multi-Dimensional Scaling Filter Java GUI Weka.core.Instances Visualization/Interaction 2D/3D Projection Window Control Panel Figure 2 System Diagram 3.5.1 Pre-processing The preprocessing step prepares data for use. The extraction script reads data from a database and writes it to the proprietary Weka data format. The Java application reads the formatted data from disk into memory. We will aggregate the atomic events in the database to the page level record described above. We will filter out robot traffic as much as possible, using already available flags in the database. The aggregation will discount the last event of the session for mean duration calculations. 8
  • 9. 3.5.1.1 Clickstream Data Store The data store exists today in the form of a large data warehouse for an anonymous web- publishing company. They have agreed to share data for the project on condition that we obfuscate user or commercially identifiable information. There will be some data transformation within the data store: a script will aggregate the individual events with the page as the key, forming the page vector described in the data description section. 3.5.1.2 Extraction Module The extraction script pulls data from the data store. This module is a placeholder for the process that creates a flat file in Weka format from the database. It may be a series of scripts, or it may be a set of actions undertaken to spool query results to disk manually and add a header. Any code developed will be handed in for inspection, but it should be understood that evidence of completion is the Weka data file rather than code that created it. Anyone seeking to recreate this experiment on his or her own would have to code this module by hand; the rest would flow from there. As such, the extraction module is formally outside the bounds of the project. 3.5.1.3 Attribute-Relation File Format (ARFF) file The Attribute-Relation File Format (ARFF) file constitutes the boundary of the application. A Java program using the Weka class libraries can easily read the file into an in-memory representation for machine learning by the various Weka modules. A website describing the Weka data file format in detail is found at http://www.cs.waikato.ac.nz/~ml/weka/arff.html 3.5.2 Unsupervised Learning The unsupervised learning component performs the actual clustering. It clusters instances read from the pre-processing component and refines or redistributes clusters based on feedback from the user through the GUI component. 3.5.2.1 Cluster Engine The cluster engine is a driver that uses the Weka data-mining framework to read the ARFF file into memory and exercise a clustering algorithm on that data. After performing the unsupervised learning algorithm, it prepares data for presentation by creating low-dimensional projections of the instances. The program appends the location in the low-dimensional space to the existing attributes of the instance. At this point in the data path, the instance includes the original features, the cluster identifier and any related cluster metrics such as distance from the centroid, and the newly appended location in low-dimensional space. 3.5.2.2 Multi Dimensional Scaling (MDS) Filter The task of this component is to take the n-dimensional feature vector from the input data and scale it to a projection suitable for presentation, i.e. either a two- or three-dimensional vector. The filter preserves the distance between the points in the original space as closely as possible by minimizing an error function. The MDS filter is completely independent of the clustering, and could be applied itself to the data. This filter will do some sort of analysis like Principle 9
  • 10. Components Analysis (PCA) on the data (Bishop, C. 1995, Appendix E). It may be we can make use of the existing cluster information in a clever way as another alternative – in which case this could stay in the data path as a no-op. 3.5.2.3 New Cluster Strategies The Weka framework allows for the easy introduction of new clustering strategies. More general classification techniques could make use of the generic classifier container as well. The Weka library offers several clustering choices out-of-the-box: Cobweb, Expectation Maximization, Farthest-first, and K-means (Witten, I. & Frank, E. 2000, pp. 210-227). This is a good start but by no means exhaustive. The application does not strictly require additional clustering strategies to function; as such, the new strategies are candidates for scope reduction. 3.5.3 Visualization/Interaction The visualization/interaction component is the user interface to the clustered data. This component allows the user to view the data. It allows the user to transform or refine the clusters through a limited set of interactions. 3.5.3.1 Java GUI The Graphical User Interface presents the user with a two- or three-dimensional projection of the source data, using color to represent the class membership established by clustering. Intuitively, the intensity of the color can represent the distance from the centroid or “fuzzy” class memberships. Additionally, GUI presents the user with a set of controls that allow non- destructive and destructive data transformations. Non-destructive operations include standard graphical transformations such as rotation, pan, zoom. A search capability allows a user to locate specific instances or groups of instances in the scatter plot. Such a capability may further refine the subjectivity of the results. A proposed destructive operation of the data allows the user to act as a critic by demonstrating what a more appropriate result might be by “forcing” their own bias into the model. The system will re-cluster and re-present the modified instances. If happening interactively, this would likely happen on a random sample of the data. 3.5.3.2 Projection Window (GUI Component) The projection window is a Java component that displays a colored scatterplot of two- dimensional or three-dimensional data. It should handle non-destructive transformations like rotation, pan, zoom, scale, color/grayscale toggling, selection, and indicating “interesting” instances. 3.5.3.3 Control Panel (GUI Component) The control panel allows a user to interact with the GUI. It will provide a menu of custom controls fulfilling all the tasks defined for the interface. 4 Work Plan At a high level, there are two phases to the project: building the tools, and using the tools to explore the data. 10
  • 11. The approach to the initial phase is to work backwards through the data path, and establish baseline functionality. The first component completed is the last in the data path, the Java GUI front-end. Pseudo data will functionally test the GUI. The clustering engine is scheduled next. Finally the MDS module, which in effect bridges the two initial components, will be completed. The exception to the rule is the projection window. The work on the projection window will happen in two parts. The initial round of work will produce a basic visualization window using test data; a second round of development will extend the functionality for application-specific features. The next phase involves experimentation. Once the first component in the data path brings in the data, the real project is underway. In addition to trials using various combinations of features, implementation of custom clustering strategies and incremental GUI changes happen during this phase. Ideally, the tools would complete by Christmas, and experimentation would start in January. The schedule below presents a slightly less aggressive view of the timeline. Code complete happens on 2/1/2005. Pulling in this date to 1/1/2005 is a stretch goal for the development cycle. 4.1 Assumptions, Risks and Alternatives The original code written for the project will be in Java. The cluster engine will use the Weka open source data-mining framework (Weka 3 - Data Mining with Open Source Machine Learning Software in Java. 2004). The GUI will use Swing components (Geary, D. 1999). It might use the standard Java 3D extensions (Java 3D API. 2004). If it does use the AWT-based 3D API, the GUI will use heavyweight AWT components instead of their Swing counterparts (Geary, D. 1997). Other small scripts will extract and format data for import; these may be in SQL, Perl, or some other language. CVS versioning software will facilitate milestone releases. The eventual application will run client-side. It might be packaged as a “Java Web Start” application. Risks and Alternatives: • Unachievable Schedule – the schedule as indicated below is aggressive. o Alternative: Use third party components in the GUI, especially for prototyping o Alternative: Use existing clustering software only • Personal Schedule Conflict – we’re expecting our 2nd child 4/15/2005 o Alternative: Enter into the program later. Re-negotiate graduation date. o Alternative: Build in extension. • No signal in target data – the experiment could fail. o Alternative: Establish signal before undertaking the project using sample data o Alternative: Establish validity of negative outcome; success of tool 11
  • 12. 4.2 Preliminary Schedule Figure 3 shows a high-level view of the schedule. 1/17/2005 Full integ, Built-in Clustering 12/20/2004 - 1/3/2005 Break 1/31/2005 2/28/2005 11/22/2004 Code Complete Exploration Ends 10/4/2004 11/1/2004 Clustering 12/12/2004 3/31/2005 Start GUI Complete Dummy Data Scaling Work ends 11/1/2004 12/1/2004 1/1/2005 2/1/2005 3/1/2005 10/4/2004 3/31/2005 Figure 3 High-level View of Schedule Table 2 shows a detailed view of the proposed schedule: Design Dev Unit test Integ Test Explore Scaling/ New Projectio Controlle Clustering Projectio Extractio Clustering Week Java GUI n r Engine n n Algorithms 4-Oct 11-Oct 18-Oct 25-Oct 1-Nov 8-Nov 15-Nov 22-Nov 29-Nov 6-Dec 13-Dec 20-Dec 27-Dec 3-Jan 10-Jan 17-Jan 24-Jan 31-Jan 7-Feb 14-Feb 21-Feb 28-Feb 7-Mar 14-Mar 21-Mar 12
  • 13. 28-Mar Table 2 Detailed Schedule View 5 Glossary Centroid A pseudo exemplar serving as the statistical center of a given class. Clustering Clustering algorithms find groups of items that are similar. For example, clustering could be used by an insurance company to group customers according to income, age, types of policies purchased and prior claims experience. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Since the categories are unspecified, this is sometimes referred to as unsupervised learning. (Two Crows: Data Mining Glossary. 2001). Unsupervised Learning As distinct from supervised learning, the classification of unlabeled data. Data Mining The process of automatically extracting valid, useful, previously unknown, and ultimately comprehensible information from large databases and using it to make crucial business decisions. “Torturing the data until they confess” (Hsu, W. 2001) Weka An open source Java project for machine learning and data mining found at: http://www.cs.waikato.ac.nz/~ml/weka/ Webserver A software application for serving content to browsers on the world wide web. 6 References 6.1 Works Cited The following is a list of references cited in the document. Bishop, C. (1995). Neural Networks for Pattern Recognition. New York: Oxford Press. Bucklin, R., & Sisemiro, C. (2003). A Model of Website Browsing Behavior Estimated on Clickstream Data. Journal of Marketing Research, XL, 249-267. Retrieved August 15, 2004, from http://www.anderson.ucla.edu/faculty/randy.bucklin/papers/bucklinandsismeiro2003.pdf 13
  • 14. Duda, R, Hart, P., & Stork, D. (2001) Pattern Classification. New York: John Wiley & Sons. Geary, D (1997) Graphic Java 1.1: Mastering the AWT. New York: Prentice Hall. Geary, D. (1999) Graphic Java Volume II: Swing. New York: Prentice Hall. Hsu, W. (2001). Knowledge Discovery in Databases and Data Mining. Retrieved October 17, 2004, from http://www.kddresearch.org/Courses/Fall-2003/CIS732/Lectures/Lecture-28-20011204.pdf Java 3D API. Retrieved October 17, 2004 from http://java.sun.com/products/java-media/3D/ Two Crows: Data Mining Glossary. (2001). Retrieved October 17, 2004, from http://www.twocrows.com/glossary.htm#anchor311516 UCI Machine Learning Repository Content Summary. Retrieved October 16, 2004, from http://www.ics.uci.edu/~mlearn/MLSummary.html Weka 3 - Data Mining with Open Source Machine Learning Software in Java. (2004) Retrieved October 17, 2004, from: http://www.cs.waikato.ac.nz/ml/weka/ Witten, I., Frank, E. (2000) Data Mining. San Diego: Academic Press. 6.2 Works Consulted The following is a list of works consulted while researching the topic. Anasari, A., & Mela, C. (2003). E-Customization. [Electronic Version] Journal of Marketing Research, XL, 131-145. Retrieved August 10, 2004, from http://faculty.fuqua.duke.edu/~mela/bio/Ansari_Mela_2003.pdf Moe, W. & Fader, P. (2002) Capturing Evolving Visit Behavior in Clickstream Data [Electronic Version] Retrieved August 10, 2004, from http://www- marketing.wharton.upenn.edu/ideas/pdf/00-003.pdf Moe, W. & Fader, P. (2003) Dynamic Purchase Behavior at e-Commerce Sites [Electronic Version] Retrieved August 10, 2004, from: http://www-marketing.wharton.upenn.edu/ideas/pdf/Fader/Moe-Fader%20conversion %200303.pdf Montgomery, A., Li, S., Srinivasan, K., & Liechty, J (2004) Modeling Online Browsing and Path Analysis Using Clickstream Data [Electronic Version] Retrieved August 10, 2004, from http://www.andrew.cmu.edu/user/alm3/papers/purchase%20conversion.pdf 6.3 Works To Be Consulted The following is a list of works marked for future review. Jain, A., Murty, M., & Flynn, P. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3). Retrieved August 15, 2004, from http://portal.acm.org/citation.cfm?id=331499.331504 14
  • 15. Leouski, A., & Swan, R. (1997). Interactive Cluster Visualization for Information Retrieval. Retrieved August 10, 2004, from http://citeseer.ist.psu.edu/rd/41003322%2C82112%2C1%2C0.25%2CDownload/http %3AqSqqSqciir.cs.umass.eduqSqinfoqSqpsfilesqSqirpubsqSqir-116.ps.gz Procopiuc, C., Jones, M., Agarwal, P., & Murali, T. (2002) A Monte Carlo Algorithm for Fast Projective Clustering. [Electronic Version] Presented at ACM SIGMOD 2002. Retrieved August 10, 2004, from http://www.research.att.com/resources/papers/Clustering.pdf 15