Mais conteúdo relacionado Semelhante a Text mining tutorial (20) Mais de Salford Systems (20) Text mining tutorial1. Getting Started with Text Mining:
STM™, CART® and TreeNet®
Dan Steinberg
Mykhaylo Golovnya
Ilya Polosukhin
May, 2011
2. Text Mining and Data Mining
Text mining is an important and fascinating area of modern analytics
On the one hand text mining can be thought of as just another application
area for powerful learning machines
On the other hand, text mining is a distinct field with its own dedicated
concepts, vocabulary, tools, and techniques
In this tutorial we aim to illustrate some important analytical methods and
strategies from both perspectives on data mining
introducing tools specific to the analysis text, and,
deploying general machine learning technology
The Salford Text Mining utility (STM) is a powerful text processing system
that prepares data for advanced machine learning analytics
Our machine learning tools are the Salford Systems flagship CART® decision
tree and stochastic gradient boosting TreeNet®
Evaluation copies of the the proprietary technology in CART and TreeNet as
well as the STM are available from http://www.salford-systems.com
Salford Systems © Copyright 2011 2
3. For Readers of this Tutorial
To follow along this tutorial we recommend that you have the analytical tools we use
installed on your computer. Everything you need may already be on a CD disk
containing this tutorial and analytical software
Create an empty folder named “stmtutor”, this is the root folder where all of the work
files related to this tutorial will reside
You may also use the following link to download Salford Systems Predictive Modeler
(SPM)
http://www.salford-systems.com/dist/SPM/SPM680_Mulitple_Installs_2011_06_07.zip
After downloading the package, unzip its contents into “stmtutor” which will create a
new folder named “SPM680_Mulitple_Installs_2011_06_07”. Follow installation steps
described on the next slide.
For the original DMC2006 competition website visit
http://www.data-mining-cup.de/en/review/dmc-2006/
We recommend that you visit the above site for information only; data and tools for
preparing that data are available at the URL next below
For the STM package, prepared data files, and other utilities developed for this tutorial
please visit
http://www.salford-systems.com/dist/STM.zip
After downloading the archive, unzip its contents into “stmtutor”
Salford Systems © Copyright 2011 3
4. Important! Installing the SPM Software
The Salford Systems software you‟ve just downloaded needs to be both
installed and licensed. No-cost license codes for a 30 day period are
available on request to visitors of this tutorial*
Double click on the “Install_a_Transform_SPM.exe” file located in the
“SPM680_Mulitple_Installs_2011_06_07” folder (see the previous slide) to
install the specific version of SPM used in this tutorial
Following the above procedure will ensure that all of the currently installed
versions of SPM, if any, will remain intact!
Follow simple installation steps on your screen
* Salford Systems reserves the right to decline to offer a no-cost license at its sole discretion
Salford Systems © Copyright 2011 4
5. Important! Licensing the SPM Software
When you launch the Salford Systems Predictive Modeler (SPM) you will be
greeted with a License dialog containing information needed to secure a
license via email
Please, send the
necessary information
to Salford Systems to
secure your license by
entering the “Unlock
Code” which will be e-
mailed back to you
The software will
operate for 3 days
without any licensing;
however, you can
secure a 30-day
license on request
Salford Systems © Copyright 2011 5
6. Installing the Salford Text Miner (STM)
In addition to the Salford Predictive Modeler (SPM) you will also work with the
Salford Text Miner (STM) software
No installation is needed and you should already have the “stm.exe”
executable in the “stmtutorSTMbin” folder as the result of unzipping the
“STM.zip” package earlier
STM builds upon the Python 2.6 distribution and the NLTK (Natural Language
Tool Kit) but makes text data processing for analytics very easy to conduct
and manage
You do not need to add any other support software to use STM
Expect to see several folders and a large number of files located under the
“stmtutorSTM” folder. It is important to leave these files in the location to
which you have installed them.
Please do not MOVE or alter any of the installed files other than those explicitly
listed as user-modifiable!
“stm.exe” will expire in the middle of 2012, contact Salford Systems to get an
updated version beyond that
Salford Systems © Copyright 2011 6
7. The Example Project
The best examples are drawn from real world data sets and we were
fortunate to locate data publicly released by eBay.
Good teaching examples also need to be simple.
Unfortunately, real world text mining could easily involve hundreds of thousands if
not millions of features characterizing billions of records. Professionals need to be
able to tackle such problems but to learn we need to start with simpler situations.
Fortunately, there are many applications in which text is important but the
dimensions of the data set are radically smaller, either because the data available
is limited or because a decision has been made to work with a reduced problem.
We use our simpler example to illustrate many useful ideas for beginning text
miners while pointing the way to working on larger problems.
Salford Systems © Copyright 2011 7
8. The DMC2006 Text Mining Challenge
In 2006 the DMC data mining competition (restricted to student competitors
only) introduced a predictive modeling problem for which much of the
predictive information was in the form of unstructured text.
The datasets for the DMC 2006 data mining competition can be downloaded
from http://www.data-mining-cup.de/en/review/dmc-2006/
For your convenience we have re-packaged this data and made it somewhat
easier to work with. This re-packaged data is included in the STMU package
described near the beginning of this tutorial.
The data summarizes 16,000 iPod auctions held at eBay from May 2005
through May 2006 in Germany
Each auction item is represented by a text description written by the seller (in
German) as well as a number of flags and features available to the seller at
the time of the auction
Auction items were grouped into 15 mutually exclusive categories based on
distinct iPod features: storage size, type (regular, mini, nano), and color
The competition goal was to predict whether the closing price would be above
or below the category average
Salford Systems © Copyright 2011 8
9. Comments on the Challenge
One might think that a challenge with text in German might not be of general
interest outside of Germany
However, working with a language essentially unfamiliar to any member of
the analysis team helps to illustrate one important point
Text mining via tools that have no “understanding” of the language can be
strikingly effective
We have no doubt that dedicated tools which embed knowledge of the
language being analyzed can yield predictive benefits
We also believe we could have gained further valuable insight into the data if any
of the authors spoke German! But our performance without this knowledge is still
impressive.
In contexts where simple methods can yield more than satisfactory results, or
in contexts where the same methods must be applied uniformly across
multiple languages, the methods described in this tutorial will be an excellent
guide.
Salford Systems © Copyright 2011 9
10. Configuring Work Location in SPM
The original datasets from the DMC 2006 challenge reside in the
“stmtutorSTMdmc2006” folder
To facilitate further modeling steps, we will configure SPM to use this location
as the default location:
Start SPM
Go to the Edit – Options menu
Switch to the Directories tab
Enter the “stmtutorSTMdmc2006”
folder location in all text entry boxes
except the last one
Press the [Save as Defaults] button
so that the configuration is restored
the next time you start SPM
Salford Systems © Copyright 2011 10
11. Configuring TreeNet Engine
Now switch to the TreeNet tab
Configure the Plot Creation
section as shown on the screen
shot
Press the
[Save as Defaults]
button
Press the [OK] button
to exit
Salford Systems © Copyright 2011 11
12. Steps in the Analysis: Data Overview
1. Describe the data: (Data Dictionary and Dimensions of Data)
a. What is the unit of observation? Each record of data is describing what?
b. What is the dependent or target variable?
c. What other variables (data base fields) are available?
d. How many records are available?
2. Statistical Summary
a. Basic summary including means, quantiles, frequency tables
b. Dimensions of categorical predictors
c. Number of distinct values of continuous variables
3. Outlier and Anomaly Assessment
a. Detection of gross data errors such as extreme values
b. Assessment of usability of levels of categorical predictors (rare levels)
Salford Systems © Copyright 2011 12
13. Data Fundamentals
The original dataset is called “dmc2006.csv” and resides in the
“stmtutorSTMdmc2006” folder
16,000 records divided into two equal sized partitions
Part 1: Complete data including target, available for training during the competition
Part 2: Data to be scored; during the competition the target was not availabler
25 database fields two of which were unstructured text written by the seller
Each line of data describes an auction of an iPod including the final winning
bid price
An eBay seller must construct a headline and a description of the product
being sold. Sellers can also pay for selling assistance
E.g. Seller can pay to list the item title in BOLD
Salford Systems © Copyright 2011 13
14. The Data: Available Fields
The following variables describe general features of each auction event
Variable Description
AUCT_ID ID number of auction
ITEM_LEAF_CATEGORY_NAME products category
LISTING_START_DATE start date of auction
LISTING_END_DATE end date of auction
LISTING_DURTN_DAYS duration of auction
LISTING_TYPE_CODE type of auction (normal auction, multi auction, etc)
QTY_AVAILABLE_PER_LISTING amount of offered items for multi auction
FEEDBACK_SCORE_AT_LISTIN feedback-rating of the seller of this auction listing
START_PRICE start price in EUR
BUY_IT_NOW_PRICE buy it now price in EUR
BUY_IT_NOW_LISTING_FLAG option for buy it now on this auction listing
Salford Systems © Copyright 2011 14
15. Available Data Fields
In addition, there are binary indicators of various “value added” features that
can be turned on for each auction
Variable Description
BOLD_FEE_FLAG option for bold font on this auction listing
FEATUERD_FEE_FLAG show this auction listing on top of homepage
CATEGORY_FEATURED_FEE_FLAG show this auction listing on top of category
GALLERY_FEE_FLAG auction listing with picture gallery
GALLERY_FEATURED_FEE_FLAG auction listing with gallery (in gallery view)
IPIX_FEATURED_FEE_FLAG auction listing with IPIX (additional xxl, picture
show, pack)
RESERVE_FEE_FLAG auction listing with reserve-price
HIGHLIGHT_FEE_FLAG auction listing with background color
SCHEDULE_FEE_FLAG auction listing, including the definition of the
starting time
BORDER_FEE_FLAG auction listing with frame
Salford Systems © Copyright 2011 15
16. Target Variable
Finally, the target variable is defined based on the winning bid price revenue
relative to the category average
Variable Description
GMS scored sales revenue in EUR
CATEGORY_AVG_GMS Average sales revenue for the product category
GMS_GREATER_AVG zero when the revenue is less than or equal to the
category average sales and one otherwise
The values were only disclosed on a randomly selected set of 8,000 auctions
which we use to train a model
4199 auctions with the revenue below the category average
3801 auctions with the revenue above the category average
During the competition the auction results for the remaining 8,000 auction
results were kept secret, and used to score competitive entries
We will only use these records at the very end of this tutorial to validate
the performance of various models that will be built
Salford Systems © Copyright 2011 16
17. Comments on Methodology
Predictive modeling and general analytics competitions are increasingly being
launched both by private companies and by professional organizations and
provide both public data sets and a wealth of illustrative examples using
different analytic techniques
When reviewing results from a competition, and especially when comparing
results generated by analysts running models after the competition, it is
important to keep in mind that there is an ocean of difference between being
a competitor during the actual competition and an after-the-fact commentator
Regardless of what is reported the after-the-fact analyst does have access to
“what really happened” and it is nearly impossible to simulate the competitive
environment once the results have been published
We all learn in both direct and indirect ways from many sources including the
outcomes of public competitions. This can affect anything that comes later in time.
In spite of this, we have tried to mimic the circumstances of the competitors
by presenting analyses based only on the original training data, and using
well-established guidelines we have been promoting for more than decade to
arrive at a final model
We urge you to never take as face value an analyst‟s report on what would
have happened if they had hypothetically participated
Salford Systems © Copyright 2011 17
18. First Round Modeling: Ignoring the TEXT Data
Even before doing any type of data preparation it is always valuable to run a
few preliminary CART models
CART automatically handles missing values and is immune to outliers
CART is flexible enough to adapt to any type of nonlinearity and interaction effects
among predictors. The analyst does not need to do any data preparation to assist
CART in this regard
CART performs well enough out of the box that we are guaranteed to learn
something of value without conducting any of the common data preparation
operations
The only requirement for useful results is that we exclude any possible
perfect or near perfect illegitimate predictors
Common examples of illegitimate predictors include repackaged versions of the
dependent variable, ID variables, and data drawn from the future relative to the
data to be predicted
We start with a quick model using 20 of the 25 available predictors. None of
these involve any of the text data we will focus on later.
Salford Systems © Copyright 2011 18
19. Quick Modeling Round with CART
We start by building a quick CART model using original raw variables and all
8,000 complete auction records
Assuming that you already have SPM launched
Go to the
File – Open – Data File menu
Note that we have already
configured the default working
folder for SPM
Make sure that the Files of Type
is set to ASCII
Highlight the dmc2006.csv dataset
Press the [Open] button
Salford Systems © Copyright 2011 19
20. Dataset Summary Window
The resulting window summarizes basic facts about the dataset
Note that even though the dataset has 16000 records, only top 8000 will be
used for modeling as was already pointed out
Salford Systems © Copyright 2011 20
21. The View Data Window
Press the [View Data…] button to have a quick impression of physical
contents of the dataset
Out goal is to eventually use the unstructured information contained in the
text fields right next to the auction ID
Salford Systems © Copyright 2011 21
22. Requesting Basic Descriptive Stats
We next produce some basic stats for all available variables:
Go to the View – Data Info… menu
Set the Sort mode into File Order
Highlight the Include column
Check the Select box
Press the [OK] button
Salford Systems © Copyright 2011 22
23. Data Information Window
All basic descriptive statistics for all requested variables are now summarized in one
place
Note that the target variable GMS_GREATER_AVG is not defined for the one half of
the dataset (N Missing 8,000), all those records will be automatically discarded during
model building
Press the [Full] button to see more details
Salford Systems © Copyright 2011 23
24. Setting Up CART Model
We are now ready to set up a basic CART run:
Switch to the Classic Output window active
Go to the Model – Construct Model… menu (alternatively, you could press one of
the buttons located on the bar right below the menu bar)
In the resulting Model Setup window make sure that the Analysis Method is set
to CART
In the Model tab make sure that the Sort is set to File Order and the Tree Type is
set to Classification
Check GMS_GREATER_AVG as the Target
Check all of the remaining variables except AUCT_ID, LISTING_TITLE$,
LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors
You should see something similar to what is shown on the next slide
Salford Systems © Copyright 2011 24
26. Model Setup Window: Testing Tab
Switch to the Testing tab and confirm that the 10-fold cross-validation is used
as the optimal model selection method
Salford Systems © Copyright 2011 26
27. Model Setup Window: Advanced Tab
Switch to the Advanced tab and set the minimum required number of records
for the parent nodes and the child nodes at 15 and 5
These limits were chosen to avoid extremely small nodes in the resulting tree
Salford Systems © Copyright 2011 27
28. Building CART Model
Press the [Start] button, building progress window will appear for a while and then the Navigator
window containing model results will be displayed
Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator
window, note that all trees within one standard error (SE) of the optimal tree are now marked in green
Use the arrow keys to select the 64-node tree from the tree sequence, which is the smallest 1SE tree
Salford Systems © Copyright 2011 28
29. CART model observations
The selected CART model contains 64 terminal nodes and it is the smallest
model with the relative error still within one standard error of the optimal
model (the model with the smallest relative error) pointed by the green bar
This approach to model selection is usually employed for easy comprehension
We might also want to require terminal nodes to contain more than the 6 record
minimum we observe in this out of the box tree
All 20 predictor variables play a role in the tree construction
but there is more to observe about this when we look at the variable importance
details
Area under the ROC curve is a respectable 0.748
Salford Systems © Copyright 2011 29
30. CART Model Performance
Press the [Summary
Reports…] button in the
Navigator, select Prediction
Success tab, and press the
[Test] button to display cross-
validated test performance of
68.66% classification accuracy
Now select the Variable
Importance tab to review which
variables entered into the model
Interestingly enough, none of
the “added value” paid options
are important and exhibit
practically no direct influence on
the sales revenue
A detailed look at the nodes
might also be instructive for
understanding the model
Salford Systems © Copyright 2011 30
31. Experimenting with TreeNet
We almost always follow initial CART models with similar TreeNet models
We start with CART because some glaring errors such as perfect predictors
are more quickly found and obviously displayed in CART
A perfect predictor often yields a single split tree (two terminal nodes) for
classification trees
TreeNet models have strengths similar to CART regarding flexibility and
robustness and has advantages and disadvantages relative to CART
TreeNet is an ensemble of small CART trees that have been linked together in
special ways. Thus TreeNet shares many desirable features of CART
TreeNet is superior to CART in the context of errors in the dependent variable (not
relevant in this data)
TreeNet yields much more complex models but generally offers substantially better
predictive accuracy. TreeNet may easily generate thousands of trees to arrive at
an optimal model
TreeNet yields more reliable variable importance rankings
Salford Systems © Copyright 2011 31
32. A few words about TreeNet
TreeNet builds predictive models in stages. It first starts with a deliberately
very small first round tree (essentially a CART tree).
Then TreeNet calculates the prediction error made by this simple model and
builds a second tree to try to model that prediction error. The second tree
serves as tool to update, refine, and improve the first stage model.
A TreeNet model produces a “score” which is a simple of sum of all the
predictions made by each tree in the model
Typically the TreeNet score becomes progressively more accurate as the
number of trees is increased up to an optimal number of trees
Rarely the optimal number of trees is just one! Occasionally, a handful of
trees are optimal. More typically, hundreds or thousands of trees are optimal.
TreeNet models are very useful for the analysis of data with large numbers of
predictors as the models are built up in layers each of which makes use of
just a few predictors
More detail on TreeNet can be found at http://www.salford-systems.com
Salford Systems © Copyright 2011 32
33. Setting Up TN Model
Switch to the Classic Output window and go to the Model – Construct
Model… menu
Choose TreeNet as the Analysis Method
In the Model tab make sure that the Tree Type is set to Logistic Binary
Salford Systems © Copyright 2011 33
34. Setting Up TN Parameters
Switch to the TreeNet tab and do the following:
Set the Learnrate to 0.05
Set the Number of trees to use: to 800 trees
Leave all of the remaining options at their default values
Salford Systems © Copyright 2011 34
35. TN Results Window
Press the [Start] button to initiate TN modeling run, the TreeNet Results
window will appear in the end
Salford Systems © Copyright 2011 35
36. Checking TN Performance
Press on the [Summary] button and switch to the Prediction Success tab
Press the [Test] button to view cross-validation results
Lower the Threshold: to 0.45 to roughly equalize classification accuracy in
both classes (this makes it easier to compare the TN performance with the
earlier reported CART performance)
Salford Systems © Copyright 2011 36
37. The Performance Has Improved!
The overall classification accuracy goes up to about 71%
Press the [ROC] button to see that the area under ROC is now a solid 0.800
This comes at the cost of added model complexity – 796 trees each with about 6
terminal nodes
Variable importance remains similar to CART
Salford Systems © Copyright 2011 37
38. Understanding the TreeNet Model
TreeNet produces partial dependency plots for every predictor that
appears in the model, the plots can be viewed by pressing on the [Display
Plots…] button
Such plots are generally 2D illustrations of how the predictor in question
affects an outcome
For example, in the graph below the Y axis represents the probability that an iPod
will sell at an above category average price
We see that for a BUY_IT_NOW price between 200 and 300 the probability of
above average winning bid rises sharply with the BUY_IT_NOW_PRICE
For prices above 300 or below 200 the curve is essentially flat meaning that
changes in the predictor do not result in changes in the probable outcome
Salford Systems © Copyright 2011 38
39. Understanding the Partial Dependency Plot (PD Plot)
The PD Plot is not a simple description of the data. If you plotted the raw data
as say the fraction of above average winning bids against prices intervals you
might see a somewhat different curve
The PD Plot is a plot that is extracted from the TreeNet model and it is
generated by examining TreeNet predictions (and not input data)
The PD Plot appears to be relate two variables but in fact other variables may
well play a role in the graph construction
Essentially the PD Plot shows the relationship between a predictor and the
target variable taking all other predictors into account
The important points to understand are that
the graph is extracted from the model and not directly from raw data
the graph provides an honest estimate of the typical effect of a predictor
the graph displays not absolute outcomes but typical expected changes from some
baseline as the predictor varies. The graph can be thought of as floating up or
down depending on the values of other predictors
Salford Systems © Copyright 2011 39
41. Introducing the Text Mining Dimension
To this point, we have been working only with the set of traditional structured
data fields continuous and categorical variables
Further substantial performance improvement can be achieved only if we
utilize the text descriptions supplied by the seller in the following fields
Variable Description
LISTING_TITLE title of auction
LISTING_SUBTITLE subtitle of auction
Unfortunately, these two variables cannot be used “as is”. Sellers were free to
enter free form text including misspellings, acronyms, slang, etc.
So we must address the challenge of converting the unstructured text strings
of the type shown here into a well structured representation
Salford Systems © Copyright 2011 41
42. The Bag of Words Approach of Text Mining
The most straightforward strategy for dealing with free form text is to
represent each “word” that appears in the complete data set as a dummy
(0/1) indicator variable
For iPods on eBay we could imagine sellers wanting to use words like “new”
“slightly scratched”, “pink” etc. to describe their iPod. Of course the
descriptions may well be complete phrases like “autographed by Angela
Merkel” rather than just single term adjectives
Nevertheless in the simplest Bag of Words (BOW) approach we just create
dummy indicators for every word
Even though the headlines and descriptions are space limited the number of
distinct words that can appear in collections of free text can be huge
Text mining applications involving complete documents, e.g. newspaper
articles, the number of distinct words can easily reach several hundred
thousands or even millions
Salford Systems © Copyright 2011 42
43. The End Goal of the Bag of Words
Record_ID RED USED SCRATCHED CASE
1001 0 1 0 1
1002 0 0 0 0
1003 1 0 0 0
1004 0 0 0 0
1005 1 1 1 0
1006 0 0 0 0
• Above we see an example of a database intended to describe each auction
item by indicating which words appeared in the auction announcement
• Observe that Record_ID 1005 contains the three words “RED”, “USED” and
“SCRATCHED”
• Data in the above format looks just like the kind of numeric data used in
traditional data mining and statistical modeling
• We can use data in this form, as is, feeding it into CART, TreeNet, or
regression tools such Generalized Path Seeker (GPS) or everyday regression
• Observe that we have transformed the unstructured text into structured
numerical data
Salford Systems © Copyright 2011 43
44. Coding the Term Vector and TF weighting
In the sample data matrix on the previous slide we coded all of our indicators
as 0 or 1 to indicate presence or absence of a term
An alternative coding scheme is based on the FREQUENCY COUNT of the
terms with these variations:
0 or 1 coding for presence/absence
Actual term count (0,1,2,3,…)
Three level indicator for absent, one occurrence, and more than one (0,1,2)
The text mining literature has established some useful weighted coding
schemes. We start with term frequency weighting (tf)
Text mining can involve blocks of text of considerably different lengths
It is thus desirable to normalize counts based on relative frequency. Two text fields
might each contain the term “RED” twice, but one of the fields contains 10 words
while the other contains 40 words. We might want our coding to reflect the fact that
2/10 is more frequent than 2/40.
This is nothing more than making counts relative to the total length of the unit of
text (or document) and such coding yields the term frequency weighting
Salford Systems © Copyright 2011 44
45. Inverse Document Frequency (IDF) Weighting
IDF weighting is drawn from the information retrieval literature and is intended
to reflect the value of a term in narrowing the search for a specific document
within a larger corpus of documents
If a given term occurs very rarely in a collection of documents then that term
is very valuable as a tag to target those documents accurately
By contrast, if a term is very common, then knowing that such a term occurs
within the document you are looking for is not helpful in narrowing the search
While text mining has somewhat different goals than information retrieval the
concept of IDF weighting has caught on. IDF weighting serves to upweight
terms that occur relatively rarely.
IDF(term) =
log { (Number of documents)/Number of documents containing(term))}
The IDF increases with the rarity of a term and is maximum for words that
occur in only one document
A common coding of the term vector uses the product: tf * idf
Salford Systems © Copyright 2011 45
46. Coding the DMC2006 Text Data
The DMC2006 text data is unusual principally because of the limit on the amount of
text a seller was allowed to upload
This has the effect making the lengths of all the documents very similar
It also limits sharply the possibility that a term in a document would occur with a high
frequency
These factors contribute to making the TF-IDF weighting irrelevant to this challenge. In
fact, for this prediction task other coding schemes allow more accurate prediction.
STM offers these options for term vector coding
0 – no/yes
1 – no/yes/many – this one will be used in the remainder of this tutorial
2 – 0/1
3 – 0/1/2
4 – term frequency (relative to document)
5 – inversed document frequency (relative to corpus)
6 – TF-IDF (traditional IR coding)
Salford Systems © Copyright 2011 46
47. Text Mining Data Preparation
The heavy lifting in text mining technology is devoted to moving us from raw
unstructured text to structured numerical data
Once we have structured data we are free to use any of a large number of
traditional data mining and statistical tools to move forward
Typical analytical tools include logistic and multiple regression, predictive
modeling, and clustering tools
But before diving into the analysis stage we need move through the text
transformation stage in detail
The first step is to extract and identify the words or “terms” which can be
thought of as creating the list of all words recognized in the training data set
This stage is essentially one of defining the “dictionary”, the list of officially
recognized terms. Any new term encountered in the future will be
unrecognizable by the dictionary and will represent an unknown item
It is therefore very important to ensure that the training data set contains
almost all terms of interest that would be relevant for future prediction
Salford Systems © Copyright 2011 47
48. Automatic Dictionary Building
The following steps will build an active dictionary for a collection of
documents (in our case, auction item description strings)
Read all text values into one character string
Tokenize this string into an array of words (token)
Remove words without any letters or digits
Remove “stop words” (words like “the”, “a”, “in”, “und”, “mit”, etc.) for both English
and German languages
Remove words that have fewer than 2 letters and encountered less than 10 times
across the entire collection of documents (rare small words)
At this point the too-common, too-rare, weird, obscure, and useless
combinations of characters should have been eliminated
Lemmatize words using WordNet lexical database
This step combines words present in different grammatical forms (“go”, “went”,
“going”, etc.) into the corresponding stem word (“go”)
Remove all resulting words that appear less than MIN times (5 in the remainder of
this tutorial)
Salford Systems © Copyright 2011 48
49. Build the Dictionary (or Term Vector)
For purpose of automatic dictionary building and preprocessing data we developed the
Salford Text Mining (STM) software - a stand alone collection of tools that perform all
the essential steps in preparing text documents for text mining
STM builds on the Python “Natural Language Toolkit” (NLTK)
From NLTK we use the following tools
Tokenizer (extract items most likely to be “words”)
Porter Stemmer (recognize different simple forms of same word – e.g. plural)
Word Net lemmatizer (more complex recognition of same word variations)
stop word list (words that contribute little to no vale such as “the”, “a”)
Future versions of STM might use other tools to accomplish these essential tasks
“stm.exe” is a command line utility that must be run from a Command Prompt window
(assuming you are running Windows, go to the Start – All Programs – Accessories –
Command Prompt menu)
The version provided here resides in the stmtutorSTMbin folder
Salford Systems © Copyright 2011 49
50. STM Commands and Options
Open a Command Prompt window in Windows, then CD to the
“stmtutorSTM” folder location, for example, on our system you would type in
cd c:stmtutorSTM
To obtain help type the following at the prompt:
binstm --help
This command will return very concise information about STM:
stm [-h] [-data DATAFILE] [-dict DICTFILE] [-source-dict SRCDICTFILE]
[-score SCOREFILE] [-spm SPMAPP] [-t TARGET] [-ex EXCLUDE]
etc.
The details for each command line option are contained in the software
manual appearing in the appendix
You will also notice the “stm.cfg” configuration file – this file controls the default
behavior of the STM module and relieves you of specifying a large number of
configuration options each time “stm.exe” is launched
Note the
TEXT_VARIABLES : 'ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE, LISTING_SUBTITLE‘
line which specifies the names of the text variables to be processed
50
51. Create Dictionary Options
For the purposes of this tutorial, we have prepackaged all of the text processing
steps into individual command files (extension *.bat). You can either double-
click on the referenced command file or alternatively type its contents into the
Command Prompt window opened in the directory that contains the files
The most important arguments for our purposes in this tutorial now are:
--dataset DATAFILE name and location of your input CSV format data set
--dictionary DICTFILE name and location of the dictionary to be created
These two arguments are all you need to create your dictionary. By default,
STM will process every text field in your input data set to create a single
omnibus dictionary
Simply double click on the “stm_create_dictionary.bat” to create the dictionary
file for the DMC 2006 dataset, which will be saved in the “dmc2006_ynm.dict”
file in the “stmtutorSTMdmc2006” folder
In typical text mining practice the process of generating the final dictionary will
be iterative. A review of the first dictionary might reveal further words you wish
to exclude (“stop” words)
Salford Systems © Copyright 2011 51
52. Internal Dictionary Format
The dictionary file is a simple text file with extension
*.dict
The file contents can be viewed and edited in a
standard text editor
The name of the text mining variable that will be
created later on appears on the left of the “=“ sign on
each un-indented line
The default value that will be assigned to this
variable appears on the right side of the “=“ sign of
the un-indented lines and it usually means the
absence of the word(s) of interest
Each indented line represents the value (left of the
“=“) which will be entered for a single occurrence in a
document for any of the word(s) appearing on the
right of the “=“
More than one occurrence will be recorded as
“many” when requested (always the case in this
tutorial)
Salford Systems © Copyright 2011 52
53. Hand Made Dictionary
To use a multi-level coding you need to create a “hand made dictionary”, which is already
supplied to you as “hand.dict” in the “stmtutorSTMdmc2006” folder
Here is an example of an entry in this file
hand_model=standard
mini
nano
standard
The un-indented line of an entry starts with the name we wish to give to the term
(HAND_MODEL) and also indicates that a BLANK or missing value is to be coded with
the default value of “standard”
The remaining indented entries are listed one-per-line and are an exhaustive list of the
acceptable values which the term HAND-MODEL can receive in the term vector
Another coding option is, for example:
hand_unused=no
yes=unbenutzt,ungeoffnet
which sets “no” as the default value but substitutes “yes” if one of the two values listed
above is encountered
You may study additional examples in our stmtutorSTMdmc2006hand.dict file on your
own, all of them were created manually based on common sense logic
53
54. Why Create Hand Made Dictionary Entries
Let‟s revisit the variable HAND_MODEL which brings together the terms
Standard, mini, nano
Without a hand made dictionary entry we would have three terms created,
one for each model type, with “yes” and “no” values, and possibly “many”
By creating the hand made entry we
Ensure that every auction is assigned a model (default=“standard”)
All three models are brought together into one categorical variable with three
possible values “standard”, “mini”, and “nano”
This representation of the information is helpful when using tree-based
learning machines but not helpful for regression-based learning machines
The best choice of representation may vary from project to project
Salford regression-based learning machines automatically repackage categorical
predictors into 01 indicators meaning that you work with one representation
But if you need to use other tools you may not have this flexibility
Salford Systems © Copyright 2011 54
55. Further Dictionary Customization
The following table summarizes some of the important fields introduced in the
custom dictionary for this tutorial
Variable Values Combines word variants
CAPACITY 20 20gb,20 gb,20 gigabyte
30 30gb,30 gb,30 gigabyte
40 40gb,40 gb,40 gigabyte
80gb,80 gb,80 gigabyte
80
…
…
STATUS Wieneu Wie neu,super gepflegt,top gepflegt,top zustand,neuwertig
Neu neu,new,brandneu,brandneues
Unbenutzt Unbenu
defekt defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektes
MODEL Mini, nano, Captures presence of the corresponding word in the auction
standard description
COLOR Black, white, Captures presence of the corresponding words or variants in
Green, etc. the auction description
IPOD_GENE First, Identified iPod generation from the information available in
RATION second, etc. the text description
Salford Systems © Copyright 2011 55
56. Final Stage Dictionary Extraction
To generate a final version of the dictionary in most real world applications
you would also need to prepare an expanded list of stopwords
The NLTK provides a ready-made list of stopwords for English and another
14 major languages spanning Europe, Russia, Turkey, and Scandinavia
These appear in the directory named stmtutorSTMdatacorporastopwords
and should be left as they are
Additional stopwords, which might well vary from project to project, can be
entered into the file named “stopwords.dat” in the “stmtutorSTMdata”
folder
In the package distributed with this tutorial the “stopwords.dat” file is empty
You can freely add words to this file, with one stopword per line
Once the custom “stopwords.dat” and “hand.dict” files have been prepared
you just run the dictionary extraction again but with the “--source-dictionary”
argument added (see the command files introduced in the later slides)
The resulting dictionary will now include all the introduced customizations
Salford Systems © Copyright 2011 56
57. Creating Structured Text Mining Variables
The resulting dictionary file “dmc2006_ynm.dict” contains about 600 individual stems
In the final step of text processing the data dictionary is applied to each document entry
Each stem from the dictionary is represented by a categorical variable (usually binary)
with the corresponding name
The preparation process checks whether any of the known word variants associated
with each stem from the dictionary are present in the current auction description, and if
“yes”, the corresponding value is set to “yes”, otherwise, it is set to “no”
When the “--code YNM” option is set, multiple instances of “yes” will be coded as “many”
You can also request integer codes 0, 1, 2 in place of the character “yes/no/many”
We have experimented with alternative variants of coding (see the “--code” help entry in the
STM manual) and came to conclusion that the “YNM” approach works best in this tutorial
Feel free to experiment with alternative coding schemas on your own
The resulting large collection of variables will be used as additional predictors in our
modeling efforts
Even though other more computationally intense text processing methods exist, further
investigation failed to demonstrate their utility on the current data which is most likely
related to extremely terse nature of the auction descriptions
Salford Systems © Copyright 2011 57
58. Creating Additional Variables
Finally, we spent additional efforts on reorganizing the original raw variables
into more useful measures
MONTH_OF_START – based on the recorded start date of auction
MONTH_OF_SALE – based on the recorded closing date of auction
HIGH_BUY_IT_NOW – set to “yes” if BUY_IT_NOW_PRICE exceeds the
CATEGORY_AVG_GMS as suggested by common sense and the nature of the
classification problem
In the original raw data, BUY_IT_NOW_PRICE was set to 0 on all items where that
option was not available – we reset all such 0s to missing
All of these operations are encoded in the “preprocess.py” Python file
located in the “stmtutorSTMdmc2006” folder
This component of the STM is under active development
The file is automatically called by the main STM utility
You may add/modify the contents of this file to allow alternative transformations of
the original predictors
Salford Systems © Copyright 2011 58
59. Generation of the Analysis Data Set
As this point we are ready to move on to the next step which is data creation
This is nothing more than appending the relevant columns of data to the
original data set. Remember that the dictionary may contain tens of
thousands if not hundreds of thousands of terms
For the DMC2006 dataset the dictionary is quite small by text mining
standards containing just a little over 600 words
To generate processed dataset simply double-click on the stm_ynm.bat
command file or explicitly type in its contents in the Command Prompt
The “--dataset” option specifies the input dataset to be processed
The “--code YNM” option requests “yes/no/many” style of coding
The “--source-dictionary” option specifies the hand dictionary
The “--process” option specifies the output dataset
Of course you may add other options as you prefer
This creates a processed dataset with the name dmc2006_res_ynm.csv
which resides in the stmtutorSTMdmc2006 folder
Salford Systems © Copyright 2011 59
60. Analysis Data Set Observations
At this point we have a new modeling dataset with the text information
represented by the extra variables
Note that he raw input data set is just shy of 3 MB in size in a plain text format
while the prepared analysis data set is about 40 MB in size, 13 times larger
Process only training data or all data?
For prediction purposes all data needs to be processed, both the data that will be
used to train the predictive models and the holdout or future data that will receive
predictions later
In the DMC2006 data we happen to have access to both training and holdout data
and thus have the option of processing all the text data at the same time
Generating the term vector based only on the training data would generally be the
norm because future data flows have not yet arrived
In this project we elected to process all the data together for convenience knowing
that the train and holdout partitions were created by random division of the data
It is worth pointing out, though, that the final dictionary generated from training
data only might be slightly different due to the infrequent word elimination
component of the text processor
Salford Systems © Copyright 2011 60
61. Quick Modeling Round with CART
We are now ready to proceed with another CART run this time using all of the
newly created text fields as additional predictors
Assuming that you already have SPM launched
Go to the
File – Open – Data File menu
Make sure that the Files of Type
is set to ASCII
Highlight the
dmc2006_res_ynm.csv
dataset
Press the [Open] button
Salford Systems © Copyright 2011 61
62. Dataset Summary Window
Again, the resulting window summarizes basic facts about the dataset
Note the dramatic increase in the number of available variables
Salford Systems © Copyright 2011 62
63. The View Data Window
Press the [View Data…] button to have a quick look at the physical contents
of the dataset
Note how the individual dictionary word entries are now coded with the “yes”,
“no”, or “many” values for each document row
Salford Systems © Copyright 2011 63
64. Setting Up CART Model
Proceed with setting up a CART modeling run as before:
Make the Classic Output window active
Go to the Model – Construct Model… menu (alternatively, you could use one of
the buttons located on the bar right below the menu)
In the resulting Model Setup window make sure that the Analysis Method is set
to CART
In the Model tab make sure that the Sort is set to File Order and the Tree Type is
set to Classification
Check GMS_GREATER_AVG as the Target
Check all of the remaining variables except AUCT_ID, LISTING_TITLE$,
LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors
You should see something similar to what is shown on the next slide
Salford Systems © Copyright 2011 64
66. Model Setup Window: Testing Tab
Switch to the Testing tab and confirm that the 10-fold cross-validation is used
as the optimal model selection method
Salford Systems © Copyright 2011 66
67. Model Setup Window: Advanced Tab
Switch to the Advanced tab and set the minimum required number of records
for the parent nodes and the child nodes at 15 and 5
These limits were chosen to avoid extremely small nodes in the resulting tree
Salford Systems © Copyright 2011 67
68. Building CART Model
Press the [Start] button, building progress window will appear for a while and then the Navigator
window containing model results will be displayed (this time, the process takes a few minutes!)
Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigator
window, note that all trees within one standard error (SE) of the optimal tree are now marked in green
Use the arrow keys to select the 102-node tree from the tree sequence, which is the smallest 1SE tree
Salford Systems © Copyright 2011 68
69. CART Model Performance
The selected CART model contains 102 terminal nodes where nearly all available
predictor variables play a role in the tree construction
Area under the ROC curve (Test) is now an impressive 0.830, especially when
compared to the one reported earlier at 0.748 for the basic CART run or the 0.800 for
the basic TN run
Press on the [Summary Reports] button in the Navigator window, select the
Prediction Success tab, and finally press the [Test] button to see cross-validated test
performance at 76.58% classification accuracy – a significant improvement!
Also note the presence of the original and derived variables on the list shown in the
Variable Importance tab
Salford Systems © Copyright 2011 69
70. Setting Up TN Model
Now switch to the Classic Output window and go to the Model – Construct
Model… menu
Choose TreeNet as the Analysis Method
In the Model tab make sure that the Tree Type is set to Logistic Binary
Salford Systems © Copyright 2011 70
71. Setting Up TN Parameters
Switch to the TreeNet tab and do the following:
Set the Learnrate: to 0.05
Set the Number of trees to use: to 800
Leave all of the remaining options at their default values
Salford Systems © Copyright 2011 71
72. TN Results Window
Press the [Start] button to initiate TN modeling run, the TreeNet Results
window will appear in the end, even though you might want to take a coffee
break until the modeling run completes
Salford Systems © Copyright 2011 72
73. Checking TN Performance
Press on the [Summary] button and switch to the Prediction Success tab
Press the [Test] button to view cross-validation results
Lower the Threshold: to 0.47 to roughly equalize classification accuracy in both classes
(this makes it easier to compare the TN performance with the earlier reported CART
and TN model performance)
You can clearly see the improvement!
Salford Systems © Copyright 2011 73
74. Requesting TN Graphs
Here we present a sample collection of all 2-D contribution plots produced by
TN for the resulting model
The plots are available by pressing on the [Display Plots…] button in the
TreeNet Results window
The list is arranged according to the variable importance table
74
76. Insights Suggested by the Model
Here is a list of insights we arrived at by looking into the selection of plots
There is a distinct effect of the iPod category once all the other factors have been
accounted for
Larger start price means above the average sale (most likely relates to the quality
of an item)
A“new” and “unpacked” item should fetch a better price, while any “defect” brings
the price down
End of the year means better sales
Having a good feedback score is important
It is best to wait 10 days or more before closing the deal
Interestingly, 1st and 3rd generations of iPod show poorer sales than the 2nd and 4th
2G started to fall out of favor in 2005-2006
Black is much more popular in Germany than other colors
Mentioning “photo”, “video”, “color display”, etc. helps get a better price
The paid advertising features are of little or marginal importance
Salford Systems © Copyright 2011 76
77. Final Validation of Models
At this point we are ready to check the performance of all our models using
the remaining 8,000 auctions originally not available for training
This way each model can be positioned with respect to all of the official 173
entries originally submitted to the DMC 2006 competition
However, in order to proceed with the evaluation, we must first score the
input data using all of the models we have generated up until now
The following slides explain how to score the most recently constructed
CART and TN models, the earlier models can be scored using similar steps
You may choose to skip the scoring steps as we have already included the
results of scoring in the “stmtutorSTMscored” folder:
Score_cart_raw.csv – simple CART model predictions
Score_tn_raw.csv – simple TN model predictions
Score_cart_txt.csv – text mining enhanced CART model predictions
Score_tn_txt.csv – text mining enhanced TN model predictions
Salford Systems © Copyright 2011 77
78. Scoring a CART Model
Select the Navigator window for the model you wish to score
Select the tree from the tree sequence (in our runs we pick the 1SE trees as
more robust)
Press the [Score] button to open the “Score Data” window
Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press
the [Select…] button on the right and select the dataset to be scored
Place a checkmark in the “Save results to a file” box, then press the [Select]
button right next to it, this will open the “Save As” window
Navigate to the “stmtutorSTMscored” folder under “Save in:” selection box,
enter “Scored_cart_txt.csv” in the “File name:” text entry box, and press the
[Save] button
You should now see something similar to what‟s shown on the next slide
Press the [OK] button to initiate the scoring process
You should now have the Scored_cart_txt.csv file in the stmtutorSTMscored
folder
Salford Systems © Copyright 2011 78
80. Scoring a TN Model
Select the “TreeNet Results” window for the model you wish to score
Go to the “Model – Score Data…“ menu to open the “Score Data” window
Make sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not press
the [Select…] button on the right and select the dataset to be scored
Place a checkmark in the “Save results to a file” box, then press the
[Select] button right next to it, this will open the “Save As” window
Navigate to the “stmtutorSTMscored” folder under “Save in:” selection
box, enter “Scored_tn_txt.csv” in the “File name:” text entry box, and press
the [Save] button
You should now see something similar to what‟s shown on the next slide
Press the [OK] button to initiate the scoring process
You should now have the Scored_tn_txt.csv file in the
stmtutorSTMscored folder
Salford Systems © Copyright 2011 80
82. Using STM to Validate Performance
We can now use the STM machinery to do final model validation
Simply double-click the “stm_validate.bat” command file to proceed
Note the use of the following options inside of the command file:
“-score” – specifies the output dataset where the model predictions will be written
“--score-column” – specifies the name of the variable containing the actual model
predictions (these variables are produced by CART or TN during the scoring
process)
“--check” – specifies the name of the dataset that contains the originally withheld
values of the target
this dataset was used by the organizers of the DMC 2006 competition to
select the actual winners
STM is currently configured to validate only the bottom 8,000 of the 16,000
predictions generated by the model; the top 8,000 records (used for learning) are
simply ignored
The results will be saved into text files with extensions “*.result” appended to
the original score file names in the “stmtutorSTMscored” folder
Salford Systems © Copyright 2011 82
83. Validation Results Format
The following window shows the validation results of the final TN model we
built
8000 validation records were scored, of which:
719 ones were misclassified as zeroes
807 zeroes were misclassified as ones
Thus 1,526 documents were misclassified
This gives the final score of 8,000 – (1,526 * 2) = 4,948
Salford Systems © Copyright 2011 83
84. Final Validation of Models
Based on the predicted class assignments, the final performance score is
calculated as 8,000 minus twice the total number of auction items
misclassified
The following table summarizes how these virtually out-of-the-box elementary
modelings perform on the holdout data (the values are extracted from the four
*.result files produced by the STM validator)
Model ROC Area Missed 0s Missed 1s Score
CART raw data 75% 1123 1387 2980
TN raw data 80% 1308 926 3532
CART text data 83% 981 848 4342
TN text data 89% 807 719 4948
Salford Systems © Copyright 2011 84
85. Visual Validation of the Results
The following graph summarizes the positioning of the four basic models with
respect to the 173 official competition entries
The TN model with text mining processing is among the top 10 winners!
TN text
CART text
TN raw
CART raw
Salford Systems © Copyright 2011 85
86. Observations on the Results
We used the most basic form of text mining, the Bag of Words, with minor
emendations
None of the authors speaks German although we did look up some of the words in
an on-line dictionary. If there are any subtleties to be picked from seller wording
choices we would have missed them.
We chose the coding scheme that performed best on the training data. We
have six coding options and one stands out as clearly best
We used common settings for the controls for CART and TreeNet
We did not use any of the modeling refinement techniques we teach in our
CART and TreeNet tutorials
We thus invite you to see if you can tweak the performances of these models
even higher
Salford Systems © Copyright 2011 86
87. Command Line Automation in SPM
SPM has a powerful command line processing component which allows you to completely
reproduce any modeling activity by creating and later submitting a command file
We have packaged the command files for the four modeling and scoring runs you have conducted
in the course of this tutorial
SPM command files must have the extension *.cmd
The four command files are stored in the “stmtutorSTMdmc2006” folder
You can create, open, or edit a command file using a simple text editor, like Notepad, etc.
SPM has a built-in editor, just go to the File – New Notepad… menu
You may also access the command line directly from inside of the SPM GUI, just make sure that the
File – Command Prompt menu item is checked
Just type in “help” in the Command Prompt part (starts with the “>” mark) of the Classic Output
window to get the listing of all available commands
Then you can request a more detailed help for any specific command of interest, for example “help
battery” will produce a long list of various batteries of automated runs available in SPM
Furthermore, you may view all of the commands issued during the current session by going to the
View – Open Command Log… menu, this way you can quickly learn which commands correspond
to the recent GUI activity you were involved with
Salford Systems © Copyright 2011 87
88. Basic CART Model Command File
You may now restart SPM to emulate a new fresh run
Go to the File – Open – Command File… menu
Select the “cart_raw.cmd” command file and press the [Open] button
The file is now opened in the built-in Notepad window
Salford Systems © Copyright 2011 88
89. CART Command File Contents
OUT – saves the classic output into a
text file
USE – points to the modeling dataset
GROVE – saves the model as a binary
grove file
MODEL – specifies the target variable
CATEGORY – indicates which variables
are categorical, including the target
KEEP – specifies the list of predictors
LIMIT – sets the node limits
ERROR – requests cross-validation
BUILD – builds a CART model
SAVE – names the file where the CART
model predictions will be saved
HARVEST – specifies which tree is to be
used in scoring
IDVAR – requests saving of the
Note the use of the relative paths in the GROVE and SAVE commands additional variables into the output
dataset
Also note the use of the forward slash “/” to separate folder names
SCORE – scores the CART model
OUTPUT * – closes the current text
output file
Salford Systems © Copyright 2011 89
90. Submitting Command File
With the Notepad window active, go to the File – Submit Window menu to
submit the command file into SPM
In the end you will see the Navigator and the Score windows opened which
should be identical to the ones you have already seen in the beginning of this
tutorial
Furthermore, you should now have
“cart_raw.dat” text file created in the “stmtutorSTMdmc2006” folder, the file
contains the classic output you normally see in the “Classic Output” window
“cart_raw.grv” binary grove file created in the “stmtutorSTMmodels” folder, the
file contains the CART model itself, it can be opened in the GUI using the File –
Open – Open Grove… menu which reopens the Navigator window, this file will be
also needed to future scoring or translation
“Score_cart_raw.csv” data file created in the “stmtutorSTMscored” folder, the
file contains the selected CART model predictions on your data
You may proceed now with opening up the “tn_raw.cmd” file using the File –
Open – Command File… menu
Salford Systems © Copyright 2011 90
91. TN Command File Contents
OUT, USE, GROVE, MODEL,
CATEGORY, KEEP, ERROR, SAVE,
IDVAR, SCORE, OUTPUT – same as the
CART command file introduced earlier
MART TREES – sets the TN model size
in trees
MART NODES – sets the tree size in
terminal nodes
MART MINCHILD - set the minimum
individual node size in records
MART OPTIMAL – sets the evaluation
criterion that will be used for optimal
model selection
MART BINARY – requests logistic
regression processing in our case
MART LEARNRATE – sets the learnrate
parameter
MART SUBSAMPLE – sets the sampling
rate
MART INFLUENCE – sets the influence
trimming value
The rest of the MART commands
requests automatic saving of the 2-D and
3-D plots into the grove; type in “help
mart” to get full descriptions
Salford Systems © Copyright 2011 91
92. Submitting the Rest of the Command Files
Again, with the current Notepad window active, use the File – Submit Window menu
to launch the basic TN modeling run automatically followed by scoring
This will create the output, grove, and scored data files in the corresponding locations
for the chosen TN model; also note the use of the EXCLUDE command in place of the
KEEP command inside of the command file – this saves a lot of typing
Now go back to the Classic Output window and notice that the File menu has
changed
Go to the File – Sumbit Command File… menu, select the “cart_txt.cmd” command
file, and press the [Open] button
Notice the modeling activity in the Classic Output window, but no Results window is
produced – this is how the Submit Command File… menu item is different from the
Submit Window menu item used previously; nonetheless, the output, grove, and score
files are still created in the specified locations
Use the File – Open – Open Grove… menu to open the “tn_raw.grv” file located in
the “stmtutorSTMmodels folder”, you will need to navigate into this folder using the
Look in: selection box in the Open Grove File window
You may now proceed with the final TN run by submitting the “tn_txt.cmd” command
file using either the File – Open – Command File… / File – Submit Window or File –
Submit Command File… menu routes – don‟t forget that it does take long time to run!
Salford Systems © Copyright 2011 92
93. Final Remarks
This completes the Salford Systems Data Mining and Text Mining tutorial
In the process of going through the tutorial you have learned how to use both
GUI and command cine facilities of SPM as well as the command line text
mining facility STM
You managed to build two CART models, two TN models, as well as enriched
the original dataset with a variety of text mining fields
The final model puts you among the top winners in a major text mining
competition – a proud achievement
Even though we have barely scratched the surface, you are now ready to
proceed with exploring the remainder of the vast data mining activities offered
within SPM and STM on your own
We wish you best of luck on the exciting and never ending road of modern
data analysis and exploration
And don‟t forget that you can always reach us at www.salford-systems.com
should you have further modeling questions and needs
Salford Systems © Copyright 2011 93
94. References
Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and
Regression Trees, Pacific Grove: Wadsworth
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of
Statistical Learning. Springer.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting
algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth
National Conference, Morgan Kaufmann, pp. 148-156.
Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics
Department, Stanford University.
Friedman, J.H. (1999). Greedy function approximation: a gradient boosting
machine. Stanford: Statistics Department, Stanford University.
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004).
Text Mining. Predictive Methods for Analyzing Unstructured Information.
Springer.
Salford Systems © Copyright 2011 94
95. STM Command Reference
Salford Text Miner is simple utility that should make text mining process
much easier. For this purpose application described in this manual have
different parameters and can execute Salford Predictive Miner at the data
mining backend
STM Workflow:
Automatically generate dictionary based on dataset
Process dataset and generate new with additional columns based on dictionary
Generate model folder with dataset, command file and dictionary
Run Salford Predictive Miner with generated command file
Run checking process comparing results from scoring with real classes
All of these steps can be done in separate STM calls or in one call
Salford Systems © Copyright 2011 95
96. STM Command Reference
Short Option Long Option Description
-data DATAFILE --dataset DATAFILE Specify dataset to work with
-dict DICTFILE --dictionary Specify dictionary to work with
DICTFILE
-source-dict SDFILE --source-dictionary Dictionary that is used as source for
SDFILE automatic dictionary retrieval process
-score SFILE --scoreresult SFILE Specify file with score result, for
checking process, default – „score.csv‟
-spm SPMAPP --spmapplication Path to spm application, default –
SPMAPP „spm.exe‟
-t TARGET --target TARGET Target variable to generate command
file, default – „GMS_GREATER_AVG‟
-ex EXCLUDE --exclude EXCLUDE List of variables to exclude from keep
list, when generate command file.
-cat CATEGORY --category List of variables to select as category
CATEGORY variables, when generate command
file
Salford Systems © Copyright 2011 96
97. STM Command Reference
Short Option Long Option Description
-templ CMDTEMPL --cmdtemplate Specify template of command file, that will
CMDTEMPL be used for generation. Default –
„data/template.cmd‟
-md MODEL_DIR --modeldir Dir, where model‟s folders will be created.
MODEL_DIR Default – „models‟
-trees TREES --trees TREES Parameter for TreeNet command files,
specify number of trees will be build.
Default – 500
-maxnodes --maxnodes Parameter for TreeNet command files,
MAXNODES MAXNODES specify numbers of nodes in one tree will
be build. Default – 6
-fixwords --fixwords Enables heuristics that tries to fix words
(find nearest by different metrics, searching
spell checking, etc)
-textvars VARLIST --text-variables List of variables separated by commas,
VARLIST which will be used in dictionary retrieving
process
Salford Systems © Copyright 2011 97
98. STM Command Reference
Short Option Long Option Description
-outrmwords --output-removed- Enables outputting removed stop words to
words file „data/removed.dat‟
-code CODE --column-coding Specify how to code absence/presence of
CODE word in row:
YN or 0 – no/yes
YNM or 1 – no/yes/many
01 or 2 – 0/1
012 or 3 – 0/1/2
TF or 4 – term frequency
IDF or 5 – inversed document frequency
TF-IDF or 6 – TF-IDF
TC or 7 – term count (0,1,2,…)
Default – YN
-mp MODELPATH --model-path Specify path where model files would be
MODELPATH created
-cmd-path CMDPATH --command-file-path Specify path to command file, which will be
CMDPATH executed by Salford Predictive Miner
-ppfile PPFILE --preprocess-file Path to python code that will be executed
PPFILE on process step for data manipulate data
Salford Systems © Copyright 2011 98
99. STM Command Reference
Short Option Long Option Description
-rc NAME --realclass- Specify column name for in real class dataset for
column-name check step. Default – GMS_GREATER_AVG
-e --extract Run first step – automatic extraction of dictionary
from dataset. Need to specify --dataset
-p OUTFILE --process Run second step – process dataset and create new
OUTFILE dataset with name OUTPUTFILE were depending on
dictionary will be created new columns. Need to
specify --dataset and --dictionary
-g --generate Run third step – generate model folder with
command file. Need specify --dataset, --dictionary
-m --model Run forth step. Run Salford Predictive Miner with
generate command file. Works only with –generate
-c DATASET --check DATASET Run fives step. Check score file with real classes
(from specified REALCLASSFILE) and outputs
misclassification table. Need to specify --scoreresult
-h --help Show help
Salford Systems © Copyright 2011 99
100. STM Configuration File
Name Description Default
SPM_APPLICATION Path to Salford Predictive Miner spm.exe
CMD_TREES Number of trees to build in TN models 500
CMD_NODES Tree size for TN modes 6
CMD_TEMPLATE Command file template data/template.cmd
MODELS_DIR Dir, where model‟s folders will be created models
LANGUAGES Languages, stop words which will be used English, German
SPELLCHECKER_DICT Additional spell checker dictionary, with words that data/spellchecker_dict.dat
are allowed (like “ipod”)
SPELLCHECKER_LANGUAGE Language for spell checker de_DE
ADDITIONAL_STOPWORDS File with additional stop words, which user can edit data/stopwords.dat
REMOVED_WORDS_FILE File, where removed words will be written on data/removed.dat
“extract” step
WORD_FREQUENCY_THRESH Lower threshold word frequency, which will be 5
OLD deleted on “extract” step
PREPROCESS_FILE Include script to do additional processing dmc2006/preprocess.py
Salford Systems © Copyright 2011 100
101. STM Configuration File
Name Description Default
CHECK_RESULTS_FILE data/score_results.csv
LOGFILE Path to log file. Can be mask (%s for date). log/stm%s.log
TARGET Default variable for target argument, which would be used to GMS_GREATER_AVG
fill command file template
EXCLUDE Default variable for keep argument, which would be used to AUCT_ID,
fill command file template LISTING_TITLE$,
LISTING_SUBTITLE$,
GMS,
GMS_GREATER_AVG
CATEGORY Default variable for category argument, which would be used GMS_GREATER_AVG
to fill command file template
SCORE_FILE Name of score file which need to be checked Score.csv
TEXT_VARIABLES List of text variables in dataset separated by comma ITEM_LEAF_CATEGORY_
NAME, LISTING_TITLE,
LISTING_SUBTITLE
DEFAULT_CODING Default coding for extract and preprocess steps YN
REALCLASS_COLUMN_ Name of column in real class file, which would be used in GMS_GREATE_AVG
NAME check step
SCORE_COLUMN_NAM Name of column in score file, which would be used in check PREDICTION
E step
Salford Systems © Copyright 2011 101