SlideShare uma empresa Scribd logo
1 de 90
Baixar para ler offline
Data Science Meetup

 Finalize Data Science Teams
    Technology Overview
      Data Science Tools
   Data Science Resources

       October 3, 2012
Presentation by:

Michael Walker
Rose Business Technologies
720.373.2200
m@rosebt.com
http://www.rosebt.com
Agenda

6:00 - 6:30 Overview - Finalize Data Science Teams: Michael Walker
6:30 - 7:00 Hadoop/Mapreduce Presentation: John Dougherty
7:00 - 7:15 Qubole Presentation: Sadiq Shaik
7:15 - 7:45 Kognitio Presentation: Reggie Arizmendi
7:45 - 8:00 Network
Hype Cycle for Emerging Tech 2012
Hype Cycle for Big Data 2012
Top 5 Big Data Challenges



1. Deciding what data is relevant
2. Cost of technology infrastructure
3. Lack of skills to analyze the data
4. Lack of skills to manage big data projects
5. Lack of business support
Most Difficult Big Data Skills to Find


1. Advanced analytics, predictive analytics
2. Complex event processing
3. Rules management
4. Business intelligence tools
5. Data integration
Big Data Drivers
Analysis of…:

1. Operational data
2. Online customer data
3. Sales transactions data
4. Machine or device data
5. Service innovation
Definitions
Big data analytics is the application of advanced analytic
  techniques to very big data sets.

Big data is a new generation of technologies and
  architectures designed to extract value economically
  from very large volumes of a wide variety of data
by enabling high-velocity capture, discovery and/or
analysis.
Horizontal & Vertical Applications
Big Data technology can be deployed for business
  processes such as the following:

• Customer relationship management (sales, marketing,
   customer service)
• Supply chain and operations
• Administration (finance and accounting, human
   resources, legal)
• Research and development
• Information technology management
• Risk management
Horizontal & Vertical Applications

In addition, big data technology can be used for industry-
   specific applications such as the following:

• Logistics optimization in the transportation industry
• Price optimization in the retail industry
• Intellectual property management in the media and entertainment industry
• Natural resource exploration in the oil and gas industry
• Warranty management in the manufacturing industry
• Crime prevention and investigation in local law enforcement
• Predictive damage assessments in the insurance industry
• Fraud detection in the banking industry
• Patient treatment and fraud detection in the healthcare industry
Data Science Teams
Four (4) person teams

Optimal skill mix:

1. Business Leader (consumer)
2. Statistics
3. Data Modeler
4. IT
Data Science Use Case / Scenario
Each team selects a use case / scenario

Thesis
Data sources
Analytical tools / platforms
Use Case

Example: I suggest there is a correlation
 between size of government and economic
 growth.

Thesis: Bigger government = slower economic
  growth

Data source: Open data source from
 government stats; yahoo finance- bloomberg

Tool: Qubole on Amazon PaaS
Data Modeling

A data model is a plan for building a database.

To use a common analogy, the data model is
  equivalent to an architect's building plans.
Data Modeling
Three different types of data models:

1) Conceptual data models.

These models, sometimes called domain models, are
  typically used to explore domain concepts with project
  stakeholders. On Agile teams high-level conceptual
  models are often created as part of your initial
  requirements envisioning efforts as they are used to
  explore the high-level static business structures and
  concepts. On traditional teams conceptual data models
  are often created as the precursor to LDMs or as
  alternatives to LDMs.
Data Modeling
2) Logical data models (LDMs).

LDMs are used to explore the domain concepts, and their
  relationships, of your problem domain. This could be
  done for the scope of a single project or for your entire
  enterprise. LDMs depict the logical entity types, typically
  referred to simply as entity types, the data attributes
  describing those entities, and the relationships between
  the entities. LDMs are rarely used on Agile projects
  although often are on traditional projects (where they
  rarely seem to add much value in practice).
Data Modeling
3) Physical data models (PDMs).

PDMs are used to design the internal schema of a
  database, depicting the data tables, the data columns of
  those tables, and the relationships between the tables.
  PDMs often prove to be useful on both Agile and
  traditional projects and as a result the focus of this article
  is on physical modeling.
Data Modeling
Data Modeling
Models of Data
A framework to organize and analyze data.

Predictive, Descriptive, Prescriptive Analytics


There are three types of data analysis:

Predictive (forecasting)
Descriptive (business intelligence and data mining)
Prescriptive (optimization and simulation)
Models of Data



Predictive Analytics

Predictive analytics turns data into valuable, actionable
  information. Predictive analytics uses data to determine the
  probable future outcome of an event or a likelihood of a
  situation occurring.

Predictive analytics encompasses a variety of statistical
  techniques from modeling, machine learning, data mining
  and game theory that analyze current and historical facts to
  make predictions about future events.
Models of Data



Predictive Analytics

Three basic cornerstones of predictive analytics are:

Predictive modeling
Decision Analysis and Optimization
Transaction Profiling

An example of using predictive analytics is optimizing customer
  relationship management systems. They can help enable an
  organization to analyze all customer data therefore exposing
  patterns that predict customer behavior.
Models of Data



Predictive Analytics

Another example is for an organization that offers multiple
  products, predictive analytics can help analyze customers’
  spending, usage and other behavior, leading to efficient
  cross sales, or selling additional products to current
  customers.

This directly leads to higher profitability per customer and
  stronger customer relationships.
Models of Data



Descriptive Analytics

Descriptive analytics looks at data and analyzes past events for
  insight as to how to approach the future. Descriptive analytics
  looks at past performance and understands that performance
  by mining historical data to look for the reasons behind past
  success or failure.

Almost all management reporting such as sales, marketing,
  operations, and finance, uses this type of post-mortem
  analysis.
Models of Data



Descriptive Analytics

Descriptive models quantify relationships in data in a way that is
  often used to classify customers or prospects into groups.
  Unlike predictive models that focus on predicting a single
  customer behavior (such as credit risk), descriptive models
  identify many different relationships between customers or
  products.

Descriptive models do not rank-order customers by their
  likelihood of taking a particular action the way predictive
  models do.
Models of Data



Descriptive Analytics

Descriptive models can be used, for example, to categorize
  customers by their product preferences and life stage.
  Descriptive modeling tools can be utilized to develop further
  models that can simulate large number of individualized
  agents and make predictions.

For example, descriptive analytics examines historical electricity
  usage data to help plan power needs and allow electric
  companies to set optimal prices.
Models of Data



Prescriptive Analytics

Prescriptive analytics automatically synthesizes big data,
  mathematical sciences, business rules, and machine learning
  to make predictions and then suggests decision options to
  take advantage of the predictions.

Prescriptive analytics goes beyond predicting future outcomes
  by also suggesting actions to benefit from the predictions and
  showing the decision maker the implications of each decision
  option. Prescriptive analytics not only anticipates what will
  happen and when it will happen, but also why it will happen.
Models of Data



Prescriptive Analytics

Further, prescriptive analytics can suggest decision options on
  how to take advantage of a future opportunity or mitigate a
  future risk and illustrate the implication of each decision
  option.

In practice, prescriptive analytics can continually and
   automatically process new data to improve prediction
   accuracy and provide better decision options.
Models of Data

Prescriptive Analytics

Prescriptive analytics synergistically combines data, business
  rules, and mathematical models. The data inputs to
  prescriptive analytics may come from multiple sources,
  internal (inside the organization) and external (social media).
  The data may also be structured, which includes numerical
  and categorical data, as well as unstructured data, such as
  text, images, audio, and video data, including big data.
  Business rules define the business process and include
  constraints, preferences, policies, best practices, and
  boundaries. Mathematical models are techniques derived
  from mathematical sciences and related disciplines including
  applied statistics, machine learning, operations research, and
  natural language processing.
Models of Data



Prescriptive Analytics

For example, prescriptive analytics can benefit healthcare
  strategic planning by using analytics to leverage operational
  and usage data combined with data of external factors such
  as economic data, population demographic trends and
  population health trends, to more accurately plan for future
  capital investments such as new facilities and equipment
  utilization as well as understand the trade-offs between
  adding additional beds and expanding an existing facility
  versus building a new one.
Models of Data



Prescriptive Analytics

Another example is energy and utilities. Natural gas prices
  fluctuate dramatically depending upon supply, demand,
  econometrics, geo-politics, and weather conditions. Gas
  producers, transmission (pipeline) companies and utility firms
  have a keen interest in more accurately predicting gas prices
  so that they can lock in favorable terms while hedging
  downside risk.

Prescriptive analytics can accurately predict prices by modeling
  internal and external variables simultaneously and also
  provide decision options and show the impact of each
  decision option.
Analytical Technologies

Platforms

Amazon PaaS
Cloud Foundry PaaS
MS Azure
Google App Engine
IBM SmartCloud
Heroku
Analytical Technologies

Tools

Hadoop / MapReduce
R Language - Revolution Analytics
Qubole
Alteryx
Vertica
BigML
Kognitio
MS SQL Server: SSIS; SSAS
Geokettle
Analytical Technologies

Tools

ERwin Data Modeler
StrategyCompanion
Talend
Pentaho
Hortonworks
Metalab
SAS
SPSS
PSPP
Open Data Sources
Freebase


Data Hub


Numbrary


Peter Skomoroch's Delicious Data


InfoChimps


Open Data Sites


DBpedia


theinfo.org


Lending Club Statistics


MAF/TIGER (US Census Geo) Database


Reuters Corpora (RCV1, RCV2, TRC2)


Open Street Map


MusicBrainz


Jigsaw


Opentick
Open Data Sources


Historical Data, Yahoo Finance
Historical Foreign Exchange Data, Federal Reserve Bank of New York
Graduate School of Business, Stanford University
Proprietary Trading Articles & Resources
Wilmott.com
DefaultRisk.com, Credit Risk Modeling Resource: Papers, Books, Conferences, Jobs
Forex Factory, Forums
NBER Papers in Asset Pricing: Stocks, Bonds and Foreign Currency
Financial Engineering Books, International Association of Financial Engineers
Open Data Sources
•
•   Literacy, Gross Domestic Product, Income and Military Expenditures for 154 Countries
•   Continent Codes for Countries
•   Source: Various Wikipedia Articles
•   Daily Precipitation, Min and Max Temperatures for Berkeley for the first 10 months of 2005
•   Source: http://hurricane.ncdc.noaa.gov/dly/DLY
•   Release Dates and Box Office Earnings for Top Movies
•   Source: http://www.movieweb.com/movies/boxoffice/alltime.php
•   See Also: http://imdb.com/Top/
•   Bush-Kerry Election Results 2004
•   US State Population, 2003 and 2004
•   Source: http://www.factmonster.com/ipka/A0004986.html
•   Information about Cars (1978-1979)
•   Diabetes in Pima Indians
•   Information about Diabetes data source: http://www.ics.uci.edu/~mlearn/MLRepository.html
•   Updated world data with new variables
•   Wine Recognition Data
•   Information about Wine data source: http://www.ics.uci.edu/~mlearn/MLRepository.html
•   Nutritional Information about Crackers source: http://www.math.csi.cuny.edu/st/Projects
•   XML Plant Catalog source: http://www.w3schools.com/xml/
•   US Wheat Production 1910-2004 source: http://usda.mannlib.cornell.edu/data-sets/crops/88008/
•   Birthdays and Terms of US Senators source: Wikipedia
•   Weight and Sleep Information of Various Animals
•   Information about Sleep Data Set
•   SQLite Album database
•   Iron dataset
Eight Levels of Analytics
Statistical Analysis
Statistical Analysis answers the questions: Why is this
  happening? What opportunities am I missing?

Example: Banks can discover why an increasing number of
  customers are refinancing their homes.
Here we can begin to run some complex analytics, like
  frequency models and regression analysis. We can
  begin to look at why things are happening using the
  stored data and then begin to answer questions based
  on the data.
Forecasting
Forecasting answers the questions: What if these trends
  continue? How much is needed? When will it be
  needed?

Example: Retailers can predict how demand for individual
  products will vary from store to store.
Forecasting is one of the hottest markets – and hottest
  analytical applications – right now. It applies everywhere.
  In particular, forecasting demand helps supply just
  enough inventory, so you don’t run out or have too
  much.
Predictive Modeling


Predictive Modeling answers the questions: What will
  happen next? How will it affect my business?

Example: Hotels and casinos can predict which VIP
  customers will be more interested in particular vacation
  packages. If you have 10 million customers and want to
  do a marketing campaign, who's most likely to respond?
  How do you segment that group? And how do you
  determine who's most likely to leave your organization?
  Predictive modeling provides the answers.
Optimization


Optimization answers the question: How do we do things
  better? What is the best decision for a complex problem?

Example: Given business priorities, resource constraints
  and available technology, determine the best way to
  optimize your IT platform to satisfy the needs of every
  user.

Optimization supports innovation. It takes your resources
  and needs into consideration and helps you find the best
  possible way to accomplish your goals.
Conceptual Modeling


Conceptual Modeling brings together the business and
  technology views to define the solution scope.

It is more than technical architecture or data context
    diagrams. Technical architecture and data context
    diagrams have their place, but the critical skill is the
    business view (vs. technical view) of the solution scope.

This is critical to engaging stakeholders and setting the
  stage for innovation.
Statistical Models
Nonparametric Tests
T-test
ANOVA & MANOVA
ANCOVA & MANCOVA
Linear Regression
Generalized Least Squares
Ridge Regression
Lasso
Generalized Linear Models
Mixed Effects Models
Statistical Models
Logistic Regression
Nonlinear Regression
Discriminant Analysis
Nearest Neighbor
Factor & Principal Components Analysis
Copula Models
Cross-Validation
Bayesian Statistics
Monte Carlo, Classic Methods
Markov Chain Monte Carlo
Statistical Models
Bootstrap & Jackknife
EM Algorithm
Missing Data Imputation
Outlier Diagnostics
Robust Estimation
Longitudinal (Panel) Data
Survival Analysis
Path Analysis
Propensity Score Matching
Stratified Samples (Survey Data)
Statistical Models
Experimental Design
Quality Control
Reliability Theory
Univariate Time Series
Multivariate Time Series
Markov Chains
Hidden Markov Models
Stochastic Volatility Models
Diffusions
Counting Processes
Statistical Models
Filtering
Instrumental Variables
Simultaneous Equations
Splines
Nonparametric Smoothing Methods
Extreme Value Theory
Variance Stabilization
Cluster Analysis
Neural Networks
Classification & Regression Trees
Statistical Models
Boosting Classification & Regression Trees
Random Forests
Support Vector Machines
Signal Processing
Wavelet Analysis
ROC Curves
Optimization
Statistical Models
Two simple yet powerful models:

Generalized Linear Regression Model

Random Forests

Suggestion: Keep it simple for the first use case.
Predictive Modeling Techniques
Predictive Modeling Techniques

Problems with some predictive modeling techniques. Note that most of these techniques have evolved
    over time (in the last 10 years) to the point where most drawbacks have been eliminated - making
    the updated tool far different and better than its original version. Typically, these bad techniques
    are still widely used.

   1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not
      capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to
      interpret. Very unstable when independent variables are highly correlated. Fixes: variable
      reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or
      Lasso regression)
   2. Traditional decision trees. Very large decision trees are very unstable and impossible to
      interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead
      of using a large decision tree.
   3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it
      assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they
      never do. Use density estimation techniques instead.
   4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well
      with data points that are not a mixture of Gaussian distributions.
Predictive Modeling Techniques


5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.
6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic
    distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit
    for your data.
7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality.
    Fix: use (non parametric) kernel density estimators with adaptive bandwidths.
8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are
    independent, if not it will fail miserably. In the context of fraud or spam detection, variables
    (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of
    variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or
    use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam
    detection) combined with naive Bayes produces absolutely terrible results with many false
    positives and false negatives.


And remember to use sound cross-validations techniques when testing models!
Predictive Modeling Techniques

Poor cross-validation allows bad models to make the cut, by over-estimating
   the true lift to be expected in future data, the true accuracy or the true ROI
   outside the training set. Good cross validations consist in:

  •   splitting your training set into multiple subsets (test and control subsets),
  •   include different types of clients and more recent data in the control sets
      (than in your test sets)
  •   check quality of forecasted values on control sets
  •   compute confidence intervals for individual errors (error defined e.g. as
      |true value minus forecasted value|) to make sure that error is small
      enough AND not too volatile (it has small variance across all control sets)
Statistical Software
Almost all serious statistical analysis is done in one of the
  following packages: R (SPlus), Matlab, SAS, SPSS and
  Stata.

It does not mean that each of those packages is good for a
    specific type of analysis. In fact, for most advanced areas,
    only 2-3 packages will be suitable, providing enough
    functionality or enough tools to implement this functionality
    easily.

For example, a very important area of Markov Chain Monte
  Carlo is doable in R, Matlab and SAS only, unless you want
  to rely on convoluted macros written by random users on
  the web.
Statistical Software
R & MATLAB

R and Matlab are the richest systems by far. They contain an
  impressive amount of libraries, which is growing each day.
  Even if a desired very specific model is not part of the
  standard functionality, you can implement it yourself,
  because R and Matlab are really programming languages
  with relatively simple syntaxes. As "languages" they allow
  you to express any idea. The question is whether you are a
  good writer or not. In terms of modern applied statistics
  tools, R libraries are somewhat richer than those of Matlab.
  Also R is free. On the flip side, Matlab has much better
  graphics, which you will not be ashamed to put in a paper or
  a presentation.
Statistical Software
SPSS

On the other end of the spectrum is a package like SPSS. SPSS is quite narrow in
   its capabilities and allows you to do only about half of the mainstream statistics.
   It is quite useless for ambitious modeling and estimation procedures which are
   part of kernel smoothing, pattern recognition or signal processing. Nonetheless,
   SPSS is very popular among the practitioners because it does not require
   almost any programming training. All you have to do is hit several buttons and
   SPSS does all the calculations for you. In those cases when you need
   something standard, SPSS may have it implemented fully. The SPSS output will
   be quite detailed and visually pleasing. It will contain all the major tests and
   diagnostic tools associated with the method and will allow you to write an
   informative statistics section of your empirical analysis. In short, when the
   method is there, it is faster to run than a similar functionality in R or Matlab. So I
   use SPSS often for standard requests from my clients, like running linear
   regression, ANOVA or principal components analysis. SPSS gives you the
   ability to program macros, but that feature is quite inflexible.
Statistical Software
SAS & STATA


Somewhere in-between R, Matlab and SPSS lie SAS and Stata. SAS is more
  extensive analytics than Stata. It is composed of dozens of procedures with
  massive, massive output, often covering more than ten pages. The idea of SAS
  is not to listen to you that much. It is like an old grandfather, which you approach
  with a simple question but instead he tells you the story of his life. Many
  procedures contain three times more than what you need to know about that
  segment. So some time has to be spent on filtering in the relevant output. SAS
  procedures are invoked using simple scripts. Stata procedures can be invoked
  by clicking buttons in the menu or by running simple scripts. In the menu part,
  Stata resembles SPSS. Both SAS and Stata are programming languages, so
  they allow you to build analytics around standard procedures. Stata is somewhat
  more flexible than SAS. Still, in terms of programming flexibility, Stata and SAS
  do not come even close to R or Matlab. Selected strengths of SAS compared to
  all other packages: large data sets, speed, beautiful graphics, flexibility in
  formatting the output, time series procedures, counting processes. Selected
  strengths of Stata compared to all other packages: manipulation of survey data
  (stratified samples, clustering), robust estimation and tests, longitudinal data
  methods, multivariate time series.
Statistical Software
Useful Resources:



American Statistical Association

Department of Statistics, Stanford University

Elementary Statistics Books Available to Download for Free
Statistical Software
•   Downloading R
•   R Manuals (at CRAN)
•   Accessing the SCF Remotely (includes how to get the necessary software)
•   Class Bulletin Board (bspace)
•   Driver to convert Windows Documents to PDF
•   Introduction to R (pdf)
•   Slides for a Course in R (pdf)
•   R Graph Gallery
•   statsnetbase Search for R Graphics to read Paul Murrell's book about plotting in R
•   Some Notes on Saving Plots in R
•   Free Graphical MySQL Client
•   SQLite Graphical Client for Windows
•   Instructions on running the Firefox SQLiteManager extension as an application on Mac OSX
•   Accessing the Class MySQL Server through an SSH Tunnel
•   Connecting to the MySQL server under Windows
•   Introduction to Cluster Analysis (statsoft.nl)
•   Fruit pictures for the "slot machine" (zipped)
•   R TclTk examples
•   More R TclTk examples
•   Additional GUI examples: Deal or No Deal Piano
•   HTML Form Tutorial
•   Setting up your account for CGI scripting
•   Running your own Webserver to test CGI programs (Mac & Linux)
•   Notes on Document Preparation with Latex
•   vi reference card
•   emacs reference card
•   R reference card
•   More information on Dates and Times in R
•   More information on Factors in R
Statistical Software
Books
  •   Competing on Analytics
  •   Analytics at Work
  •   Super Crunchers
  •   The Numerati
  •   Data Driven
  •   Data Source Handbook
  •   Programming Collective Intelligence
  •   Mining the Social Web
  •   Data Analysis with Open Source Tools
  •   Visualizing Data
  •   The Visual Display of Quantitative Information
  •   Envisioning Information
  •   Visual Explanations: Images and Quantities, Evidence and Narrative
  •   Beautiful Evidence
  •   Think Stats
  •   Data Analysis Using Regression and Multilevel/Hierarchical Models
  •   Applied Longitudinal Data Analysis
  •   Design of Observational Studies
  •   Statistical Rules of Thumb
  •   All of Statistics
  •   A Handbook of Statistical Analyses Using R
  •   Mathematical Statistics and Data Analysis
  •   The Elements of Statistical Learning
  •   Counterfactuals and Causal Inference
Statistical Software

    •
    •   Mining of Massive Data Sets
    •   Data Analysis: What Can Be Learned From the Past 50 Years
    •   Bias and Causation
    •   Regression Modeling Strategies
    •   Probably Not
    •   Statistics as Principled Argument
    •   The Practice of Data Analysis


Great class notes on Data Science: http://statistics.berkeley.edu/classes/s133/all2011.pdf


Related Workshops

    •   Data Bootcamp, Strata 2011
    •   Machine Learning Summer School, Purdue 2011
    •   Looking at Data
Statistical Software
Courses
  •   Concepts in Computing with Data, Berkeley
  •   Practical Machine Learning, Berkeley
  •   Artificial Intelligence, Berkeley
  •   Visualization, Berkeley
  •   Data Mining and Analytics in Intelligent Business Services, Berkeley
  •   Data Science and Analytics: Thought Leaders, Berkeley
  •   Machine Learning, Stanford
  •   Paradigms for Computing with Data, Stanford
  •   Mining Massive Data Sets, Stanford
  •   Data Visualization, Stanford
  •   Algorithms for Massive Data Set Analysis, Stanford
  •   Research Topics in Interactive Data Analysis, Stanford
  •   Data Mining, Stanford
  •   Machine Learning, CMU
  •   Statistical Computing, CMU
  •   Machine Learning with Large Datasets, CMU
  •   Machine Learning, MIT
  •   Data Mining, MIT
  •   Statistical Learning Theory and Applications, MIT
  •   Data Literacy, MIT
  •   Introduction to Data Mining, UIUC
  •   Learning from Data, Caltech
  •   Introduction to Statistics, Harvard
  •   Data-Intensive Information Processing Applications, University of Maryland
Statistical Software
  •   Dealing with Massive Data, Columbia
  •   Data-Driven Modeling, Columbia
  •   Introduction to Data Mining and Analysis, Georgia Tech
  •   Computational Data Analysis: Foundations of Machine Learning and Da..., Georgia Tech
  •   Applied Statistical Computing, Iowa State
  •   Data Visualization, Rice
  •   Data Warehousing and Data Mining, NYU
  •   Data Mining in Engineering, Toronto
  •   Machine Learning and Data Mining, UC Irvine
  •   Knowledge Discovery from Data, Cal Poly
  •   Large Scale Learning, University of Chicago
  •   Data Science: Large-scale Advanced Data Analysis, University of Florida
  •   Strategies for Statistical Data Analysis, Universität Leipzig




Videos

  •   Lies, damned lies and statistics (about TEDTalks)
  •   The Joy of Stats
  •   Journalism in the Age of Data
Data Science Team Ideas
Keep it simple!

Work on a real problem from work.

Suggestions for more challenging problems:

Census Return Rate
Develop a statistical model to predict census mail return rates at the Census block
  group level of geography. The Census Bureau will use this model for planning
  purposes for the decennial census and for demographic sample surveys.

Develop and evaluate different statistical approaches to proposing the best
  predictive model for geographic units. The intent is to improve current predictive
  analytics.
Data Science Team Ideas
Hierarchical load forecasting problem: backcasting and forecasting hourly
   loads (in kW) for a US utility with 20 zones.

Backcast and forecast at both zonal level (20 series) and system (sum of the 20
   zonal level series) level, totally 21 series. Data (loads of 20 zones and
   temperature of 11 stations) history ranges from the 1st hour of 2004/1/1 to the
   6th hour of 2008/6/30. Given actual temperature history, the 8 weeks below in
   the load history are set to be missing and are required to be backcasted. It's OK
   to use the entire history to backcast these 8 weeks.
2005/3/6 - 2005/3/12;
2005/6/20 - 2005/6/26;
2005/9/10 - 2005/9/16;
2005/12/25 - 2005/12/31;
2006/2/13 - 2006/2/19;
2006/5/25 - 2006/5/31;
2006/8/2 - 2006/8/8;
2006/11/22 - 2006/11/28;
Need to forecast hourly loads from 2008/7/1 to 2008/7/7. No actual temperatures
Data Science Team Ideas
Wind power forecasting problem: predicting hourly power generation up to 48 hours ahead at 7 wind
   farms


Based on historical measurements and additional wind forecast information (48-hour ahead predictions of
    wind speed and direction at the sites). The data is available for period ranging from the 1st hour of
    2009/7/1 to the 12th hour of 2012/6/28.
The period between 2009/7/1 and 2010/12/31 is a model identification and training period, while the
    remainder of the dataset, that is, from 2011/1/1 to 2012/6/28, is there for the evaluation. The training
    period is there to be used for designing and estimating models permiting to predicting wind power
    generation at lead times from 1 to 48 hours ahead, based on past power observations and/or available
    meteorological wind forecasts for that period. Over the evaluation part, it is aimed at mimicking real
    operational conditions. For that, a number of 48-hour periods with missing power observations where
    defined. All these power observations are to be predicted. These periods are defined as following. The
    first period with missing observations is that from 2011/1/1 at 01:00 until 2011/1/3 at 00:00. The second
    period with missing observations is that from 2011/1/4 at 13:00 until 2011/1/6 at 12:00. Note that to be
    consistent, only the meteorological forecasts for that period that would actually be available in practice
    are given. These two periods then repeats every 7 days until the end of the dataset. Inbetween periods
    with missing data, power observations are available for updating the models.
Data Science Team Ideas
Predict the online sales of a consumer product based on a data set of product features.


Build as good a model as possible to predict monthly online sales of a product.
   Imagine the products are online self-help programs following an initial
   advertising campaign.
Obtain data in the comma separated values (CSV) format. Each row in this data set
   represents a different consumer product.
The first 12 columns (Outcome_M1 through Outcome_M12) contains the monthly
   online sales for the first 12 months after the product launches.
Date_1 is the day number the major advertising campaign began and the product
   launched.
Date_2 is the day number the product was announced and a pre-release
   advertising campaign began.
Other columns in the data set are features of the product and the advertising
   campaign. Quan_x are quantitative variables and Cat_x are categorical
   variables. Binary categorical variables are measured as (1) if the product had
   the feature and (0) if it did not.
Data Science Team Ideas
Improve on the state of the art in credit scoring by predicting the probability that somebody will
   experience financial distress in the next two years.


Banks play a crucial role in market economies. They decide who can get finance
   and on what terms and can make or break investment decisions. For markets
   and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the
   method banks use to determine whether or not a loan should be granted.
   Improve on the state of the art in credit scoring, by predicting the probability that
   somebody will experience financial distress in the next two years.

The goal is to build a model that borrowers can use to help make the best financial
   decisions. Obtain historical data on 250,000 borrowers.
Data Tools

Included is a list of tools, such as programming languages and web-based utilities,
    data mining resources, some prominent organizations in the field, repositories
    where you can play with data, events you may want to attend and important
    articles you should take a look at.


The second segment of the list includes a number of art and design resources the
   infographic designers might like including color palette generators and image
   searches. There are also some invisible web resources (if you’re looking for
   something on Google and not finding it) and metadata resources so you can
   appropriately curate your data.
Data Tools
Google Refine – A power tool for working with messy data (formerly Freebase Gridworks)
The Overview Project – Overview is an open-source tool to help journalists find stories in large amounts of data by cleaning, visualizing and
     interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information
     requests, journalists are drowning in more documents than they can ever hope to read.
Refine, reuse and request data | ScraperWiki – ScraperWiki is an online tool to make acquiring useful data simpler and more collaborative. Anyone
     can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other
     programmers can contribute to and improve the code.
Data Curation Profiles – This website is an environment where academic librarians of all kinds, special librarians at research facilities, archivists
     involved in the preservation of digital data, and those who support digital repositories can find help, support and camaraderie in exploring avenues
     to learn more about working with research data and the use of the Data Curation Profiles Tool.
Google Chart Tools – Google Chart Tools provide a perfect way to visualize data on your website. From simple line charts to complex hierarchical
     tree maps, the chart galley provides a large number of well-designed chart types. Populating your data is easy using the provided client- and
     server-side tools.
22 free tools for data visualization and analysis
The R Journal – The R Journal is the refereed journal of the R project for statistical computing. It features short to medium length articles covering
      topics that might be of interest to users or developers of R.
CS 229: Machine Learning – A widely referenced course by Professor Andrew Ng, CS 229: Machine Learning provides a broad introduction to
     machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement
     learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation,
     bioinformatics, speech recognition, and text and web data processing are also discussed.
Google Research Publication: BigTable – Bigtable is a distributed storage system for managing structured data that is designed to scale to a very
     large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing,
     Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web
     pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands,
     Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data
     model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of
     Bigtable.
Scientific Data Management – An introduction.
Natural Language Toolkit – Open source Python modules, linguistic data and documentation for research and development in natural language
    processing and text analytics, with distributions for Windows, Mac OSX and Linux.
Beautiful Soup – Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.
Mondrian: Pentaho Analysis – Pentaho Open source analysis OLAP server written in Java. Enabling interactive analysis of very large datasets stored
Data Tools
The Comprehensive R Archive Network - R is `GNU S’, a freely available language and environment for statistical computing and graphics which
      provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification,
      clustering, etc. Please consult the R project homepage for further information. CRAN is a network of ftp and web servers around the world that
      store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.
DataStax – Software, support, and training for Apache Cassandra.
Machine Learning Demos
Visual.ly – Infographics & Visualizations. Create, Share, Explore
Google Fusion Tables - Google Fusion Tables is a modern data management and publishing web application that makes it easy
to host, manage, collaborate on, visualize, and publish data tables online.
Tableau Software - Fast Analytics and Rapid-fire Business Intelligence from Tableau Software.
WaveMaker - WaveMaker is a rapid application development environment for building, maintaining and modernizing business-critical Web 2.0
      applications.
Visualization: Annotated Time Line – Google Chart Tools – Google Code - An interactive time series line chart with optional annotations. The
      chart is rendered within the browser using Flash.
Visualization: Motion Chart – Google Chart Tools – Google Code - A dynamic chart to explore several indicators over time. The chart is rendered
     within the browser using Flash.
PhotoStats - Create gorgeous infographics about your iPhone photos.
Ionz Ionz will help you craft an infographic about yourself.
chart builder - Powerful tools for creating a variety of charts for online display.
Creately - Online diagramming and design.
Pixlr Editor - A powerful online photo editor.
Google Public Data Explorer - The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and
      maps animate over time, the changes in the world become easier to understand. You don’t have to be a data expert to navigate between different
      views, make your own comparisons, and share your findings.
Fathom -Fathom Information Design helps clients understand and express complex data through information graphics, interactive tools, and software
      for installations, the web, and mobile devices. Led by Ben Fry. Enough said!
healthymagination | GE Data Visualization - Visualizations that advance the conversation about issues that shape our lives, and so we encourage
      visitors to download, post and share these visualizations.
ggplot2 - ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none
      of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model
      of graphics that makes it easy to produce complex multi-layered graphics.
Data Tools
Protovis - Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become
      tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to
      simplify construction.Protovis is free and open-source, provided under the BSD License. It uses JavaScript and SVG for web-native
      visualizations; no plugin required (though you will need a modern web browser)! Protovis is mostly declarative and designed to be learned by
      example.
d3.js - D3.js is a small, free JavaScript library for manipulating documents based on data.
MATLAB – The Language Of Technical Computing - MATLAB® is a high-level language and interactive environment that enables you to perform
    computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran.
OpenGL – The Industry Standard for High Performance Graphics - OpenGL.org is a vendor-independent and organization-independent web site
    that acts as one-stop hub for developers and consumers for all OpenGL news and development resources. It has a very large and continually
    expanding developer and end-user community that is very active and vested in the continued growth of OpenGL.
Google Correlate - Google Correlate finds search patterns which correspond with real-world trends.
Revolution Analytics – Commercial Software & Support for the R Statistics Language - Revolution Analytics delivers advanced analytics
      software at half the cost of existing solutions. By building on open source R—the world’s most powerful statistics software—with innovations in big
      data analysis, integration and user experience, Revolution Analytics meets the demands and requirements of modern data-driven businesses.
22 Useful Online Chart & Graph Generators
The Best Tools for Visualization - Visualization is a technique to graphically represent sets of data. When data is large or abstract, visualization can
     help make the data easier to read or understand. There are visualization tools for search, music, networks, online communities, and almost
     anything else you can think of. Whether you want a desktop application or a web-based tool, there are many specific tools are available on the
     web that let you visualize all kinds of data.
Visual Understanding Environment - The Visual Understanding Environment (VUE) is an Open Source project based at Tufts University. The
     VUE project is focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE
     provides a flexible visual environment for structuring, presenting, and sharing digital information.
Bime – Cloud Business Intelligence | Analytics & Dashboards - Bime is a revolutionary approach to data analysis and dashboarding. It allows you
     to analyze your data through interactive data visualizations and create stunning dashboards from the Web.
Data Science Toolkit - A collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn
     the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more.
BuzzData - BuzzData lets you share your data in a smarter, easier way. Instead of juggling versions and overwriting files, use BuzzData and enjoy a
     social network designed for data.
SAP – SAP Crystal Solutions: Simple, Affordable, and Open BI Tools for Everyday Use
Project Voldemort
Data Tools
Data Mining


 1. Weka -nWeka is a collection of machine learning algorithms for data mining tasks. The algorithms can
    either be applied directly to a dataset or called from your own Java code. Weka contains tools for data
    pre-processing, classification, regression, clustering, association rules, and visualization. It is also
    well-suited for developing new machine learning schemes. Weka is open source software issued
    under the GNU General Public License.
 2. PSPP- PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the
    proprietary program SPSS, and appears very similar to it with a few exceptions. The most important of
    these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or
    deliberately stop working in the future. Neither are there any artificial limits on the number of cases or
    variables which you can use. There are no additional packages to purchase in order to get “advanced”
    functions; all functionality that PSPP currently supports is in the core package.PSPP can perform
    descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to
    perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP
    with its graphical interface or the more traditional syntax commands.
Data Tools
3. Rapid I- Rapid-I provides software, solutions, and services in the fields of predictive analytics, data
    mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale
    base, i.e. for large amounts of structured data like database systems and unstructured data like texts.
    The open-source data mining specialist Rapid-I enables other companies to use leading-edge
    technologies for data mining and business intelligence. The discovery and leverage of unused business
    intelligence from existing data enables better informed decisions and allows for process
    optimization.The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading
    open-source system for knowledge discovery and data mining. It is available as a stand-alone
    application for data analysis and as a data mining engine which can be integrated into own products. By
    now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive
    edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP,
    Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma,
    PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses
    benefitting from the open-source business model of Rapid-I.
Data Tools
4. R Project – R is a language and environment for statistical computing and graphics. It is a GNU
    projectwhich is similar to the S language and environment which was developed at Bell Laboratories
    (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as
    a different implementation of S. There are some important differences, but much code written for S runs
    unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical
    statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is
    highly extensible. The S language is often the vehicle of choice for research in statistical methodology,
    and R provides an Open Source route to participation in that activity.One of R’s strengths is the ease
    with which well-designed publication-quality plots can be produced, including mathematical symbols
    and formulae where needed. Great care has been taken over the defaults for the minor design choices
    in graphics, but the user retains full control.R is available as Free Software under the terms of the Free
    Software Foundation‘s GNU General Public License in source code form. It compiles and runs on a
    wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and
    MacOS.
Data Tools
Organizations


 1.  Data.gov
 2.  SDM group at LBNL
 3.  Open Archives Initiative
 4.  Code for America | A New Kind of Public Service
 5.  The # DataViz Daily
 6.  Institute for Advanced Analytics | North Carolina State University | Professor
     Michael Rappa · MSA Curriculum
 7. BuzzData | Blog, 25 great links for data-lovin’ journalists
 8. MetaOptimize – Home – Machine learning, natural language processing,
     predictive analytics, business intelligence, artificial intelligence, text analysis,
     information retrieval, search, data mining, statistical modeling, and data
     visualization
 9. had.co.nz
 10. Measuring Measures – Measuring Measures
Data Tools
Repositories


 1. Repositories | DataCite
 2. Data | The World Bank
 3. Infochimps Data Marketplace + Commons: Download Sell or Share Databases,
     statistics, datasets for free | Infochimps
 4. Factual Home – Factual
 5. Flowing Media: Your Data Has Something To Say
 6. Chartsbin
 7. Public Data Explorer
 8. StatPlanet
 9. ManyEyes
 10. 25+ more ways to bring data into R
Data Tools
Articles


 1. Data Science: a literature review | (R news & tutorials)
 2. What is “Data Science” Anyway?
 3. Hal Varian on how the Web challenges managers – McKinsey Quarterly –
    Strategy – Innovation
 4. The Three Sexy Skills of Data Geeks « Dataspora
 5. Rise of the Data Scientist
 6. dataists » A Taxonomy of Data Science
 7. The Data Science Venn Diagram « Zero Intelligence Agents
 8. Revolutions: Growth in data-related jobs
 9. Building data startups: Fast, big, and focused – O’Reilly Radar
Data Tools
Art Design


 1.   Periodic Table of Typefaces
 2.   Color Scheme Designer 3
 3.   Color Palette Generator Generate A Color Palette For Any Image
 4.   COLOURlovers
 5.   Colorbrewer: Color Advice for Maps
Data Tools
Image Searches


 1.    American Memory from the Library of Congress -The home page for the American Memory Historical Collections from
       the Library of Congress. American Memory provides free access to historical images, maps, sound recordings, and
       motion pictures that document the American experience. American Memory offers primary source materials that
       chronicle historical events, people, places, and ideas that continue to shape America.
 2.    Galaxy of Images | Smithsonian Institution Libraries
 3.    Flickr Search
 4.    50 Websites For Free Vector Images Download -Design weblog for designers, bloggers and tech users. Covering
       useful tools, tutorials, tips and inspirational photos.
 5.    Images - Google Images. The most comprehensive image search on the web.
 6.    Trade Literature – a set on Flickr
 7.    Compfight / A Flickr Search Tool
 8.    morgueFile free photos for creatives by creatives
 9.    stock.xchng – the leading free stock photography site
 10.   The Ultimate Collection Of Free Vector Packs – Smashing Magazine
 11.   How to Create Animated GIFs Using Photoshop CS3 – wikiHow
 12.   IAN Symbol Libraries (Free Vector Symbols and Icons) – Integration and Application Network
 13.   Usability.gov
 14.   best icons
 15.   Iconspedia
 16.   IconFinder
 17.   IconSeeker
Data Tools
Invisible Web
   1.   10 Search Engines to Explore the Invisible Web
   2.   Scirus – for scientific information - The most comprehensive scientific research tool on the web. With over 410 million scientific items indexed at
        last count, it allows researchers to search for not only journal content but also scientists’ homepages, courseware, pre-print server material,
        patents and institutional repository and website information.
   3.   TechXtra: Engineering, Mathematics, and Computing - TechXtra is a free service which can help you find articles, books, the best websites, the
        latest industry news, job announcements, technical reports, technical data, full text eprints, the latest research, thesis & dissertations, teaching
        and learning resources and more, in engineering, mathematics and computing.
   4.   Welcome to INFOMINE: Scholarly Internet Resource Collections - INFOMINE is a virtual library of Internet resources relevant to faculty,
        students, and research staff at the university level. It contains useful Internet resources such as databases, electronic journals, electronic
        books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.
   5.   The WWW Virtual Library - The WWW Virtual Library (VL) is the oldest catalogue of the Web, started by Tim Berners-Lee, the creator of HTML
        and of the Web itself, in 1991 at CERN in Geneva. Unlike commercial catalogues, it is run by a loose confederation of volunteers, who compile
        pages of key links for particular areas in which they are expert; even though it isn’t the biggest index of the Web, the VL pages are widely
        recognised as being amongst the highest-quality guides to particular sections of the Web.
   6.   Intute - Intute is a free online service that helps you to find web resources for your studies and research. With millions of resources available on
        the Internet, it can be difficult to find useful material. We have reviewed and evaluated thousands of resources to help you choose key websites
        in your subject. CompletePlanet – Discover over 70,000+ databases and specially search engines - There are hundreds of thousands of
        databases that contain Deep Web content. CompletePlanet is the front door to these Deep Web databases on the Web and to the thousands of
        regular search engines — it is the first step in trying to find highly topical information. By tracing through Infoplease: Encyclopedia, Almanac,
        Atlas, Biographies, Dictionary, Thesaurus. - Information Please has been providing authoritative answers to all kinds of factual questions since
        1938—first as a popular radio quiz show, then starting in 1947 as an annual almanac, and since 1998 on the Internet at www.infoplease.com.
        Many things have changed since 1938, but not our dedication to providing reliable information, in a way that engages and entertains.
   7.   DeepPeep: discover the hidden web - DeepPeep is a search engine specialized in Web forms. The current beta version currently tracks 45,000
        forms across 7 domains. DeepPeep helps you discover the entry points to content in Deep Web (aka Hidden Web) sites, including online
        databases and Web services. Advanced search allows you to perform more specific queries. Besides specifying keywords, you can also search
        for specific form element labels, i.e., the description of the form attributes.
   8.   IncyWincy: The Invisible Web Search Engine - IncyWincy is a showcase of Net Research Server (NRS) 5.0, a software product that provides a
        complete search portal solution, developed by LoopIP LLC. LoopIP licenses the NRS engine and provides consulting expertise in building
        search solutions.
Data Tools
Metadata
Description Schema: MODS (Library of Congress) and Outline of elements and
  attributes in MODS version 3.4: MetadataObject - This document contains a
  listing of elements and their related attributes in MODS Version 3.4 with values
  or value sources where applicable. It is an “outline” of the schema. Items
  highlighted in red indicate changes made to MODS in Version 3.4.All top-level
  elements and all attributes are optional, but you must have at least one element.
  Subelements are optional, although in some cases you may not have empty
  containers. Attributes are not in a mandated sequence and not repeatable (per
  XML rules). “Ordered” below means the subelements must occur in the order
  given. Elements are repeatable unless otherwise noted.”Authority” attributes are
  either followed by codes for authority lists (e.g., iso639-2b) or “see” references
  that link to documents that contain codes for identifying authority lists.For
  additional information about any MODS elements (version 3.4 elements will be
  added soon), please see the MODS User Guidelines.
Data Tools

wiki.dbpedia.org : About - DBpedia is a community effort to extract structured
   information from Wikipedia and to make this information available on the Web.
   DBpedia allows you to ask sophisticated queries against Wikipedia, and to link
   other data sets on the Web to Wikipedia data. We hope this will make it easier
   for the amazing amount of information in Wikipedia to be used in new and
   interesting ways, and that it might inspire new mechanisms for navigating,
   linking and improving the encyclopaedia itself.
Data Tools
Semantic Web – W3C - In addition to the classic “Web of documents” W3C is
  helping to build a technology stack to support a “Web of data,” the sort of data
  you find in databases. The ultimate goal of the Web of data is to enable
  computers to do more useful work and to develop systems that can support
  trusted interactions over the network. The term “Semantic Web” refers to W3C’s
  vision of the Web of linked data. Semantic Web technologies enable people to
  create data stores on the Web, build vocabularies, and write rules for handling
  data. Linked data are empowered by technologies such as RDF, SPARQL,
  OWL, and SKOS.

RDA: Resource Description & Access | www.rdatoolkit.org - Designed for the digital
  world and an expanding universe of metadata users, RDA: Resource
  Description and Access is the new, unified cataloging standard. The online RDA
  Toolkit subscription is the most effective way to interact with the new standard.
  More on RDA.
Data Tools
Cataloging Cultural Objects - A Guide to Describing Cultural Works and Their
   Images (CCO) is a manual for describing, documenting, and cataloging cultural
   works and their visual surrogates. The primary focus of CCO is art and
   architecture, including but not limited to paintings, sculpture, prints, manuscripts,
   photographs, built works, installations, and other visual media. CCO also covers
   many other types of cultural works, including archaeological sites, artifacts, and
   functional objects from the realm of material culture.

Library of Congress Authorities (Search for Name, Subject, Title and Name/Title) -
    Using Library of Congress Authorities, you can browse and view authority
    headings for Subject, Name, Title and Name/Title combinations; and download
    authority records in MARC format for use in a local library system. This service
    is offered free of charge.

Search Tools and Databases (Getty Research Institute) - Use these search tools to
   access library materials, specialized databases, and other digital resources.
Data Tools
Art & Architecture Thesaurus (Getty Research Institute) - Learn about the purpose,
    scope and structure of the AAT. The AAT is an evolving vocabulary, growing
    and changing thanks to contributions from Getty projects and other institutions.
    Find out more about the AAT’s contributors.

Getty Thesaurus of Geographic Names (Getty Research Institute) Learn about the
   purpose, scope and structure of the TGN. The TGN is an evolving vocabulary,
   growing and changing thanks to contributions from Getty projects and other
   institutions. Find out more about the TGN’s contributors.

DCMI Metadata Terms

The Digital Object Identifier System

The Federal Geographic Data Committee — Federal Geographic Data Committee
9 mistakes that will kill the best data analyses


1. Sampling or design of experiment not properly done
2. Non robust cross-validation
3. Poor communication of results to management or clients
4. Poor data visualization
5. Does not solve our business problems
6. Database misses important data or fields
7. Failure to leverage external data
8. Can't make business data silos to "talk to each other"
9. Developers (production people) and designers speak "different languages"
Thank You

Presentation by:

Michael Walker
Rose Business Technologies
720.373.2200
m@rosebt.com
http://www.rosebt.com

Mais conteúdo relacionado

Destaque (16)

Sitios de interes
Sitios de interesSitios de interes
Sitios de interes
 
PRUEBA TOEFL
PRUEBA TOEFLPRUEBA TOEFL
PRUEBA TOEFL
 
Enc lecture day3
Enc lecture day3Enc lecture day3
Enc lecture day3
 
Aca advocacy
Aca advocacyAca advocacy
Aca advocacy
 
Big Data ROI
Big Data ROIBig Data ROI
Big Data ROI
 
Presentacin1 121120192622-phpapp02
Presentacin1 121120192622-phpapp02Presentacin1 121120192622-phpapp02
Presentacin1 121120192622-phpapp02
 
Presentation1
Presentation1Presentation1
Presentation1
 
Evaluation Question 4
Evaluation Question 4Evaluation Question 4
Evaluation Question 4
 
Applying Big Data
Applying Big DataApplying Big Data
Applying Big Data
 
페차쿠차_ 조연진
페차쿠차_ 조연진페차쿠차_ 조연진
페차쿠차_ 조연진
 
Qlitan wid my cousins
Qlitan wid my cousinsQlitan wid my cousins
Qlitan wid my cousins
 
Pre cd and artist research
Pre cd and artist researchPre cd and artist research
Pre cd and artist research
 
Enc 3241 document_design1
Enc 3241 document_design1Enc 3241 document_design1
Enc 3241 document_design1
 
Top 150 global design firms
Top 150 global design firmsTop 150 global design firms
Top 150 global design firms
 
Diapositiva asesores
Diapositiva asesoresDiapositiva asesores
Diapositiva asesores
 
Organizaciones tradicionales vs organizaciones actuales
Organizaciones tradicionales vs organizaciones actuales Organizaciones tradicionales vs organizaciones actuales
Organizaciones tradicionales vs organizaciones actuales
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Data Science Overview (Oct. 3rd, 2012)

  • 1. Data Science Meetup Finalize Data Science Teams Technology Overview Data Science Tools Data Science Resources October 3, 2012
  • 2. Presentation by: Michael Walker Rose Business Technologies 720.373.2200 m@rosebt.com http://www.rosebt.com
  • 3. Agenda 6:00 - 6:30 Overview - Finalize Data Science Teams: Michael Walker 6:30 - 7:00 Hadoop/Mapreduce Presentation: John Dougherty 7:00 - 7:15 Qubole Presentation: Sadiq Shaik 7:15 - 7:45 Kognitio Presentation: Reggie Arizmendi 7:45 - 8:00 Network
  • 4. Hype Cycle for Emerging Tech 2012
  • 5. Hype Cycle for Big Data 2012
  • 6. Top 5 Big Data Challenges 1. Deciding what data is relevant 2. Cost of technology infrastructure 3. Lack of skills to analyze the data 4. Lack of skills to manage big data projects 5. Lack of business support
  • 7. Most Difficult Big Data Skills to Find 1. Advanced analytics, predictive analytics 2. Complex event processing 3. Rules management 4. Business intelligence tools 5. Data integration
  • 8. Big Data Drivers Analysis of…: 1. Operational data 2. Online customer data 3. Sales transactions data 4. Machine or device data 5. Service innovation
  • 9. Definitions Big data analytics is the application of advanced analytic techniques to very big data sets. Big data is a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high-velocity capture, discovery and/or analysis.
  • 10. Horizontal & Vertical Applications Big Data technology can be deployed for business processes such as the following: • Customer relationship management (sales, marketing, customer service) • Supply chain and operations • Administration (finance and accounting, human resources, legal) • Research and development • Information technology management • Risk management
  • 11. Horizontal & Vertical Applications In addition, big data technology can be used for industry- specific applications such as the following: • Logistics optimization in the transportation industry • Price optimization in the retail industry • Intellectual property management in the media and entertainment industry • Natural resource exploration in the oil and gas industry • Warranty management in the manufacturing industry • Crime prevention and investigation in local law enforcement • Predictive damage assessments in the insurance industry • Fraud detection in the banking industry • Patient treatment and fraud detection in the healthcare industry
  • 12. Data Science Teams Four (4) person teams Optimal skill mix: 1. Business Leader (consumer) 2. Statistics 3. Data Modeler 4. IT
  • 13. Data Science Use Case / Scenario Each team selects a use case / scenario Thesis Data sources Analytical tools / platforms
  • 14. Use Case Example: I suggest there is a correlation between size of government and economic growth. Thesis: Bigger government = slower economic growth Data source: Open data source from government stats; yahoo finance- bloomberg Tool: Qubole on Amazon PaaS
  • 15. Data Modeling A data model is a plan for building a database. To use a common analogy, the data model is equivalent to an architect's building plans.
  • 16. Data Modeling Three different types of data models: 1) Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.
  • 17. Data Modeling 2) Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).
  • 18. Data Modeling 3) Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.
  • 21. Models of Data A framework to organize and analyze data. Predictive, Descriptive, Prescriptive Analytics There are three types of data analysis: Predictive (forecasting) Descriptive (business intelligence and data mining) Prescriptive (optimization and simulation)
  • 22. Models of Data Predictive Analytics Predictive analytics turns data into valuable, actionable information. Predictive analytics uses data to determine the probable future outcome of an event or a likelihood of a situation occurring. Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events.
  • 23. Models of Data Predictive Analytics Three basic cornerstones of predictive analytics are: Predictive modeling Decision Analysis and Optimization Transaction Profiling An example of using predictive analytics is optimizing customer relationship management systems. They can help enable an organization to analyze all customer data therefore exposing patterns that predict customer behavior.
  • 24. Models of Data Predictive Analytics Another example is for an organization that offers multiple products, predictive analytics can help analyze customers’ spending, usage and other behavior, leading to efficient cross sales, or selling additional products to current customers. This directly leads to higher profitability per customer and stronger customer relationships.
  • 25. Models of Data Descriptive Analytics Descriptive analytics looks at data and analyzes past events for insight as to how to approach the future. Descriptive analytics looks at past performance and understands that performance by mining historical data to look for the reasons behind past success or failure. Almost all management reporting such as sales, marketing, operations, and finance, uses this type of post-mortem analysis.
  • 26. Models of Data Descriptive Analytics Descriptive models quantify relationships in data in a way that is often used to classify customers or prospects into groups. Unlike predictive models that focus on predicting a single customer behavior (such as credit risk), descriptive models identify many different relationships between customers or products. Descriptive models do not rank-order customers by their likelihood of taking a particular action the way predictive models do.
  • 27. Models of Data Descriptive Analytics Descriptive models can be used, for example, to categorize customers by their product preferences and life stage. Descriptive modeling tools can be utilized to develop further models that can simulate large number of individualized agents and make predictions. For example, descriptive analytics examines historical electricity usage data to help plan power needs and allow electric companies to set optimal prices.
  • 28. Models of Data Prescriptive Analytics Prescriptive analytics automatically synthesizes big data, mathematical sciences, business rules, and machine learning to make predictions and then suggests decision options to take advantage of the predictions. Prescriptive analytics goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the decision maker the implications of each decision option. Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen.
  • 29. Models of Data Prescriptive Analytics Further, prescriptive analytics can suggest decision options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each decision option. In practice, prescriptive analytics can continually and automatically process new data to improve prediction accuracy and provide better decision options.
  • 30. Models of Data Prescriptive Analytics Prescriptive analytics synergistically combines data, business rules, and mathematical models. The data inputs to prescriptive analytics may come from multiple sources, internal (inside the organization) and external (social media). The data may also be structured, which includes numerical and categorical data, as well as unstructured data, such as text, images, audio, and video data, including big data. Business rules define the business process and include constraints, preferences, policies, best practices, and boundaries. Mathematical models are techniques derived from mathematical sciences and related disciplines including applied statistics, machine learning, operations research, and natural language processing.
  • 31. Models of Data Prescriptive Analytics For example, prescriptive analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of external factors such as economic data, population demographic trends and population health trends, to more accurately plan for future capital investments such as new facilities and equipment utilization as well as understand the trade-offs between adding additional beds and expanding an existing facility versus building a new one.
  • 32. Models of Data Prescriptive Analytics Another example is energy and utilities. Natural gas prices fluctuate dramatically depending upon supply, demand, econometrics, geo-politics, and weather conditions. Gas producers, transmission (pipeline) companies and utility firms have a keen interest in more accurately predicting gas prices so that they can lock in favorable terms while hedging downside risk. Prescriptive analytics can accurately predict prices by modeling internal and external variables simultaneously and also provide decision options and show the impact of each decision option.
  • 33. Analytical Technologies Platforms Amazon PaaS Cloud Foundry PaaS MS Azure Google App Engine IBM SmartCloud Heroku
  • 34. Analytical Technologies Tools Hadoop / MapReduce R Language - Revolution Analytics Qubole Alteryx Vertica BigML Kognitio MS SQL Server: SSIS; SSAS Geokettle
  • 35. Analytical Technologies Tools ERwin Data Modeler StrategyCompanion Talend Pentaho Hortonworks Metalab SAS SPSS PSPP
  • 36. Open Data Sources Freebase Data Hub Numbrary Peter Skomoroch's Delicious Data InfoChimps Open Data Sites DBpedia theinfo.org Lending Club Statistics MAF/TIGER (US Census Geo) Database Reuters Corpora (RCV1, RCV2, TRC2) Open Street Map MusicBrainz Jigsaw Opentick
  • 37. Open Data Sources Historical Data, Yahoo Finance Historical Foreign Exchange Data, Federal Reserve Bank of New York Graduate School of Business, Stanford University Proprietary Trading Articles & Resources Wilmott.com DefaultRisk.com, Credit Risk Modeling Resource: Papers, Books, Conferences, Jobs Forex Factory, Forums NBER Papers in Asset Pricing: Stocks, Bonds and Foreign Currency Financial Engineering Books, International Association of Financial Engineers
  • 38. Open Data Sources • • Literacy, Gross Domestic Product, Income and Military Expenditures for 154 Countries • Continent Codes for Countries • Source: Various Wikipedia Articles • Daily Precipitation, Min and Max Temperatures for Berkeley for the first 10 months of 2005 • Source: http://hurricane.ncdc.noaa.gov/dly/DLY • Release Dates and Box Office Earnings for Top Movies • Source: http://www.movieweb.com/movies/boxoffice/alltime.php • See Also: http://imdb.com/Top/ • Bush-Kerry Election Results 2004 • US State Population, 2003 and 2004 • Source: http://www.factmonster.com/ipka/A0004986.html • Information about Cars (1978-1979) • Diabetes in Pima Indians • Information about Diabetes data source: http://www.ics.uci.edu/~mlearn/MLRepository.html • Updated world data with new variables • Wine Recognition Data • Information about Wine data source: http://www.ics.uci.edu/~mlearn/MLRepository.html • Nutritional Information about Crackers source: http://www.math.csi.cuny.edu/st/Projects • XML Plant Catalog source: http://www.w3schools.com/xml/ • US Wheat Production 1910-2004 source: http://usda.mannlib.cornell.edu/data-sets/crops/88008/ • Birthdays and Terms of US Senators source: Wikipedia • Weight and Sleep Information of Various Animals • Information about Sleep Data Set • SQLite Album database • Iron dataset
  • 39. Eight Levels of Analytics
  • 40. Statistical Analysis Statistical Analysis answers the questions: Why is this happening? What opportunities am I missing? Example: Banks can discover why an increasing number of customers are refinancing their homes. Here we can begin to run some complex analytics, like frequency models and regression analysis. We can begin to look at why things are happening using the stored data and then begin to answer questions based on the data.
  • 41. Forecasting Forecasting answers the questions: What if these trends continue? How much is needed? When will it be needed? Example: Retailers can predict how demand for individual products will vary from store to store. Forecasting is one of the hottest markets – and hottest analytical applications – right now. It applies everywhere. In particular, forecasting demand helps supply just enough inventory, so you don’t run out or have too much.
  • 42. Predictive Modeling Predictive Modeling answers the questions: What will happen next? How will it affect my business? Example: Hotels and casinos can predict which VIP customers will be more interested in particular vacation packages. If you have 10 million customers and want to do a marketing campaign, who's most likely to respond? How do you segment that group? And how do you determine who's most likely to leave your organization? Predictive modeling provides the answers.
  • 43. Optimization Optimization answers the question: How do we do things better? What is the best decision for a complex problem? Example: Given business priorities, resource constraints and available technology, determine the best way to optimize your IT platform to satisfy the needs of every user. Optimization supports innovation. It takes your resources and needs into consideration and helps you find the best possible way to accomplish your goals.
  • 44. Conceptual Modeling Conceptual Modeling brings together the business and technology views to define the solution scope. It is more than technical architecture or data context diagrams. Technical architecture and data context diagrams have their place, but the critical skill is the business view (vs. technical view) of the solution scope. This is critical to engaging stakeholders and setting the stage for innovation.
  • 45. Statistical Models Nonparametric Tests T-test ANOVA & MANOVA ANCOVA & MANCOVA Linear Regression Generalized Least Squares Ridge Regression Lasso Generalized Linear Models Mixed Effects Models
  • 46. Statistical Models Logistic Regression Nonlinear Regression Discriminant Analysis Nearest Neighbor Factor & Principal Components Analysis Copula Models Cross-Validation Bayesian Statistics Monte Carlo, Classic Methods Markov Chain Monte Carlo
  • 47. Statistical Models Bootstrap & Jackknife EM Algorithm Missing Data Imputation Outlier Diagnostics Robust Estimation Longitudinal (Panel) Data Survival Analysis Path Analysis Propensity Score Matching Stratified Samples (Survey Data)
  • 48. Statistical Models Experimental Design Quality Control Reliability Theory Univariate Time Series Multivariate Time Series Markov Chains Hidden Markov Models Stochastic Volatility Models Diffusions Counting Processes
  • 49. Statistical Models Filtering Instrumental Variables Simultaneous Equations Splines Nonparametric Smoothing Methods Extreme Value Theory Variance Stabilization Cluster Analysis Neural Networks Classification & Regression Trees
  • 50. Statistical Models Boosting Classification & Regression Trees Random Forests Support Vector Machines Signal Processing Wavelet Analysis ROC Curves Optimization
  • 51. Statistical Models Two simple yet powerful models: Generalized Linear Regression Model Random Forests Suggestion: Keep it simple for the first use case.
  • 53. Predictive Modeling Techniques Problems with some predictive modeling techniques. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used. 1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to interpret. Very unstable when independent variables are highly correlated. Fixes: variable reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or Lasso regression) 2. Traditional decision trees. Very large decision trees are very unstable and impossible to interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead of using a large decision tree. 3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they never do. Use density estimation techniques instead. 4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well with data points that are not a mixture of Gaussian distributions.
  • 54. Predictive Modeling Techniques 5. Neural networks. Difficult to interpret, unstable, subject to over-fitting. 6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit for your data. 7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality. Fix: use (non parametric) kernel density estimators with adaptive bandwidths. 8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are independent, if not it will fail miserably. In the context of fraud or spam detection, variables (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives. And remember to use sound cross-validations techniques when testing models!
  • 55. Predictive Modeling Techniques Poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set. Good cross validations consist in: • splitting your training set into multiple subsets (test and control subsets), • include different types of clients and more recent data in the control sets (than in your test sets) • check quality of forecasted values on control sets • compute confidence intervals for individual errors (error defined e.g. as |true value minus forecasted value|) to make sure that error is small enough AND not too volatile (it has small variance across all control sets)
  • 56. Statistical Software Almost all serious statistical analysis is done in one of the following packages: R (SPlus), Matlab, SAS, SPSS and Stata. It does not mean that each of those packages is good for a specific type of analysis. In fact, for most advanced areas, only 2-3 packages will be suitable, providing enough functionality or enough tools to implement this functionality easily. For example, a very important area of Markov Chain Monte Carlo is doable in R, Matlab and SAS only, unless you want to rely on convoluted macros written by random users on the web.
  • 57. Statistical Software R & MATLAB R and Matlab are the richest systems by far. They contain an impressive amount of libraries, which is growing each day. Even if a desired very specific model is not part of the standard functionality, you can implement it yourself, because R and Matlab are really programming languages with relatively simple syntaxes. As "languages" they allow you to express any idea. The question is whether you are a good writer or not. In terms of modern applied statistics tools, R libraries are somewhat richer than those of Matlab. Also R is free. On the flip side, Matlab has much better graphics, which you will not be ashamed to put in a paper or a presentation.
  • 58. Statistical Software SPSS On the other end of the spectrum is a package like SPSS. SPSS is quite narrow in its capabilities and allows you to do only about half of the mainstream statistics. It is quite useless for ambitious modeling and estimation procedures which are part of kernel smoothing, pattern recognition or signal processing. Nonetheless, SPSS is very popular among the practitioners because it does not require almost any programming training. All you have to do is hit several buttons and SPSS does all the calculations for you. In those cases when you need something standard, SPSS may have it implemented fully. The SPSS output will be quite detailed and visually pleasing. It will contain all the major tests and diagnostic tools associated with the method and will allow you to write an informative statistics section of your empirical analysis. In short, when the method is there, it is faster to run than a similar functionality in R or Matlab. So I use SPSS often for standard requests from my clients, like running linear regression, ANOVA or principal components analysis. SPSS gives you the ability to program macros, but that feature is quite inflexible.
  • 59. Statistical Software SAS & STATA Somewhere in-between R, Matlab and SPSS lie SAS and Stata. SAS is more extensive analytics than Stata. It is composed of dozens of procedures with massive, massive output, often covering more than ten pages. The idea of SAS is not to listen to you that much. It is like an old grandfather, which you approach with a simple question but instead he tells you the story of his life. Many procedures contain three times more than what you need to know about that segment. So some time has to be spent on filtering in the relevant output. SAS procedures are invoked using simple scripts. Stata procedures can be invoked by clicking buttons in the menu or by running simple scripts. In the menu part, Stata resembles SPSS. Both SAS and Stata are programming languages, so they allow you to build analytics around standard procedures. Stata is somewhat more flexible than SAS. Still, in terms of programming flexibility, Stata and SAS do not come even close to R or Matlab. Selected strengths of SAS compared to all other packages: large data sets, speed, beautiful graphics, flexibility in formatting the output, time series procedures, counting processes. Selected strengths of Stata compared to all other packages: manipulation of survey data (stratified samples, clustering), robust estimation and tests, longitudinal data methods, multivariate time series.
  • 60. Statistical Software Useful Resources: American Statistical Association Department of Statistics, Stanford University Elementary Statistics Books Available to Download for Free
  • 61. Statistical Software • Downloading R • R Manuals (at CRAN) • Accessing the SCF Remotely (includes how to get the necessary software) • Class Bulletin Board (bspace) • Driver to convert Windows Documents to PDF • Introduction to R (pdf) • Slides for a Course in R (pdf) • R Graph Gallery • statsnetbase Search for R Graphics to read Paul Murrell's book about plotting in R • Some Notes on Saving Plots in R • Free Graphical MySQL Client • SQLite Graphical Client for Windows • Instructions on running the Firefox SQLiteManager extension as an application on Mac OSX • Accessing the Class MySQL Server through an SSH Tunnel • Connecting to the MySQL server under Windows • Introduction to Cluster Analysis (statsoft.nl) • Fruit pictures for the "slot machine" (zipped) • R TclTk examples • More R TclTk examples • Additional GUI examples: Deal or No Deal Piano • HTML Form Tutorial • Setting up your account for CGI scripting • Running your own Webserver to test CGI programs (Mac & Linux) • Notes on Document Preparation with Latex • vi reference card • emacs reference card • R reference card • More information on Dates and Times in R • More information on Factors in R
  • 62. Statistical Software Books • Competing on Analytics • Analytics at Work • Super Crunchers • The Numerati • Data Driven • Data Source Handbook • Programming Collective Intelligence • Mining the Social Web • Data Analysis with Open Source Tools • Visualizing Data • The Visual Display of Quantitative Information • Envisioning Information • Visual Explanations: Images and Quantities, Evidence and Narrative • Beautiful Evidence • Think Stats • Data Analysis Using Regression and Multilevel/Hierarchical Models • Applied Longitudinal Data Analysis • Design of Observational Studies • Statistical Rules of Thumb • All of Statistics • A Handbook of Statistical Analyses Using R • Mathematical Statistics and Data Analysis • The Elements of Statistical Learning • Counterfactuals and Causal Inference
  • 63. Statistical Software • • Mining of Massive Data Sets • Data Analysis: What Can Be Learned From the Past 50 Years • Bias and Causation • Regression Modeling Strategies • Probably Not • Statistics as Principled Argument • The Practice of Data Analysis Great class notes on Data Science: http://statistics.berkeley.edu/classes/s133/all2011.pdf Related Workshops • Data Bootcamp, Strata 2011 • Machine Learning Summer School, Purdue 2011 • Looking at Data
  • 64. Statistical Software Courses • Concepts in Computing with Data, Berkeley • Practical Machine Learning, Berkeley • Artificial Intelligence, Berkeley • Visualization, Berkeley • Data Mining and Analytics in Intelligent Business Services, Berkeley • Data Science and Analytics: Thought Leaders, Berkeley • Machine Learning, Stanford • Paradigms for Computing with Data, Stanford • Mining Massive Data Sets, Stanford • Data Visualization, Stanford • Algorithms for Massive Data Set Analysis, Stanford • Research Topics in Interactive Data Analysis, Stanford • Data Mining, Stanford • Machine Learning, CMU • Statistical Computing, CMU • Machine Learning with Large Datasets, CMU • Machine Learning, MIT • Data Mining, MIT • Statistical Learning Theory and Applications, MIT • Data Literacy, MIT • Introduction to Data Mining, UIUC • Learning from Data, Caltech • Introduction to Statistics, Harvard • Data-Intensive Information Processing Applications, University of Maryland
  • 65. Statistical Software • Dealing with Massive Data, Columbia • Data-Driven Modeling, Columbia • Introduction to Data Mining and Analysis, Georgia Tech • Computational Data Analysis: Foundations of Machine Learning and Da..., Georgia Tech • Applied Statistical Computing, Iowa State • Data Visualization, Rice • Data Warehousing and Data Mining, NYU • Data Mining in Engineering, Toronto • Machine Learning and Data Mining, UC Irvine • Knowledge Discovery from Data, Cal Poly • Large Scale Learning, University of Chicago • Data Science: Large-scale Advanced Data Analysis, University of Florida • Strategies for Statistical Data Analysis, Universität Leipzig Videos • Lies, damned lies and statistics (about TEDTalks) • The Joy of Stats • Journalism in the Age of Data
  • 66. Data Science Team Ideas Keep it simple! Work on a real problem from work. Suggestions for more challenging problems: Census Return Rate Develop a statistical model to predict census mail return rates at the Census block group level of geography. The Census Bureau will use this model for planning purposes for the decennial census and for demographic sample surveys. Develop and evaluate different statistical approaches to proposing the best predictive model for geographic units. The intent is to improve current predictive analytics.
  • 67. Data Science Team Ideas Hierarchical load forecasting problem: backcasting and forecasting hourly loads (in kW) for a US utility with 20 zones. Backcast and forecast at both zonal level (20 series) and system (sum of the 20 zonal level series) level, totally 21 series. Data (loads of 20 zones and temperature of 11 stations) history ranges from the 1st hour of 2004/1/1 to the 6th hour of 2008/6/30. Given actual temperature history, the 8 weeks below in the load history are set to be missing and are required to be backcasted. It's OK to use the entire history to backcast these 8 weeks. 2005/3/6 - 2005/3/12; 2005/6/20 - 2005/6/26; 2005/9/10 - 2005/9/16; 2005/12/25 - 2005/12/31; 2006/2/13 - 2006/2/19; 2006/5/25 - 2006/5/31; 2006/8/2 - 2006/8/8; 2006/11/22 - 2006/11/28; Need to forecast hourly loads from 2008/7/1 to 2008/7/7. No actual temperatures
  • 68. Data Science Team Ideas Wind power forecasting problem: predicting hourly power generation up to 48 hours ahead at 7 wind farms Based on historical measurements and additional wind forecast information (48-hour ahead predictions of wind speed and direction at the sites). The data is available for period ranging from the 1st hour of 2009/7/1 to the 12th hour of 2012/6/28. The period between 2009/7/1 and 2010/12/31 is a model identification and training period, while the remainder of the dataset, that is, from 2011/1/1 to 2012/6/28, is there for the evaluation. The training period is there to be used for designing and estimating models permiting to predicting wind power generation at lead times from 1 to 48 hours ahead, based on past power observations and/or available meteorological wind forecasts for that period. Over the evaluation part, it is aimed at mimicking real operational conditions. For that, a number of 48-hour periods with missing power observations where defined. All these power observations are to be predicted. These periods are defined as following. The first period with missing observations is that from 2011/1/1 at 01:00 until 2011/1/3 at 00:00. The second period with missing observations is that from 2011/1/4 at 13:00 until 2011/1/6 at 12:00. Note that to be consistent, only the meteorological forecasts for that period that would actually be available in practice are given. These two periods then repeats every 7 days until the end of the dataset. Inbetween periods with missing data, power observations are available for updating the models.
  • 69. Data Science Team Ideas Predict the online sales of a consumer product based on a data set of product features. Build as good a model as possible to predict monthly online sales of a product. Imagine the products are online self-help programs following an initial advertising campaign. Obtain data in the comma separated values (CSV) format. Each row in this data set represents a different consumer product. The first 12 columns (Outcome_M1 through Outcome_M12) contains the monthly online sales for the first 12 months after the product launches. Date_1 is the day number the major advertising campaign began and the product launched. Date_2 is the day number the product was announced and a pre-release advertising campaign began. Other columns in the data set are features of the product and the advertising campaign. Quan_x are quantitative variables and Cat_x are categorical variables. Binary categorical variables are measured as (1) if the product had the feature and (0) if it did not.
  • 70. Data Science Team Ideas Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years. Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. Improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. The goal is to build a model that borrowers can use to help make the best financial decisions. Obtain historical data on 250,000 borrowers.
  • 71. Data Tools Included is a list of tools, such as programming languages and web-based utilities, data mining resources, some prominent organizations in the field, repositories where you can play with data, events you may want to attend and important articles you should take a look at. The second segment of the list includes a number of art and design resources the infographic designers might like including color palette generators and image searches. There are also some invisible web resources (if you’re looking for something on Google and not finding it) and metadata resources so you can appropriately curate your data.
  • 72. Data Tools Google Refine – A power tool for working with messy data (formerly Freebase Gridworks) The Overview Project – Overview is an open-source tool to help journalists find stories in large amounts of data by cleaning, visualizing and interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read. Refine, reuse and request data | ScraperWiki – ScraperWiki is an online tool to make acquiring useful data simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code. Data Curation Profiles – This website is an environment where academic librarians of all kinds, special librarians at research facilities, archivists involved in the preservation of digital data, and those who support digital repositories can find help, support and camaraderie in exploring avenues to learn more about working with research data and the use of the Data Curation Profiles Tool. Google Chart Tools – Google Chart Tools provide a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart galley provides a large number of well-designed chart types. Populating your data is easy using the provided client- and server-side tools. 22 free tools for data visualization and analysis The R Journal – The R Journal is the refereed journal of the R project for statistical computing. It features short to medium length articles covering topics that might be of interest to users or developers of R. CS 229: Machine Learning – A widely referenced course by Professor Andrew Ng, CS 229: Machine Learning provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed. Google Research Publication: BigTable – Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable. Scientific Data Management – An introduction. Natural Language Toolkit – Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. Beautiful Soup – Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Mondrian: Pentaho Analysis – Pentaho Open source analysis OLAP server written in Java. Enabling interactive analysis of very large datasets stored
  • 73. Data Tools The Comprehensive R Archive Network - R is `GNU S’, a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information. CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load. DataStax – Software, support, and training for Apache Cassandra. Machine Learning Demos Visual.ly – Infographics & Visualizations. Create, Share, Explore Google Fusion Tables - Google Fusion Tables is a modern data management and publishing web application that makes it easy to host, manage, collaborate on, visualize, and publish data tables online. Tableau Software - Fast Analytics and Rapid-fire Business Intelligence from Tableau Software. WaveMaker - WaveMaker is a rapid application development environment for building, maintaining and modernizing business-critical Web 2.0 applications. Visualization: Annotated Time Line – Google Chart Tools – Google Code - An interactive time series line chart with optional annotations. The chart is rendered within the browser using Flash. Visualization: Motion Chart – Google Chart Tools – Google Code - A dynamic chart to explore several indicators over time. The chart is rendered within the browser using Flash. PhotoStats - Create gorgeous infographics about your iPhone photos. Ionz Ionz will help you craft an infographic about yourself. chart builder - Powerful tools for creating a variety of charts for online display. Creately - Online diagramming and design. Pixlr Editor - A powerful online photo editor. Google Public Data Explorer - The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and maps animate over time, the changes in the world become easier to understand. You don’t have to be a data expert to navigate between different views, make your own comparisons, and share your findings. Fathom -Fathom Information Design helps clients understand and express complex data through information graphics, interactive tools, and software for installations, the web, and mobile devices. Led by Ben Fry. Enough said! healthymagination | GE Data Visualization - Visualizations that advance the conversation about issues that shape our lives, and so we encourage visitors to download, post and share these visualizations. ggplot2 - ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
  • 74. Data Tools Protovis - Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to simplify construction.Protovis is free and open-source, provided under the BSD License. It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser)! Protovis is mostly declarative and designed to be learned by example. d3.js - D3.js is a small, free JavaScript library for manipulating documents based on data. MATLAB – The Language Of Technical Computing - MATLAB® is a high-level language and interactive environment that enables you to perform computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran. OpenGL – The Industry Standard for High Performance Graphics - OpenGL.org is a vendor-independent and organization-independent web site that acts as one-stop hub for developers and consumers for all OpenGL news and development resources. It has a very large and continually expanding developer and end-user community that is very active and vested in the continued growth of OpenGL. Google Correlate - Google Correlate finds search patterns which correspond with real-world trends. Revolution Analytics – Commercial Software & Support for the R Statistics Language - Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. By building on open source R—the world’s most powerful statistics software—with innovations in big data analysis, integration and user experience, Revolution Analytics meets the demands and requirements of modern data-driven businesses. 22 Useful Online Chart & Graph Generators The Best Tools for Visualization - Visualization is a technique to graphically represent sets of data. When data is large or abstract, visualization can help make the data easier to read or understand. There are visualization tools for search, music, networks, online communities, and almost anything else you can think of. Whether you want a desktop application or a web-based tool, there are many specific tools are available on the web that let you visualize all kinds of data. Visual Understanding Environment - The Visual Understanding Environment (VUE) is an Open Source project based at Tufts University. The VUE project is focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE provides a flexible visual environment for structuring, presenting, and sharing digital information. Bime – Cloud Business Intelligence | Analytics & Dashboards - Bime is a revolutionary approach to data analysis and dashboarding. It allows you to analyze your data through interactive data visualizations and create stunning dashboards from the Web. Data Science Toolkit - A collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more. BuzzData - BuzzData lets you share your data in a smarter, easier way. Instead of juggling versions and overwriting files, use BuzzData and enjoy a social network designed for data. SAP – SAP Crystal Solutions: Simple, Affordable, and Open BI Tools for Everyday Use Project Voldemort
  • 75. Data Tools Data Mining 1. Weka -nWeka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. 2. PSPP- PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions. The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.
  • 76. Data Tools 3. Rapid I- Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.
  • 77. Data Tools 4. R Project – R is a language and environment for statistical computing and graphics. It is a GNU projectwhich is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.R is available as Free Software under the terms of the Free Software Foundation‘s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
  • 78. Data Tools Organizations 1. Data.gov 2. SDM group at LBNL 3. Open Archives Initiative 4. Code for America | A New Kind of Public Service 5. The # DataViz Daily 6. Institute for Advanced Analytics | North Carolina State University | Professor Michael Rappa · MSA Curriculum 7. BuzzData | Blog, 25 great links for data-lovin’ journalists 8. MetaOptimize – Home – Machine learning, natural language processing, predictive analytics, business intelligence, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization 9. had.co.nz 10. Measuring Measures – Measuring Measures
  • 79. Data Tools Repositories 1. Repositories | DataCite 2. Data | The World Bank 3. Infochimps Data Marketplace + Commons: Download Sell or Share Databases, statistics, datasets for free | Infochimps 4. Factual Home – Factual 5. Flowing Media: Your Data Has Something To Say 6. Chartsbin 7. Public Data Explorer 8. StatPlanet 9. ManyEyes 10. 25+ more ways to bring data into R
  • 80. Data Tools Articles 1. Data Science: a literature review | (R news & tutorials) 2. What is “Data Science” Anyway? 3. Hal Varian on how the Web challenges managers – McKinsey Quarterly – Strategy – Innovation 4. The Three Sexy Skills of Data Geeks « Dataspora 5. Rise of the Data Scientist 6. dataists » A Taxonomy of Data Science 7. The Data Science Venn Diagram « Zero Intelligence Agents 8. Revolutions: Growth in data-related jobs 9. Building data startups: Fast, big, and focused – O’Reilly Radar
  • 81. Data Tools Art Design 1. Periodic Table of Typefaces 2. Color Scheme Designer 3 3. Color Palette Generator Generate A Color Palette For Any Image 4. COLOURlovers 5. Colorbrewer: Color Advice for Maps
  • 82. Data Tools Image Searches 1. American Memory from the Library of Congress -The home page for the American Memory Historical Collections from the Library of Congress. American Memory provides free access to historical images, maps, sound recordings, and motion pictures that document the American experience. American Memory offers primary source materials that chronicle historical events, people, places, and ideas that continue to shape America. 2. Galaxy of Images | Smithsonian Institution Libraries 3. Flickr Search 4. 50 Websites For Free Vector Images Download -Design weblog for designers, bloggers and tech users. Covering useful tools, tutorials, tips and inspirational photos. 5. Images - Google Images. The most comprehensive image search on the web. 6. Trade Literature – a set on Flickr 7. Compfight / A Flickr Search Tool 8. morgueFile free photos for creatives by creatives 9. stock.xchng – the leading free stock photography site 10. The Ultimate Collection Of Free Vector Packs – Smashing Magazine 11. How to Create Animated GIFs Using Photoshop CS3 – wikiHow 12. IAN Symbol Libraries (Free Vector Symbols and Icons) – Integration and Application Network 13. Usability.gov 14. best icons 15. Iconspedia 16. IconFinder 17. IconSeeker
  • 83. Data Tools Invisible Web 1. 10 Search Engines to Explore the Invisible Web 2. Scirus – for scientific information - The most comprehensive scientific research tool on the web. With over 410 million scientific items indexed at last count, it allows researchers to search for not only journal content but also scientists’ homepages, courseware, pre-print server material, patents and institutional repository and website information. 3. TechXtra: Engineering, Mathematics, and Computing - TechXtra is a free service which can help you find articles, books, the best websites, the latest industry news, job announcements, technical reports, technical data, full text eprints, the latest research, thesis & dissertations, teaching and learning resources and more, in engineering, mathematics and computing. 4. Welcome to INFOMINE: Scholarly Internet Resource Collections - INFOMINE is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level. It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information. 5. The WWW Virtual Library - The WWW Virtual Library (VL) is the oldest catalogue of the Web, started by Tim Berners-Lee, the creator of HTML and of the Web itself, in 1991 at CERN in Geneva. Unlike commercial catalogues, it is run by a loose confederation of volunteers, who compile pages of key links for particular areas in which they are expert; even though it isn’t the biggest index of the Web, the VL pages are widely recognised as being amongst the highest-quality guides to particular sections of the Web. 6. Intute - Intute is a free online service that helps you to find web resources for your studies and research. With millions of resources available on the Internet, it can be difficult to find useful material. We have reviewed and evaluated thousands of resources to help you choose key websites in your subject. CompletePlanet – Discover over 70,000+ databases and specially search engines - There are hundreds of thousands of databases that contain Deep Web content. CompletePlanet is the front door to these Deep Web databases on the Web and to the thousands of regular search engines — it is the first step in trying to find highly topical information. By tracing through Infoplease: Encyclopedia, Almanac, Atlas, Biographies, Dictionary, Thesaurus. - Information Please has been providing authoritative answers to all kinds of factual questions since 1938—first as a popular radio quiz show, then starting in 1947 as an annual almanac, and since 1998 on the Internet at www.infoplease.com. Many things have changed since 1938, but not our dedication to providing reliable information, in a way that engages and entertains. 7. DeepPeep: discover the hidden web - DeepPeep is a search engine specialized in Web forms. The current beta version currently tracks 45,000 forms across 7 domains. DeepPeep helps you discover the entry points to content in Deep Web (aka Hidden Web) sites, including online databases and Web services. Advanced search allows you to perform more specific queries. Besides specifying keywords, you can also search for specific form element labels, i.e., the description of the form attributes. 8. IncyWincy: The Invisible Web Search Engine - IncyWincy is a showcase of Net Research Server (NRS) 5.0, a software product that provides a complete search portal solution, developed by LoopIP LLC. LoopIP licenses the NRS engine and provides consulting expertise in building search solutions.
  • 84. Data Tools Metadata Description Schema: MODS (Library of Congress) and Outline of elements and attributes in MODS version 3.4: MetadataObject - This document contains a listing of elements and their related attributes in MODS Version 3.4 with values or value sources where applicable. It is an “outline” of the schema. Items highlighted in red indicate changes made to MODS in Version 3.4.All top-level elements and all attributes are optional, but you must have at least one element. Subelements are optional, although in some cases you may not have empty containers. Attributes are not in a mandated sequence and not repeatable (per XML rules). “Ordered” below means the subelements must occur in the order given. Elements are repeatable unless otherwise noted.”Authority” attributes are either followed by codes for authority lists (e.g., iso639-2b) or “see” references that link to documents that contain codes for identifying authority lists.For additional information about any MODS elements (version 3.4 elements will be added soon), please see the MODS User Guidelines.
  • 85. Data Tools wiki.dbpedia.org : About - DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.
  • 86. Data Tools Semantic Web – W3C - In addition to the classic “Web of documents” W3C is helping to build a technology stack to support a “Web of data,” the sort of data you find in databases. The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS. RDA: Resource Description & Access | www.rdatoolkit.org - Designed for the digital world and an expanding universe of metadata users, RDA: Resource Description and Access is the new, unified cataloging standard. The online RDA Toolkit subscription is the most effective way to interact with the new standard. More on RDA.
  • 87. Data Tools Cataloging Cultural Objects - A Guide to Describing Cultural Works and Their Images (CCO) is a manual for describing, documenting, and cataloging cultural works and their visual surrogates. The primary focus of CCO is art and architecture, including but not limited to paintings, sculpture, prints, manuscripts, photographs, built works, installations, and other visual media. CCO also covers many other types of cultural works, including archaeological sites, artifacts, and functional objects from the realm of material culture. Library of Congress Authorities (Search for Name, Subject, Title and Name/Title) - Using Library of Congress Authorities, you can browse and view authority headings for Subject, Name, Title and Name/Title combinations; and download authority records in MARC format for use in a local library system. This service is offered free of charge. Search Tools and Databases (Getty Research Institute) - Use these search tools to access library materials, specialized databases, and other digital resources.
  • 88. Data Tools Art & Architecture Thesaurus (Getty Research Institute) - Learn about the purpose, scope and structure of the AAT. The AAT is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the AAT’s contributors. Getty Thesaurus of Geographic Names (Getty Research Institute) Learn about the purpose, scope and structure of the TGN. The TGN is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the TGN’s contributors. DCMI Metadata Terms The Digital Object Identifier System The Federal Geographic Data Committee — Federal Geographic Data Committee
  • 89. 9 mistakes that will kill the best data analyses 1. Sampling or design of experiment not properly done 2. Non robust cross-validation 3. Poor communication of results to management or clients 4. Poor data visualization 5. Does not solve our business problems 6. Database misses important data or fields 7. Failure to leverage external data 8. Can't make business data silos to "talk to each other" 9. Developers (production people) and designers speak "different languages"
  • 90. Thank You Presentation by: Michael Walker Rose Business Technologies 720.373.2200 m@rosebt.com http://www.rosebt.com