6. Top 5 Big Data Challenges
1. Deciding what data is relevant
2. Cost of technology infrastructure
3. Lack of skills to analyze the data
4. Lack of skills to manage big data projects
5. Lack of business support
7. Most Difficult Big Data Skills to Find
1. Advanced analytics, predictive analytics
2. Complex event processing
3. Rules management
4. Business intelligence tools
5. Data integration
8. Big Data Drivers
Analysis of…:
1. Operational data
2. Online customer data
3. Sales transactions data
4. Machine or device data
5. Service innovation
9. Definitions
Big data analytics is the application of advanced analytic
techniques to very big data sets.
Big data is a new generation of technologies and
architectures designed to extract value economically
from very large volumes of a wide variety of data
by enabling high-velocity capture, discovery and/or
analysis.
10. Horizontal & Vertical Applications
Big Data technology can be deployed for business
processes such as the following:
• Customer relationship management (sales, marketing,
customer service)
• Supply chain and operations
• Administration (finance and accounting, human
resources, legal)
• Research and development
• Information technology management
• Risk management
11. Horizontal & Vertical Applications
In addition, big data technology can be used for industry-
specific applications such as the following:
• Logistics optimization in the transportation industry
• Price optimization in the retail industry
• Intellectual property management in the media and entertainment industry
• Natural resource exploration in the oil and gas industry
• Warranty management in the manufacturing industry
• Crime prevention and investigation in local law enforcement
• Predictive damage assessments in the insurance industry
• Fraud detection in the banking industry
• Patient treatment and fraud detection in the healthcare industry
12. Data Science Teams
Four (4) person teams
Optimal skill mix:
1. Business Leader (consumer)
2. Statistics
3. Data Modeler
4. IT
13. Data Science Use Case / Scenario
Each team selects a use case / scenario
Thesis
Data sources
Analytical tools / platforms
14. Use Case
Example: I suggest there is a correlation
between size of government and economic
growth.
Thesis: Bigger government = slower economic
growth
Data source: Open data source from
government stats; yahoo finance- bloomberg
Tool: Qubole on Amazon PaaS
15. Data Modeling
A data model is a plan for building a database.
To use a common analogy, the data model is
equivalent to an architect's building plans.
16. Data Modeling
Three different types of data models:
1) Conceptual data models.
These models, sometimes called domain models, are
typically used to explore domain concepts with project
stakeholders. On Agile teams high-level conceptual
models are often created as part of your initial
requirements envisioning efforts as they are used to
explore the high-level static business structures and
concepts. On traditional teams conceptual data models
are often created as the precursor to LDMs or as
alternatives to LDMs.
17. Data Modeling
2) Logical data models (LDMs).
LDMs are used to explore the domain concepts, and their
relationships, of your problem domain. This could be
done for the scope of a single project or for your entire
enterprise. LDMs depict the logical entity types, typically
referred to simply as entity types, the data attributes
describing those entities, and the relationships between
the entities. LDMs are rarely used on Agile projects
although often are on traditional projects (where they
rarely seem to add much value in practice).
18. Data Modeling
3) Physical data models (PDMs).
PDMs are used to design the internal schema of a
database, depicting the data tables, the data columns of
those tables, and the relationships between the tables.
PDMs often prove to be useful on both Agile and
traditional projects and as a result the focus of this article
is on physical modeling.
21. Models of Data
A framework to organize and analyze data.
Predictive, Descriptive, Prescriptive Analytics
There are three types of data analysis:
Predictive (forecasting)
Descriptive (business intelligence and data mining)
Prescriptive (optimization and simulation)
22. Models of Data
Predictive Analytics
Predictive analytics turns data into valuable, actionable
information. Predictive analytics uses data to determine the
probable future outcome of an event or a likelihood of a
situation occurring.
Predictive analytics encompasses a variety of statistical
techniques from modeling, machine learning, data mining
and game theory that analyze current and historical facts to
make predictions about future events.
23. Models of Data
Predictive Analytics
Three basic cornerstones of predictive analytics are:
Predictive modeling
Decision Analysis and Optimization
Transaction Profiling
An example of using predictive analytics is optimizing customer
relationship management systems. They can help enable an
organization to analyze all customer data therefore exposing
patterns that predict customer behavior.
24. Models of Data
Predictive Analytics
Another example is for an organization that offers multiple
products, predictive analytics can help analyze customers’
spending, usage and other behavior, leading to efficient
cross sales, or selling additional products to current
customers.
This directly leads to higher profitability per customer and
stronger customer relationships.
25. Models of Data
Descriptive Analytics
Descriptive analytics looks at data and analyzes past events for
insight as to how to approach the future. Descriptive analytics
looks at past performance and understands that performance
by mining historical data to look for the reasons behind past
success or failure.
Almost all management reporting such as sales, marketing,
operations, and finance, uses this type of post-mortem
analysis.
26. Models of Data
Descriptive Analytics
Descriptive models quantify relationships in data in a way that is
often used to classify customers or prospects into groups.
Unlike predictive models that focus on predicting a single
customer behavior (such as credit risk), descriptive models
identify many different relationships between customers or
products.
Descriptive models do not rank-order customers by their
likelihood of taking a particular action the way predictive
models do.
27. Models of Data
Descriptive Analytics
Descriptive models can be used, for example, to categorize
customers by their product preferences and life stage.
Descriptive modeling tools can be utilized to develop further
models that can simulate large number of individualized
agents and make predictions.
For example, descriptive analytics examines historical electricity
usage data to help plan power needs and allow electric
companies to set optimal prices.
28. Models of Data
Prescriptive Analytics
Prescriptive analytics automatically synthesizes big data,
mathematical sciences, business rules, and machine learning
to make predictions and then suggests decision options to
take advantage of the predictions.
Prescriptive analytics goes beyond predicting future outcomes
by also suggesting actions to benefit from the predictions and
showing the decision maker the implications of each decision
option. Prescriptive analytics not only anticipates what will
happen and when it will happen, but also why it will happen.
29. Models of Data
Prescriptive Analytics
Further, prescriptive analytics can suggest decision options on
how to take advantage of a future opportunity or mitigate a
future risk and illustrate the implication of each decision
option.
In practice, prescriptive analytics can continually and
automatically process new data to improve prediction
accuracy and provide better decision options.
30. Models of Data
Prescriptive Analytics
Prescriptive analytics synergistically combines data, business
rules, and mathematical models. The data inputs to
prescriptive analytics may come from multiple sources,
internal (inside the organization) and external (social media).
The data may also be structured, which includes numerical
and categorical data, as well as unstructured data, such as
text, images, audio, and video data, including big data.
Business rules define the business process and include
constraints, preferences, policies, best practices, and
boundaries. Mathematical models are techniques derived
from mathematical sciences and related disciplines including
applied statistics, machine learning, operations research, and
natural language processing.
31. Models of Data
Prescriptive Analytics
For example, prescriptive analytics can benefit healthcare
strategic planning by using analytics to leverage operational
and usage data combined with data of external factors such
as economic data, population demographic trends and
population health trends, to more accurately plan for future
capital investments such as new facilities and equipment
utilization as well as understand the trade-offs between
adding additional beds and expanding an existing facility
versus building a new one.
32. Models of Data
Prescriptive Analytics
Another example is energy and utilities. Natural gas prices
fluctuate dramatically depending upon supply, demand,
econometrics, geo-politics, and weather conditions. Gas
producers, transmission (pipeline) companies and utility firms
have a keen interest in more accurately predicting gas prices
so that they can lock in favorable terms while hedging
downside risk.
Prescriptive analytics can accurately predict prices by modeling
internal and external variables simultaneously and also
provide decision options and show the impact of each
decision option.
36. Open Data Sources
Freebase
Data Hub
Numbrary
Peter Skomoroch's Delicious Data
InfoChimps
Open Data Sites
DBpedia
theinfo.org
Lending Club Statistics
MAF/TIGER (US Census Geo) Database
Reuters Corpora (RCV1, RCV2, TRC2)
Open Street Map
MusicBrainz
Jigsaw
Opentick
37. Open Data Sources
Historical Data, Yahoo Finance
Historical Foreign Exchange Data, Federal Reserve Bank of New York
Graduate School of Business, Stanford University
Proprietary Trading Articles & Resources
Wilmott.com
DefaultRisk.com, Credit Risk Modeling Resource: Papers, Books, Conferences, Jobs
Forex Factory, Forums
NBER Papers in Asset Pricing: Stocks, Bonds and Foreign Currency
Financial Engineering Books, International Association of Financial Engineers
38. Open Data Sources
•
• Literacy, Gross Domestic Product, Income and Military Expenditures for 154 Countries
• Continent Codes for Countries
• Source: Various Wikipedia Articles
• Daily Precipitation, Min and Max Temperatures for Berkeley for the first 10 months of 2005
• Source: http://hurricane.ncdc.noaa.gov/dly/DLY
• Release Dates and Box Office Earnings for Top Movies
• Source: http://www.movieweb.com/movies/boxoffice/alltime.php
• See Also: http://imdb.com/Top/
• Bush-Kerry Election Results 2004
• US State Population, 2003 and 2004
• Source: http://www.factmonster.com/ipka/A0004986.html
• Information about Cars (1978-1979)
• Diabetes in Pima Indians
• Information about Diabetes data source: http://www.ics.uci.edu/~mlearn/MLRepository.html
• Updated world data with new variables
• Wine Recognition Data
• Information about Wine data source: http://www.ics.uci.edu/~mlearn/MLRepository.html
• Nutritional Information about Crackers source: http://www.math.csi.cuny.edu/st/Projects
• XML Plant Catalog source: http://www.w3schools.com/xml/
• US Wheat Production 1910-2004 source: http://usda.mannlib.cornell.edu/data-sets/crops/88008/
• Birthdays and Terms of US Senators source: Wikipedia
• Weight and Sleep Information of Various Animals
• Information about Sleep Data Set
• SQLite Album database
• Iron dataset
40. Statistical Analysis
Statistical Analysis answers the questions: Why is this
happening? What opportunities am I missing?
Example: Banks can discover why an increasing number of
customers are refinancing their homes.
Here we can begin to run some complex analytics, like
frequency models and regression analysis. We can
begin to look at why things are happening using the
stored data and then begin to answer questions based
on the data.
41. Forecasting
Forecasting answers the questions: What if these trends
continue? How much is needed? When will it be
needed?
Example: Retailers can predict how demand for individual
products will vary from store to store.
Forecasting is one of the hottest markets – and hottest
analytical applications – right now. It applies everywhere.
In particular, forecasting demand helps supply just
enough inventory, so you don’t run out or have too
much.
42. Predictive Modeling
Predictive Modeling answers the questions: What will
happen next? How will it affect my business?
Example: Hotels and casinos can predict which VIP
customers will be more interested in particular vacation
packages. If you have 10 million customers and want to
do a marketing campaign, who's most likely to respond?
How do you segment that group? And how do you
determine who's most likely to leave your organization?
Predictive modeling provides the answers.
43. Optimization
Optimization answers the question: How do we do things
better? What is the best decision for a complex problem?
Example: Given business priorities, resource constraints
and available technology, determine the best way to
optimize your IT platform to satisfy the needs of every
user.
Optimization supports innovation. It takes your resources
and needs into consideration and helps you find the best
possible way to accomplish your goals.
44. Conceptual Modeling
Conceptual Modeling brings together the business and
technology views to define the solution scope.
It is more than technical architecture or data context
diagrams. Technical architecture and data context
diagrams have their place, but the critical skill is the
business view (vs. technical view) of the solution scope.
This is critical to engaging stakeholders and setting the
stage for innovation.
46. Statistical Models
Logistic Regression
Nonlinear Regression
Discriminant Analysis
Nearest Neighbor
Factor & Principal Components Analysis
Copula Models
Cross-Validation
Bayesian Statistics
Monte Carlo, Classic Methods
Markov Chain Monte Carlo
47. Statistical Models
Bootstrap & Jackknife
EM Algorithm
Missing Data Imputation
Outlier Diagnostics
Robust Estimation
Longitudinal (Panel) Data
Survival Analysis
Path Analysis
Propensity Score Matching
Stratified Samples (Survey Data)
48. Statistical Models
Experimental Design
Quality Control
Reliability Theory
Univariate Time Series
Multivariate Time Series
Markov Chains
Hidden Markov Models
Stochastic Volatility Models
Diffusions
Counting Processes
51. Statistical Models
Two simple yet powerful models:
Generalized Linear Regression Model
Random Forests
Suggestion: Keep it simple for the first use case.
53. Predictive Modeling Techniques
Problems with some predictive modeling techniques. Note that most of these techniques have evolved
over time (in the last 10 years) to the point where most drawbacks have been eliminated - making
the updated tool far different and better than its original version. Typically, these bad techniques
are still widely used.
1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not
capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to
interpret. Very unstable when independent variables are highly correlated. Fixes: variable
reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or
Lasso regression)
2. Traditional decision trees. Very large decision trees are very unstable and impossible to
interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead
of using a large decision tree.
3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it
assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they
never do. Use density estimation techniques instead.
4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well
with data points that are not a mixture of Gaussian distributions.
54. Predictive Modeling Techniques
5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.
6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic
distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit
for your data.
7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality.
Fix: use (non parametric) kernel density estimators with adaptive bandwidths.
8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are
independent, if not it will fail miserably. In the context of fraud or spam detection, variables
(sometimes called rules) are highly correlated. Fix: group variables into independent clusters of
variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or
use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam
detection) combined with naive Bayes produces absolutely terrible results with many false
positives and false negatives.
And remember to use sound cross-validations techniques when testing models!
55. Predictive Modeling Techniques
Poor cross-validation allows bad models to make the cut, by over-estimating
the true lift to be expected in future data, the true accuracy or the true ROI
outside the training set. Good cross validations consist in:
• splitting your training set into multiple subsets (test and control subsets),
• include different types of clients and more recent data in the control sets
(than in your test sets)
• check quality of forecasted values on control sets
• compute confidence intervals for individual errors (error defined e.g. as
|true value minus forecasted value|) to make sure that error is small
enough AND not too volatile (it has small variance across all control sets)
56. Statistical Software
Almost all serious statistical analysis is done in one of the
following packages: R (SPlus), Matlab, SAS, SPSS and
Stata.
It does not mean that each of those packages is good for a
specific type of analysis. In fact, for most advanced areas,
only 2-3 packages will be suitable, providing enough
functionality or enough tools to implement this functionality
easily.
For example, a very important area of Markov Chain Monte
Carlo is doable in R, Matlab and SAS only, unless you want
to rely on convoluted macros written by random users on
the web.
57. Statistical Software
R & MATLAB
R and Matlab are the richest systems by far. They contain an
impressive amount of libraries, which is growing each day.
Even if a desired very specific model is not part of the
standard functionality, you can implement it yourself,
because R and Matlab are really programming languages
with relatively simple syntaxes. As "languages" they allow
you to express any idea. The question is whether you are a
good writer or not. In terms of modern applied statistics
tools, R libraries are somewhat richer than those of Matlab.
Also R is free. On the flip side, Matlab has much better
graphics, which you will not be ashamed to put in a paper or
a presentation.
58. Statistical Software
SPSS
On the other end of the spectrum is a package like SPSS. SPSS is quite narrow in
its capabilities and allows you to do only about half of the mainstream statistics.
It is quite useless for ambitious modeling and estimation procedures which are
part of kernel smoothing, pattern recognition or signal processing. Nonetheless,
SPSS is very popular among the practitioners because it does not require
almost any programming training. All you have to do is hit several buttons and
SPSS does all the calculations for you. In those cases when you need
something standard, SPSS may have it implemented fully. The SPSS output will
be quite detailed and visually pleasing. It will contain all the major tests and
diagnostic tools associated with the method and will allow you to write an
informative statistics section of your empirical analysis. In short, when the
method is there, it is faster to run than a similar functionality in R or Matlab. So I
use SPSS often for standard requests from my clients, like running linear
regression, ANOVA or principal components analysis. SPSS gives you the
ability to program macros, but that feature is quite inflexible.
59. Statistical Software
SAS & STATA
Somewhere in-between R, Matlab and SPSS lie SAS and Stata. SAS is more
extensive analytics than Stata. It is composed of dozens of procedures with
massive, massive output, often covering more than ten pages. The idea of SAS
is not to listen to you that much. It is like an old grandfather, which you approach
with a simple question but instead he tells you the story of his life. Many
procedures contain three times more than what you need to know about that
segment. So some time has to be spent on filtering in the relevant output. SAS
procedures are invoked using simple scripts. Stata procedures can be invoked
by clicking buttons in the menu or by running simple scripts. In the menu part,
Stata resembles SPSS. Both SAS and Stata are programming languages, so
they allow you to build analytics around standard procedures. Stata is somewhat
more flexible than SAS. Still, in terms of programming flexibility, Stata and SAS
do not come even close to R or Matlab. Selected strengths of SAS compared to
all other packages: large data sets, speed, beautiful graphics, flexibility in
formatting the output, time series procedures, counting processes. Selected
strengths of Stata compared to all other packages: manipulation of survey data
(stratified samples, clustering), robust estimation and tests, longitudinal data
methods, multivariate time series.
61. Statistical Software
• Downloading R
• R Manuals (at CRAN)
• Accessing the SCF Remotely (includes how to get the necessary software)
• Class Bulletin Board (bspace)
• Driver to convert Windows Documents to PDF
• Introduction to R (pdf)
• Slides for a Course in R (pdf)
• R Graph Gallery
• statsnetbase Search for R Graphics to read Paul Murrell's book about plotting in R
• Some Notes on Saving Plots in R
• Free Graphical MySQL Client
• SQLite Graphical Client for Windows
• Instructions on running the Firefox SQLiteManager extension as an application on Mac OSX
• Accessing the Class MySQL Server through an SSH Tunnel
• Connecting to the MySQL server under Windows
• Introduction to Cluster Analysis (statsoft.nl)
• Fruit pictures for the "slot machine" (zipped)
• R TclTk examples
• More R TclTk examples
• Additional GUI examples: Deal or No Deal Piano
• HTML Form Tutorial
• Setting up your account for CGI scripting
• Running your own Webserver to test CGI programs (Mac & Linux)
• Notes on Document Preparation with Latex
• vi reference card
• emacs reference card
• R reference card
• More information on Dates and Times in R
• More information on Factors in R
62. Statistical Software
Books
• Competing on Analytics
• Analytics at Work
• Super Crunchers
• The Numerati
• Data Driven
• Data Source Handbook
• Programming Collective Intelligence
• Mining the Social Web
• Data Analysis with Open Source Tools
• Visualizing Data
• The Visual Display of Quantitative Information
• Envisioning Information
• Visual Explanations: Images and Quantities, Evidence and Narrative
• Beautiful Evidence
• Think Stats
• Data Analysis Using Regression and Multilevel/Hierarchical Models
• Applied Longitudinal Data Analysis
• Design of Observational Studies
• Statistical Rules of Thumb
• All of Statistics
• A Handbook of Statistical Analyses Using R
• Mathematical Statistics and Data Analysis
• The Elements of Statistical Learning
• Counterfactuals and Causal Inference
63. Statistical Software
•
• Mining of Massive Data Sets
• Data Analysis: What Can Be Learned From the Past 50 Years
• Bias and Causation
• Regression Modeling Strategies
• Probably Not
• Statistics as Principled Argument
• The Practice of Data Analysis
Great class notes on Data Science: http://statistics.berkeley.edu/classes/s133/all2011.pdf
Related Workshops
• Data Bootcamp, Strata 2011
• Machine Learning Summer School, Purdue 2011
• Looking at Data
64. Statistical Software
Courses
• Concepts in Computing with Data, Berkeley
• Practical Machine Learning, Berkeley
• Artificial Intelligence, Berkeley
• Visualization, Berkeley
• Data Mining and Analytics in Intelligent Business Services, Berkeley
• Data Science and Analytics: Thought Leaders, Berkeley
• Machine Learning, Stanford
• Paradigms for Computing with Data, Stanford
• Mining Massive Data Sets, Stanford
• Data Visualization, Stanford
• Algorithms for Massive Data Set Analysis, Stanford
• Research Topics in Interactive Data Analysis, Stanford
• Data Mining, Stanford
• Machine Learning, CMU
• Statistical Computing, CMU
• Machine Learning with Large Datasets, CMU
• Machine Learning, MIT
• Data Mining, MIT
• Statistical Learning Theory and Applications, MIT
• Data Literacy, MIT
• Introduction to Data Mining, UIUC
• Learning from Data, Caltech
• Introduction to Statistics, Harvard
• Data-Intensive Information Processing Applications, University of Maryland
65. Statistical Software
• Dealing with Massive Data, Columbia
• Data-Driven Modeling, Columbia
• Introduction to Data Mining and Analysis, Georgia Tech
• Computational Data Analysis: Foundations of Machine Learning and Da..., Georgia Tech
• Applied Statistical Computing, Iowa State
• Data Visualization, Rice
• Data Warehousing and Data Mining, NYU
• Data Mining in Engineering, Toronto
• Machine Learning and Data Mining, UC Irvine
• Knowledge Discovery from Data, Cal Poly
• Large Scale Learning, University of Chicago
• Data Science: Large-scale Advanced Data Analysis, University of Florida
• Strategies for Statistical Data Analysis, Universität Leipzig
Videos
• Lies, damned lies and statistics (about TEDTalks)
• The Joy of Stats
• Journalism in the Age of Data
66. Data Science Team Ideas
Keep it simple!
Work on a real problem from work.
Suggestions for more challenging problems:
Census Return Rate
Develop a statistical model to predict census mail return rates at the Census block
group level of geography. The Census Bureau will use this model for planning
purposes for the decennial census and for demographic sample surveys.
Develop and evaluate different statistical approaches to proposing the best
predictive model for geographic units. The intent is to improve current predictive
analytics.
67. Data Science Team Ideas
Hierarchical load forecasting problem: backcasting and forecasting hourly
loads (in kW) for a US utility with 20 zones.
Backcast and forecast at both zonal level (20 series) and system (sum of the 20
zonal level series) level, totally 21 series. Data (loads of 20 zones and
temperature of 11 stations) history ranges from the 1st hour of 2004/1/1 to the
6th hour of 2008/6/30. Given actual temperature history, the 8 weeks below in
the load history are set to be missing and are required to be backcasted. It's OK
to use the entire history to backcast these 8 weeks.
2005/3/6 - 2005/3/12;
2005/6/20 - 2005/6/26;
2005/9/10 - 2005/9/16;
2005/12/25 - 2005/12/31;
2006/2/13 - 2006/2/19;
2006/5/25 - 2006/5/31;
2006/8/2 - 2006/8/8;
2006/11/22 - 2006/11/28;
Need to forecast hourly loads from 2008/7/1 to 2008/7/7. No actual temperatures
68. Data Science Team Ideas
Wind power forecasting problem: predicting hourly power generation up to 48 hours ahead at 7 wind
farms
Based on historical measurements and additional wind forecast information (48-hour ahead predictions of
wind speed and direction at the sites). The data is available for period ranging from the 1st hour of
2009/7/1 to the 12th hour of 2012/6/28.
The period between 2009/7/1 and 2010/12/31 is a model identification and training period, while the
remainder of the dataset, that is, from 2011/1/1 to 2012/6/28, is there for the evaluation. The training
period is there to be used for designing and estimating models permiting to predicting wind power
generation at lead times from 1 to 48 hours ahead, based on past power observations and/or available
meteorological wind forecasts for that period. Over the evaluation part, it is aimed at mimicking real
operational conditions. For that, a number of 48-hour periods with missing power observations where
defined. All these power observations are to be predicted. These periods are defined as following. The
first period with missing observations is that from 2011/1/1 at 01:00 until 2011/1/3 at 00:00. The second
period with missing observations is that from 2011/1/4 at 13:00 until 2011/1/6 at 12:00. Note that to be
consistent, only the meteorological forecasts for that period that would actually be available in practice
are given. These two periods then repeats every 7 days until the end of the dataset. Inbetween periods
with missing data, power observations are available for updating the models.
69. Data Science Team Ideas
Predict the online sales of a consumer product based on a data set of product features.
Build as good a model as possible to predict monthly online sales of a product.
Imagine the products are online self-help programs following an initial
advertising campaign.
Obtain data in the comma separated values (CSV) format. Each row in this data set
represents a different consumer product.
The first 12 columns (Outcome_M1 through Outcome_M12) contains the monthly
online sales for the first 12 months after the product launches.
Date_1 is the day number the major advertising campaign began and the product
launched.
Date_2 is the day number the product was announced and a pre-release
advertising campaign began.
Other columns in the data set are features of the product and the advertising
campaign. Quan_x are quantitative variables and Cat_x are categorical
variables. Binary categorical variables are measured as (1) if the product had
the feature and (0) if it did not.
70. Data Science Team Ideas
Improve on the state of the art in credit scoring by predicting the probability that somebody will
experience financial distress in the next two years.
Banks play a crucial role in market economies. They decide who can get finance
and on what terms and can make or break investment decisions. For markets
and society to function, individuals and companies need access to credit.
Credit scoring algorithms, which make a guess at the probability of default, are the
method banks use to determine whether or not a loan should be granted.
Improve on the state of the art in credit scoring, by predicting the probability that
somebody will experience financial distress in the next two years.
The goal is to build a model that borrowers can use to help make the best financial
decisions. Obtain historical data on 250,000 borrowers.
71. Data Tools
Included is a list of tools, such as programming languages and web-based utilities,
data mining resources, some prominent organizations in the field, repositories
where you can play with data, events you may want to attend and important
articles you should take a look at.
The second segment of the list includes a number of art and design resources the
infographic designers might like including color palette generators and image
searches. There are also some invisible web resources (if you’re looking for
something on Google and not finding it) and metadata resources so you can
appropriately curate your data.
72. Data Tools
Google Refine – A power tool for working with messy data (formerly Freebase Gridworks)
The Overview Project – Overview is an open-source tool to help journalists find stories in large amounts of data by cleaning, visualizing and
interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information
requests, journalists are drowning in more documents than they can ever hope to read.
Refine, reuse and request data | ScraperWiki – ScraperWiki is an online tool to make acquiring useful data simpler and more collaborative. Anyone
can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other
programmers can contribute to and improve the code.
Data Curation Profiles – This website is an environment where academic librarians of all kinds, special librarians at research facilities, archivists
involved in the preservation of digital data, and those who support digital repositories can find help, support and camaraderie in exploring avenues
to learn more about working with research data and the use of the Data Curation Profiles Tool.
Google Chart Tools – Google Chart Tools provide a perfect way to visualize data on your website. From simple line charts to complex hierarchical
tree maps, the chart galley provides a large number of well-designed chart types. Populating your data is easy using the provided client- and
server-side tools.
22 free tools for data visualization and analysis
The R Journal – The R Journal is the refereed journal of the R project for statistical computing. It features short to medium length articles covering
topics that might be of interest to users or developers of R.
CS 229: Machine Learning – A widely referenced course by Professor Andrew Ng, CS 229: Machine Learning provides a broad introduction to
machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement
learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation,
bioinformatics, speech recognition, and text and web data processing are also discussed.
Google Research Publication: BigTable – Bigtable is a distributed storage system for managing structured data that is designed to scale to a very
large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing,
Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web
pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands,
Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data
model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of
Bigtable.
Scientific Data Management – An introduction.
Natural Language Toolkit – Open source Python modules, linguistic data and documentation for research and development in natural language
processing and text analytics, with distributions for Windows, Mac OSX and Linux.
Beautiful Soup – Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.
Mondrian: Pentaho Analysis – Pentaho Open source analysis OLAP server written in Java. Enabling interactive analysis of very large datasets stored
73. Data Tools
The Comprehensive R Archive Network - R is `GNU S’, a freely available language and environment for statistical computing and graphics which
provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification,
clustering, etc. Please consult the R project homepage for further information. CRAN is a network of ftp and web servers around the world that
store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.
DataStax – Software, support, and training for Apache Cassandra.
Machine Learning Demos
Visual.ly – Infographics & Visualizations. Create, Share, Explore
Google Fusion Tables - Google Fusion Tables is a modern data management and publishing web application that makes it easy
to host, manage, collaborate on, visualize, and publish data tables online.
Tableau Software - Fast Analytics and Rapid-fire Business Intelligence from Tableau Software.
WaveMaker - WaveMaker is a rapid application development environment for building, maintaining and modernizing business-critical Web 2.0
applications.
Visualization: Annotated Time Line – Google Chart Tools – Google Code - An interactive time series line chart with optional annotations. The
chart is rendered within the browser using Flash.
Visualization: Motion Chart – Google Chart Tools – Google Code - A dynamic chart to explore several indicators over time. The chart is rendered
within the browser using Flash.
PhotoStats - Create gorgeous infographics about your iPhone photos.
Ionz Ionz will help you craft an infographic about yourself.
chart builder - Powerful tools for creating a variety of charts for online display.
Creately - Online diagramming and design.
Pixlr Editor - A powerful online photo editor.
Google Public Data Explorer - The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and
maps animate over time, the changes in the world become easier to understand. You don’t have to be a data expert to navigate between different
views, make your own comparisons, and share your findings.
Fathom -Fathom Information Design helps clients understand and express complex data through information graphics, interactive tools, and software
for installations, the web, and mobile devices. Led by Ben Fry. Enough said!
healthymagination | GE Data Visualization - Visualizations that advance the conversation about issues that shape our lives, and so we encourage
visitors to download, post and share these visualizations.
ggplot2 - ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none
of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model
of graphics that makes it easy to produce complex multi-layered graphics.
74. Data Tools
Protovis - Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become
tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layouts to
simplify construction.Protovis is free and open-source, provided under the BSD License. It uses JavaScript and SVG for web-native
visualizations; no plugin required (though you will need a modern web browser)! Protovis is mostly declarative and designed to be learned by
example.
d3.js - D3.js is a small, free JavaScript library for manipulating documents based on data.
MATLAB – The Language Of Technical Computing - MATLAB® is a high-level language and interactive environment that enables you to perform
computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran.
OpenGL – The Industry Standard for High Performance Graphics - OpenGL.org is a vendor-independent and organization-independent web site
that acts as one-stop hub for developers and consumers for all OpenGL news and development resources. It has a very large and continually
expanding developer and end-user community that is very active and vested in the continued growth of OpenGL.
Google Correlate - Google Correlate finds search patterns which correspond with real-world trends.
Revolution Analytics – Commercial Software & Support for the R Statistics Language - Revolution Analytics delivers advanced analytics
software at half the cost of existing solutions. By building on open source R—the world’s most powerful statistics software—with innovations in big
data analysis, integration and user experience, Revolution Analytics meets the demands and requirements of modern data-driven businesses.
22 Useful Online Chart & Graph Generators
The Best Tools for Visualization - Visualization is a technique to graphically represent sets of data. When data is large or abstract, visualization can
help make the data easier to read or understand. There are visualization tools for search, music, networks, online communities, and almost
anything else you can think of. Whether you want a desktop application or a web-based tool, there are many specific tools are available on the
web that let you visualize all kinds of data.
Visual Understanding Environment - The Visual Understanding Environment (VUE) is an Open Source project based at Tufts University. The
VUE project is focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE
provides a flexible visual environment for structuring, presenting, and sharing digital information.
Bime – Cloud Business Intelligence | Analytics & Dashboards - Bime is a revolutionary approach to data analysis and dashboarding. It allows you
to analyze your data through interactive data visualizations and create stunning dashboards from the Web.
Data Science Toolkit - A collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn
the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more.
BuzzData - BuzzData lets you share your data in a smarter, easier way. Instead of juggling versions and overwriting files, use BuzzData and enjoy a
social network designed for data.
SAP – SAP Crystal Solutions: Simple, Affordable, and Open BI Tools for Everyday Use
Project Voldemort
75. Data Tools
Data Mining
1. Weka -nWeka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for data
pre-processing, classification, regression, clustering, association rules, and visualization. It is also
well-suited for developing new machine learning schemes. Weka is open source software issued
under the GNU General Public License.
2. PSPP- PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the
proprietary program SPSS, and appears very similar to it with a few exceptions. The most important of
these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or
deliberately stop working in the future. Neither are there any artificial limits on the number of cases or
variables which you can use. There are no additional packages to purchase in order to get “advanced”
functions; all functionality that PSPP currently supports is in the core package.PSPP can perform
descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to
perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP
with its graphical interface or the more traditional syntax commands.
76. Data Tools
3. Rapid I- Rapid-I provides software, solutions, and services in the fields of predictive analytics, data
mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale
base, i.e. for large amounts of structured data like database systems and unstructured data like texts.
The open-source data mining specialist Rapid-I enables other companies to use leading-edge
technologies for data mining and business intelligence. The discovery and leverage of unused business
intelligence from existing data enables better informed decisions and allows for process
optimization.The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading
open-source system for knowledge discovery and data mining. It is available as a stand-alone
application for data analysis and as a data mining engine which can be integrated into own products. By
now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive
edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP,
Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma,
PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses
benefitting from the open-source business model of Rapid-I.
77. Data Tools
4. R Project – R is a language and environment for statistical computing and graphics. It is a GNU
projectwhich is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as
a different implementation of S. There are some important differences, but much code written for S runs
unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for research in statistical methodology,
and R provides an Open Source route to participation in that activity.One of R’s strengths is the ease
with which well-designed publication-quality plots can be produced, including mathematical symbols
and formulae where needed. Great care has been taken over the defaults for the minor design choices
in graphics, but the user retains full control.R is available as Free Software under the terms of the Free
Software Foundation‘s GNU General Public License in source code form. It compiles and runs on a
wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and
MacOS.
78. Data Tools
Organizations
1. Data.gov
2. SDM group at LBNL
3. Open Archives Initiative
4. Code for America | A New Kind of Public Service
5. The # DataViz Daily
6. Institute for Advanced Analytics | North Carolina State University | Professor
Michael Rappa · MSA Curriculum
7. BuzzData | Blog, 25 great links for data-lovin’ journalists
8. MetaOptimize – Home – Machine learning, natural language processing,
predictive analytics, business intelligence, artificial intelligence, text analysis,
information retrieval, search, data mining, statistical modeling, and data
visualization
9. had.co.nz
10. Measuring Measures – Measuring Measures
79. Data Tools
Repositories
1. Repositories | DataCite
2. Data | The World Bank
3. Infochimps Data Marketplace + Commons: Download Sell or Share Databases,
statistics, datasets for free | Infochimps
4. Factual Home – Factual
5. Flowing Media: Your Data Has Something To Say
6. Chartsbin
7. Public Data Explorer
8. StatPlanet
9. ManyEyes
10. 25+ more ways to bring data into R
80. Data Tools
Articles
1. Data Science: a literature review | (R news & tutorials)
2. What is “Data Science” Anyway?
3. Hal Varian on how the Web challenges managers – McKinsey Quarterly –
Strategy – Innovation
4. The Three Sexy Skills of Data Geeks « Dataspora
5. Rise of the Data Scientist
6. dataists » A Taxonomy of Data Science
7. The Data Science Venn Diagram « Zero Intelligence Agents
8. Revolutions: Growth in data-related jobs
9. Building data startups: Fast, big, and focused – O’Reilly Radar
81. Data Tools
Art Design
1. Periodic Table of Typefaces
2. Color Scheme Designer 3
3. Color Palette Generator Generate A Color Palette For Any Image
4. COLOURlovers
5. Colorbrewer: Color Advice for Maps
82. Data Tools
Image Searches
1. American Memory from the Library of Congress -The home page for the American Memory Historical Collections from
the Library of Congress. American Memory provides free access to historical images, maps, sound recordings, and
motion pictures that document the American experience. American Memory offers primary source materials that
chronicle historical events, people, places, and ideas that continue to shape America.
2. Galaxy of Images | Smithsonian Institution Libraries
3. Flickr Search
4. 50 Websites For Free Vector Images Download -Design weblog for designers, bloggers and tech users. Covering
useful tools, tutorials, tips and inspirational photos.
5. Images - Google Images. The most comprehensive image search on the web.
6. Trade Literature – a set on Flickr
7. Compfight / A Flickr Search Tool
8. morgueFile free photos for creatives by creatives
9. stock.xchng – the leading free stock photography site
10. The Ultimate Collection Of Free Vector Packs – Smashing Magazine
11. How to Create Animated GIFs Using Photoshop CS3 – wikiHow
12. IAN Symbol Libraries (Free Vector Symbols and Icons) – Integration and Application Network
13. Usability.gov
14. best icons
15. Iconspedia
16. IconFinder
17. IconSeeker
83. Data Tools
Invisible Web
1. 10 Search Engines to Explore the Invisible Web
2. Scirus – for scientific information - The most comprehensive scientific research tool on the web. With over 410 million scientific items indexed at
last count, it allows researchers to search for not only journal content but also scientists’ homepages, courseware, pre-print server material,
patents and institutional repository and website information.
3. TechXtra: Engineering, Mathematics, and Computing - TechXtra is a free service which can help you find articles, books, the best websites, the
latest industry news, job announcements, technical reports, technical data, full text eprints, the latest research, thesis & dissertations, teaching
and learning resources and more, in engineering, mathematics and computing.
4. Welcome to INFOMINE: Scholarly Internet Resource Collections - INFOMINE is a virtual library of Internet resources relevant to faculty,
students, and research staff at the university level. It contains useful Internet resources such as databases, electronic journals, electronic
books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.
5. The WWW Virtual Library - The WWW Virtual Library (VL) is the oldest catalogue of the Web, started by Tim Berners-Lee, the creator of HTML
and of the Web itself, in 1991 at CERN in Geneva. Unlike commercial catalogues, it is run by a loose confederation of volunteers, who compile
pages of key links for particular areas in which they are expert; even though it isn’t the biggest index of the Web, the VL pages are widely
recognised as being amongst the highest-quality guides to particular sections of the Web.
6. Intute - Intute is a free online service that helps you to find web resources for your studies and research. With millions of resources available on
the Internet, it can be difficult to find useful material. We have reviewed and evaluated thousands of resources to help you choose key websites
in your subject. CompletePlanet – Discover over 70,000+ databases and specially search engines - There are hundreds of thousands of
databases that contain Deep Web content. CompletePlanet is the front door to these Deep Web databases on the Web and to the thousands of
regular search engines — it is the first step in trying to find highly topical information. By tracing through Infoplease: Encyclopedia, Almanac,
Atlas, Biographies, Dictionary, Thesaurus. - Information Please has been providing authoritative answers to all kinds of factual questions since
1938—first as a popular radio quiz show, then starting in 1947 as an annual almanac, and since 1998 on the Internet at www.infoplease.com.
Many things have changed since 1938, but not our dedication to providing reliable information, in a way that engages and entertains.
7. DeepPeep: discover the hidden web - DeepPeep is a search engine specialized in Web forms. The current beta version currently tracks 45,000
forms across 7 domains. DeepPeep helps you discover the entry points to content in Deep Web (aka Hidden Web) sites, including online
databases and Web services. Advanced search allows you to perform more specific queries. Besides specifying keywords, you can also search
for specific form element labels, i.e., the description of the form attributes.
8. IncyWincy: The Invisible Web Search Engine - IncyWincy is a showcase of Net Research Server (NRS) 5.0, a software product that provides a
complete search portal solution, developed by LoopIP LLC. LoopIP licenses the NRS engine and provides consulting expertise in building
search solutions.
84. Data Tools
Metadata
Description Schema: MODS (Library of Congress) and Outline of elements and
attributes in MODS version 3.4: MetadataObject - This document contains a
listing of elements and their related attributes in MODS Version 3.4 with values
or value sources where applicable. It is an “outline” of the schema. Items
highlighted in red indicate changes made to MODS in Version 3.4.All top-level
elements and all attributes are optional, but you must have at least one element.
Subelements are optional, although in some cases you may not have empty
containers. Attributes are not in a mandated sequence and not repeatable (per
XML rules). “Ordered” below means the subelements must occur in the order
given. Elements are repeatable unless otherwise noted.”Authority” attributes are
either followed by codes for authority lists (e.g., iso639-2b) or “see” references
that link to documents that contain codes for identifying authority lists.For
additional information about any MODS elements (version 3.4 elements will be
added soon), please see the MODS User Guidelines.
85. Data Tools
wiki.dbpedia.org : About - DBpedia is a community effort to extract structured
information from Wikipedia and to make this information available on the Web.
DBpedia allows you to ask sophisticated queries against Wikipedia, and to link
other data sets on the Web to Wikipedia data. We hope this will make it easier
for the amazing amount of information in Wikipedia to be used in new and
interesting ways, and that it might inspire new mechanisms for navigating,
linking and improving the encyclopaedia itself.
86. Data Tools
Semantic Web – W3C - In addition to the classic “Web of documents” W3C is
helping to build a technology stack to support a “Web of data,” the sort of data
you find in databases. The ultimate goal of the Web of data is to enable
computers to do more useful work and to develop systems that can support
trusted interactions over the network. The term “Semantic Web” refers to W3C’s
vision of the Web of linked data. Semantic Web technologies enable people to
create data stores on the Web, build vocabularies, and write rules for handling
data. Linked data are empowered by technologies such as RDF, SPARQL,
OWL, and SKOS.
RDA: Resource Description & Access | www.rdatoolkit.org - Designed for the digital
world and an expanding universe of metadata users, RDA: Resource
Description and Access is the new, unified cataloging standard. The online RDA
Toolkit subscription is the most effective way to interact with the new standard.
More on RDA.
87. Data Tools
Cataloging Cultural Objects - A Guide to Describing Cultural Works and Their
Images (CCO) is a manual for describing, documenting, and cataloging cultural
works and their visual surrogates. The primary focus of CCO is art and
architecture, including but not limited to paintings, sculpture, prints, manuscripts,
photographs, built works, installations, and other visual media. CCO also covers
many other types of cultural works, including archaeological sites, artifacts, and
functional objects from the realm of material culture.
Library of Congress Authorities (Search for Name, Subject, Title and Name/Title) -
Using Library of Congress Authorities, you can browse and view authority
headings for Subject, Name, Title and Name/Title combinations; and download
authority records in MARC format for use in a local library system. This service
is offered free of charge.
Search Tools and Databases (Getty Research Institute) - Use these search tools to
access library materials, specialized databases, and other digital resources.
88. Data Tools
Art & Architecture Thesaurus (Getty Research Institute) - Learn about the purpose,
scope and structure of the AAT. The AAT is an evolving vocabulary, growing
and changing thanks to contributions from Getty projects and other institutions.
Find out more about the AAT’s contributors.
Getty Thesaurus of Geographic Names (Getty Research Institute) Learn about the
purpose, scope and structure of the TGN. The TGN is an evolving vocabulary,
growing and changing thanks to contributions from Getty projects and other
institutions. Find out more about the TGN’s contributors.
DCMI Metadata Terms
The Digital Object Identifier System
The Federal Geographic Data Committee — Federal Geographic Data Committee
89. 9 mistakes that will kill the best data analyses
1. Sampling or design of experiment not properly done
2. Non robust cross-validation
3. Poor communication of results to management or clients
4. Poor data visualization
5. Does not solve our business problems
6. Database misses important data or fields
7. Failure to leverage external data
8. Can't make business data silos to "talk to each other"
9. Developers (production people) and designers speak "different languages"