Business Intelligence Project on Startup Success Factors

MIS 6324 : BUSINESS INTELLIGENCE
TERM PROJECT
ON
VIBHORE AGARWAL

University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement

BUSINESS
UNDERSTANDING
SCOPE : Investment strategies for investing in start-up companies are widely based on intuition and past
experience. As a result, investors rely primarily on the need being addressed, background of the
founders, size of the market and the ability of the company to scale after tasting success.
AIM : To perform some rigorous analysis that can be used to identify relevant factors and score prospective
start-ups on their potential to be successful.
RESULT : The model / analysis will then allow investors to make some more informed decisions and rely less on
intuitions.
DATA
UNDERSTANDING
/ EXPLORATION
DATA CLEANING
DATA
PREPARATION
FEATURE
SELECTION
MODEL
BUILDING
TESTING AND
EVALUATION
Venture Capitalist
(CLIENT)

Data Understanding
{ } – 15% of the entire data
2 Date and 1 Year Column
2 Scorecard Column
4 Company Profile Column – Tab
and Comma Delimited
4 Investors Portfolio and 1 Funding
Received Column
$
~10 Columns for Top Management
> 40 Column for Team Members
Unstructured Data
116 – columns ;
472 (unique) –
rows / records ;
1 Target Variable
• # Unique across Columns
• Spread of Attributes a/c
Columns
• Min / Max Frequency -
Attribute
No Class
Imbalance
Blanks “ “
~18 / Col.
“No Info”
~ 53 / Col.
~ 401 useful value / Col.
~ 362 out of 472 rows
has at least 1 blank

Data Understanding
Dash-Board for Initial Analysis of Data

Data Cleaning and Data Preparation
In general, Data Cleaning and Data Preparation are Data Pre-processing steps which involves data filtering,
aggregation and imputation of missing values
DATA
IMPUTATION
• Class Mean Imputation
(Clustering Based
Missing Value
Imputation)
• Simulation of Neural
Network Based
Imputation
• Multivariate
Imputation by chained
Equation (Excel Add-in)
DATA CLUSTERING
• Using business logic to
create nominal bins and
using Fuzzy Lookup to
remove the redundant
groups and to map the
raw data to the bins /
buckets
• Clustering Nominal Data
using Cross Table aka.
Bertin Matrix
Visualization
• K-Means Clustering along
with Ranking
DIMENSIONALITY
REDUCTION
• Principal Component
Analysis
• Logical Operations based
on Business
Understanding
USED IN CONJUNCTION TO PRE-PROCESS THE DATA

Situation Analysis – Types of Missing Data
Market Research|Marketing|Crowdfunding Marketing, sales
VS
Tab and Comma Delimited Data i.e. Multiple variables inside a single cell
Given the observed data,
data are missing
independently of
unobserved data
Missing observations
related to values of
unobserved data

Data Clustering
Creation of Bins / Buckets based on Business Logic
It lets you quickly category records. When we create a bucket, we basically define multiple categories (buckets) used to group similar variables.
GREEN
RED
BLUE
RECTANGLE
TRIANGLE
CIRCLE
COLORS SHAPES
NEED /
REQUIREMNT
Source :
- https://www.exploreanalytics.com/wiki/index.php?title=Binning

Data Clustering
Fuzzy Logic for mapping and reducing redundant groups
A challenging problem in Data Management is that same entity can be represented in multiple ways, throughout the dataset.
Andy Hill Mr. Andrew HillHill, Mr. AndrewAndy Hill Mr. Andrew HillHill, Mr. Andrew
Essentially, they all refer to the same person But, during analysis they are treated as different person
Application of Fuzzy Logic allows us to identify records
which are textually similar
These variations results
basically because of :
1) Merging of independent
data source
2) Spelling Mistakes
3) Inconsistent naming
conventions and
abbreviations
Methodologies Used :
• Jaccard Similarity
• Weighted Jaccard Similarity and
Tokenization of Records
• Token Weighting
• Transformations
• Jaccard Similarity under
Transformation
• Edit Distance
Source :
- https://atidan.files.wordpress.com/2013/08/fuzzy-lookup-add-in-for-excel.pdf
Snap Shot

Data Clustering Implementation
Considering Each
Column at a Time
Removing Duplicates (i.e. see analyze
the unique occurrences); also remove
the inconsistency in the data using
Fuzzy Technique
Decide upon the # of
bins based
Map each row to the bins based
on Fuzzy Mapping technique
348 records
Consider Column “Industry of a Company”

Considering Each
Column at a Time
Removing Duplicates (i.e. see analyze
the unique occurrences); also remove
the inconsistency in the data using
Fuzzy Technique
Decide upon the # of
bins based
Map each row to the bins based
on Fuzzy Mapping technique
348 records 40 unique records
Consider Column “Industry of a Company”
Total 11 Bins
Industry Final Industry Val. Similarity Final 40 values Bin Allocation
0.0000 Others Others
Market Research|Marketing|Crowdfunding Market Research 0.4554 Market Research Marketing
Analytics|Cloud Computing|Software Development Software Development 0.5042 Software DevelopmentIT
Mobile|Analytics Analytics 0.0000 Analytics Analytics
Analytics|Marketing|Enterprise Software Enterprise Software 0.3929 Enterprise SoftwareIT
Food & Beverages|Hospitality Food & Beverages 0.4615 Food & Beverages Hospitality / Entertainment
Analytics Analytics 1.0000 Analytics Analytics
Cloud Computing|Network / Hosting / Infrastructure Network / Hosting / Infrastructure0.6095 Network / Hosting / InfrastructureIT
Analytics|Mobile|Marketing Analytics 0.0000 Analytics Analytics
Healthcare|Pharmaceuticals|Analytics Analytics 0.0000 Analytics Analytics
1 2 3 4 5 6 7 8 9 10 11
Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others Govt HR Hospitality / Entertainment
Advertising Space Travel Analytics CleanTech Mobile Media Finance Career / Job Search Energy Human Resources (HR)Entertainment
Market Research Transportation Deals Cloud Computing Social Networking Crowdfunding Classifieds Security Food & Beverages
Marketing Travel E-Commerce Insurance Education Government Music
Retail Email Healthcare Hospitality
Enterprise Software Publishing
Gaming
Network / Hosting / Infrastructure
Real Estate
Search
Software Development
Telecommunications
Pharmaceuticals
Final Bins
Default Value Searched
Values
Matched with the Bin
attributes
Duplicates Removal + Textually Similar values removal =

Data Clustering
Clustering Nominal Data using Cross Table aka. Bertin Matrix Visualization
Bertin Matrix or a Cross Table (Pivot Chart) allows rearrangements to transform an initial matrix to a more homogeneous structure. The
rearrangements are row and column permutations and groupings.
GREEN
RED
BLUE
RECTANGLE TRIANGLE
CIRCLE
1 0 1
1 1 1
1 3 2
Source :
- http://www.aviz.fr/wiki/uploads/Bertifier/bertifier-authorversion.pdf
- http://bertin.r-forge.r-project.org/bertinR.pdf
- https://books.google.com/books?id=2Q1qCQAAQBAJ&pg=PA398&lpg=PA398&dq=clustering+nominal+data+using+cross+table&
source=bl&ots=mzNGnnPu6H&sig=axEgHoiUmntfXwlwMfqIbMls05A&hl=en&sa=X&ved=0ahUKEwjji-
uZkczJAhUQ2WMKHbYXB4gQ6AEIMzAE#v=onepage&q=clustering%20nominal%20data%20using%20cross%20table&f=false
GREEN
RED
BLUE
RECTANGLE TRIANGLE
CIRCLE
“green – rectangle” “Green – Triangle” “Green – Circle”
“Red – Rectangle” “Red – Triangle” “Red – Circle”
“Blue – Rectangle” “Blue – Triangle” “Blue – Circle”
Frequency of
Occurrence

Make a cross – table
for 2 columns
Take the intersection values of cross
– table 1 as rows and the left out
column to form another cross - table
Map the categorical value resulting from the
final cross – table, map those values to the
dataset, giving way to a single column
(reduction from 3 to 1 single column)
“Industry of Company”
1 2 3 4 5 6 7 8
Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others
targeted marketing Solution providing Web Analytics Research mobile app social advertising Risk service
Sales Strategy analytic Computing application social news Inventory management security
consumer behaviour Social Media optimization intellectual property analysis Technology PERSONAL APPS social media marketing PAYMENT Recommendation
retail Optimization data visualization Bug fix IPHONE APPS social commerce finance Energy saving
consumer web Travel Planning Social media analytics Data Integration mobile app development social branding revenue maximization entertainment
APP REVENUE reporting PHONE INTELLIGENCE malware protection Location based service SOCIAL MEDIA CAMPAIGN enterprise Merchandising
Customer Retention DASHBOARDS Music intelligece Database Management app Social Media billing News
customer engagement MAIL REPORTS SOCIAL TV ANALYTICS Data Collection Data driven applications social network localized behaviour
CRM NETWORK OPTIMIZATION big data analytics e-learning PUBLISHING
advertising TARGETING OPTIMIZE customer analytics software service global
PRICING management analytics crowdsourcing software development PRIVACY
Targeting Information management web METRICS
writing blog Music
curated web Production
Tool Development customer service
games Community Betterment
Search Engine
VIDEO STREAMING
networking
wireless
online music
cloud computing
Server Design
Search Engine
ecommerce
“Focus Function of Company”
1 2 3 4 5 6 7 8
Marketing Intelligence Platform Human insight at machine scale Event Data Analytics API Video distribution The Location-Based Marketing Platform Social Media Analytics and Reporting The ï¬•nancial terminal of the web. Healthcare Data
Marketing intelligence solutions Business Logic Abuse Fraud Protection The most advanced analytics for mobile Engagement Engine In-Store Mobile Commerce Social Media ROI Measurment Simple Inventory Management for Square Health Care Analytics
Smart Suggestions for Sales Reps Business Dashboards big data for foodservice Internet Company Mobile shopping lead generator The Social Media Customer Care Tool Social Payments Shaking Up Publishing
Connected data for marketers & ecommerce Customer Experience Platform Big data for clinical insight Real-time error tracking Consumer Data Made Easy by App Advanced Twitter Management Content Valuation Platform Healthcare transformation.
Customer Data-Powered Marketing Intelligent social media dashboard Business Analytics Secure NoSQL Database Mobile Advertising Technology The Twitter of food. Delivering Return on Social Competitions for startups
Enterprise Marketing Intelligence SaaS Business Status Dashboard Analytics for the Music Industry SaaS Job Marketing Platform Mobile App Analytics & Marketing Enterprise social network In-game Payment Solutions A place for people to talk about the tv
Local Advertising promotions optimization for ecommerce Advanced predictive analytics energy efficiency data platform Social mobile Social Media Marketing & Technology Peer-to-Peer Student Loans CRE research made simple
Know and Grow Your Audience Big Data Analytics Software-as-a-Service (SaaS) platform SMS /Online Reminders Social media performance measurement Billing for web hosts Changing how people save energy
Mobile Audience Targeting Social Business Analytics Software Company Mobile BI Platform Social Data Platform Peer-to-Peer Lending
Easy and Powerful Marketing Solutions Smart Data for Better Places Social Influence/Authority platform Twitter Monitoring App Social Media Opinion's Movement
info exchange for physician interactions White Label Web Analytics technology research iOS and Android Crash Reporting Social media gamification Making Cities Easier to Love
Marketing Decisions Platform Investing tools made simple Specialists in Internet TV mobile development tools Enterprise Ready Social Media Monitoring Save Time
Verified B2B Contacts Semantic Automation and Storytelling Real-time discounts platform Consumer Location Analytics music social networking Call 3.0 Company
Marketplace for quality tutoring. Real-time in-store analytics A market research technology firm. 360Ã‚Â° mobile analytics Entertainment Based Social Networking Services on your terms.
Social media marketing web analytics api New Generation of NoSQL mobile app data provider Social Network Applications Helping people share
Acquire and retain valuable customers Real-time Media Analytics Cloud computing Order food and drink from your mobile. Social transparent
The Marketing Suite for the Visual Web Analytics online real estate broker Mobile Payment Services Leveraging Emotions
Customer Experience Management Platform Predictive Analytics Video Optimization Email powered applications science. semantic. simple. sisu.
Doubleclick for Market Research Data visualization system API for Social data context Media Relations
“Short Description of Company Profile”
40 unique values mapped to 11 categories
Therefore, a
total of 11 *
8 * 8 = 704
categories

1 2 3 4 5 6 7 8
Marketing Operationsand Strategy Analytics IT Mobile Social Financeand Risk Others
1 Marketing 11 12 13 14 15 16 17 18
2 Operationsand Strategy 21 22 23 24 25 26 27 28
3 Analytics 31 32 33 34 35 36 37 38
4 IT 41 42 43 44 45 46 47 48
5 Mobile 51 52 53 54 55 56 57 58
6 Social 61 62 63 64 65 66 67 68
7 Financeand Risk 71 72 73 74 75 76 77 78
8 Others 81 82 83 84 85 86 87 88
9 Govt 91 92 93 94 95 96 97 98
10 HR 101 102 103 104 105 106 107 108
11 Hospitality / Entertainment 111 112 113 114 115 116 117 118
Industry
Functionsof Company
Profile
1 2 3 4 5 6 7 8
Functions-Industry
11 Marketing - Marketing 111 112 113 114 115 116 117 118
21 Operations and Strategy - Marketing 211 212 213 214 215 216 217 218
31 Analytics - Marketing 311 312 313 314 315 316 317 318
41 IT - Marketing 411 412 413 414 415 416 417 418
51 Mobile - Marketing 511 512 513 514 515 516 517 518
61 Social - Marketing 611 612 613 614 615 616 617 618
71 Finance and Risk - Marketing 711 712 713 714 715 716 717 718
81 Others - Marketing 811 812 813 814 815 816 817 818
91 Govt - Marketing 911 912 913 914 915 916 917 918
101 HR - Marketing 1011 1012 1013 1014 1015 1016 1017 1018
111 Hospitality / Entertainment - Marketing 1111 1112 1113 1114 1115 1116 1117 1118
12 Marketing - Operations and Strategy 121 122 123 124 125 126 127 128
22 Operations and Strategy - Operations and Strategy 221 222 223 224 225 226 227 228
32 Analytics - Operations and Strategy 321 322 323 324 325 326 327 328
42 IT - Operations and Strategy 421 422 423 424 425 426 427 428
52 Mobile - Operations and Strategy 521 522 523 524 525 526 527 528
62 Social - Operations and Strategy 621 622 623 624 625 626 627 628
72 Finance and Risk - Operations and Strategy 721 722 723 724 725 726 727 728
82 Others - Operations and Strategy 821 822 823 824 825 826 827 828
92 Govt - Operations and Strategy 921 922 923 924 925 926 927 928
102 HR - Operations and Strategy 1021 1022 1023 1024 1025 1026 1027 1028
112 Hospitality / Entertainment - Operations and Strategy 1121 1122 1123 1124 1125 1126 1127 1128
13 Marketing - Analytics 131 132 133 134 135 136 137 138
23 Operations and Strategy - Analytics 231 232 233 234 235 236 237 238
33 Analytics - Analytics 331 332 333 334 335 336 337 338
43 IT - Analytics 431 432 433 434 435 436 437 438
53 Mobile - Analytics 531 532 533 534 535 536 537 538
63 Social - Analytics 631 632 633 634 635 636 637 638
73 Finance and Risk - Analytics 731 732 733 734 735 736 737 738
83 Others - Analytics 831 832 833 834 835 836 837 838
93 Govt - Analytics 931 932 933 934 935 936 937 938
103 HR - Analytics 1031 1032 1033 1034 1035 1036 1037 1038
113 Hospitality / Entertainment - Analytics 1131 1132 1133 1134 1135 1136 1137 1138
14 Marketing - IT 141 142 143 144 145 146 147 148
24 Operations and Strategy - IT 241 242 243 244 245 246 247 248
34 Analytics - IT 341 342 343 344 345 346 347 348
44 IT - IT 441 442 443 444 445 446 447 448
54 Mobile - IT 541 542 543 544 545 546 547 548
64 Social - IT 641 642 643 644 645 646 647 648
74 Finance and Risk - IT 741 742 743 744 745 746 747 748
84 Others - IT 841 842 843 844 845 846 847 848
94 Govt - IT 941 942 943 944 945 946 947 948
104 HR - IT 1041 1042 1043 1044 1045 1046 1047 1048
114 Hospitality / Entertainment - IT 1141 1142 1143 1144 1145 1146 1147 1148
15 Marketing - Mobile 151 152 153 154 155 156 157 158
25 Operations and Strategy - Mobile 251 252 253 254 255 256 257 258
35 Analytics - Mobile 351 352 353 354 355 356 357 358
45 IT - Mobile 451 452 453 454 455 456 457 458
55 Mobile - Mobile 551 552 553 554 555 556 557 558
65 Social - Mobile 651 652 653 654 655 656 657 658
75 Finance and Risk - Mobile 751 752 753 754 755 756 757 758
85 Others - Mobile 851 852 853 854 855 856 857 858
95 Govt - Mobile 951 952 953 954 955 956 957 958
105 HR - Mobile 1051 1052 1053 1054 1055 1056 1057 1058
115 Hospitality / Entertainment - Mobile 1151 1152 1153 1154 1155 1156 1157 1158
16 Marketing - Social 161 162 163 164 165 166 167 168
26 Operations and Strategy - Social 261 262 263 264 265 266 267 268
36 Analytics - Social 361 362 363 364 365 366 367 368
46 IT - Social 461 462 463 464 465 466 467 468
56 Mobile - Social 561 562 563 564 565 566 567 568
66 Social - Social 661 662 663 664 665 666 667 668
76 Finance and Risk - Social 761 762 763 764 765 766 767 768
86 Others - Social 861 862 863 864 865 866 867 868
96 Govt - Social 961 962 963 964 965 966 967 968
106 HR - Social 1061 1062 1063 1064 1065 1066 1067 1068
116 Hospitality / Entertainment - Social 1161 1162 1163 1164 1165 1166 1167 1168
17 Marketing - Finance and Risk 171 172 173 174 175 176 177 178
27 Operations and Strategy - Finance and Risk 271 272 273 274 275 276 277 278
37 Analytics - Finance and Risk 371 372 373 374 375 376 377 378
47 IT - Finance and Risk 471 472 473 474 475 476 477 478
57 Mobile - Finance and Risk 571 572 573 574 575 576 577 578
67 Social - Finance and Risk 671 672 673 674 675 676 677 678
77 Finance and Risk - Finance and Risk 771 772 773 774 775 776 777 778
87 Others - Finance and Risk 871 872 873 874 875 876 877 878
97 Govt - Finance and Risk 971 972 973 974 975 976 977 978
107 HR - Finance and Risk 1071 1072 1073 1074 1075 1076 1077 1078
117 Hospitality / Entertainment - Finance and Risk 1171 1172 1173 1174 1175 1176 1177 1178
18 Marketing - Others 181 182 183 184 185 186 187 188
28 Operations and Strategy - Others 281 282 283 284 285 286 287 288
38 Analytics - Others 381 382 383 384 385 386 387 388
48 IT - Others 481 482 483 484 485 486 487 488
58 Mobile - Others 581 582 583 584 585 586 587 588
68 Social - Others 681 682 683 684 685 686 687 688
78 Finance and Risk - Others 781 782 783 784 785 786 787 788
88 Others - Others 881 882 883 884 885 886 887 888
98 Govt - Others 981 982 983 984 985 986 987 988
108 HR - Others 1081 1082 1083 1084 1085 1086 1087 1088
118 Hospitality / Entertainment - Others 1181 1182 1183 1184 1185 1186 1187 1188
88 Unique Categorical Values for combination of two Columns
704 Unique Categorical Values for combination of three Columns
Short Description of
company profile
Bucket for Profile Industry of company Buckets for Industry
Focus functions
of company
Buckets for
Functions
Categorical Value for
Industry-Functions
Video distribution IT Others operation
Operations
and Strategy 824
Others
Market
Research|Marketing|Cr
owdfunding Marketing Marketing, sales Marketing 118
Event Data Analytics
API Analytics
Analytics|Cloud
Computing|Software
Development IT operations
Operations
and Strategy 423
The most advanced
analytics for mobile Analytics Mobile|Analytics Analytics
Marketing &
Sales Marketing 313
The Location-Based
Marketing Platform Mobile
Analytics|Marketing|En
terprise Software IT
Marketing &
Sales Marketing 415
big data for
foodservice Analytics
Food &
Beverages|Hospitality
Hospitality /
Entertainment analytics Analytics 1133
Others Analytics Analytics Research IT 348
A total of 143 out of 704 Categorical Values mapped in the original dataset

Data Clustering
K Means Clustering along with Ranking
Data Set
User
Choose the number of
clusters i.e. K – he
wants the data to be
clustered into
1) Random “K”
Centroids are
chosen from the
dataset
2) Each Record – Data
is assigned to its
closest cluster
(based on low SSE)
3) Re-compute the
centroid of each
cluster
4) Process repeated
until the centroids
doesn’t changes
Source :
- https://www.youtube.com/watch?v=u1NtKPuXQKo
- http://sci2s.ugr.es/keel/pdf/specific/congreso/brazdil00comparison.pdf
Resulting Data Set with clusters
specified as centroids
Ranking Algorithm which
assigns rank to the centroid
either in Ascending or
Descending order
- Deciles i.e. 10 clusters
- Quintiles i.e. 5 clusters
Segmented Data with proper
Deciles, Quintiles etc.

Percent_skill_
Entrepreneurship
Percent_skill_
Operations
Percent_skill
Engineering
Percent_skill_
Marketing
Percent_skill
Leadership
Percent_skill
Data Science
Percent_skill
Business Strategy
Percent_skill_Produc
t Management
Percent
skill_Sales
Percent_skill
_Domain
Percent_
skill_Law
Percent_skill
Consulting
Percent_skill
_Finance
Percent_skill_
Investment
Renown
score
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15.88235294 11.76470588 15 12.94117647 0 8.823529412 21.76470588 10.88235294 2.941176471 0 0 0 0 0 8
9.401709402 0 57.47863248 0 0 3.846153846 17.09401709 9.401709402 0 2.777777778 0 0 0 0 9
0 0 0 0 0 0 0 0 0 0 0 0 0 0 5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 6
6.25 0 3.125 15.625 9.375 3.125 6.25 3.125 3.125 0 0 0 0 0 6
0 0 66.66666667 5.555555556 0 22.22222222 0 0 0 5.555555556 0 0 0 0 0
0 0 100 0 0 0 0 0 0 0 0 0 0 0 2
8.333333333 0 46.73202614 5.718954248 8.333333333 0 19.77124183 2.777777778 2.777777778 0 0 0 0 5.555555556 5
8.333333333 0 27.08333333 19.79166667 0 23.95833333 0 0 0 20.83333333 0 0 0 0 4
3.846153846 0 26.92307692 0 3.846153846 3.846153846 7.692307692 0 3.846153846 0 0 0 0 0 6
27.27272727 0 18.18181818 0 9.090909091 0 36.36363636 9.090909091 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8.333333333 0 50 16.66666667 0 12.5 4.166666667 4.166666667 4.166666667 0 0 0 0 0 0
13.33333333 0 6.666666667 60 0 13.33333333 6.666666667 0 0 0 0 0 0 0 1
11.11111111 5.555555556 5.555555556 0 11.11111111 11.11111111 27.77777778 0 11.11111111 5.555555556 0 0 5.555555556 5.555555556 5
8.333333333 0 58.33333333 0 0 0 25 0 0 8.333333333 0 0 0 0 2
20 0 20 0 40 0 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
5.555555556 0 27.77777778 11.11111111 0 16.66666667 27.77777778 5.555555556 0 5.555555556 0 0 0 0 7
Average of 61 records with the maximum of “Failure” Category had Blank / No Info values
Therefore, Case – Wise Deletion was not a good option !! We had to impute the data.

Data Imputation
Source :
- Missing Value Imputation using Refined Mean Substitution - http://ijcsi.org/papers/IJCSI-9-4-3-306-313.pdf
- http://scs.math.yorku.ca/images/6/6d/Enders_jofschoolpsyc.pdf
- http://www4.ncsu.edu/~pollock/pdfs/Lecture%20ST%20432%20Weighting,%20Imputation%20and%20Variances.pdf
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.4863&rep=rep1&type=pdf (Page No. 6)
- http://www.csos.jhu.edu/contact/staff/jwayman_pub/wayman_multimp_aera2003.pdf
- http://www.stefvanbuuren.nl/mi/MI.html
- http://www.stefvanbuuren.nl/publications/MICE%20V1.0%20Manual%20TNO00038%202000.pdf
- http://arxiv.org/ftp/arxiv/papers/0704/0704.3474.pdf
- http://sci2s.ugr.es/keel/pdf/specific/articulo/2011-silva-mlp-imputation-NN.pdf
Missing Value
Imputation
can happen in
datasets in
several forms
Missing Value
occur in
several
attributes
(columns)
[MAR]
Missing Value
occurs in
number of
instances
(rows) [MAR]
Missing value
occur
randomly in
attributes
and instances
[MNAR]
Class Mean Imputation
• Respondents (Identifiers)
are divided into classes
• Cell Mean for a particular
class is used for all
missing values in that
class
• This method can be
biased, it overestimates
the correlation and
underestimates the
variability of the data
• Modified Version -
Stochastic Regression
Method in which a
random error term is
added to each predicted
score
Multivariate Imputation by
Chained Equation
• Missing values are
predicted using existing
values.
• The predicted values
“imputes”, are substituted
for missing values,
resulting in full data set –
“imputed data set”
• Performed Multiple Times
using – Bayesian Linear
Regression, Predictive
Mean Matching,
Unconditional Mean
Imputation, Logistic
Regression (Polytomous
>=2 category), LDA
Incomplete
Data
Imputed
Data
Analysis
Results
Pooled
Results
Simulation of Neural
Network Based Imputation
• The method of Neural
Network uses the auto –
associative neural
network to approximate
missing data.
• We tried to simulate the
basic structure of Neural
Network , basically its
ability to learn certain
liner and non-linear inter-
relationships in the input
space.
• We also tried to simulate
the functionality of auto-
encoder which project
the input onto smaller set
by intensively squashing it
into smaller details

Data Imputation Implementation
Consider last 15 columns one at a time and Impute
the Values using Class Mean Imputation treating
“Company Category and Target Value” Combination
as Class
After Imputation, use K Means to
compute the Clusters and Rank Them
accordingly
Repeat this for all the 15 columns and Sum up
the Rank for each row. A highest rank would
be of 150. Evaluate if this produces
Randomness in data
Company_
Name
Dependent-
Company Status
Catagorical Value for
Industry-Functions
Percent_skill_
Entrepreneurship
Percent_skill_
Entrepreneurship
Percent_skill_
Entrepreneurship
Overall
Score
Company1 Success 824 0 0 0 0
Company2 Success 118 15.88235294 15.88235294 9 63
Company3 Success 423 9.401709402 9.401709402 6 46
Company6 Success 1133 6.25 6.25 4 37
Company9 Success 314 8.333333333 8.333333333 5 51
Company10 Success 343 8.333333333 8.333333333 5 42
Company11 Success 413 3.846153846 3.846153846 2 30
Company12 Success 828 27.27272727 27.27272727 10 42
Company376 Success 848 No Info 9.322638145 6 63
Company413 Failed 888 No Info 5.664488017 3 49
Original Col.
Imputation
Inclusive Col.
Ranked /
Segmented Col.
Into 10 bins
Sum of All 15 Ranks,
making a sum of 150
Average of all row’s “%_Skill_Entr.” (Col. 4) whose :
1) Dependant_Comp._Status (Col. 2) = “Failed”; AND
2) Categorical_Val. (Col. 3) = “888”; AND
3) %_Skill_Entr. (Col 4.) is not equal to “No Info”
Similarly, Column
“Internet Activity
Score” was first
imputed on similar
grounds and then
segmented
Ceiling (Sum of all the numbers in
the (Column 5) which are “<=“
((Column 5) No. / Total Sum) * # K )

Data Imputation Implementation
For Calculating missing values in “Age of the
Company”, confirm the reference end date.
1) Take the age of the least value and see the
corresponding Established Date
Varied from Jan 13’ – Jul 13’, Hence Ref. Date
is somewhere in 2014 Jan - Jul
1) Take all the values of Column “Age of the
Company” and “Estb. Date” without NULL.
2) Take the Last Date of “Last Funding Date”
Column and assume it to be the Ref. Date.
3) Verify if that being assumption makes sense
Age
Est. Founding
Date
Last Funding
Date Date_ Assumption
Diff. Assum. -
Est.
5 6/20/2009 5/10/2012 4/8/2014 4.80274
4 4/1/2010 12/11/2013 4/8/2014 4.021918
4 5/1/2010 9/17/2013 4/8/2014 3.939726
3 1/1/2011 9/3/2013 4/8/2014 3.268493
4 1/1/2010 11/8/2012 4/8/2014 4.268493
3 1/1/2011 2/26/2014 4/8/2014 3.268493
1 5/16/2013 10/24/2013 4/8/2014 0.89589
Company_
Name
Dependent-
Company Status
Age of company
in years
Age of company
in years #1
Age of company
in years #2
Age of company
in years #3
Company1 Success No Info 1 1 1
Company2 Success 3 3 3 3
Company153 Failed No Info 6 6
Company453 Failed 7
1) Like the Neural network have different
hidden layers performing same mathematical
function differently, perform Class Mean
imputation method for different Logical
Association which might be associated or on
which “Age of the Company” would be
dependent. Repeat this till we get the 90% of
the data imputed
# of Advisors Internet Activity
Score Segment
Established Date
Success / Failure
Target Value
Success / Failure
Target Value
Industry Category Industry Category
# 59 - Blanks # 34 – Blanks
# 25 - Imputed
# 09 – Blanks
# 25 - Imputed
# 08 – Blanks
# 01 - Imputed
8 records were deleted. #”Success” – 6, #”Failure” - 2
Therefore, still no Case Imbalance
Similarly, Column “Last Funding Amount” was imputed, and a total of 27 records were deleted of
which, # “success” deletion was 20 and # “Failure” Deletion was 7
Therefore, Total Records Deleted = 27+8 = 35 ; 437 records
#Success = 279 and #Failure = 158

Dimensionality Reduction
Dimension Reduction is the
mapping of data to a lower
dimensional space such that
uninformative variance of the data
is discarded, or such that a
subspace in which the data lives is
detected How to take a
picture to capture
the most
information about
the rectangle?

Dimension Reduction is the
mapping of data to a lower
dimensional space such that
uninformative variance of the data
is discarded, or such that a
subspace in which the data lives is
detected How to take a
picture to capture
the most
information about
the rectangle?
A B C D E

WHY THIS POSITION ?
BECAUSE IT PROVIDES THE MOST
VISUAL INFORMATION !!
Second Longest Axis
while fixing the first
longest axis
First Longest Axis
PCA Understanding
• Rotate the object around its center to find
the best orientation
• First find the axis so that the object has
largest extend in average along the axis
• Rotate the object around the first axis to
find the axis that is perpendicular to the
first axis, and the object has largest
extend in average along this axis
• The two axis found are the first and
second principal component
• The PCA algorithm helps us find those
components
• We deconstruct the data set into Eigen
Vector and its corresponding Eigen Values.
They come in pair.
• Eigen Vector is a direction of the axis / line
(vertical, horizontal , 45 degrees etc.) and
the Eigen Values is a number telling us
how spread out the data is on the line.
• Eigen Vector with the highest Eigen value
is therefor the principal component.Source :
- https://www.youtube.com/watch?v=BfTMmoDFXyE
- https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-
dimension-reduction/
- http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
- http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
- https://www.youtube.com/watch?v=7BUHYpNVT5E&list=LLeNKG4d3dB8SEg1gw91dmig&index=6
- https://nutsandboltsspeedtraining.com/spicypresentations/rotating-3d-shapes-with-powerpoint-animations/

Dimensionality Reduction Implementation
1) Choose columns which seems to be co-
related but the relation isn’t identifiable
mathematically
2) Look for # of missing Values
3) Run Multiple Imputation method on them
1) After running Multiple Imputation, look for
10 different imputed values
2) Take the average of each cell from the 10
tables
3) Plug in the resultant Data into Statgraphics
Centurion and Run PCA Analysis
1) Look at the Eigen Value Graph, to have an
idea about how many components define
85% – 90% of the Dataset
2) Take the value of Covariance Matrix *
Component Weights for Each Component
to get the Data Values a/c spread in the
particular component space
ID
Employee
Count
Employees count
MoM change
Has the team
size grown
Team size all
employees
1 3 0 -1 15
2 17 -1 20
3 14 0 -1 10
4 45 10 -1 50
5 39 3 -1 40
6 14 8 -1 14
7 7 0 -1 15
8 29 -12 -1 40
9 16 45 -1 50
10 3 -1 3
11 34 0 -1 50
Scree Plot
0 1 2 3 4
Component
0
0.3
0.6
0.9
1.2
1.5
1.8
Eigenvalue

Feature Selection
Feature Selection refers to the selection of attributes in the data set that are most relevant to the
predictive modeling
2 Date and 1 Year Column
~10 Columns for Top Management
> 40 Column for Team Members
Target Variable
The variables associated
with it had been included in
the final feature list.
Addition of this would be
redundant
Few data sets have been
included whereas others
with binary attribute have
been omitted.
4 Investors Portfolio and 1 Funding
Received Column
$
The funding information
has been included,
however, seed funders and
investors detail aren’t
included as because of the
319 unique values, which
would not lead to any
information gain
REJECTED LIST ACCEPTED LIST
Identifier
2 PCA Component
Age of the Company
Most of the Columns
were of Binary form with
many missing values ,
which were difficult to
impute
Internet Activity Detail
Funding Received Information
# Co – Founders and Investors
11

Model Building
Ignoring since Rattle
Random Forest can handle
only 32 categorical
Variables
Confusion Matrix

Testing And Evaluation

Testing And Evaluation
For Validation Set For Testing Set
RANDOM FOREST
True Positive True Negative False Positive False Negative
Failed 68.42% 91.67% 8.33% 31.58%
Success 91.67% 68.42% 31.58% 8.33%
Misclassification Rate

Future Enhancement
 Currently, there is a lot of dependability on multiple tools – e.g. Statgraphics Centurion for Multivariate
Analysis. Statgraphics is a paid tool. Dependability on such tools can be removed by building up an in-
house plug-in or a library function for the requirement.
 Project relies very much on ad-hoc analysis. Chances are high of omitting steps when new dataset would
arrive. Automation could be done of each steps post drafting of the overall step-wise procedure. For
Automation, VBA or R Programming could be a good option.
 Visualizing the data set could help making much more quick informed decision.

Thank You !!

Business Intelligence Project on Startup Success Factors

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Business Intelligence Project on Startup Success Factors

Similar to Business Intelligence Project on Startup Success Factors (20)

Recently uploaded

Recently uploaded (20)

Business Intelligence Project on Startup Success Factors

Editor's Notes