SlideShare a Scribd company logo
1 of 37
MIS 6324 : BUSINESS INTELLIGENCE
TERM PROJECT
ON
VIBHORE AGARWAL
University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement
University of Texas at Dallas
Introduction – Business Understanding
BUSINESS
UNDERSTANDING
SCOPE : Investment strategies for investing in start-up companies are widely based on intuition and past
experience. As a result, investors rely primarily on the need being addressed, background of the
founders, size of the market and the ability of the company to scale after tasting success.
AIM : To perform some rigorous analysis that can be used to identify relevant factors and score prospective
start-ups on their potential to be successful.
RESULT : The model / analysis will then allow investors to make some more informed decisions and rely less on
intuitions.
DATA
UNDERSTANDING
/ EXPLORATION
DATA CLEANING
DATA
PREPARATION
FEATURE
SELECTION
MODEL
BUILDING
TESTING AND
EVALUATION
Venture Capitalist
(CLIENT)
University of Texas at Dallas
Introduction – Business Understanding
University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement
University of Texas at Dallas
Data Understanding
{ } – 15% of the entire data
2 Date and 1 Year Column
2 Scorecard Column
4 Company Profile Column – Tab
and Comma Delimited
4 Investors Portfolio and 1 Funding
Received Column
$
~10 Columns for Top Management
> 40 Column for Team Members
Unstructured Data
116 – columns ;
472 (unique) –
rows / records ;
1 Target Variable
• # Unique across Columns
• Spread of Attributes a/c
Columns
• Min / Max Frequency -
Attribute
No Class
Imbalance
Blanks “ “
~18 / Col.
“No Info”
~ 53 / Col.
~ 401 useful value / Col.
~ 362 out of 472 rows
has at least 1 blank
University of Texas at Dallas
Data Understanding
Dash-Board for Initial Analysis of Data
University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement
University of Texas at Dallas
Data Cleaning and Data Preparation
In general, Data Cleaning and Data Preparation are Data Pre-processing steps which involves data filtering,
aggregation and imputation of missing values
DATA
IMPUTATION
• Class Mean Imputation
(Clustering Based
Missing Value
Imputation)
• Simulation of Neural
Network Based
Imputation
• Multivariate
Imputation by chained
Equation (Excel Add-in)
DATA CLUSTERING
• Using business logic to
create nominal bins and
using Fuzzy Lookup to
remove the redundant
groups and to map the
raw data to the bins /
buckets
• Clustering Nominal Data
using Cross Table aka.
Bertin Matrix
Visualization
• K-Means Clustering along
with Ranking
DIMENSIONALITY
REDUCTION
• Principal Component
Analysis
• Logical Operations based
on Business
Understanding
USED IN CONJUNCTION TO PRE-PROCESS THE DATA
University of Texas at Dallas
Situation Analysis – Types of Missing Data
Market Research|Marketing|Crowdfunding Marketing, sales
VS
Tab and Comma Delimited Data i.e. Multiple variables inside a single cell
Given the observed data,
data are missing
independently of
unobserved data
Missing observations
related to values of
unobserved data
University of Texas at Dallas
Data Clustering
Creation of Bins / Buckets based on Business Logic
It lets you quickly category records. When we create a bucket, we basically define multiple categories (buckets) used to group similar variables.
GREEN
RED
BLUE
RECTANGLE
TRIANGLE
CIRCLE
COLORS SHAPES
NEED /
REQUIREMNT
Source :
- https://www.exploreanalytics.com/wiki/index.php?title=Binning
University of Texas at Dallas
Data Clustering
Fuzzy Logic for mapping and reducing redundant groups
A challenging problem in Data Management is that same entity can be represented in multiple ways, throughout the dataset.
Andy Hill Mr. Andrew HillHill, Mr. AndrewAndy Hill Mr. Andrew HillHill, Mr. Andrew
Essentially, they all refer to the same person But, during analysis they are treated as different person
Application of Fuzzy Logic allows us to identify records
which are textually similar
These variations results
basically because of :
1) Merging of independent
data source
2) Spelling Mistakes
3) Inconsistent naming
conventions and
abbreviations
Methodologies Used :
• Jaccard Similarity
• Weighted Jaccard Similarity and
Tokenization of Records
• Token Weighting
• Transformations
• Jaccard Similarity under
Transformation
• Edit Distance
Source :
- https://atidan.files.wordpress.com/2013/08/fuzzy-lookup-add-in-for-excel.pdf
Snap Shot
University of Texas at Dallas
Data Clustering Implementation
Considering Each
Column at a Time
Removing Duplicates (i.e. see analyze
the unique occurrences); also remove
the inconsistency in the data using
Fuzzy Technique
Decide upon the # of
bins based
Map each row to the bins based
on Fuzzy Mapping technique
348 records
Consider Column “Industry of a Company”
University of Texas at Dallas
Data Clustering Implementation
Considering Each
Column at a Time
Removing Duplicates (i.e. see analyze
the unique occurrences); also remove
the inconsistency in the data using
Fuzzy Technique
Decide upon the # of
bins based
Map each row to the bins based
on Fuzzy Mapping technique
348 records 40 unique records
Consider Column “Industry of a Company”
Total 11 Bins
Industry Final Industry Val. Similarity Final 40 values Bin Allocation
0.0000 Others Others
Market Research|Marketing|Crowdfunding Market Research 0.4554 Market Research Marketing
Analytics|Cloud Computing|Software Development Software Development 0.5042 Software DevelopmentIT
Mobile|Analytics Analytics 0.0000 Analytics Analytics
Analytics|Marketing|Enterprise Software Enterprise Software 0.3929 Enterprise SoftwareIT
Food & Beverages|Hospitality Food & Beverages 0.4615 Food & Beverages Hospitality / Entertainment
Analytics Analytics 1.0000 Analytics Analytics
Cloud Computing|Network / Hosting / Infrastructure Network / Hosting / Infrastructure0.6095 Network / Hosting / InfrastructureIT
Analytics|Mobile|Marketing Analytics 0.0000 Analytics Analytics
Healthcare|Pharmaceuticals|Analytics Analytics 0.0000 Analytics Analytics
1 2 3 4 5 6 7 8 9 10 11
Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others Govt HR Hospitality / Entertainment
Advertising Space Travel Analytics CleanTech Mobile Media Finance Career / Job Search Energy Human Resources (HR)Entertainment
Market Research Transportation Deals Cloud Computing Social Networking Crowdfunding Classifieds Security Food & Beverages
Marketing Travel E-Commerce Insurance Education Government Music
Retail Email Healthcare Hospitality
Enterprise Software Publishing
Gaming
Network / Hosting / Infrastructure
Real Estate
Search
Software Development
Telecommunications
Pharmaceuticals
Final Bins
Default Value Searched
Values
Matched with the Bin
attributes
Duplicates Removal + Textually Similar values removal =
University of Texas at Dallas
Data Clustering
Clustering Nominal Data using Cross Table aka. Bertin Matrix Visualization
Bertin Matrix or a Cross Table (Pivot Chart) allows rearrangements to transform an initial matrix to a more homogeneous structure. The
rearrangements are row and column permutations and groupings.
GREEN
RED
BLUE
RECTANGLE TRIANGLE
CIRCLE
1 0 1
1 1 1
1 3 2
Source :
- http://www.aviz.fr/wiki/uploads/Bertifier/bertifier-authorversion.pdf
- http://bertin.r-forge.r-project.org/bertinR.pdf
- https://books.google.com/books?id=2Q1qCQAAQBAJ&pg=PA398&lpg=PA398&dq=clustering+nominal+data+using+cross+table&
source=bl&ots=mzNGnnPu6H&sig=axEgHoiUmntfXwlwMfqIbMls05A&hl=en&sa=X&ved=0ahUKEwjji-
uZkczJAhUQ2WMKHbYXB4gQ6AEIMzAE#v=onepage&q=clustering%20nominal%20data%20using%20cross%20table&f=false
GREEN
RED
BLUE
RECTANGLE TRIANGLE
CIRCLE
“green – rectangle” “Green – Triangle” “Green – Circle”
“Red – Rectangle” “Red – Triangle” “Red – Circle”
“Blue – Rectangle” “Blue – Triangle” “Blue – Circle”
Frequency of
Occurrence
University of Texas at Dallas
Make a cross – table
for 2 columns
Take the intersection values of cross
– table 1 as rows and the left out
column to form another cross - table
Map the categorical value resulting from the
final cross – table, map those values to the
dataset, giving way to a single column
(reduction from 3 to 1 single column)
“Industry of Company”
1 2 3 4 5 6 7 8
Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others
targeted marketing Solution providing Web Analytics Research mobile app social advertising Risk service
Sales Strategy analytic Computing application social news Inventory management security
consumer behaviour Social Media optimization intellectual property analysis Technology PERSONAL APPS social media marketing PAYMENT Recommendation
retail Optimization data visualization Bug fix IPHONE APPS social commerce finance Energy saving
consumer web Travel Planning Social media analytics Data Integration mobile app development social branding revenue maximization entertainment
APP REVENUE reporting PHONE INTELLIGENCE malware protection Location based service SOCIAL MEDIA CAMPAIGN enterprise Merchandising
Customer Retention DASHBOARDS Music intelligece Database Management app Social Media billing News
customer engagement MAIL REPORTS SOCIAL TV ANALYTICS Data Collection Data driven applications social network localized behaviour
CRM NETWORK OPTIMIZATION big data analytics e-learning PUBLISHING
advertising TARGETING OPTIMIZE customer analytics software service global
PRICING management analytics crowdsourcing software development PRIVACY
Targeting Information management web METRICS
writing blog Music
curated web Production
Tool Development customer service
games Community Betterment
Search Engine
VIDEO STREAMING
networking
wireless
online music
cloud computing
Server Design
Search Engine
ecommerce
“Focus Function of Company”
1 2 3 4 5 6 7 8
Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others
Marketing Intelligence Platform Human insight at machine scale Event Data Analytics API Video distribution The Location-Based Marketing Platform Social Media Analytics and Reporting The ﬕnancial terminal of the web. Healthcare Data
Marketing intelligence solutions Business Logic Abuse Fraud Protection The most advanced analytics for mobile Engagement Engine In-Store Mobile Commerce Social Media ROI Measurment Simple Inventory Management for Square Health Care Analytics
Smart Suggestions for Sales Reps Business Dashboards big data for foodservice Internet Company Mobile shopping lead generator The Social Media Customer Care Tool Social Payments Shaking Up Publishing
Connected data for marketers & ecommerce Customer Experience Platform Big data for clinical insight Real-time error tracking Consumer Data Made Easy by App Advanced Twitter Management Content Valuation Platform Healthcare transformation.
Customer Data-Powered Marketing Intelligent social media dashboard Business Analytics Secure NoSQL Database Mobile Advertising Technology The Twitter of food. Delivering Return on Social Competitions for startups
Enterprise Marketing Intelligence SaaS Business Status Dashboard Analytics for the Music Industry SaaS Job Marketing Platform Mobile App Analytics & Marketing Enterprise social network In-game Payment Solutions A place for people to talk about the tv
Local Advertising promotions optimization for ecommerce Advanced predictive analytics energy efficiency data platform Social mobile Social Media Marketing & Technology Peer-to-Peer Student Loans CRE research made simple
Know and Grow Your Audience Big Data Analytics Software-as-a-Service (SaaS) platform SMS /Online Reminders Social media performance measurement Billing for web hosts Changing how people save energy
Mobile Audience Targeting Social Business Analytics Software Company Mobile BI Platform Social Data Platform Peer-to-Peer Lending
Easy and Powerful Marketing Solutions Smart Data for Better Places Social Influence/Authority platform Twitter Monitoring App Social Media Opinion's Movement
info exchange for physician interactions White Label Web Analytics technology research iOS and Android Crash Reporting Social media gamification Making Cities Easier to Love
Marketing Decisions Platform Investing tools made simple Specialists in Internet TV mobile development tools Enterprise Ready Social Media Monitoring Save Time
Verified B2B Contacts Semantic Automation and Storytelling Real-time discounts platform Consumer Location Analytics music social networking Call 3.0 Company
Marketplace for quality tutoring. Real-time in-store analytics A market research technology firm. 360° mobile analytics Entertainment Based Social Networking Services on your terms.
Social media marketing web analytics api New Generation of NoSQL mobile app data provider Social Network Applications Helping people share
Acquire and retain valuable customers Real-time Media Analytics Cloud computing Order food and drink from your mobile. Social transparent
The Marketing Suite for the Visual Web Analytics online real estate broker Mobile Payment Services Leveraging Emotions
Customer Experience Management Platform Predictive Analytics Video Optimization Email powered applications science. semantic. simple. sisu.
Doubleclick for Market Research Data visualization system API for Social data context Media Relations
“Short Description of Company Profile”
40 unique values mapped to 11 categories
99 unique values mapped to 8 categories
310 unique values mapped to 8 categories
Data Clustering Implementation
Therefore, a
total of 11 *
8 * 8 = 704
categories
University of Texas at Dallas
Data Clustering Implementation
1 2 3 4 5 6 7 8
Marketing Operationsand Strategy Analytics IT Mobile Social Financeand Risk Others
1 Marketing 11 12 13 14 15 16 17 18
2 Operationsand Strategy 21 22 23 24 25 26 27 28
3 Analytics 31 32 33 34 35 36 37 38
4 IT 41 42 43 44 45 46 47 48
5 Mobile 51 52 53 54 55 56 57 58
6 Social 61 62 63 64 65 66 67 68
7 Financeand Risk 71 72 73 74 75 76 77 78
8 Others 81 82 83 84 85 86 87 88
9 Govt 91 92 93 94 95 96 97 98
10 HR 101 102 103 104 105 106 107 108
11 Hospitality / Entertainment 111 112 113 114 115 116 117 118
Industry
Functionsof Company
Profile
1 2 3 4 5 6 7 8
Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others
Functions-Industry
11 Marketing - Marketing 111 112 113 114 115 116 117 118
21 Operations and Strategy - Marketing 211 212 213 214 215 216 217 218
31 Analytics - Marketing 311 312 313 314 315 316 317 318
41 IT - Marketing 411 412 413 414 415 416 417 418
51 Mobile - Marketing 511 512 513 514 515 516 517 518
61 Social - Marketing 611 612 613 614 615 616 617 618
71 Finance and Risk - Marketing 711 712 713 714 715 716 717 718
81 Others - Marketing 811 812 813 814 815 816 817 818
91 Govt - Marketing 911 912 913 914 915 916 917 918
101 HR - Marketing 1011 1012 1013 1014 1015 1016 1017 1018
111 Hospitality / Entertainment - Marketing 1111 1112 1113 1114 1115 1116 1117 1118
12 Marketing - Operations and Strategy 121 122 123 124 125 126 127 128
22 Operations and Strategy - Operations and Strategy 221 222 223 224 225 226 227 228
32 Analytics - Operations and Strategy 321 322 323 324 325 326 327 328
42 IT - Operations and Strategy 421 422 423 424 425 426 427 428
52 Mobile - Operations and Strategy 521 522 523 524 525 526 527 528
62 Social - Operations and Strategy 621 622 623 624 625 626 627 628
72 Finance and Risk - Operations and Strategy 721 722 723 724 725 726 727 728
82 Others - Operations and Strategy 821 822 823 824 825 826 827 828
92 Govt - Operations and Strategy 921 922 923 924 925 926 927 928
102 HR - Operations and Strategy 1021 1022 1023 1024 1025 1026 1027 1028
112 Hospitality / Entertainment - Operations and Strategy 1121 1122 1123 1124 1125 1126 1127 1128
13 Marketing - Analytics 131 132 133 134 135 136 137 138
23 Operations and Strategy - Analytics 231 232 233 234 235 236 237 238
33 Analytics - Analytics 331 332 333 334 335 336 337 338
43 IT - Analytics 431 432 433 434 435 436 437 438
53 Mobile - Analytics 531 532 533 534 535 536 537 538
63 Social - Analytics 631 632 633 634 635 636 637 638
73 Finance and Risk - Analytics 731 732 733 734 735 736 737 738
83 Others - Analytics 831 832 833 834 835 836 837 838
93 Govt - Analytics 931 932 933 934 935 936 937 938
103 HR - Analytics 1031 1032 1033 1034 1035 1036 1037 1038
113 Hospitality / Entertainment - Analytics 1131 1132 1133 1134 1135 1136 1137 1138
14 Marketing - IT 141 142 143 144 145 146 147 148
24 Operations and Strategy - IT 241 242 243 244 245 246 247 248
34 Analytics - IT 341 342 343 344 345 346 347 348
44 IT - IT 441 442 443 444 445 446 447 448
54 Mobile - IT 541 542 543 544 545 546 547 548
64 Social - IT 641 642 643 644 645 646 647 648
74 Finance and Risk - IT 741 742 743 744 745 746 747 748
84 Others - IT 841 842 843 844 845 846 847 848
94 Govt - IT 941 942 943 944 945 946 947 948
104 HR - IT 1041 1042 1043 1044 1045 1046 1047 1048
114 Hospitality / Entertainment - IT 1141 1142 1143 1144 1145 1146 1147 1148
15 Marketing - Mobile 151 152 153 154 155 156 157 158
25 Operations and Strategy - Mobile 251 252 253 254 255 256 257 258
35 Analytics - Mobile 351 352 353 354 355 356 357 358
45 IT - Mobile 451 452 453 454 455 456 457 458
55 Mobile - Mobile 551 552 553 554 555 556 557 558
65 Social - Mobile 651 652 653 654 655 656 657 658
75 Finance and Risk - Mobile 751 752 753 754 755 756 757 758
85 Others - Mobile 851 852 853 854 855 856 857 858
95 Govt - Mobile 951 952 953 954 955 956 957 958
105 HR - Mobile 1051 1052 1053 1054 1055 1056 1057 1058
115 Hospitality / Entertainment - Mobile 1151 1152 1153 1154 1155 1156 1157 1158
16 Marketing - Social 161 162 163 164 165 166 167 168
26 Operations and Strategy - Social 261 262 263 264 265 266 267 268
36 Analytics - Social 361 362 363 364 365 366 367 368
46 IT - Social 461 462 463 464 465 466 467 468
56 Mobile - Social 561 562 563 564 565 566 567 568
66 Social - Social 661 662 663 664 665 666 667 668
76 Finance and Risk - Social 761 762 763 764 765 766 767 768
86 Others - Social 861 862 863 864 865 866 867 868
96 Govt - Social 961 962 963 964 965 966 967 968
106 HR - Social 1061 1062 1063 1064 1065 1066 1067 1068
116 Hospitality / Entertainment - Social 1161 1162 1163 1164 1165 1166 1167 1168
17 Marketing - Finance and Risk 171 172 173 174 175 176 177 178
27 Operations and Strategy - Finance and Risk 271 272 273 274 275 276 277 278
37 Analytics - Finance and Risk 371 372 373 374 375 376 377 378
47 IT - Finance and Risk 471 472 473 474 475 476 477 478
57 Mobile - Finance and Risk 571 572 573 574 575 576 577 578
67 Social - Finance and Risk 671 672 673 674 675 676 677 678
77 Finance and Risk - Finance and Risk 771 772 773 774 775 776 777 778
87 Others - Finance and Risk 871 872 873 874 875 876 877 878
97 Govt - Finance and Risk 971 972 973 974 975 976 977 978
107 HR - Finance and Risk 1071 1072 1073 1074 1075 1076 1077 1078
117 Hospitality / Entertainment - Finance and Risk 1171 1172 1173 1174 1175 1176 1177 1178
18 Marketing - Others 181 182 183 184 185 186 187 188
28 Operations and Strategy - Others 281 282 283 284 285 286 287 288
38 Analytics - Others 381 382 383 384 385 386 387 388
48 IT - Others 481 482 483 484 485 486 487 488
58 Mobile - Others 581 582 583 584 585 586 587 588
68 Social - Others 681 682 683 684 685 686 687 688
78 Finance and Risk - Others 781 782 783 784 785 786 787 788
88 Others - Others 881 882 883 884 885 886 887 888
98 Govt - Others 981 982 983 984 985 986 987 988
108 HR - Others 1081 1082 1083 1084 1085 1086 1087 1088
118 Hospitality / Entertainment - Others 1181 1182 1183 1184 1185 1186 1187 1188
88 Unique Categorical Values for combination of two Columns
704 Unique Categorical Values for combination of three Columns
Short Description of
company profile
Bucket for Profile Industry of company Buckets for Industry
Focus functions
of company
Buckets for
Functions
Categorical Value for
Industry-Functions
Video distribution IT Others operation
Operations
and Strategy 824
Others
Market
Research|Marketing|Cr
owdfunding Marketing Marketing, sales Marketing 118
Event Data Analytics
API Analytics
Analytics|Cloud
Computing|Software
Development IT operations
Operations
and Strategy 423
The most advanced
analytics for mobile Analytics Mobile|Analytics Analytics
Marketing &
Sales Marketing 313
The Location-Based
Marketing Platform Mobile
Analytics|Marketing|En
terprise Software IT
Marketing &
Sales Marketing 415
big data for
foodservice Analytics
Food &
Beverages|Hospitality
Hospitality /
Entertainment analytics Analytics 1133
Others Analytics Analytics Research IT 348
A total of 143 out of 704 Categorical Values mapped in the original dataset
University of Texas at Dallas
Data Clustering
K Means Clustering along with Ranking
Data Set
User
Choose the number of
clusters i.e. K – he
wants the data to be
clustered into
1) Random “K”
Centroids are
chosen from the
dataset
2) Each Record – Data
is assigned to its
closest cluster
(based on low SSE)
3) Re-compute the
centroid of each
cluster
4) Process repeated
until the centroids
doesn’t changes
Source :
- https://www.youtube.com/watch?v=u1NtKPuXQKo
- http://sci2s.ugr.es/keel/pdf/specific/congreso/brazdil00comparison.pdf
Resulting Data Set with clusters
specified as centroids
Ranking Algorithm which
assigns rank to the centroid
either in Ascending or
Descending order
- Deciles i.e. 10 clusters
- Quintiles i.e. 5 clusters
Segmented Data with proper
Deciles, Quintiles etc.
University of Texas at Dallas
Percent_skill_
Entrepreneurship
Percent_skill_
Operations
Percent_skill
Engineering
Percent_skill_
Marketing
Percent_skill
Leadership
Percent_skill
Data Science
Percent_skill
Business Strategy
Percent_skill_Produc
t Management
Percent
skill_Sales
Percent_skill
_Domain
Percent_
skill_Law
Percent_skill
Consulting
Percent_skill
_Finance
Percent_skill_
Investment
Renown
score
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15.88235294 11.76470588 15 12.94117647 0 8.823529412 21.76470588 10.88235294 2.941176471 0 0 0 0 0 8
9.401709402 0 57.47863248 0 0 3.846153846 17.09401709 9.401709402 0 2.777777778 0 0 0 0 9
0 0 0 0 0 0 0 0 0 0 0 0 0 0 5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 6
6.25 0 3.125 15.625 9.375 3.125 6.25 3.125 3.125 0 0 0 0 0 6
0 0 66.66666667 5.555555556 0 22.22222222 0 0 0 5.555555556 0 0 0 0 0
0 0 100 0 0 0 0 0 0 0 0 0 0 0 2
8.333333333 0 46.73202614 5.718954248 8.333333333 0 19.77124183 2.777777778 2.777777778 0 0 0 0 5.555555556 5
8.333333333 0 27.08333333 19.79166667 0 23.95833333 0 0 0 20.83333333 0 0 0 0 4
3.846153846 0 26.92307692 0 3.846153846 3.846153846 7.692307692 0 3.846153846 0 0 0 0 0 6
27.27272727 0 18.18181818 0 9.090909091 0 36.36363636 9.090909091 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8.333333333 0 50 16.66666667 0 12.5 4.166666667 4.166666667 4.166666667 0 0 0 0 0 0
13.33333333 0 6.666666667 60 0 13.33333333 6.666666667 0 0 0 0 0 0 0 1
11.11111111 5.555555556 5.555555556 0 11.11111111 11.11111111 27.77777778 0 11.11111111 5.555555556 0 0 5.555555556 5.555555556 5
8.333333333 0 58.33333333 0 0 0 25 0 0 8.333333333 0 0 0 0 2
20 0 20 0 40 0 20 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 8
5.555555556 0 27.77777778 11.11111111 0 16.66666667 27.77777778 5.555555556 0 5.555555556 0 0 0 0 7
Average of 61 records with the maximum of “Failure” Category had Blank / No Info values
Therefore, Case – Wise Deletion was not a good option !! We had to impute the data.
Data Clustering Implementation
University of Texas at Dallas
Data Imputation
Source :
- Missing Value Imputation using Refined Mean Substitution - http://ijcsi.org/papers/IJCSI-9-4-3-306-313.pdf
- http://scs.math.yorku.ca/images/6/6d/Enders_jofschoolpsyc.pdf
- http://www4.ncsu.edu/~pollock/pdfs/Lecture%20ST%20432%20Weighting,%20Imputation%20and%20Variances.pdf
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.4863&rep=rep1&type=pdf (Page No. 6)
- http://www.csos.jhu.edu/contact/staff/jwayman_pub/wayman_multimp_aera2003.pdf
- http://www.stefvanbuuren.nl/mi/MI.html
- http://www.stefvanbuuren.nl/publications/MICE%20V1.0%20Manual%20TNO00038%202000.pdf
- http://arxiv.org/ftp/arxiv/papers/0704/0704.3474.pdf
- http://sci2s.ugr.es/keel/pdf/specific/articulo/2011-silva-mlp-imputation-NN.pdf
Missing Value
Imputation
can happen in
datasets in
several forms
Missing Value
occur in
several
attributes
(columns)
[MAR]
Missing Value
occurs in
number of
instances
(rows) [MAR]
Missing value
occur
randomly in
attributes
and instances
[MNAR]
Class Mean Imputation
• Respondents (Identifiers)
are divided into classes
• Cell Mean for a particular
class is used for all
missing values in that
class
• This method can be
biased, it overestimates
the correlation and
underestimates the
variability of the data
• Modified Version -
Stochastic Regression
Method in which a
random error term is
added to each predicted
score
Multivariate Imputation by
Chained Equation
• Missing values are
predicted using existing
values.
• The predicted values
“imputes”, are substituted
for missing values,
resulting in full data set –
“imputed data set”
• Performed Multiple Times
using – Bayesian Linear
Regression, Predictive
Mean Matching,
Unconditional Mean
Imputation, Logistic
Regression (Polytomous
>=2 category), LDA
Incomplete
Data
Imputed
Data
Analysis
Results
Pooled
Results
Simulation of Neural
Network Based Imputation
• The method of Neural
Network uses the auto –
associative neural
network to approximate
missing data.
• We tried to simulate the
basic structure of Neural
Network , basically its
ability to learn certain
liner and non-linear inter-
relationships in the input
space.
• We also tried to simulate
the functionality of auto-
encoder which project
the input onto smaller set
by intensively squashing it
into smaller details
University of Texas at Dallas
Data Imputation Implementation
Consider last 15 columns one at a time and Impute
the Values using Class Mean Imputation treating
“Company Category and Target Value” Combination
as Class
After Imputation, use K Means to
compute the Clusters and Rank Them
accordingly
Repeat this for all the 15 columns and Sum up
the Rank for each row. A highest rank would
be of 150. Evaluate if this produces
Randomness in data
Company_
Name
Dependent-
Company Status
Catagorical Value for
Industry-Functions
Percent_skill_
Entrepreneurship
Percent_skill_
Entrepreneurship
Percent_skill_
Entrepreneurship
Overall
Score
Company1 Success 824 0 0 0 0
Company2 Success 118 15.88235294 15.88235294 9 63
Company3 Success 423 9.401709402 9.401709402 6 46
Company4 Success 313 0 0 0 6
Company5 Success 415 0 0 0 7
Company6 Success 1133 6.25 6.25 4 37
Company7 Success 348 0 0 0 28
Company8 Success 448 0 0 0 12
Company9 Success 314 8.333333333 8.333333333 5 51
Company10 Success 343 8.333333333 8.333333333 5 42
Company11 Success 413 3.846153846 3.846153846 2 30
Company12 Success 828 27.27272727 27.27272727 10 42
Company376 Success 848 No Info 9.322638145 6 63
Company413 Failed 888 No Info 5.664488017 3 49
Original Col.
Imputation
Inclusive Col.
Ranked /
Segmented Col.
Into 10 bins
Sum of All 15 Ranks,
making a sum of 150
Average of all row’s “%_Skill_Entr.” (Col. 4) whose :
1) Dependant_Comp._Status (Col. 2) = “Failed”; AND
2) Categorical_Val. (Col. 3) = “888”; AND
3) %_Skill_Entr. (Col 4.) is not equal to “No Info”
Similarly, Column
“Internet Activity
Score” was first
imputed on similar
grounds and then
segmented
Ceiling (Sum of all the numbers in
the (Column 5) which are “<=“
((Column 5) No. / Total Sum) * # K )
University of Texas at Dallas
Data Imputation Implementation
For Calculating missing values in “Age of the
Company”, confirm the reference end date.
1) Take the age of the least value and see the
corresponding Established Date
Varied from Jan 13’ – Jul 13’, Hence Ref. Date
is somewhere in 2014 Jan - Jul
1) Take all the values of Column “Age of the
Company” and “Estb. Date” without NULL.
2) Take the Last Date of “Last Funding Date”
Column and assume it to be the Ref. Date.
3) Verify if that being assumption makes sense
Age
Est. Founding
Date
Last Funding
Date Date_ Assumption
Diff. Assum. -
Est.
5 6/20/2009 5/10/2012 4/8/2014 4.80274
4 4/1/2010 12/11/2013 4/8/2014 4.021918
4 5/1/2010 9/17/2013 4/8/2014 3.939726
3 1/1/2011 9/3/2013 4/8/2014 3.268493
4 1/1/2010 11/8/2012 4/8/2014 4.268493
3 1/1/2011 2/26/2014 4/8/2014 3.268493
1 5/16/2013 10/24/2013 4/8/2014 0.89589
Company_
Name
Dependent-
Company Status
Age of company
in years
Age of company
in years #1
Age of company
in years #2
Age of company
in years #3
Company1 Success No Info 1 1 1
Company2 Success 3 3 3 3
Company153 Failed No Info 6 6
Company453 Failed 7
1) Like the Neural network have different
hidden layers performing same mathematical
function differently, perform Class Mean
imputation method for different Logical
Association which might be associated or on
which “Age of the Company” would be
dependent. Repeat this till we get the 90% of
the data imputed
# of Advisors Internet Activity
Score Segment
Established Date
Success / Failure
Target Value
Success / Failure
Target Value
Industry Category Industry Category
# 59 - Blanks # 34 – Blanks
# 25 - Imputed
# 09 – Blanks
# 25 - Imputed
# 08 – Blanks
# 01 - Imputed
8 records were deleted. #”Success” – 6, #”Failure” - 2
Therefore, still no Case Imbalance
Similarly, Column “Last Funding Amount” was imputed, and a total of 27 records were deleted of
which, # “success” deletion was 20 and # “Failure” Deletion was 7
Therefore, Total Records Deleted = 27+8 = 35 ; 437 records
#Success = 279 and #Failure = 158
University of Texas at Dallas
Dimensionality Reduction
Dimension Reduction is the
mapping of data to a lower
dimensional space such that
uninformative variance of the data
is discarded, or such that a
subspace in which the data lives is
detected How to take a
picture to capture
the most
information about
the rectangle?
University of Texas at Dallas
Dimensionality Reduction
Dimension Reduction is the
mapping of data to a lower
dimensional space such that
uninformative variance of the data
is discarded, or such that a
subspace in which the data lives is
detected How to take a
picture to capture
the most
information about
the rectangle?
A B C D E
University of Texas at Dallas
Dimensionality Reduction
Dimension Reduction is the
mapping of data to a lower
dimensional space such that
uninformative variance of the data
is discarded, or such that a
subspace in which the data lives is
detected How to take a
picture to capture
the most
information about
the rectangle?
A B C D E
University of Texas at Dallas
Dimensionality Reduction
WHY THIS POSITION ?
BECAUSE IT PROVIDES THE MOST
VISUAL INFORMATION !!
Second Longest Axis
while fixing the first
longest axis
First Longest Axis
PCA Understanding
• Rotate the object around its center to find
the best orientation
• First find the axis so that the object has
largest extend in average along the axis
• Rotate the object around the first axis to
find the axis that is perpendicular to the
first axis, and the object has largest
extend in average along this axis
• The two axis found are the first and
second principal component
• The PCA algorithm helps us find those
components
• We deconstruct the data set into Eigen
Vector and its corresponding Eigen Values.
They come in pair.
• Eigen Vector is a direction of the axis / line
(vertical, horizontal , 45 degrees etc.) and
the Eigen Values is a number telling us
how spread out the data is on the line.
• Eigen Vector with the highest Eigen value
is therefor the principal component.Source :
- https://www.youtube.com/watch?v=BfTMmoDFXyE
- https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-
dimension-reduction/
- http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
- http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
- https://www.youtube.com/watch?v=7BUHYpNVT5E&list=LLeNKG4d3dB8SEg1gw91dmig&index=6
- https://nutsandboltsspeedtraining.com/spicypresentations/rotating-3d-shapes-with-powerpoint-animations/
University of Texas at Dallas
Dimensionality Reduction Implementation
1) Choose columns which seems to be co-
related but the relation isn’t identifiable
mathematically
2) Look for # of missing Values
3) Run Multiple Imputation method on them
1) After running Multiple Imputation, look for
10 different imputed values
2) Take the average of each cell from the 10
tables
3) Plug in the resultant Data into Statgraphics
Centurion and Run PCA Analysis
1) Look at the Eigen Value Graph, to have an
idea about how many components define
85% – 90% of the Dataset
2) Take the value of Covariance Matrix *
Component Weights for Each Component
to get the Data Values a/c spread in the
particular component space
ID
Employee
Count
Employees count
MoM change
Has the team
size grown
Team size all
employees
1 3 0 -1 15
2 17 -1 20
3 14 0 -1 10
4 45 10 -1 50
5 39 3 -1 40
6 14 8 -1 14
7 7 0 -1 15
8 29 -12 -1 40
9 16 45 -1 50
10 3 -1 3
11 34 0 -1 50
Scree Plot
0 1 2 3 4
Component
0
0.3
0.6
0.9
1.2
1.5
1.8
Eigenvalue
University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement
University of Texas at Dallas
Feature Selection
Feature Selection refers to the selection of attributes in the data set that are most relevant to the
predictive modeling
2 Date and 1 Year Column
~10 Columns for Top Management
> 40 Column for Team Members
Target Variable
The variables associated
with it had been included in
the final feature list.
Addition of this would be
redundant
Few data sets have been
included whereas others
with binary attribute have
been omitted.
4 Investors Portfolio and 1 Funding
Received Column
$
The funding information
has been included,
however, seed funders and
investors detail aren’t
included as because of the
319 unique values, which
would not lead to any
information gain
REJECTED LIST ACCEPTED LIST
Identifier
2 PCA Component
Age of the Company
Most of the Columns
were of Binary form with
many missing values ,
which were difficult to
impute
Internet Activity Detail
Funding Received Information
# Co – Founders and Investors
11
University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement
University of Texas at Dallas
Model Building
Ignoring since Rattle
Random Forest can handle
only 32 categorical
Variables
Confusion Matrix
University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement
University of Texas at Dallas
Testing And Evaluation
University of Texas at Dallas
Testing And Evaluation
For Validation Set For Testing Set
RANDOM FOREST
True Positive True Negative False Positive False Negative
Failed 68.42% 91.67% 8.33% 31.58%
Success 91.67% 68.42% 31.58% 8.33%
Misclassification Rate
University of Texas at Dallas
Agenda
Introduction – Business Understanding
Data Understanding
Data Cleaning and Preparation
Feature Selection
Model Building
Testing and Evaluation
Future Enhancement
University of Texas at Dallas
Future Enhancement
 Currently, there is a lot of dependability on multiple tools – e.g. Statgraphics Centurion for Multivariate
Analysis. Statgraphics is a paid tool. Dependability on such tools can be removed by building up an in-
house plug-in or a library function for the requirement.
 Project relies very much on ad-hoc analysis. Chances are high of omitting steps when new dataset would
arrive. Automation could be done of each steps post drafting of the overall step-wise procedure. For
Automation, VBA or R Programming could be a good option.
 Visualizing the data set could help making much more quick informed decision.
University of Texas at Dallas
Thank You !!

More Related Content

What's hot

Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
WHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data QualityWHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data QualityAlan D. Duncan
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPromptCloud
 
Real World Guide to Building Your Knowledge Graph
Real World Guide to Building Your Knowledge GraphReal World Guide to Building Your Knowledge Graph
Real World Guide to Building Your Knowledge GraphNeo4j
 
Responsible AI
Responsible AIResponsible AI
Responsible AINeo4j
 
Big data visualization
Big data visualizationBig data visualization
Big data visualizationAnurag Gupta
 
Datamining and Business Analytics
Datamining and Business Analytics Datamining and Business Analytics
Datamining and Business Analytics amacolumbia
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining Customer Profiling using Data Mining
Customer Profiling using Data Mining Suman Chatterjee
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
04. Logical Data Definition template
04. Logical Data Definition template04. Logical Data Definition template
04. Logical Data Definition templateAlan D. Duncan
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification TemplateAlan D. Duncan
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaIJERA Editor
 
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...IRJET Journal
 

What's hot (18)

Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
WHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data QualityWHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data Quality
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
 
Real World Guide to Building Your Knowledge Graph
Real World Guide to Building Your Knowledge GraphReal World Guide to Building Your Knowledge Graph
Real World Guide to Building Your Knowledge Graph
 
Responsible AI
Responsible AIResponsible AI
Responsible AI
 
Brooke Guthrie
Brooke GuthrieBrooke Guthrie
Brooke Guthrie
 
Big data visualization
Big data visualizationBig data visualization
Big data visualization
 
Datamining and Business Analytics
Datamining and Business Analytics Datamining and Business Analytics
Datamining and Business Analytics
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining Customer Profiling using Data Mining
Customer Profiling using Data Mining
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
04. Logical Data Definition template
04. Logical Data Definition template04. Logical Data Definition template
04. Logical Data Definition template
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification Template
 
Data mining
Data miningData mining
Data mining
 
Data modelling 101
Data modelling 101Data modelling 101
Data modelling 101
 
Datamining
DataminingDatamining
Datamining
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
 
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
 

Similar to Business Intelligence Project on Startup Success Factors

Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Denodo
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018LoQutus
 
Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Anastasija Nikiforova
 
data collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxdata collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxSourabhkumar729579
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallTrillium Software
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 

Similar to Business Intelligence Project on Startup Success Factors (20)

Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
 
Data Mining
Data MiningData Mining
Data Mining
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018
 
Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...Overlooked aspects of data governance: workflow framework for enterprise data...
Overlooked aspects of data governance: workflow framework for enterprise data...
 
Talk
TalkTalk
Talk
 
Data Science
Data ScienceData Science
Data Science
 
data collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxdata collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptx
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Data science 101
Data science 101Data science 101
Data science 101
 

Recently uploaded

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 

Recently uploaded (20)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 

Business Intelligence Project on Startup Success Factors

  • 1. MIS 6324 : BUSINESS INTELLIGENCE TERM PROJECT ON VIBHORE AGARWAL
  • 2. University of Texas at Dallas Agenda Introduction – Business Understanding Data Understanding Data Cleaning and Preparation Feature Selection Model Building Testing and Evaluation Future Enhancement
  • 3. University of Texas at Dallas Introduction – Business Understanding BUSINESS UNDERSTANDING SCOPE : Investment strategies for investing in start-up companies are widely based on intuition and past experience. As a result, investors rely primarily on the need being addressed, background of the founders, size of the market and the ability of the company to scale after tasting success. AIM : To perform some rigorous analysis that can be used to identify relevant factors and score prospective start-ups on their potential to be successful. RESULT : The model / analysis will then allow investors to make some more informed decisions and rely less on intuitions. DATA UNDERSTANDING / EXPLORATION DATA CLEANING DATA PREPARATION FEATURE SELECTION MODEL BUILDING TESTING AND EVALUATION Venture Capitalist (CLIENT)
  • 4. University of Texas at Dallas Introduction – Business Understanding
  • 5. University of Texas at Dallas Agenda Introduction – Business Understanding Data Understanding Data Cleaning and Preparation Feature Selection Model Building Testing and Evaluation Future Enhancement
  • 6. University of Texas at Dallas Data Understanding { } – 15% of the entire data 2 Date and 1 Year Column 2 Scorecard Column 4 Company Profile Column – Tab and Comma Delimited 4 Investors Portfolio and 1 Funding Received Column $ ~10 Columns for Top Management > 40 Column for Team Members Unstructured Data 116 – columns ; 472 (unique) – rows / records ; 1 Target Variable • # Unique across Columns • Spread of Attributes a/c Columns • Min / Max Frequency - Attribute No Class Imbalance Blanks “ “ ~18 / Col. “No Info” ~ 53 / Col. ~ 401 useful value / Col. ~ 362 out of 472 rows has at least 1 blank
  • 7. University of Texas at Dallas Data Understanding Dash-Board for Initial Analysis of Data
  • 8. University of Texas at Dallas Agenda Introduction – Business Understanding Data Understanding Data Cleaning and Preparation Feature Selection Model Building Testing and Evaluation Future Enhancement
  • 9. University of Texas at Dallas Data Cleaning and Data Preparation In general, Data Cleaning and Data Preparation are Data Pre-processing steps which involves data filtering, aggregation and imputation of missing values DATA IMPUTATION • Class Mean Imputation (Clustering Based Missing Value Imputation) • Simulation of Neural Network Based Imputation • Multivariate Imputation by chained Equation (Excel Add-in) DATA CLUSTERING • Using business logic to create nominal bins and using Fuzzy Lookup to remove the redundant groups and to map the raw data to the bins / buckets • Clustering Nominal Data using Cross Table aka. Bertin Matrix Visualization • K-Means Clustering along with Ranking DIMENSIONALITY REDUCTION • Principal Component Analysis • Logical Operations based on Business Understanding USED IN CONJUNCTION TO PRE-PROCESS THE DATA
  • 10. University of Texas at Dallas Situation Analysis – Types of Missing Data Market Research|Marketing|Crowdfunding Marketing, sales VS Tab and Comma Delimited Data i.e. Multiple variables inside a single cell Given the observed data, data are missing independently of unobserved data Missing observations related to values of unobserved data
  • 11. University of Texas at Dallas Data Clustering Creation of Bins / Buckets based on Business Logic It lets you quickly category records. When we create a bucket, we basically define multiple categories (buckets) used to group similar variables. GREEN RED BLUE RECTANGLE TRIANGLE CIRCLE COLORS SHAPES NEED / REQUIREMNT Source : - https://www.exploreanalytics.com/wiki/index.php?title=Binning
  • 12. University of Texas at Dallas Data Clustering Fuzzy Logic for mapping and reducing redundant groups A challenging problem in Data Management is that same entity can be represented in multiple ways, throughout the dataset. Andy Hill Mr. Andrew HillHill, Mr. AndrewAndy Hill Mr. Andrew HillHill, Mr. Andrew Essentially, they all refer to the same person But, during analysis they are treated as different person Application of Fuzzy Logic allows us to identify records which are textually similar These variations results basically because of : 1) Merging of independent data source 2) Spelling Mistakes 3) Inconsistent naming conventions and abbreviations Methodologies Used : • Jaccard Similarity • Weighted Jaccard Similarity and Tokenization of Records • Token Weighting • Transformations • Jaccard Similarity under Transformation • Edit Distance Source : - https://atidan.files.wordpress.com/2013/08/fuzzy-lookup-add-in-for-excel.pdf Snap Shot
  • 13. University of Texas at Dallas Data Clustering Implementation Considering Each Column at a Time Removing Duplicates (i.e. see analyze the unique occurrences); also remove the inconsistency in the data using Fuzzy Technique Decide upon the # of bins based Map each row to the bins based on Fuzzy Mapping technique 348 records Consider Column “Industry of a Company”
  • 14. University of Texas at Dallas Data Clustering Implementation Considering Each Column at a Time Removing Duplicates (i.e. see analyze the unique occurrences); also remove the inconsistency in the data using Fuzzy Technique Decide upon the # of bins based Map each row to the bins based on Fuzzy Mapping technique 348 records 40 unique records Consider Column “Industry of a Company” Total 11 Bins Industry Final Industry Val. Similarity Final 40 values Bin Allocation 0.0000 Others Others Market Research|Marketing|Crowdfunding Market Research 0.4554 Market Research Marketing Analytics|Cloud Computing|Software Development Software Development 0.5042 Software DevelopmentIT Mobile|Analytics Analytics 0.0000 Analytics Analytics Analytics|Marketing|Enterprise Software Enterprise Software 0.3929 Enterprise SoftwareIT Food & Beverages|Hospitality Food & Beverages 0.4615 Food & Beverages Hospitality / Entertainment Analytics Analytics 1.0000 Analytics Analytics Cloud Computing|Network / Hosting / Infrastructure Network / Hosting / Infrastructure0.6095 Network / Hosting / InfrastructureIT Analytics|Mobile|Marketing Analytics 0.0000 Analytics Analytics Healthcare|Pharmaceuticals|Analytics Analytics 0.0000 Analytics Analytics 1 2 3 4 5 6 7 8 9 10 11 Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others Govt HR Hospitality / Entertainment Advertising Space Travel Analytics CleanTech Mobile Media Finance Career / Job Search Energy Human Resources (HR)Entertainment Market Research Transportation Deals Cloud Computing Social Networking Crowdfunding Classifieds Security Food & Beverages Marketing Travel E-Commerce Insurance Education Government Music Retail Email Healthcare Hospitality Enterprise Software Publishing Gaming Network / Hosting / Infrastructure Real Estate Search Software Development Telecommunications Pharmaceuticals Final Bins Default Value Searched Values Matched with the Bin attributes Duplicates Removal + Textually Similar values removal =
  • 15. University of Texas at Dallas Data Clustering Clustering Nominal Data using Cross Table aka. Bertin Matrix Visualization Bertin Matrix or a Cross Table (Pivot Chart) allows rearrangements to transform an initial matrix to a more homogeneous structure. The rearrangements are row and column permutations and groupings. GREEN RED BLUE RECTANGLE TRIANGLE CIRCLE 1 0 1 1 1 1 1 3 2 Source : - http://www.aviz.fr/wiki/uploads/Bertifier/bertifier-authorversion.pdf - http://bertin.r-forge.r-project.org/bertinR.pdf - https://books.google.com/books?id=2Q1qCQAAQBAJ&pg=PA398&lpg=PA398&dq=clustering+nominal+data+using+cross+table& source=bl&ots=mzNGnnPu6H&sig=axEgHoiUmntfXwlwMfqIbMls05A&hl=en&sa=X&ved=0ahUKEwjji- uZkczJAhUQ2WMKHbYXB4gQ6AEIMzAE#v=onepage&q=clustering%20nominal%20data%20using%20cross%20table&f=false GREEN RED BLUE RECTANGLE TRIANGLE CIRCLE “green – rectangle” “Green – Triangle” “Green – Circle” “Red – Rectangle” “Red – Triangle” “Red – Circle” “Blue – Rectangle” “Blue – Triangle” “Blue – Circle” Frequency of Occurrence
  • 16. University of Texas at Dallas Make a cross – table for 2 columns Take the intersection values of cross – table 1 as rows and the left out column to form another cross - table Map the categorical value resulting from the final cross – table, map those values to the dataset, giving way to a single column (reduction from 3 to 1 single column) “Industry of Company” 1 2 3 4 5 6 7 8 Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others targeted marketing Solution providing Web Analytics Research mobile app social advertising Risk service Sales Strategy analytic Computing application social news Inventory management security consumer behaviour Social Media optimization intellectual property analysis Technology PERSONAL APPS social media marketing PAYMENT Recommendation retail Optimization data visualization Bug fix IPHONE APPS social commerce finance Energy saving consumer web Travel Planning Social media analytics Data Integration mobile app development social branding revenue maximization entertainment APP REVENUE reporting PHONE INTELLIGENCE malware protection Location based service SOCIAL MEDIA CAMPAIGN enterprise Merchandising Customer Retention DASHBOARDS Music intelligece Database Management app Social Media billing News customer engagement MAIL REPORTS SOCIAL TV ANALYTICS Data Collection Data driven applications social network localized behaviour CRM NETWORK OPTIMIZATION big data analytics e-learning PUBLISHING advertising TARGETING OPTIMIZE customer analytics software service global PRICING management analytics crowdsourcing software development PRIVACY Targeting Information management web METRICS writing blog Music curated web Production Tool Development customer service games Community Betterment Search Engine VIDEO STREAMING networking wireless online music cloud computing Server Design Search Engine ecommerce “Focus Function of Company” 1 2 3 4 5 6 7 8 Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others Marketing Intelligence Platform Human insight at machine scale Event Data Analytics API Video distribution The Location-Based Marketing Platform Social Media Analytics and Reporting The ﬕnancial terminal of the web. Healthcare Data Marketing intelligence solutions Business Logic Abuse Fraud Protection The most advanced analytics for mobile Engagement Engine In-Store Mobile Commerce Social Media ROI Measurment Simple Inventory Management for Square Health Care Analytics Smart Suggestions for Sales Reps Business Dashboards big data for foodservice Internet Company Mobile shopping lead generator The Social Media Customer Care Tool Social Payments Shaking Up Publishing Connected data for marketers & ecommerce Customer Experience Platform Big data for clinical insight Real-time error tracking Consumer Data Made Easy by App Advanced Twitter Management Content Valuation Platform Healthcare transformation. Customer Data-Powered Marketing Intelligent social media dashboard Business Analytics Secure NoSQL Database Mobile Advertising Technology The Twitter of food. Delivering Return on Social Competitions for startups Enterprise Marketing Intelligence SaaS Business Status Dashboard Analytics for the Music Industry SaaS Job Marketing Platform Mobile App Analytics & Marketing Enterprise social network In-game Payment Solutions A place for people to talk about the tv Local Advertising promotions optimization for ecommerce Advanced predictive analytics energy efficiency data platform Social mobile Social Media Marketing & Technology Peer-to-Peer Student Loans CRE research made simple Know and Grow Your Audience Big Data Analytics Software-as-a-Service (SaaS) platform SMS /Online Reminders Social media performance measurement Billing for web hosts Changing how people save energy Mobile Audience Targeting Social Business Analytics Software Company Mobile BI Platform Social Data Platform Peer-to-Peer Lending Easy and Powerful Marketing Solutions Smart Data for Better Places Social Influence/Authority platform Twitter Monitoring App Social Media Opinion's Movement info exchange for physician interactions White Label Web Analytics technology research iOS and Android Crash Reporting Social media gamification Making Cities Easier to Love Marketing Decisions Platform Investing tools made simple Specialists in Internet TV mobile development tools Enterprise Ready Social Media Monitoring Save Time Verified B2B Contacts Semantic Automation and Storytelling Real-time discounts platform Consumer Location Analytics music social networking Call 3.0 Company Marketplace for quality tutoring. Real-time in-store analytics A market research technology firm. 360° mobile analytics Entertainment Based Social Networking Services on your terms. Social media marketing web analytics api New Generation of NoSQL mobile app data provider Social Network Applications Helping people share Acquire and retain valuable customers Real-time Media Analytics Cloud computing Order food and drink from your mobile. Social transparent The Marketing Suite for the Visual Web Analytics online real estate broker Mobile Payment Services Leveraging Emotions Customer Experience Management Platform Predictive Analytics Video Optimization Email powered applications science. semantic. simple. sisu. Doubleclick for Market Research Data visualization system API for Social data context Media Relations “Short Description of Company Profile” 40 unique values mapped to 11 categories 99 unique values mapped to 8 categories 310 unique values mapped to 8 categories Data Clustering Implementation Therefore, a total of 11 * 8 * 8 = 704 categories
  • 17. University of Texas at Dallas Data Clustering Implementation 1 2 3 4 5 6 7 8 Marketing Operationsand Strategy Analytics IT Mobile Social Financeand Risk Others 1 Marketing 11 12 13 14 15 16 17 18 2 Operationsand Strategy 21 22 23 24 25 26 27 28 3 Analytics 31 32 33 34 35 36 37 38 4 IT 41 42 43 44 45 46 47 48 5 Mobile 51 52 53 54 55 56 57 58 6 Social 61 62 63 64 65 66 67 68 7 Financeand Risk 71 72 73 74 75 76 77 78 8 Others 81 82 83 84 85 86 87 88 9 Govt 91 92 93 94 95 96 97 98 10 HR 101 102 103 104 105 106 107 108 11 Hospitality / Entertainment 111 112 113 114 115 116 117 118 Industry Functionsof Company Profile 1 2 3 4 5 6 7 8 Marketing Operations and Strategy Analytics IT Mobile Social Finance and Risk Others Functions-Industry 11 Marketing - Marketing 111 112 113 114 115 116 117 118 21 Operations and Strategy - Marketing 211 212 213 214 215 216 217 218 31 Analytics - Marketing 311 312 313 314 315 316 317 318 41 IT - Marketing 411 412 413 414 415 416 417 418 51 Mobile - Marketing 511 512 513 514 515 516 517 518 61 Social - Marketing 611 612 613 614 615 616 617 618 71 Finance and Risk - Marketing 711 712 713 714 715 716 717 718 81 Others - Marketing 811 812 813 814 815 816 817 818 91 Govt - Marketing 911 912 913 914 915 916 917 918 101 HR - Marketing 1011 1012 1013 1014 1015 1016 1017 1018 111 Hospitality / Entertainment - Marketing 1111 1112 1113 1114 1115 1116 1117 1118 12 Marketing - Operations and Strategy 121 122 123 124 125 126 127 128 22 Operations and Strategy - Operations and Strategy 221 222 223 224 225 226 227 228 32 Analytics - Operations and Strategy 321 322 323 324 325 326 327 328 42 IT - Operations and Strategy 421 422 423 424 425 426 427 428 52 Mobile - Operations and Strategy 521 522 523 524 525 526 527 528 62 Social - Operations and Strategy 621 622 623 624 625 626 627 628 72 Finance and Risk - Operations and Strategy 721 722 723 724 725 726 727 728 82 Others - Operations and Strategy 821 822 823 824 825 826 827 828 92 Govt - Operations and Strategy 921 922 923 924 925 926 927 928 102 HR - Operations and Strategy 1021 1022 1023 1024 1025 1026 1027 1028 112 Hospitality / Entertainment - Operations and Strategy 1121 1122 1123 1124 1125 1126 1127 1128 13 Marketing - Analytics 131 132 133 134 135 136 137 138 23 Operations and Strategy - Analytics 231 232 233 234 235 236 237 238 33 Analytics - Analytics 331 332 333 334 335 336 337 338 43 IT - Analytics 431 432 433 434 435 436 437 438 53 Mobile - Analytics 531 532 533 534 535 536 537 538 63 Social - Analytics 631 632 633 634 635 636 637 638 73 Finance and Risk - Analytics 731 732 733 734 735 736 737 738 83 Others - Analytics 831 832 833 834 835 836 837 838 93 Govt - Analytics 931 932 933 934 935 936 937 938 103 HR - Analytics 1031 1032 1033 1034 1035 1036 1037 1038 113 Hospitality / Entertainment - Analytics 1131 1132 1133 1134 1135 1136 1137 1138 14 Marketing - IT 141 142 143 144 145 146 147 148 24 Operations and Strategy - IT 241 242 243 244 245 246 247 248 34 Analytics - IT 341 342 343 344 345 346 347 348 44 IT - IT 441 442 443 444 445 446 447 448 54 Mobile - IT 541 542 543 544 545 546 547 548 64 Social - IT 641 642 643 644 645 646 647 648 74 Finance and Risk - IT 741 742 743 744 745 746 747 748 84 Others - IT 841 842 843 844 845 846 847 848 94 Govt - IT 941 942 943 944 945 946 947 948 104 HR - IT 1041 1042 1043 1044 1045 1046 1047 1048 114 Hospitality / Entertainment - IT 1141 1142 1143 1144 1145 1146 1147 1148 15 Marketing - Mobile 151 152 153 154 155 156 157 158 25 Operations and Strategy - Mobile 251 252 253 254 255 256 257 258 35 Analytics - Mobile 351 352 353 354 355 356 357 358 45 IT - Mobile 451 452 453 454 455 456 457 458 55 Mobile - Mobile 551 552 553 554 555 556 557 558 65 Social - Mobile 651 652 653 654 655 656 657 658 75 Finance and Risk - Mobile 751 752 753 754 755 756 757 758 85 Others - Mobile 851 852 853 854 855 856 857 858 95 Govt - Mobile 951 952 953 954 955 956 957 958 105 HR - Mobile 1051 1052 1053 1054 1055 1056 1057 1058 115 Hospitality / Entertainment - Mobile 1151 1152 1153 1154 1155 1156 1157 1158 16 Marketing - Social 161 162 163 164 165 166 167 168 26 Operations and Strategy - Social 261 262 263 264 265 266 267 268 36 Analytics - Social 361 362 363 364 365 366 367 368 46 IT - Social 461 462 463 464 465 466 467 468 56 Mobile - Social 561 562 563 564 565 566 567 568 66 Social - Social 661 662 663 664 665 666 667 668 76 Finance and Risk - Social 761 762 763 764 765 766 767 768 86 Others - Social 861 862 863 864 865 866 867 868 96 Govt - Social 961 962 963 964 965 966 967 968 106 HR - Social 1061 1062 1063 1064 1065 1066 1067 1068 116 Hospitality / Entertainment - Social 1161 1162 1163 1164 1165 1166 1167 1168 17 Marketing - Finance and Risk 171 172 173 174 175 176 177 178 27 Operations and Strategy - Finance and Risk 271 272 273 274 275 276 277 278 37 Analytics - Finance and Risk 371 372 373 374 375 376 377 378 47 IT - Finance and Risk 471 472 473 474 475 476 477 478 57 Mobile - Finance and Risk 571 572 573 574 575 576 577 578 67 Social - Finance and Risk 671 672 673 674 675 676 677 678 77 Finance and Risk - Finance and Risk 771 772 773 774 775 776 777 778 87 Others - Finance and Risk 871 872 873 874 875 876 877 878 97 Govt - Finance and Risk 971 972 973 974 975 976 977 978 107 HR - Finance and Risk 1071 1072 1073 1074 1075 1076 1077 1078 117 Hospitality / Entertainment - Finance and Risk 1171 1172 1173 1174 1175 1176 1177 1178 18 Marketing - Others 181 182 183 184 185 186 187 188 28 Operations and Strategy - Others 281 282 283 284 285 286 287 288 38 Analytics - Others 381 382 383 384 385 386 387 388 48 IT - Others 481 482 483 484 485 486 487 488 58 Mobile - Others 581 582 583 584 585 586 587 588 68 Social - Others 681 682 683 684 685 686 687 688 78 Finance and Risk - Others 781 782 783 784 785 786 787 788 88 Others - Others 881 882 883 884 885 886 887 888 98 Govt - Others 981 982 983 984 985 986 987 988 108 HR - Others 1081 1082 1083 1084 1085 1086 1087 1088 118 Hospitality / Entertainment - Others 1181 1182 1183 1184 1185 1186 1187 1188 88 Unique Categorical Values for combination of two Columns 704 Unique Categorical Values for combination of three Columns Short Description of company profile Bucket for Profile Industry of company Buckets for Industry Focus functions of company Buckets for Functions Categorical Value for Industry-Functions Video distribution IT Others operation Operations and Strategy 824 Others Market Research|Marketing|Cr owdfunding Marketing Marketing, sales Marketing 118 Event Data Analytics API Analytics Analytics|Cloud Computing|Software Development IT operations Operations and Strategy 423 The most advanced analytics for mobile Analytics Mobile|Analytics Analytics Marketing & Sales Marketing 313 The Location-Based Marketing Platform Mobile Analytics|Marketing|En terprise Software IT Marketing & Sales Marketing 415 big data for foodservice Analytics Food & Beverages|Hospitality Hospitality / Entertainment analytics Analytics 1133 Others Analytics Analytics Research IT 348 A total of 143 out of 704 Categorical Values mapped in the original dataset
  • 18. University of Texas at Dallas Data Clustering K Means Clustering along with Ranking Data Set User Choose the number of clusters i.e. K – he wants the data to be clustered into 1) Random “K” Centroids are chosen from the dataset 2) Each Record – Data is assigned to its closest cluster (based on low SSE) 3) Re-compute the centroid of each cluster 4) Process repeated until the centroids doesn’t changes Source : - https://www.youtube.com/watch?v=u1NtKPuXQKo - http://sci2s.ugr.es/keel/pdf/specific/congreso/brazdil00comparison.pdf Resulting Data Set with clusters specified as centroids Ranking Algorithm which assigns rank to the centroid either in Ascending or Descending order - Deciles i.e. 10 clusters - Quintiles i.e. 5 clusters Segmented Data with proper Deciles, Quintiles etc.
  • 19. University of Texas at Dallas Percent_skill_ Entrepreneurship Percent_skill_ Operations Percent_skill Engineering Percent_skill_ Marketing Percent_skill Leadership Percent_skill Data Science Percent_skill Business Strategy Percent_skill_Produc t Management Percent skill_Sales Percent_skill _Domain Percent_ skill_Law Percent_skill Consulting Percent_skill _Finance Percent_skill_ Investment Renown score 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15.88235294 11.76470588 15 12.94117647 0 8.823529412 21.76470588 10.88235294 2.941176471 0 0 0 0 0 8 9.401709402 0 57.47863248 0 0 3.846153846 17.09401709 9.401709402 0 2.777777778 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 6.25 0 3.125 15.625 9.375 3.125 6.25 3.125 3.125 0 0 0 0 0 6 0 0 66.66666667 5.555555556 0 22.22222222 0 0 0 5.555555556 0 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 2 8.333333333 0 46.73202614 5.718954248 8.333333333 0 19.77124183 2.777777778 2.777777778 0 0 0 0 5.555555556 5 8.333333333 0 27.08333333 19.79166667 0 23.95833333 0 0 0 20.83333333 0 0 0 0 4 3.846153846 0 26.92307692 0 3.846153846 3.846153846 7.692307692 0 3.846153846 0 0 0 0 0 6 27.27272727 0 18.18181818 0 9.090909091 0 36.36363636 9.090909091 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8.333333333 0 50 16.66666667 0 12.5 4.166666667 4.166666667 4.166666667 0 0 0 0 0 0 13.33333333 0 6.666666667 60 0 13.33333333 6.666666667 0 0 0 0 0 0 0 1 11.11111111 5.555555556 5.555555556 0 11.11111111 11.11111111 27.77777778 0 11.11111111 5.555555556 0 0 5.555555556 5.555555556 5 8.333333333 0 58.33333333 0 0 0 25 0 0 8.333333333 0 0 0 0 2 20 0 20 0 40 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 5.555555556 0 27.77777778 11.11111111 0 16.66666667 27.77777778 5.555555556 0 5.555555556 0 0 0 0 7 Average of 61 records with the maximum of “Failure” Category had Blank / No Info values Therefore, Case – Wise Deletion was not a good option !! We had to impute the data. Data Clustering Implementation
  • 20. University of Texas at Dallas Data Imputation Source : - Missing Value Imputation using Refined Mean Substitution - http://ijcsi.org/papers/IJCSI-9-4-3-306-313.pdf - http://scs.math.yorku.ca/images/6/6d/Enders_jofschoolpsyc.pdf - http://www4.ncsu.edu/~pollock/pdfs/Lecture%20ST%20432%20Weighting,%20Imputation%20and%20Variances.pdf - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.4863&rep=rep1&type=pdf (Page No. 6) - http://www.csos.jhu.edu/contact/staff/jwayman_pub/wayman_multimp_aera2003.pdf - http://www.stefvanbuuren.nl/mi/MI.html - http://www.stefvanbuuren.nl/publications/MICE%20V1.0%20Manual%20TNO00038%202000.pdf - http://arxiv.org/ftp/arxiv/papers/0704/0704.3474.pdf - http://sci2s.ugr.es/keel/pdf/specific/articulo/2011-silva-mlp-imputation-NN.pdf Missing Value Imputation can happen in datasets in several forms Missing Value occur in several attributes (columns) [MAR] Missing Value occurs in number of instances (rows) [MAR] Missing value occur randomly in attributes and instances [MNAR] Class Mean Imputation • Respondents (Identifiers) are divided into classes • Cell Mean for a particular class is used for all missing values in that class • This method can be biased, it overestimates the correlation and underestimates the variability of the data • Modified Version - Stochastic Regression Method in which a random error term is added to each predicted score Multivariate Imputation by Chained Equation • Missing values are predicted using existing values. • The predicted values “imputes”, are substituted for missing values, resulting in full data set – “imputed data set” • Performed Multiple Times using – Bayesian Linear Regression, Predictive Mean Matching, Unconditional Mean Imputation, Logistic Regression (Polytomous >=2 category), LDA Incomplete Data Imputed Data Analysis Results Pooled Results Simulation of Neural Network Based Imputation • The method of Neural Network uses the auto – associative neural network to approximate missing data. • We tried to simulate the basic structure of Neural Network , basically its ability to learn certain liner and non-linear inter- relationships in the input space. • We also tried to simulate the functionality of auto- encoder which project the input onto smaller set by intensively squashing it into smaller details
  • 21. University of Texas at Dallas Data Imputation Implementation Consider last 15 columns one at a time and Impute the Values using Class Mean Imputation treating “Company Category and Target Value” Combination as Class After Imputation, use K Means to compute the Clusters and Rank Them accordingly Repeat this for all the 15 columns and Sum up the Rank for each row. A highest rank would be of 150. Evaluate if this produces Randomness in data Company_ Name Dependent- Company Status Catagorical Value for Industry-Functions Percent_skill_ Entrepreneurship Percent_skill_ Entrepreneurship Percent_skill_ Entrepreneurship Overall Score Company1 Success 824 0 0 0 0 Company2 Success 118 15.88235294 15.88235294 9 63 Company3 Success 423 9.401709402 9.401709402 6 46 Company4 Success 313 0 0 0 6 Company5 Success 415 0 0 0 7 Company6 Success 1133 6.25 6.25 4 37 Company7 Success 348 0 0 0 28 Company8 Success 448 0 0 0 12 Company9 Success 314 8.333333333 8.333333333 5 51 Company10 Success 343 8.333333333 8.333333333 5 42 Company11 Success 413 3.846153846 3.846153846 2 30 Company12 Success 828 27.27272727 27.27272727 10 42 Company376 Success 848 No Info 9.322638145 6 63 Company413 Failed 888 No Info 5.664488017 3 49 Original Col. Imputation Inclusive Col. Ranked / Segmented Col. Into 10 bins Sum of All 15 Ranks, making a sum of 150 Average of all row’s “%_Skill_Entr.” (Col. 4) whose : 1) Dependant_Comp._Status (Col. 2) = “Failed”; AND 2) Categorical_Val. (Col. 3) = “888”; AND 3) %_Skill_Entr. (Col 4.) is not equal to “No Info” Similarly, Column “Internet Activity Score” was first imputed on similar grounds and then segmented Ceiling (Sum of all the numbers in the (Column 5) which are “<=“ ((Column 5) No. / Total Sum) * # K )
  • 22. University of Texas at Dallas Data Imputation Implementation For Calculating missing values in “Age of the Company”, confirm the reference end date. 1) Take the age of the least value and see the corresponding Established Date Varied from Jan 13’ – Jul 13’, Hence Ref. Date is somewhere in 2014 Jan - Jul 1) Take all the values of Column “Age of the Company” and “Estb. Date” without NULL. 2) Take the Last Date of “Last Funding Date” Column and assume it to be the Ref. Date. 3) Verify if that being assumption makes sense Age Est. Founding Date Last Funding Date Date_ Assumption Diff. Assum. - Est. 5 6/20/2009 5/10/2012 4/8/2014 4.80274 4 4/1/2010 12/11/2013 4/8/2014 4.021918 4 5/1/2010 9/17/2013 4/8/2014 3.939726 3 1/1/2011 9/3/2013 4/8/2014 3.268493 4 1/1/2010 11/8/2012 4/8/2014 4.268493 3 1/1/2011 2/26/2014 4/8/2014 3.268493 1 5/16/2013 10/24/2013 4/8/2014 0.89589 Company_ Name Dependent- Company Status Age of company in years Age of company in years #1 Age of company in years #2 Age of company in years #3 Company1 Success No Info 1 1 1 Company2 Success 3 3 3 3 Company153 Failed No Info 6 6 Company453 Failed 7 1) Like the Neural network have different hidden layers performing same mathematical function differently, perform Class Mean imputation method for different Logical Association which might be associated or on which “Age of the Company” would be dependent. Repeat this till we get the 90% of the data imputed # of Advisors Internet Activity Score Segment Established Date Success / Failure Target Value Success / Failure Target Value Industry Category Industry Category # 59 - Blanks # 34 – Blanks # 25 - Imputed # 09 – Blanks # 25 - Imputed # 08 – Blanks # 01 - Imputed 8 records were deleted. #”Success” – 6, #”Failure” - 2 Therefore, still no Case Imbalance Similarly, Column “Last Funding Amount” was imputed, and a total of 27 records were deleted of which, # “success” deletion was 20 and # “Failure” Deletion was 7 Therefore, Total Records Deleted = 27+8 = 35 ; 437 records #Success = 279 and #Failure = 158
  • 23. University of Texas at Dallas Dimensionality Reduction Dimension Reduction is the mapping of data to a lower dimensional space such that uninformative variance of the data is discarded, or such that a subspace in which the data lives is detected How to take a picture to capture the most information about the rectangle?
  • 24. University of Texas at Dallas Dimensionality Reduction Dimension Reduction is the mapping of data to a lower dimensional space such that uninformative variance of the data is discarded, or such that a subspace in which the data lives is detected How to take a picture to capture the most information about the rectangle? A B C D E
  • 25. University of Texas at Dallas Dimensionality Reduction Dimension Reduction is the mapping of data to a lower dimensional space such that uninformative variance of the data is discarded, or such that a subspace in which the data lives is detected How to take a picture to capture the most information about the rectangle? A B C D E
  • 26. University of Texas at Dallas Dimensionality Reduction WHY THIS POSITION ? BECAUSE IT PROVIDES THE MOST VISUAL INFORMATION !! Second Longest Axis while fixing the first longest axis First Longest Axis PCA Understanding • Rotate the object around its center to find the best orientation • First find the axis so that the object has largest extend in average along the axis • Rotate the object around the first axis to find the axis that is perpendicular to the first axis, and the object has largest extend in average along this axis • The two axis found are the first and second principal component • The PCA algorithm helps us find those components • We deconstruct the data set into Eigen Vector and its corresponding Eigen Values. They come in pair. • Eigen Vector is a direction of the axis / line (vertical, horizontal , 45 degrees etc.) and the Eigen Values is a number telling us how spread out the data is on the line. • Eigen Vector with the highest Eigen value is therefor the principal component.Source : - https://www.youtube.com/watch?v=BfTMmoDFXyE - https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and- dimension-reduction/ - http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues - http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf - https://www.youtube.com/watch?v=7BUHYpNVT5E&list=LLeNKG4d3dB8SEg1gw91dmig&index=6 - https://nutsandboltsspeedtraining.com/spicypresentations/rotating-3d-shapes-with-powerpoint-animations/
  • 27. University of Texas at Dallas Dimensionality Reduction Implementation 1) Choose columns which seems to be co- related but the relation isn’t identifiable mathematically 2) Look for # of missing Values 3) Run Multiple Imputation method on them 1) After running Multiple Imputation, look for 10 different imputed values 2) Take the average of each cell from the 10 tables 3) Plug in the resultant Data into Statgraphics Centurion and Run PCA Analysis 1) Look at the Eigen Value Graph, to have an idea about how many components define 85% – 90% of the Dataset 2) Take the value of Covariance Matrix * Component Weights for Each Component to get the Data Values a/c spread in the particular component space ID Employee Count Employees count MoM change Has the team size grown Team size all employees 1 3 0 -1 15 2 17 -1 20 3 14 0 -1 10 4 45 10 -1 50 5 39 3 -1 40 6 14 8 -1 14 7 7 0 -1 15 8 29 -12 -1 40 9 16 45 -1 50 10 3 -1 3 11 34 0 -1 50 Scree Plot 0 1 2 3 4 Component 0 0.3 0.6 0.9 1.2 1.5 1.8 Eigenvalue
  • 28. University of Texas at Dallas Agenda Introduction – Business Understanding Data Understanding Data Cleaning and Preparation Feature Selection Model Building Testing and Evaluation Future Enhancement
  • 29. University of Texas at Dallas Feature Selection Feature Selection refers to the selection of attributes in the data set that are most relevant to the predictive modeling 2 Date and 1 Year Column ~10 Columns for Top Management > 40 Column for Team Members Target Variable The variables associated with it had been included in the final feature list. Addition of this would be redundant Few data sets have been included whereas others with binary attribute have been omitted. 4 Investors Portfolio and 1 Funding Received Column $ The funding information has been included, however, seed funders and investors detail aren’t included as because of the 319 unique values, which would not lead to any information gain REJECTED LIST ACCEPTED LIST Identifier 2 PCA Component Age of the Company Most of the Columns were of Binary form with many missing values , which were difficult to impute Internet Activity Detail Funding Received Information # Co – Founders and Investors 11
  • 30. University of Texas at Dallas Agenda Introduction – Business Understanding Data Understanding Data Cleaning and Preparation Feature Selection Model Building Testing and Evaluation Future Enhancement
  • 31. University of Texas at Dallas Model Building Ignoring since Rattle Random Forest can handle only 32 categorical Variables Confusion Matrix
  • 32. University of Texas at Dallas Agenda Introduction – Business Understanding Data Understanding Data Cleaning and Preparation Feature Selection Model Building Testing and Evaluation Future Enhancement
  • 33. University of Texas at Dallas Testing And Evaluation
  • 34. University of Texas at Dallas Testing And Evaluation For Validation Set For Testing Set RANDOM FOREST True Positive True Negative False Positive False Negative Failed 68.42% 91.67% 8.33% 31.58% Success 91.67% 68.42% 31.58% 8.33% Misclassification Rate
  • 35. University of Texas at Dallas Agenda Introduction – Business Understanding Data Understanding Data Cleaning and Preparation Feature Selection Model Building Testing and Evaluation Future Enhancement
  • 36. University of Texas at Dallas Future Enhancement  Currently, there is a lot of dependability on multiple tools – e.g. Statgraphics Centurion for Multivariate Analysis. Statgraphics is a paid tool. Dependability on such tools can be removed by building up an in- house plug-in or a library function for the requirement.  Project relies very much on ad-hoc analysis. Chances are high of omitting steps when new dataset would arrive. Automation could be done of each steps post drafting of the overall step-wise procedure. For Automation, VBA or R Programming could be a good option.  Visualizing the data set could help making much more quick informed decision.
  • 37. University of Texas at Dallas Thank You !!

Editor's Notes

  1. Hello All, My name is Vibhore Agarwal and I would iterate you all through my BI project (Fall 2015) under Prof. Gregory McDonald (UT Dallas) and to give you all a glimpse of what all challenges I faced with my dataset (particularly on Data Imputation) and how did I overcome those. My dataset was collected from crowdanalytix.com and it contained data about various start-up companies across the globe.
  2. These days investing companies provide financial support to many start-up firms in the form of investment. Prior making investments, investors thoroughly examine the company and then post that they make their decisions as whether to make an investment in that particular company which is in consideration or not. Now, mostly these decisions are made on the investors intuition which has variable probability of the decision undertaken being fruitful or not. Hence, investors needed me to design a model where I could make analysis of various data which are available to the investors about the company and identify the ones which are really important in predicting the future performance of the start-ups.
  3. After tasting the first initial success, companies pitch in to the Venture Capitalist to provide funding to them for operations, logistics, resources, etc. in return of profits sharing or certain %age of ownership in the firm.
  4. This phase gave me an airplane view of how the overall data looked
  5. Dashboard to do a descriptive analysis
  6. This is the Data Cleaning and Data Preparation Phase This phase refers to data preprocessing where we format, formulate and standardize the given data so that they would be simplified and easily fit in to any framework and could be used for our analysis. I have used three techniques in the pre processing stage which are Data imputation wiz. The process of replacing missing data with substituted data Data Clustering is method by which we make clusters of objects that are somehow similar in characteristics Dimensionality Reduction wiz, the process of reducing the number of random variables under consideration, via a set of “uncorrelated” principle variables. There is no set sequence in which they should be used, they can be used in conjunction
  7. - Now the primary question which lied in front of me was – basically where to take the stab on the data. Although it was clear that I first needed to impute the data but on which grounds. Upon researching I found out that there were 2 types of missing data that were available in our data set MAR are the variables which don’t have any correlation with other missing values MNAR are the variables which are correlated to other missing values For imputing values of MNAR I would require some clustering methods so that I could statistically group them http://www.biostat.umn.edu/~will/6470stuff/Class20-13/Handout20.pdf
  8. What are buckets or bins? Buckets and Bins are basically the categories which defines the common characteristics to group the variables. We define the bucket based on the business requirement. For Eg. – overhere we can decide to define the buckets on two grounds – color and shape. Its now upto us to decide which type of bucket would solve our purpose efficiently.
  9. Since, the data in the dataset are merged with different sources. And these data are generally manually entered. Many a times there are possibilities of occurrence of spelling errors, different ways of referring to same thing etc. Therefore we require the application of fuzzy string matching. Fuzzy Lookup technology allows us to quickly identify data records which are textually similar. We can identify fuzzy duplicates within a single table or perform a fuzzy join between two different tables. There are really cool add-ins and library functions which perform Fuzzy mapping. I utilized an add-in by Excel to perform fuzzy mapping for my data set.
  10. Fuzzy Lookup technology allows us to quickly identify data records which are textually similar. We can identify fuzzy duplicates within a single table or perform a fuzzy join between two different tables.
  11. I first broke down the multiple values within a single cell into separate columns containing a single value.
  12. Post breaking down I removed the duplicates at each column and textually similar values across the entire list to finally arrive at 40 unique attribute values I then based on my intuition created 11 buckets to which these 40 unique values could be associated independently Then I mapped the raw data with the bucketized attributes to fetch in and associate the final 11 buckets to each row Similarly I repeated this process for the other two columns – Description of Company and Focused Functional Area
  13. After having attained all 3 sets of buckets for individual column, my goal was to club them into a single column, hence a intra-columnar mapping was required This was attained by implementing a cross tabulation or bertin matrix visualization. When we have an ordinal value or a statistical significant number we generally plot the frequency of the join else we concatenate the row and the column heading This would efficiently give us the intra – columnar mapping
  14. I would here consider only the bucket values for cross tabulation (intra columnar mapping), as the attributes have already been utilized for inter columnar mapping I took the first column bucket and did a cross tabulation with the second column buckets, Hence we got = 11*8 =88 unique keys Then I took those 88 keys and mapped with the last column which had another 8 buckets. Hence finally I got 88 * 8 = 704 unique keys which represented the final clusters.
  15. Hence, Post mapping now we got a column which formed the basis of further clusters.
  16. By looking at the last 15 columns which were numerical percentage ranging from 0 to 100, I thought of segmenting their sum into 10 buckets For this I decided to go with the K means clustering method where I decided the number of clusters (Ranks) well in advance and then clustered them according to the distribution of the means Suppose there is the first column in which one row value has 95% for that particular column, so it would certainly be falling in the 10th percentile (highest) – but that depends upon the distribution. Similarly, for eg. 5% value would certainly be falling in the 0th bucket or 1st bucket. Therefore the overall sum of 15 columns would be 150 which again is deciled and ranked from 0 -10. Hence I would be getting one column representing 15 columns segmented from 0 -10 which would be ordinal in nature.
  17. But But But, I had another problem. I had many missing values in each column. I initially thought of deleting those records but then it would lead to class imbalance problem. Hence data imputation was a necessity
  18. This was by far the most important slide of my presentation As explained in the earlier slides, there are 3 types of missing data -> MAR, MNAR, MCAR. Here in my dataset there was majority of data which were MNAR. Hence I took these 3 approaches to impute the missing values Class Mean Imputation – where I imputed the missing values by the mean of the class to which they belonged MICE (aka Multivariate Imputation by Chained equation) – This is the best method to impute the variables. It uses the existing values to predict the missing values. It basically runs various statistical techniques such as Bayesian Linear Regression, Predictive Mean Matching etc. to predict the missing values. To explain further it uses these techniques to create separate datasets for the missing values and then takes the mean of all of them to come up with a single imputed dataset. Simulation of neural Network – This is an approximation technique which is designed to simulate the model of an ideal neural network. There are three ways in which missing values can occur – Missing at random, Missing not at random, and Missing completely at random (extreme case where even the target values are missing) Think of bunch of students A,B,C,D and E MAR 1)When only A misses the classes on alternate days or whatever logic you can probably think of (it has a pattern) – missing at columnar intervals 2) When ABCD and E bunk classes with some pattern (A misses the first class and rest attend the classes, similarly week on week respective student in that particular group misses the classes) Now here patterns are recognizable with certain deliberation But think of a situation or a section where these bunch of students bunk the classes without having any co-ordination between them. There could be cases when either just A misses the class or when all of them miss the class. Here recognizing the pattern of bunking becomes more of a challenge However, there are ways in BI to impute the values for these cases
  19. So here what we do is first impute the values of missing column and row, based on their class in which that particular row belongs. Now we have the full liberty over here to decide what class should it be. For eg. I decided the class hierarchy to be based on the categorical values of Target Variable and their industry. Then I computed the K means. Here I made use of a macro which performed the task. However, there are add-ins in Excel and packages in R & Python which would perform the same duty
  20. This is the classics which I performed and I still don’t have any idea is it feasible or not – well however, it certainly gave me the results. This is neural network simulation in Excel Here if we look at the basic neural model – the Sigma function keeps on re computing the values until the error term is minimized. But then it is not certain which mathematical function it undertakes. Similar sort of thing we implemented here I needed to impute the age of the company, and we had “established date” and last funding date. I needed to asses what could be the potential end date of data collection. I took the data for which I had all the values present. I looked at the latest value of last funding date and established date which ranged from Jan 13’ to Jul 13’ and the age of the company was approx. 1 year. Hence it was clear that the data was collected somewhere in the 2014 with the month varying from Jan to Jul. I played with few dates and searched for the approx. value which I could have got. It clicked for 8th April 2014 for which I nearly got all the calculations correct. Now then I started imputing the value with implementation of class mean methodology but with each iterations our classes consideration changed. This happened because there were missing records both, in the established date as well as the last funding date.
  21. Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
  22. So here we first capture the width of the box which is the longest axis and it would contain the highest variation in the horizontal axis. Then we take the perpendicular of this axis which is uncorrelated and captures the variation in the vertical axis. Hence, we try to capture the maximum information http://setosa.io/ev/principal-component-analysis/