SlideShare uma empresa Scribd logo
1 de 80
Baixar para ler offline
Stevens Institute of Technology
School of Business
Business Intelligence & Analytics Program
A Snapshot of Data Science
Student Poster Presentations
Corporate Networking Event – November 28, 2017
This document reproduces the posters presented by students of the Business Intelligence
& Analytics (BI&A) program at a Corporate Networking event held at Stevens Institute on
November 28, 2017. The event was attended by over 80 company representatives and
approximately 150 students and faculty members.
The posters were presented by students at all stages in their academic programs from
their first semester through their final semester. The research described in each poster
was conducted under the guidance of a faculty member. The broad range of research
topics and methodologies exhibited by the posters in this document reflects the diversity
of faculty research interests and the practical nature of our program.
For background, the first poster describes the BI&A program. Founded in spring 2012,
with just 4 students, the program now has over 220 full-time and part-time masters of
science students, 80 graduate certificate students and is ranked 7th in the nation by The
Financial Engineer. As illustrated in the first poster, a distinctive feature of the program is
its three-layer structure. In the professional skills layer, business and communication
skills are developed through workshops, talks by industry leaders and an active student
club. In the second layer, the 12-course curriculum covers the concepts and tools
associated with database management, data warehousing, data and text mining, web
mining, social network analytics, optimization and risk analytics. The curriculum
culminates in a capstone course in which students work on a research project – often in
conjunction with industry associates. Finally, in the technical skills layer, students attend
a series of free weekend boot weekend camps that provide training in industry-standard
software packages, such as SQL, R, SAS, Python and Hadoop.
The 76 student posters in this document represent a broad array of research projects. We
are proud of the quality and innovativeness of our students’ research and of their hard
work and enthusiasm without which this event would have been impossible.
Chris Asakiewicz, Ted Stohr and Alkis Vazacopoulos
Business Intelligence & Analytics Program
Stevens Institute of Technology
www.stevens.edu/business/bia
Forward
INDEX TO POSTERS
* Indicates the poster was accompanied by a live demo
No. Title Student Authors
0 BI&A Curriculum The faculty
1*
Google Online Marketing Challenge 2017 –
True Mentors AdWords Campaign
Philippe Donaus, Rush Kirubi, Salvi Srivastava,
Thushara Elizabeth Tom, Archana Vasanthan
2
Multivariate Testing to Improve a Non-Profit’s
Home Page Rush Kirubi, Thushara Elizabeth Tom
3 Analyzing the Impact of Earthquakes Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley
4* Real Time Health Monitoring System
Ankit Kumar, Khushali Dave, Shruti Agarwal,
Nirmala Seshadri
5 Zillow Home Value Prediction
Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu,
Ziyun Song.
6
Analysis of Opioid Orescriptions and Drug-
related Overdoses
Nishant Bhushan, Sunoj Karunanithi, Pranjal
Gandhi, Raunaq Thind
7 UK Traffic Flow & Accidents Analysis
Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian
Xu, Jiahui Tang
8 US Permanent Visa Application Process
Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang
Lu
9 Climate Change Since 1770
Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li,
Zhenzhen Liu
10
Predicting Customer Conversion Rate for an
Insurance Company
Yalan Wang, Cong Shen, Junyuan Zheng, Yang
Yang
11
Determine Attractive Laptop Features Among
College Students
Liwei Cao, Gordon Oxley, Salman Sigari, Haoyue
Yu
12
Clustering Large Cap Stocks During Different
Phases Of Economic Cycle Nikhil Lohiya, Raj Mehta
13
Predicting interest level in apartment listings
on Renthop Cristina Eng, Ying Liu, Haoyue Yu, Salman Sigari,
14
Deep Learning Vs Traditional Machine
Learning Performance on NLP Problem Abhinav S Panwar
15 Wine Recognition Biying Feng, Ting Lei, Jin Xing
16
Projects Assignment Optimization Based on
Human Resource Analysis Jiarui Li, Siyan Zhang, Siyuan Dang, Xin Lin,
17*
Short Term Load Forecasting Using Artificial
Neural Networks Bhargav Kulkarni, Ephraim Schoenbrun
18 Optimizing Travel Routes Ruoqi Wang, Yicong Ma, Jinjin Wang, Shuo Jin
19 Exploring and Predicting Airbnb Price in NYC Ruoqi Wang, Yicong Ma, JIahui Bi, Xin Chen
20*
The Public’s Opinion of Obamacare: Twitter
Data Analyses Saeed Vasebi
21 S&A Processed Seafood Company
Redho Putra, Neha Bakhre, Harsh Bhotika, Ankit
Dargad, Dinesh Muchandi
22 Portfolio Optimization of Cryptocurrencies
Yue Jiang, Ruhui Luan, Jian Xu, Shaoxiu Zhang,
Tianwei Zhang
23
Fantasy Premier League Soccer Team
Optimization Haoran Du, Xiang Yang, Ruiwen Shi
24
Predicting Movie Rating and Box Office
Gross by PCA and LR Model Yunfeng Liu, Erdong Xia, Yash Naik
25 Student Performance Prediction Abineshkumar, Sai Tallapally, Vikit Shah
26 Classifying Iris Flowers Xi Chen, Shan Gao, Lan Zhang
27
Using Data Analytics to Retain Human
Resources Aditya Pendyala, Rhutvij Savant
28
Using Financial Models to Construct Efficient
Investment Portfolios Cheng Yu, Tsen-Hung Wu, Ping-Lun Yeh
29 Pima Indians Diabetes Analysis Junjun Zhu, Jiale Qin, Yi Zhang
30
Google Online Marketing Challenge: Aether
Game Café
Jaya Jayakumar, Sunoj Karunanithi, Saketh
Patibandla, Ephraim Schoenbrun
31
Experiment for Apartment Rental
Advertisements
Shavari Joshi, Nirmala Seshadri, Nishant
Bhushan
32
A Tool For Discovering High Quality Yelp
Reviews
Zijing Huang, Po-Hsun Chen, Hao-Wei Chen,
Chao Shu
33*
Worlds Best Fitness Assistant Cognitive
Computing
Anand Rai, Jaya Prasad Jayakumar, Saketh
Patibandla
34 Zillow’s Home Value Prediction Wenzhuo Lei, Chang Xu, Juncheng Lu
35 Web Traffic Time Series Forecasting Jujun Huang, Peimin Liu, Luyao Lin
36*
Portfolio Optimization with Machine
Learning
Chao Cui, Huili Si, Luotian Yin, Qidi Ying, Yinchen
Jiang
37
Developing a Supply Chain Methodology for
an Innovative Product Akshay Sanjay Mulay
38 Bike Sharing Optimization
Bai, Jiahui Nai, Yuankun Wang, Yuyan Zhou,
Yanru
39 Crime in the U.S Minyan Shao, Yuyan Wang, Yuankun Lin
39A Porto Seguro Safe Driver Prediction
Xiaoming Guo, Weiyi Chen, Dingmeng Duan,
Jian Xu, Jiahui Tang
40
Identifying Mushroom: Safe to eat or deadly
poison
Xiang Yang, Haoran Du, Shikuang Gao, Ruiwen
Shi
41* Student Grade Prediction Gaurav Sawant, Vipul Gajbhiye, Vikram Singh
42 NBA 2018 Play-offs & Champion Prediction Amit Kumar, Jayesh Mehta , Xiaohai Su
43
AI integrated interactive search interface for
Biomedical literature search
Akshay Kumar Vikram, Vishnu Pillai, Satya
Peravali
44*
Text Mining of intellectual contribution
information from a corpus of CVs
Nishant Bhushan, Neha Mansinghka , Nirmala
Seshadri, Arpit Sharma
45 Analysis on Chase Bank Deposits Lulu Zhu, Xinlian Huang, Junxin Xia, Rauhl Nair
46 Predicting Songs Hit on Billboard Chart Siyan Zhang, Yifeng Liu, Yuejie Li
47
Predicting Results of Premier League
Contest Hantao Ren, Lanyu Yu, Siyuan Dang, Jiarui Li
48
NLP Meets Yelp Recommendation System
for Restaurant Rui Song
49
WSDM — KKBox’s Churn Prediction
Challenge Caitlyn Garger, Yina Dong, Shuo Jin
50 Data Mining of Video Game Sale Xin Lin, Fanshu Li, Jingmiao Shen
51 Dengu AI: Predicting Disease Spread Vicky Rana, Pradeepkumar Prabakaran
52 Predict Tesla Model 3 Production Volume
Wangming Situ, Liwei Cao, Bohong Chen, Tianyu
Hao
53 Vehicle Routing Problem using NYC TLC Data
Adrash Alok, Garvita Malhan, Ephraim
Schoenbrun, Abhir Yadava
54 Credit Rating for a Lending Club Rui Song, Huili Si, Xiao Wan, Lulu Hu
55 Routify: Personalized Trip Planning
Minzhe Huang, Bowan Lu, Jingmiao Shen,
Xiaohai Su, Abhitej Kodali
56 Uncover World Happiness Patterns Rui Song, Xiao Wan, Xiaoyu Zhang
57* Data Centers – Where to Locate?
Smriti Vimal, Sanjay Pattanayak, Kumar
Bipulesh, Nitin Gullah, Souravi Sudamme
58 Drone Optimization in Delivery Industry Ni Man, Xinlian Huang, Xuanyan Li
59
Performance Evaluation of Machine
Learning Algorithms on Big Data using Spark
Neha Mansinghka, Madhuri Koti, Prathamesh
Parchure
60*
Duck Wisdom A Personal Portfolio
Optimization ToolTed Stohr_Nov 20 2017
Taranpreet Singh, Shivakumar Barathi, Ramona
Lasrado, Nikhil Lohiya
61 Porto Seguro’s Safe Driver Prediction
Boren Lu, Lanshi Li, Xiaoming Guo, Dingmeng
Duan
62
Hospital Recommendation System for
Patients Abdullah Khanfor, Danilo Brandao and Pedro Sa
63
Customer Segmentation for B2B Sale of
Fitness Gear Juhi Gurbani, Arpit Sharma, Neha Mansinghka
64
Predicting Vehicle Collisions & Dynamic
Assignment of Ambulances in NYC
Divya Rathore, Dhaval Sawlani, Nitasha Sharma,
Shruti Tripathi
65 Iceberg Classifier Challenge Chang Lu, Jing Li, Luotian Yin, Runtian Song,
66 Predicting Movie Success
Jialiang Liu, Huaqing Xie, Xiaohai Su, Liang Ma,
Lanjun Hui
67 Stock Prediction Based on News Titles
Jianuo Xu, Minghao Guo, Simin Liang, Yudong
Cao, Yunzhe Xu
68
How Consumer Reviews Affect a Star’s
Ratings
Jinjin Li, Prabhjot Singh, Yutian Zhou, Xuetong
Wei, Xiaoyu Zhang
69
Student Alcohol Consumption: Predicting
Final Grades Ping-Lun Yeh, Zhuohui Jiang, Gaurang Pati
70 Mobile Banking Fraud Detection
Junyuan Zheng, Ke Cao, Miaochao Wang, Tuo
Han
71 Subway Delay Dilemma
Smit Mehta, Nishita Gupta, Matthew Miller,
Jianfeng Shi
72
Integrated Digital Marketing Studies on
Hoboken Local Restaurant
Shuting Zhang, Yalan Wang, Haoyue Yu, Liyu Ma,
Christina Eng
73* AI Academic Advisor Vaibhav Desai, Piyush Bhattad
74 JFK Airport – Flight Delay Analysis
Praveen Thinagarajan, Arun Krishnamurthy,
Thushara Elizabeth Tom, Sunoj Karunanithi
75
Machine Learning on Highly Imbalanced
Manufacturing Data Set Liyu Ma
76* Duck Finder Salman Sigari, Shankar Raju and Team
Master of Science
Business Intelligence & Analytics
Business Intelligence & Analytics
http://www.stevens.edu/bia
CURRICULUM
Organizational Background
• Financial Decision Making
Data Management
• Strategic Data Management
• Data Warehousing & Business Intelligence
Data and Information Quality *
Optimization and Risk Analysis
• Optimization & Process Analytics
Risk Management Methods & Apps.*
Data Mining
• Knowledge Discovery in Databases
Statistical Learning & Analytics*
Statistics
• Multivariate Data Analytics
• Experimental Design
Social Network Analytics
• Network Analytics
• Web Mining
Management Applications
• Marketing Analytics*
• Supply Chain Analytics*
Big Data Technologies
• Data Stream Analytics*
• Big Data Seminar*
• Cognitive Computing*
Practicum
Projects with industry
* Electives - Choose 2 out of 8
Social Skills
Disciplinary Knowledge
Technical Skills
• Written & Oral Skills Workshops
• Team Skills
• Job Skills Workshops
• Industry speakers
• Industry-mentored projects
• SQL, SAS, R, Python Hadoop
• Software “Boot” Camps
• Course Projects
• Industry Projects
Curriculum Practicum
MOOCs
Infrastructure
Laboratory Facilities
• Hadoop, SAS, DB2, Cloudera
• Trading Platforms: Bloomberg
• Data Sets: Thomson-Reuters, Custom
PROGRAM ARCHITECTURE
Demographics
2013F 2014F 2015F 2016F 2017F
Applications 101 157 351 591 725
Accepted 48 84 124 287 364
Rejected 34 34 186 257 307
In system/other 19 39 41 46 53
Admissions
Full-time/Part-time
Full-time 201
Part-time 21
Gender
Female 41%
Male 59%
Placement
Starting Salaries (without signing bonus):
$65 - 140K Range
$84K Average
$90K (finance and consulting)
Data Scientists 23%: Data Analysts: 30% Business Analysts: 47%
Our students have accepted jobs at for example:
Apple, Bank of America, Blackrock, Cable Vision, Dun &
Bradstreet, Ernst & Young, Genesis Research, Jeffreys,
Leapset, PriceWaterhouse, Morgan Stanley, New York
Times, Nomura, PriceWaterhouse Coopers, RunAds, TIAA-
CREF, Verizon Wireless
Hanlon Lab -- Hadoop for Professionals
The Masters of Science in Business Intelligence and Analytics (BI&A) is a
36-credit STEM program designed for individuals who are interested in
applying analytical techniques to derive insights and predictive intelligence
from vast quantities of data.
The first of its kind in the tri-state area, the program has grown rapidly. We
now have approximately 222 master of science students and another 79
students taking 4-course graduate certificates. The program has increased
rapidly in quality as well as size. The average test scores of our student
body is top 75 percentile. We are ranked #7 among business analytics
programs in the U.S. by The Financial Engineer.
STATISTICS
PROGRAM PHILOSOPHY/OBJECTIVES
• Develop a nurturing culture
• Race with the MOOCs
• Develop innovative pedagogy
• Migrate learning upstream in the learning value chain
• Continuously improve the curriculum
• Use analytics competitions
• Improve placement
• Partner with industry
Google Online Marketing Challenge 2017
True Mentors AdWords Campaign
Team: Philippe Donaus, Rush Kirubi, Salvi Srivastava, Thushara Elizabeth Tom, Archana Vasanthan
Instructor: Theano Lianidou
Business Intelligence & Analytics
November 28, 2017
1
Motivation
• Google Online Marketing Challenge is an unique opportunity for students to build online Marketing Campaigns on Google
AdWords for a business or a non-profit. $250 budget is provided by Google to run these campaigns live for 3 weeks
• We worked with TRUE Mentors, a non-profit based in Hoboken,NJ. Built a marketing strategy on Google AdWords to
achieve goals like Creating Brand Awareness, promoting Fundraising Events and Volunteer opportunities and Donations.
• Technologies Used: Google AdWords, Google Analytics, Google Search Console, Facebook Insights
Design of Campaigns
• Conducted Market Analysis-Competitors, current
market position and platforms used, USP
• Analyzed existing data available on Google
Analytics, Google Search console, Facebook
Insights and established marketing goals
• Designed campaigns on Search and Display
Ads
Performance
Results
• 23 Ad groups with 206 Ads with 700 keywords were used in total
• Text Ads appear for people’s search terms across Google
Search and Search partner sites.
• Display Ads appear on relevant pages across the display
network
• The team finished as a “Finalist” in the Social Impact Award category.
• The team finished as a “Semi-Finalist” in the Business Award category.
• Ranked among the Top-10 teams in the Social Impact Award category.
• Ranked among the Top-15 teams in the Business Award Category.
• Ranked among the Top-5 teams in the Americas region.
• The results can be found at: https://www.google.com/onlinechallenge/past/winners-2017.html
• Team ID: 234-571-4266
Targeting and Bidding
Target Goals (set before running the campaigns)
End Results
• Campaigns ran from 24th April 2017-14th May 2017
• KPI’s were monitored and optimized continuously over the 3
weeks using insights drawn from various AdWords Reports,
Search Term reports, Google Analytics and Keyword reviews.
CAMPAIGN LEVEL
Targeting/Bidding TM_Brand TM_Events TM_Donations TM_Volunteers TM_DisplayCampaign
Location
Hudson County-NJ,
New York County-NY
Hudson County-NJ Hudson County-NJ Hudson County-NJ
Hudson County-NJ,
New York County-NY
Bidding Strategy Manual CPC Manual CPC Manual CPC Manual CPC Manual CPC
Daily Budget? Yes Yes Yes Yes Yes
ADGROUP LEVEL
Max CPC? Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups
Demographic No No No
Yes - Male and Female
were targetted
seperately
No
KEYWORD LEVEL
Max CPC? Yes Yes Yes Yes No
Topics No No No No
Yes - Charity &
Philanthropy and Fast
Foods
Improving a Non-Profit’s Home Page
Team: Rush Kirubi, Thushara Elizabeth Tom
Instructor: Chihoon Lee Business Intelligence & Analytics
November 23, 2017
2
Experiment Design
Methodology:
Full factorial design with blocking.
Factors & Levels:
Responses: Drop Off rate
Conclusion
• The best setting is Purple Donate Button and No Slider
relative to the other settings.
• However the above is not substantial at 5% level of
significance.
• Since none of the factors are significant we opted to select
those settings that minimize the page loading speed : No
Slider, Testimonial with Text, Purple Donate Button.
Data
Time (blocking factor) was confounded with the 3-factor interaction,
ABC. And thus assumed that this 3-factor interaction is negligible.
Result
• It was found that no factor significantly stood out.
• The results of the experiment are shown below:
● Effect Test
● Normal Plot
Motivation
• Goal: Optimize True Mentor’s homepage to reduce the
drop-off rate.
• In turn, improving the quality score of the AdWords bids,
leading to more ad exposures at the same or lesser
expense.
Limitations
• We did not get enough time for replication due to the
competition deadlines.
• Blocking variable was difficult to accommodate. We
manually took down the values at certain times of the day
and night.
Participating in Google's Online Marketing Challenge, we selected a nonprofit to run a digital marketing campaign. Part of our
efforts involved optimizing the organization's home page to boost user engagement as measured by drop-off rates. We set up
a full-factorial experiment (2 ^ 3) with time of day as the block variable. Put simply, we tested the donate button color, the
presence of a slider and the type of testimonial. The type of testimonial was between one that was predominantly text and the
other simply a photo with a caption. All the three factors were blocked on the time of day (daytime or nighttime). Empirically,
the best setting was having no slider with a purple donate button. However, they were not robust enough to pass a statistical
inference test. With the aforementioned settings, we additionally selected the no-slider option, as it reduces page speed, and
its presence does not impact user-drop off rates.
Analyzing the Impact of Earthquakes
Team: Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley
Instructor: Prof. Alkis Vazacopoulos Business Intelligence & Analytics
Motivation
• Earthquakes are one of the most destructive natural forces in the world that ravish entire cities with little notice
• We wanted to analyze and visualize patterns in how earthquakes strike and damage specific locations historically in order to provide
information on high risk areas and under prepared areas of the world
• Earthquake features that were used include magnitude, source, focal depth, date, and type (nuclear or tectonic activity)
• We also used an earthquake’s damaging effects using damage in US dollars, deaths, amount of houses damaged and destroyed, and
injuries
Financial Cost of EarthquakesCasualties due to Earthquakes
Technology Utilized
•Tableau was used for our analysis of earthquakes
•We found tableau especially useful when visualizing
the latitude and longitude data clearly identifying trends
in the way earthquakes affect certain parts of the world
•With the creation of 8 dashboards, we were able to
analyze and visualize many different features of
earthquakes including depth, source, and more
Earthquake Map of the World
• Based on the analysis of the number of
deaths due to earthquakes, it is clear that a
majority of high casualty events happen to
coastal regions and many in the Indian
subcontinent
• We see a peak in deaths in 2010 due to
the unfortunate number of casualties in
Haiti underlining the fact that certain
underdeveloped regions will suffer from
increased casualties
• We observe the same trend in high cost
areas although the largest cost even
occurred in Japan where the tsunami
caused massive damages in 2011
Source: National Earthquake Information Center
3
Real Time Health Monitoring System
Team: Ankit Kumar, Khushali Dave, Shruti Agarwal, Nirmala Seshadri
Instructor: Prof. David Belanger m
Architectural Approach
The architecture flow will consist of following
steps:
1. Data will be generated and stored in a
file
2. Data will be stream lined using apache
Kafka
3. A real-time data visualization will be
setup using any of the visualization tool
Tools used
1. JSON file parsing for initial data analysis
2. Apache Kafka for real time streaming
3. Arduino programming for pulling
temperature data in real time
4. Python for data cleaning
5. Tableau for visualization
Variables
1. Heart rate - Through Smart watch
sensor
2. Steps Count – Through Smart watch
3. Body Temperature – Through Arduino
Trigger Cases
1. Fever – High temp, low heart rate and
steps
2. Long term unconsciousness – low heart
rate , body temperature and steps count
3. Heart attack - high heart rate in a very
short time interval
Problem Statement
To create a real-time health monitoring system which is user specific using sensors from a smart
watch and/or an Arduino device. This application should be able to monitor health features like
heart rate, steps count and body temperature in real time and should be able to warn the user or
any emergency services of any undesired or serious condition.
Results
The last part of the project is to be able to
visualize results in real time and to be able to
plot streams of data and show a trigger result if
any abnormal use case occurs.
Business Intelligence & Analytics
4
Business Impact
Our health data analysis application can pull
data in real-time from device sensors. Using
this system authorities, friends and relatives
can easily monitor health of a loved one and
can respond immediately in case of an
emergency. From a business perspective this
can attract customers suffering from a medical
condition.
Zillow’s Home Value Prediction
Team: Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu, Ziyun Song
Instructor: Prof. Alkiviadis Vazacopoulos Business Intelligence & Analytics
5
Motivation
•Zillow Prize is challenging the data science community to
help push the accuracy of the Zestimate even
further(improving the median margin of error).
•What we should do in this competition is to develop an
algorithm that makes predictions about the the logerror for the
months in Fall 2017
Technology
•Python, Watson, Tableau for exploratory data analysis(EDA).
•Python for data preprocessing and feature engineering.
•Python to build the model.
Competition Process
Feature engineering
Feature engineering part is the most important part of this competition. It is
very important to find out feature importance to keep valuable features and
drop useless ones. We also created new features that might make machine
learning algorithms work better.
EDA
Data preprocessing
Feature
engineering
Use 2016 data to
train model
Test predicted
logerror of 2016
data
Improve feature
engineering part
Figure out the best
model and adjust
the parameters
Use both 2016 and
2017 data to train
model
Several
improvements
Final submission
Modeling
Somehow we found a magnificent gradient boost machine to deal with this
problem(compared with Xgboost and LightGBM). It is called Catboost.
Catboost builts oblivious decision trees and to prevent overfitting includes
algorithm, which in some magical way reduces bias of the residuals. In addition
it uses another scheme for calculating values in leaves.
Also supports several options for converting categorical features based on
statistics counting.
In general Catboost is declared as an algorithm that can work with categorical
features without preprocessing, is resistant to overfitting, and it is supposed
that it can be used without wasting time and effort to hyperparameters
selection. And of course most interesting that it can be more accurate. But
learning is still very slow. On average, 7-8 times longer than LightGBM, 2-3
times longer than XGBoost.
Before we train the model, it is necessary to define categorical features. In this
case, we have 26 categorical features that can be trained in “catpool”
Then we adjusted the parameters to get a better(not the best) result.
Conclusion
The final submission made a top 11% rank. It is pity that we are so close to the
bronze medal which is top 10%.
What we can do in the future is to play with the categorical features and keep
adjusting the model.
Analysis of Opioid Prescriptions and Deaths
Team: Pranjal Gandhi, Nishant Bhushan, Sunoj Karunanithi, Raunaq Thind
Instructor: Alkis Vazacopoulos Business Intelligence & Analytics
6
The objective is to find the correlation between the prescription of
drugs containing opioids and drug related deaths in the USA.
What are opioids?
Opioids are a class of drugs that include the illicit drug heroin as well as
the licit prescription pain relievers oxycodone, hydrocodone, codeine,
morphine, fentanyl and others.
• Tableau for Visualizations.
• R & Excel for cleaning the data and exploratory analysis.
• The dataset is a subset of data sourced from cms.gov and contains
prescription summaries of 250 common opioid and non-opioid drugs
written by the medical professionals in 2014.
• Number of deaths due to drug overdoses is leading the deaths occurring due to car accidents by a staggering 11,102 as per a report by the DEA.
• In 2014, there were 4.3 Million people aged 12 years or older that were using Opioid based painkillers without prescriptions.
• This led to substance abuse among almost 50% of the consumers.
• 94% of respondents in a 2014 survey of people in treatment for opioid addiction said they chose to use heroin because prescription opioids were
“far more expensive and harder to obtain.”
ffmfnfn
Analysis & Visualizations Results and Conclusion
Facts & Figures
Objective & Motivation Tools Used
• We found the opioid prescriptions were too high for the prescribers
following specialties:
• Female Nurse Practitioners
• Female Physician Assistants
• Female and Male Family Practices
• Female and Male Internal Medicines
• Male Dentists
The top 5 states with most %age of deaths due to overdoses are
California, Ohio, Philadelphia, Florida and Texas.
All of them had significantly high prescriptions of Hydrocodone-
Acetaminophen followed by Oxycodone-Acetaminophen.
Results and Conclusion
UK Traffic Flow & Accident Analysis
Team: Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang
Instructor: Alkis Vazacopoulos Business Intelligence & Analytics
Technology
• Python for integrating data for analysis.
•Tableau for data visualization and extracting data insights.
Current & Future Work
•Generating different plots from data and discovering relationships between variables.
•Plan to find relationships between traffic flow and accidents.
Motivation
• Visualization of a Dataset of 1.6 million accidents and 16
years of traffic flow.
7
US Permanent Visa Application Visualization
Team: Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang Lu
Instructor: Alkiviadis Vazacopoulos
Introduction
Develop a descriptive analysis based on US
permanent application data from 2012 to 2017
in Tableau and provide insights into visa
decisions.
Data explanation:
374363 applicants from 203 countries in 22 occupations.
Descriptive Analysis
Employer & Economic Sector
Conclusion
Business Intelligence & Analytics
April 30th, 2015
8
Applications by State Applications by Country
Top 10 companies that submit
permanent visa applications.
Education & Occupation
Certified rate and denied rate
among all education degrees We can observe that a
high-school degree has
significant denial rate
in the graph.
Doctorate’s degree has
the lowest denial rate.
Applicants with
master's and bachelor’s
are mostly working in
Computer and
Mathematical fields.
Occupations which
have highest
certification rates.
While High School
applicants most
working in
Production
Occupations that
have certification
rates lower than
5%.
Our team used different attributes to analyze the relationships
between visa applications and certification rates directly and
indirectly. Application decisions are correlative with many
vectors, such as education level, income, occupation and so on.
In conclusion, applicants with higher education (Bachelor’s,
Master’s, Doctorate’s) are mostly working in Computer and
Mathematic areas, which has higher income is more likely to be
certified.
Applicants come from the countries which dominate the
occupation have higher certification rate rate when go for in the
related jobs. we found out the certification rate increased as time
varied from 2012 to 2017.
Applicants from different occupations have different nationality.
Take the computer science and construction occupation
distributed maps for example, we could see that certified
applicants in computer domain mainly come from Indian and
China, while construction occupations mainly come from
Mexico.
Nationality & Occupation
Business Intelligence & Analytics
9
• Some say climate change is the biggest threat of our age
while others say it’s a myth based on dodgy science.
• Personally speaking, we feel climate change problem is
much more severe in these years. Global warming takes
responsibility for the disastrous hurricane recently to some
extent.
• So we are going to do some descriptive data visualization to
see how climate changes since 1970 and make some
analysis.
• Excel for data cleaning and data filtering.
• Tableau to do the data visualizations and create interactive
graph.
• Tableau and Watson to build some regression models and
conduct the analysis.
•Illustrate the world’s climate change trend starting from 18th century in the line chart.
•Specify the trend of climate change for each country and its average temperature in the whole period.
•Extract data from original data file to show how much each country’s temperature has increased and compare this percentage
change with one another.
•Customize the period to show trends of climate change in a certain period of time.
•Try to get more data sources to dig in deeply in order to find the factors leading to the climate change.
Climate Change Since 1770
Current & Future Work
Team: Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li, Zhenzhen Liu
Instructor: Alkis Vazacopoulos
Motivation Technology
Is worldwide mercury really rising? Insights into customized periods
Explore average temperature by country Recent 100 years Climate Change
Predicting Customer Conversion Rate for
an Insurance Company
Team: Yalan Wang, Cong Shen, Juanyuan Zheng, Yang Yang
Instructors: Alkis Vazacopoulos and Feng Mai
Business Intelligence & Analytics
10
Motivation
•Use the dataset which contained contact information to
predict the customer who would like to purchase the
insurance
•Help insurance company understand the characteristics of
their customers in making the purchasing decisions of their
insurance
Technology
•Used Python to analyze imbalanced data on customers from
insurance company
•Applied Synthetic Minority Over-sampling Technique
(SMOTE) algorithm to balance the data
•Built predictive models (Logistic Regression, Random Forest
and XGBoost) to predict conversion rate
Data Summary
Learning Model
•Logistic Regression: This was chosen because it is known to serve as a benchmark
with which other algorithms are compared.
•Random Forests Classifier: Random Forests Classifier is a type of decision tree algorithm.
•XGBoost: This model is short for “Extreme Gradient Boosting”
The tree ensemble model: which is a set of
classification and regression trees (CART)
Tree Ensemble: which sums the prediction of
multiple trees together.
 Raw Data
• Dataset shape: 1,892,888 records and 50 variables in the dataset
• Features Type: 5 columns are int64, 12 columns are float64, and 33 are object
• Missing Value: 42 columns contains NA values
 Clean Data
• Convert data format to train the model
• According to correlation matrix to eliminate features highly correlated but irrelevant to target label
• Applied SMOTE Algorithm to balance the dataset
Imbalance data: A dataset is imbalanced if the classes are not approximately equally represented
Correlation Matrix
0: Contacting without purchase
1: Contacting with Purchase
Processed Dataset(1885774,135)
Raw Dataset(1892888,50)Consider a sample (6,4)and let (4,3)be its nearest neighbor.
(6,4)is the sample for which k-nearest neighbors are being
identified.
(4,3)is one of its k-nearest neighbors. Let:
f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2
f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1
The new samples will be generated as
(f1’,f2’)= (6,4)+ rand(0-1)* (-2,-1)
rand(0-1)generates a random number between 0 and 1.
Processed
Data
Split data
Testing Set
25%
Training set
75%
Model
Fitted model
Predictions
Results
Conclusion & Future Work
Model Accuracy
LGR 83.6%
RFR 94.6%
XGB 79.8%
Feature Importance
From the Random Forest, we get the top
50 features which paly significant role in
our model.
'RQ_Flag',
'Original_Channel_Broker',
'First_Contact_Date_month',
'First_Contact_Time_Hour',
'PDL_Special_Coverage',
'RQ_Date_month', 'Inception-
First_Contact',
'Original_Channel_Internet',
'PPA_Coverage',
'Inception_Date_month', 'Mileage',
'Region_(03)関東',
'Original_Channel_Phone',
'License_Color_(02) Blue',
'Previous_Insurer_Category_(02)
• Comparing the accuracy of the three models, we choose the Random Forest as our final
model.
• According to the feature importance, digging the business insights from these features and
giving the suggestions on what characteristics of insurance’s customers in making the
purchasing decisions of their insurance
After training model, we get the results of the three models: Logistic Regression, Random
Forest and XGBoost
Determining Attractive Laptop Features for College Students
Team: Liwei Cao, Gordon Oxley, Haoyue Yu, Salman Sigari
Instructor: Chihoon Lee
Business Intelligence & Analytics
November 28, 2017
11
Experiment Design
Stage 1: Plackett-Burman Design
Objective: Identify the most important factors early in the
experimentation
Factors & Levels:
Stage 2: Fractional Factorial Design
Objective: study the effects and interactions that several factors
have on the response.
Factors & Levels:
Blocks:
Conclusion
• Price and operating system play really important roles
when it comes to laptop purchase
• To maximize Probability of Purchasing, price at plus level
(<750) and operating system at minus level (Windows)
would be chosen. Our maximum predicted probability of
purchasing would be Probability of Purchasing = 51.77 +
(11.36/2)*(1) - (7.23/2)*(1)*(-1) = 61.065
Data Collection
We handed out slips to Stevens students randomly and recorded their
responses
Stage 1 (32 observations): Stage 2 (64 observations):
● Effect Test
● Pareto Plot
● Normal Plot
Motivation
• Laptops have become a staple in our lives as we use them
for work, entertainment, and other daily activities
• From a marketing perspective, it is critical to find the
factors that interest consumers in order to produce and sell
a laptop that will be successful .
• Survey conducted by Pearson says 66% of
undergraduates use their laptop every day in college
• We wanted to find what drives laptop demand among
college students
Result
Stage 1 Stage 2
Probability of Purchasing = 51.77 + (11.36/2)*Price -
(7.23/2)*Price*Operating system
=51.77 + (5.68)*Price - (3.615)*Price*Operating system
Response: Probability of Purchasing (0 – 100%)
Clustering Large Cap Stocks During Different Phases of the
Economic Cycle
Students: Nikhil Lohiya, Raj Mehta
Instructor: Amir H. Gandomi
Results
Clustering of Stocks during Recovery phase
Clustering of Stocks during Recession phase
K means plot shows that the stocks are clustered with
similarities by their Sharpe ratio, volatility, and average
return. There are 9 graphs in total, and 2 of them are
displayed above for the expansion and recession phase.
The x-axis shows the ticker/symbol of snp500 and the Y-
axis shows the Cluster number. If we hover on the dot
on the graph, it shows the ticker along with its cluster
number, and the variable used for clustering. We used
Silhouette and visually inspected the data points to find
the optimal value of k, which turns out to be 22.
Introduction
OBJECTIVE
We tried to provide a set of securities that behave
similarly during a particular phase of the economic
cycle. For this project, the creation of sub-asset classes is
done only for large-cap Stocks.
BACKGROUND
Over time, developed economies such as the US are
becoming more volatile and hence the underlying risk of
securities has risen. This project aims to identify the risks
& potential returns associated with different securities
and to cluster similar stock similarities according to their
Sharpe ratio, volatility, and an average return of stocks
for a better analysis of the portfolio.
Business Intelligence & Analytics
12
Data
Acquire
• Data of Large Cap Stocks & US Treasury Bonds is gathered
directly using APIs.
• The Data potentially consists of 2 time frames i.e.
Recessionary & Expansionary Economy.
Data
Preprocessing
• This segment included the application of formulae to
calculate the pre-required parameters. (Eq 1,2,3,4).
Analysis
• This segment consists of K means clustering Analysis done on
the Large Cap Stocks. (K = 22) (500 Stocks)
• The clustered securities is then further tested for their
correlation among the sub asset classes.
Results
• The results of K means clustering varies in the range (9 to 45)
• There were some outliers in our analysis as well.
Flow - Project
Conclusion & Future Scope
• With the above methodology, we have been able to
develop a set of classes which behave in a similar
fashion during each phase of the economic cycle.
• The same methodology can be extended to
different asset classes available online.
• Application of Neural Networks can significantly
reduce the error in cluster formation.
• Also, application of different parameters such as
Valuation, Solvency or Growth potential factors can
be included for clustering purposes.
• Next, we plan to add leading economic indicator
data to identify the economic trend and to perform
the relevant analysis.
Mathematical Modelling
• Here we take daily returns for all the 500 securities.
𝑅 =
𝑃𝑐−𝑃 𝑜
𝑃 𝑜
× 100 Eq1
• Average Return and Volatility.
𝜇 𝑗 =
∑ 𝑅 𝑖
𝑛
𝑖=1
𝑛
Eq2
𝜎𝑗 =
∑ (𝑅 𝑖 − 𝜇 𝑗)2𝑛
𝑖=1
𝑛−1
Eq3
• Sharpe Ratio calculation for the Securities.
𝑆𝑅𝑗 =
(𝑅 𝑗−𝑅 𝑓)
𝜎 𝑗
Eq4
• Correlation Matrix between the clustered securities following
the cluster formation.
Introduction
• Predict how popular an apartment rental listing is based on the listing content
• Help Renthop better identify listing quality and renters’ preference
• Test several machine learning algorithms and then evaluate them to choose the best one
Statistical Learning & Analytics
Spring 2017
Exploratory Data Analysis
• Data source: Kaggle.com
• 15 attributes with 49352 training samples.
• Target variable: Interest level (high, medium, low)
Data Preprocessing
• Transformed categorical data into numerical data
• Obtained zip code from latitude & longitude
• “One-hot encoding”: created dummy variables
• Extracted the top apartment features from the list
of apartment features
Feature importance
- Built ExtraTrees Classifier
- Computed feature importance
Top features:
- Price
- Building_index
- Manager_index
- Zip_id
- Number of photos
Modeling
• Ensemble Methods: Random Forest, Bagging, AdaBoost, Xgboost
• Logistic regression, KNN, Naïve Bayes, Decision Tree
• Model Evaluation
- Multi-class logarithmic loss: logloss = -
1
𝑁
∑ ∑ 𝑦𝑖𝑖
𝑀
𝑗=1
𝑁
𝑖=1 log(𝑝𝑖𝑖)
- For imbalanced data, use Precision-recall curve and average precision score
• Best Classifier: Random
Forest
- Lowest logloss: 0.6072
- Highest average precision
score: 0.8374
• Precision-recall curve for each
class
Conclusion
• Built 8 models to predict the probability. The
best performance is Random forest with
0.6972 lowest logloss and 0.8374 highest
precision score
• Regarding to imbalanced data, precision-recall
curve could be appropriate evaluation metric
while accuracy score is not applicable
• Hyperparameter tuning for better peformance
• SMOTE technique to balance dataset
13
14
Deep Learning vs Traditional Machine Learning Performance
on an NLP Problem
Abhinav S Panwar
Instructor: Christopher Asakiewicz
Modeling
• Traditional Machine Learning:
• Bag of Words (up to 2 grams) is used for feature generation
• L1 regularization is used for feature selection
• Hyperparameter values are searched through Cross Validation
method
• FastText:
• Data is directly fed into the FastText pipeline without worrying
about feature generation
• ‘Number of Epoch’ and ‘Learning Rate’ are tuned for best
performance
Introduction
• When you need to tackle an NLP task, the sheer number of
algorithms available can be overwhelming. Task-specific
packages and generic libraries all claim to offer the best
solution to your problem, and it can be hard to decide which
one to use.
• Through this project, we are comparing performance of
traditional Machine Learning Algorithms such as Support
Vector Machine, Logistic Regression, Naïve Bayes Algorithm,
etc. with Facebook’s AI lab released FastText, which is based
on a deep neural network.
• The challenge is to predict ‘Dow Jones Industrial Average’
movement based on Top 25 news headlines in the media.
This is a binary classification task with ‘1’ indicating DJIA
rose or stayed as the same, and ‘0’ indicating DJIA value
decreased.
Conclusion
• If working with large data sets, and if speed and low memory
usage are crucial, FastText looks like best option.
• Based on research published, performance of FastText is
comparable to other deep neural net architectures, and
sometimes even better.
• Running complex neural net architectures require GPU’s for
getting good performance but one can get comparable
performance with FastText while running on general CPU’s.
Business Intelligence & Analytics
Data Pre-Processing
• Total Number of Samples: 1989
• Some Text Cleaning was done:
•Converting headline to lowercase letters
•Splitting the sentence into a list of words
•Removing punctuation and meaningless words
• Dataset was divided into Training and Testing (80:20 split):
•Training: Data from 08-08-2008 to 12-31-2014 was used
•Testing: Data from following two years was used
Data Description
•News Data: Historical news headlines were obtained from Reddit WorldNews
Channel. They are ranked by reddit users votes, and only the top 25 headlines
are considered for a single date.
•Stock Data:. Dow Jones Industrial Average (DJIA) Adjusted Close Value for the
period 8th August 2008 to 1st July 2016.
Results
•Precision, Accuracy, and Time taken are compared for all the methods
of Learning.
Word Cloud for +ve class Word Cloud for -ve class
•After analyzing the Word Cloud for both classes of data, we can see
Political sensitive words are dominant for both of them.
•Since the dataset contains only 2000 samples, the final model features
are not distinct enough resulting in average model fitting.
•Among all the algorithms, FastText gives nearly best performance and
the time consumed is far less than other algorithms.
Learning Algorithms
• Traditional Machine Learning:
•Support Vector Machine, Logistic Regression, Naïve Bayes, Random
Forest
• Deep Learning:
•FastText: Neural Network based library released by
Facebook for efficient learning of word representations and sentence
classification.
Unigram Bigram Unigram Bigram Unigram Bigram
Logistic Regression 68.7 69.3 71.2 71.8 53 57
LSVC 70.9 71.2 72.1 72.3 61 63
Naïve Bayes 67.4 68.1 67.6 68.4 52 53
Random Forest 53.2 54.9 55.8 57.1 69 72
FastText 68.9 70.2 70.9 71.7 10 11
Precision (in %) Accuracy (in %) Time (in s)
Wine Recognition
Team: Biying Feng, Ting Lei, Jin Xing
Instructor: Amir Gandomi
Results and Discussions
Model 1: KNN (Kth Nearest Neighbor Classification)
Result: Evaluation: cross validation
The error rate of cross-validation is 0.3124, and the error
rate of KNN model is 0.3483.
Model 2: LDA (Linear Discriminant Analysis)
Result: Evaluation: cross validation
According to the result of the cross validation, we can
know that the error rate is 0.0168.
Model 3: Recursive Partitioning
Result:
Evaluation:
We don’t have to use cross validation to evaluate this
model, the reason is Recursive Partitioning uses cross
validation internally. So, we can trust the error rates
implied by the table. Here, the error rate is 0.1685.
Goals we want to accomplish
• Selecting most influence variables for wine
classification.
• Developing classifiers to classify a new observation.
• Finding the best model for wine classification.
• Find the correlation between each pair of variables.
Classification Models
1. K Nearest Neighbor
2. Linear Discriminant Analysis
3. Recursive partitioning
Conclusion
• The error rate of KNN is 0.3146 which is the
worst accuracy among these three models.
• The error rate of LDA model and Recursive
Partitioning models are the same and it is equal
to 0.01685.
• However the LDA model is overestimate the
result and, therefore, it is not good enough.
• The Recursive Partitioning model uses cross
validation internally to build its decision rules,
that means this model is more reliable.
• Finally, the Recursive Partitioning model is
selected as the best model for wine classification.
Multivariate Data Analysis
Fall, 2017
Data Information
Problem Type: Classification
Variables: (1) Alcohol, (2) Malic acid, (3) Ash, (4) Alcalinity of
ash, (5) Magnesium, (6) Total phenols, (7) Flavanoids,
(8) Nonflavanoid phenols, (9) Proanthocyanins, (10)
Color intensity, (11) Hue, (12) OD280/OD315 of diluted
wines, (13) Proline
Outcome: Wine Type
Data recourse:
http://archive.ics.uci.edu/ml/datasets/Wine
Data preparation
1. Type of variables
2. Summary of variables
3.Correlations 4. Variable importance
5. Scatter Plots
6. Variable dependencies
7. Standardizing variables
15
The Public’s Opinion of Obamacare: Tweet Analyses
Saeed Vasebi
Instructor: Chris Asakiewicz
Results
Monthly trend of tweets based on their
sentiment:
Geological distribution of tweets:
Main #Hashtags of authors:
Trumpcare sentiment analyses by Language:
Introduction
 Patient protection and affordable care
act (Obama Care) provides healthcare
insurance services for US citizens.
 This act has been highly debated by
Democrat and Republican parties.
 The main beneficiary of the act is
people who use and pay for it.
 This study tries to find out what people
think about Obamacare and Trumpcare
which is potential substitute for the act.
Modeling and Data
 Watson Analytics social network tool is
used to gather data from tweeter based
on #Obamacare and #Trumpcare in this
study. Also, IBM Bluemix sentiment
analysis is used for detail valuation of
tweeters.
 Obamacare tweets have been gathered
for July-September 2016 and 2017.
 Trumpcare tweets have been extracted
form November 2016 to October 2017.
 The geological area of study is limited to
United States.
 The tweeters’ language is limited to
English and Spanish which represents
two main races in US.
Conclusion
 Most of tweets have negative sentiments about Obamacare and Trumpcare; however,
Trumpcare has relatively higher opposing.
 CA, TX, NY, and FL have high tweeting rate about the acts. They have high negative tweets
for the acts and partly support Obamacare but Trumpcare is not supported at any state.
 Hashtags with Obamacare talk about its negative impacts on government’s budget.
Hashtags with Trumpcare talk about stopping the program and President Trump’s
violations.
 Spanish tweets have higher rate of positive tweets than English ones.
20
Obamacare tweets’ trend on July-September 2016 and 2017
Trumpcare tweets’ trend from November 2016 to October 2017
Geological distribution of Obamacare’s all, negative, and positive tweets
Geological distribution of Trumpcare’s all, negative, and positive tweets
Obamacare and Trumpcare #hashtags
Project Assignment Optimization Based on
Human Resource Analysis
Team: Jiarui Li, Siyan Zhang, Siyuan Dang, Xin Lin, Yuejie Li
Instructor : Alkis Vazacopoulos
Optimization:
Constraints:
• 𝑃𝑖𝑖 stands for project i assigned to employee j. It
takes value 0 or 1.
• 𝑎𝑖 stands for how many employees needed for
one project.
• 𝑏𝑗 stands for how many projects one employee
can complete.
Introduction
The primary goal of our project is to provide
insight to companies which need to properly
assign different projects to each employee. In
order to maintain a low turnover rate as well as
increase employee productivity and growth, we
use a combination of machine learning and
powerful tool of optimization to help companies
build and preserve their successful business.
Business Intelligence & Analytics
Model
Objective: Minimize the turnover rate.
Data Exploration:
1.Project Account of Turnover VS No Turnover :
Moving Forward
For our future work, we are looking forward to
improve our model by introducing additional
constraints. In addition, we will combine machine
learning using Random Forest to give a prediction
of the turnover rate and the optimization question.
Thus, we can generate a model with more accuracy
to give a dynamic result based on the enterprise’s
data.
2.Decision Tree Model:
Get insights of data and
predict the turnover rate.
3.Heat Map:
We build the assignment model to optimize the
project assigned to each employee while
minimizing the employee turnover rate.
there is a positive(+) correlation
between projectCount,
averageMonthlyHours, and
evaluation.
16
S&A Processed Seafood Company
Team: Redho Putra, Neha Bakhre, Harsh Bhotika, Ankit Dargad, Dinesh Muchandi
Instructor: Alkis Vazacopoulos
Methodology
• We found out how many units S&A Company sold. (The
article give the ‘value’-> 330M)
• Then we used forecasting techniques to estimate how many
units S&A will sell in the next 3 years to hit sales target 550M,
which is our objective of this project.
• Suggested new supply chain strategy to reach desired target.
Introduction
• S&A Company is a distributor of processed seafood products.
• The broader offering of products has improved sales in the
area of Western Europe.
• Sales have improved with the new product acquisitions, but
S&A now wishes to expand the entire product line into more
of Eastern Europe.
• However, the company realizes that their current model of
putting a distribution center (DC) in each country that they
service may no longer be the best option as they look to
expand.
Current Strategy
Conclusion
• The optimum number of DCs for S&A Company is 3
(UK, Germany, and Poland)
• UK DC covered demand from UK & Ireland; Germany
DC covered demand from Germany, France & Belgium;
Poland DC covered demand from Poland & Hungary.
• By meeting the industry benchmark of inventory
turnover (15), UK DC expected average inventory is
257,000 cases; Germany DC expected average
inventory is 261,000 cases, and Poland DC expected
average inventory is 175,000 cases.
• The warehouse utilization rate of those 3 DC would be
80%.
Business Intelligence & Analytics
21
Suggested Strategy
Results
• Only Ireland DC has utilization rate around 80%. The other DC
utilization rate is between 25-42%.
• The average inventory turnover in all DC is 11, while the
industry benchmark is 15.
Objective
• S&A Company is looking to drive significant organic
revenue growth in the European market growing from
$340M (as of the end of the fiscal year ending June 30,
2017) to $550M over the next 3 years.
• We need to provide a supply chain strategy that helps
them answer fundamental questions regarding the
expected inventory and flow of goods that will be
required three years from now to support the desired
sales growth.
• The company needs advice regarding the supply chain
network, infrastructure and processes needed to serve
its customers within the expected delivery window while
optimizing costs.
Optimizing Travel Routes
Team: Ruoqi Wang, Yicong Ma, Jinjin Wang, Shuo Jin
Instructor: Alkiviadis Vazacopoulos
Result
Business Intelligence & Analytics
18
Introduction
There are more than a hundred million international tourists arriving to the United States every year and
they spend billions of dollars. To maximize the profit and make more competitive offers, it is crucial to lower
costs. One important factor is arranging an efficient tour route so that tour agents and travelers can lower
the cost of time and transportation.
Our project goal is to optimize a tour route by using certain constraints to minimize costs and maximize tour
experience.
Future Work
Experiment
•To achieve our objective, we constructed a two-step
experiment.
•Our first step is defining attributes and setting up constraints so
that we can pick 6 target cities from the 10 most popular tourist
destinations in the U.S. We collected data from TripAdvisor.
•The cost of living index is used to estimate the expenses of the
tour, which include dining and lodging.
•Our second step is optimizing the tour route based on the cities
that we picked in step one. We compared the costs of 3
transportation methods (train, bus, flight). We first optimized the
cost for each city and then we optimized the travel route using
Excel solver.
• Step one- we added several other constraints, such as
visiting a minimum of two national parks and visiting a
minimum of one Michelin starred restaurant to achieve the
objective of maximizing the tour experience. We used Excel
solver to obtain our final selection of destinations.
• Step two – after we were successfully done data cleaning,
we applied our data into a Travelling Salesman problem
model to optimize our route.
• In the near future, we want to make our model more realistic by adding timetables into our model.
• In addition, we want to add more constraints to decide the optimal group size and ways to deal with
customers’ luggage.
Predicting Airbnb Prices in NYC
Team: Jiahui Bi, Ruoqi Wang, Yicong Ma, Xin Chen
Instructor: Yifan Hu Business Intelligence & Analytics
November 13th 30th, 2017
19
Introduction
• Background: After showing up in 2013, the sharing economy
has become a vital area in the industry and is affecting almost
everyone’s life. Airbnb, as one of the representatives of sharing
economy, is always focused by explorers and users.
• Purpose: exploring what factors will influence the price of
the Airbnb houses/ apartments / rooms and using those factors
to predict the price in New York City.
• Key Questions: What factors will influence the price in the
New York and how will they influence?
• Technology: Python(sklearn, xgboost)
Data Preparation
• Data Source:http://insideairbnb.com/get-the-data.html
New York City, New York, United States
October 2017
• Data cleaning : We get 44317 rows data which include 96
features
 Drop entries that are missing (NaN) values for columns
like “bedroom”, “bed”
 Substitute missing values with the mean of the columns
like "review_scores_location”
 Set the value of 'reviews_per_month' to 0 where there is
currently a NaN
 Transfer the string values into the integer or float type
 Delete entries that are outliers (more than 2000$ or 0)
for “price”
• Finally, we get 43148 rows data and 28 features
Exploratory Data Analysis
• Highly relevant features
• Price Distribution
Conclusions
• Conducted 5 models and ensemble the model
• Tested xgboost model with baseline model and Ridge
Regression
• The most important features that influence the prices in NYC
are location, the type of room and the number of bedrooms
Future Work
• Abstract new features, such as amenities.
• The price is highly positively related to location, in our case,
listings in Manhattan is highly correlated with the pricing.
• Accommodates, room type, the number of bathrooms,
bedrooms, beds also have strong correlations with the price.
Feature Importance
Method 1
• Built Vanilla Linear Regression, Ridge regression, Lasso
Regression, Bayesian Ridge on cross-validation split data
• Figure 1 shows when used 18 features
Figure2 shows after One-Hot Encoding (create dummy
variables)
Figure 1 Figure 2
• So, we can see Lasso comes out on top
• Ensemble model: Gradient Boosting
• We're doing better with Gradient Boosting Regressor with
MAE=25.52, almost 20% less than previous method
Method 2
• Ridge, Baseline & Xgboost
• To test our different models in depth, we repeated train-
validation split and then see how our errors are distributed.
We also added a baseline model to compare.
• Both ridge and xgboost beat the baseline model
• Ridge Regression is performing slightly better than xgboost
model
Mean(RMSE)
baseline 129.69
ridge 97.24
xgboost 98.31
Portfolio Optimization of Cryptocurrencies
Team: Yue Jiang, Ruhui Luan, Jian Xu, Shaoxiu Zhang, Tianwei Zhang
Instructor: Alkis Vazacopoulos Business Intelligence & Analytics
22
Motivation
•Cryptocurrencies are a very important topic for investors and
the economic world.
•Everyone wants to understand whether it is worthy and safe
to invest in this type of assets.
•We want to create a portfolio optimization model so that we
can test the performance of Cryptocurrencies .
•We’d like to know and understand the returns and the risk we
take for different portfolios of cryptocurrencies.
Technology
•User API file to craw the data of cryptocurrencies from the
internet.
•Python to construct the portfolio with the Monte Carlo
Simulation, optimize the Sharpe Ratio and the portfolio variance.
•Python’s matplotlib package to make the visualization of
scatter plot of expected return and expected volatility.
Current & Future Work
•We have price data of 9 different cryptocurrencies for the period of time from May 24th, 2017 to Oct 3rd, 2017, which is a very small
subset of the data but it is a good start for us to test the investment concepts and to optimize the portfolio of digital currencies.
•With Monte Carlo Simulation method, we have constructed out portfolio model. Then, we have calculated the covariance of this 9
cryptocurrencies and also the correlations of every pair of combination.
•We have used Python to calculate and visualize the Markowitz efficient frontier of our data.
•We have optimized the Sharpe Ratio so that we can get the portfolio of maximized Sharpe Ratio. Also, we have minimized the
variance of the portfolio so we can get lower risks.
• We can observe from our findings that the volatility of digital currencies are very high due to the changes of the markets, the
expectation of investors and also the regulations from government.
•In the future, we can include more cryptocurrencies and longer time range into our data so that we can get more exact results.
Moreover, if the regulations are loose and investors can make transactions between digital currencies and real currencies, we then
can get the data of such transactions and find the opportunity of arbitrage.
Data Snapshot
Our data is crawled from the internet of cryptocurrencies markets. There are
nearly 1,500 types of digital currencies in the market, however, for testing the
concepts and due to the limits of our computers, we only got the market price
for 9 cryptocurrencies and the time frame was from May 24th, 2017 to Oct 3rd,
2017.
Log Returns
As the majority of statistical analysis approaches rely on log returns rather than
the absolute time series data, we use Numpy package of Python to get the log
returns for the 9 cryptocurrencies. The returns of the digital currencies
indicated that the risk of investing in the virtual currencies was very high during
the time period because only two of them had the positive returns.
Covariance & Correlation
Aiming to see the dependence
between these 9 digital currencies, we
got the covariance and the correlations
between each pair of them. It shows
that they have the positive correlations
and most of them are tightly related to
Bitcoin.
Random Weights
Subject to the constraint that the sum of the weights of the portfolio holdings
must be 1, we have randomly assigned weights to our 9 cryptocurrencies.
Monte Carlo Simulation & Sharpe Ratio
Optimization
We then use Monte Carlo simulation to run 4000 times of different
randomly generated weights for the individual virtual currencies and then
calculate the expected return, expected volatility and Sharpe Ratio for
each of the randomly generated portfolios.
It’s then very helpful to plot
these combinations of
expected returns and
volatilities on a scatter plot ,
and we color the data points
based on the Sharpe Ratio
of that particular portfolio.
Optimization on Variance of Portfolio
We have minimized the portfolio variance so that we can invest on the
portfolio with lowest volatility. The result suggested that we should invest
on the first, the second, the third and the last digital currencies.
Maximization of Sharpe Ratio
By looking for the minimized Sharpe
Ratio of the portfolio, the result is
choosing only the last digital
currency.
Regression Model for Box Office Gross
• Same as the model for
rating, 3 models with
different variables are
compared by MSE in 4
testing dataset from
which model 2 is the best
• In the last regression model for gross, budget,factor1,2,3,
5,duration ,director, actors are the explaining variables in
which factor3 and 5 are negative factors.
Regression Model for Rating
• 3 models with different variable
entered are simulated in training
datasets and verified in testing
datasets from which we examine
the MSE of each model in each
test where model 1 is considered as the best
• In the last regression model for rating, duration
,director, year, and actors are the explaining
variables in which year is a negative factor.
Predicting Movie Rating and Box Office Gross
Team: Yunfeng Liu, Erdong Xia, Yash Naik
Instructor: Amir H. Gandomi Statistical Learning & Analytics
Spring, 2017
MSE of rating model
test1 test2 test3 test4
model1 0.3168 0.3487 0.3872 0.4128
model2 0.3171 0.3627 0.3872 0.4128
model3 0.3195 0.3492 0.3890 0.4148
MSE of gross model
test1 test2 test3 test4
model1 2.278 × 1015
4.017 × 1015
1.926 × 1015
1.635 × 1015
model2 2.269 × 1015
3.960 × 1015
1.928 × 1015
1.650 × 1015
model3 2.277 × 1015
4.011 × 1015
1.925 × 1015
1.636 × 1015
rating model of all years
Variable
Parameter
Estimate
Standardize
d
Estimate
Intercept 41.94717 0
duration 0.01275 0.25958
director 0.13974 0.17509
actors 0.05001 0.06428
year -0.01848 -0.16053
gross model of all years
Variable
Parameter
Estimate
Standardized
Estimate
Intercept 5.1 × 10
6
0
Factor1 2 × 10
7
0.25932
Factor2 1.9 × 10
7
0.24701
Factor3 -5.0 × 10
6
-0.069
Factor5 -5.0 × 10
6
-0.0678
duration 2.5 × 10
5
0.06465
director 4.3 × 10
6
0.06724
actors 7.6 × 10
6
0.12248
budget 2.2 × 10
-1
0.24059
Introduction
• The primary purpose of this project was to create models using existing movie data for prediction of box office gross and movie ratings.
• The first model was created to predict the Gross box office revenue that a movie will generate given inputs like the genre, actors involved,
the director and many more influencing factors.
• The second model was created to predict the movie rating considering the genre, the budget of the production and many more influencing
factors.
Models
• Principal
Components
Analysis
• Multivariate
Regression
Model
Data Source
• Database is from
www.kaggle.com
• Over 5,000 movies
data in past 20 years
with 18 variables
Data Processing
• Delete null and missing values from raw data (5500 rows-3116 rows)
• Select movies with sufficient reviewer numbers (3116 rows-1558 rows)
• Transform directors and actors Facebook likes into score level from 1 to 5
• Transform string variable of genres into Boolean variable of 17 specific categories
• Divide the population into four testing datasets evenly and set training dataset according to
different testing dataset for regression model testing
Principal Component Analysis
• To investigate the movie score and gross regression models, principal components analysis
are conducted with 17 concerning categories.
• From the scree plot, the first 6 factors are selected as principal components of movie genres, the
eigenvalues are greater than 1 and the cumulative variance explained is greater than
0.6.
• The scoring coefficients of 6 principal
components are displayed where the
green cells are those positive indicators
of that factor and red cells are those
negative indicators of that factor
• From PCA we conclude that Factor 1 is
like family animation without thriller
and crime, factor 2 is like action and
sci-fi with thriller, factor 3 is action
without horror, factor 4 is biography,
factor 5 is crime, factor 6 is documentary
Conclusion
• From movie rating model, it can be seen that a long movie shot by a famous director is
more likely to earn a high rating score on reviewer websites and the genres of the
movie are irrelevant.
• From movie gross model, it can be seen that a big-budget movie with the genres of
family-animation or action-sci-fi is more likely to earn a successful box office. Also, in
terms of earning money, actors seem more important than the director based on this
model.
1
2
3
4
2
2
2
3
3
3
4
4
4
1
1
1
White = testing red = training
Test1
Test2
Test3
Test4
model1
model2
model3
duratio
n
duratio
n
duratio
n
director
director
director
year
year
year
actor
actor
actor
budget
budget
budget
model1
model2
model3
duration
duration
duration
director
director
director
Factor 1
Factor 1
Factor 1
Factor 2
Factor 2
Factor 2
Factor 3
Factor 3
Factor 3
Factor 5
Factor 5
Factor 5
actor
actor
actor
budget
budget
budget
year
year
year
Factor 4
Factor 4
Factor 4
24
Student Performance Prediction
Team: Abineshkumar, Sai Tallapally, Vikit Shah
Instructor: Amir H. Gandomi
Multivariate Data Analysis
Fall, 2017
Introduction & Motivation
• Regression and Classification models are developed to help
schools to predict the performance of students in final exam
using historical data including demographic and previous test
scores
• Prediction tools can be used in improving the quality of
education and enhancing school resource management
Data Understanding
• Final grade (G3) of students ranges from 0 – 20 which is
displayed in the histogram below.
Modeling and Technology
• Two datasets with the same number of variables are used,
first dataset is about Mathematic subject and the second
dataset is about Portuguese subject
• G3 (Final grade), ranges from 0 – 20, is the target.
• A Linear regression model is build to predict the final grade
of students by using their given information
• Linear equation is formed by selecting the top 5 variables 5
variables those provide the lowest BIC and Mallow’s CP while
providing an optimal Adjusted R2
• The most important variables are G1 and G2 which are the
first and second period grades which are important in
predicting the final grade
• A classification is done for the grade. When a student score
is more than 10, it is classified as Pass and otherwise it is
classified as Fail
• The classification technique used here is Logistics
regression
• Classifying the grades (0 – 20) into two classes (Pass / Fail )
produces a more accurate model. But in this experiment, the
focus was to build a linear regression model which predicts
the final grade as a numeric value which is shown below.
Result and Conclusion
• The final grades of the students were mostly dependent on the first period and second period grades, and also on demographic
information such as if a student’s father had a job or his attendance percentage is high were also useful in the model building process.
• Final grade had a decent correlation with the first period and second period grades but these fields were very important in building a
model. In the decision tree model above, we can see that impact of mother having a job title “teacher or other”. In Linear Model, we
could see the family relationship and age are in the top 5 independent variables shows the importance of those variables as well.
• From the above results the educators can understand the impact of the previous grades (First period and Second period), therefore,
they could start planning on taking care of the students during those exams. The predicted student performances also helps in
implementing this model as a policy in the schools.
Reference: http://www3.dsi.uminho.pt/pcortez/student.pdf
•
Linear Regression Model
• From the summary of the model, we took top 5 variable to
build the linear regression model,
Model = lm(G3~ G1 + G2 + absences + famrel + age)
• G1, G2, Absence of a student, quality of family relationship
(famrel), and age of the student are the top 5 independent
variables to predict final grade (G3).
• R-squared: 0.8599, Adjusted R-squared: 0.846
Decision Tree in R
• Decision Tree algorithm used first period, second period, and
mother’s job to predict if a student will pass or fail in the final
exam.
• Studied Model predicts student passes in final exam if their
second period Grade is >8 or First period grade is >10 or if the
student’s mother has a job title “Teacher or Other”. The
accuracy of the model is 87% which is very close to the
models built on Azure.
25
Classifying Iris Flowers
Team: Xi Chen, Shan Gao, Lan Zhang
Instructor: Amir H. Gandomi
Business Intelligence & Analytics
26
Purpose
• For the iris flowers, the sepal and petal are very different
from other flowers. Also, there are three different types of
iris in the world. To classify iris flowers, we intend to create
a k-nearest neighbors (KNN) model to assigning a label or
a class to a new instance. A regression model is also
developed to assign a value to the new instance.
Data Description
• This famous (Fisher's or Anderson's) iris data set gives the
measurements in centimeters of the variables sepal length
and width and petal length and width, respectively, for 50
flowers from each of 3 species of iris. The species are Iris
setosa, versicolor, and virginica.
Methodology
Linear Regression
• We plotted the scatter matrix of Iris data before diving into this topic and the data set consists of four measurements (length and
width for petals and sepals)
• We did the regression analysis for petals and sepals to see their significance.
KNN method with R
• Similarity measure which is typically expressed by a distance measure like the Euclidean distance that we will use in Iris dataset.
• We used this similarity value to perform predictive modeling like classification with assigning a label or a class to the new instance.
• We divide the dataset into the Training and Test Sets so as to test the accuracy of probability in our classification.
Clustering
• The hierarchical cluster analysis is used to visualize how cluster are formed and what the relationship among Iris members.
Data Analysis Results:
Business Intelligence & Analytics
31
Experiments for Apartment Rental Advertisements
Team: Nishant Bhushan, Sharvari Joshi, Nirmala Seshadri
Instructor: Chihoon Lee
Resulting Regression Equation
Likelihood of responding to the Ad = 0.6425 -0.0875 * Amenities[No]
- 0.0775 * Public Transport[No]
Expected Response under Best Factor-Level Choice
Likelihood of Responding to the Ad = 0.6425 – 0.0875 (-1) – 0.0775 (-1)
= 0.81
Objective
• To determine what factors in apartment rental advertisements
contribute to responses from ad viewers.
• Apartment hunt has always been a crucial experience for graduate
students. Therefore, we would like to perform an experiment to see
how response is affected by different factors in a posting on the
website.
• We are performing an experiment to maximize the student’s
response to a post on a housing website of an apartment rental by
determining which factors affect the student’s response.
Approach
A Customer survey was conducted and 16 sample Ads were made.
Each Ad was sent to 5 customers and they were asked the likelihood to
respond to that particular variable.
The Customers were asked to choose from one of the following options:-
• Extremely Likely (100%)
• Very Likely (80%)
• Moderately Likely (60%)
• Slightly Likely (40%)
• Not Very Likely (20%)
CHOICE OF EXPERIMENTAL DESIGN
• 2_𝐼𝑉^(8−4) Fractional Factorial Design
• NUMBER OF LEVELS FOR EACH FACTOR = 2
• NUMBER OF FACTORS = 8
• NUMBER OF OBSERVATIONS = 16
• RESOLUTION = 4
• No. of Replications: 5
• We chose this design as the number of factors is large i.e. 8. Thus, a
full factorial design would result in large number of interactions and
2^8 =256 runs which would be impossible to carry out given the
time and budget constraints.
Example of a survey
- 16 advertisement surveys were created.
- Each survey had 5 responses
- Target Audience: Students at Stevens Institute of Technology
Business Intelligence & Analytics
THE DESIGN OF EXPERIMENT
Conclusion
• Not including the amenities decreases the chances of a customer
responding to the advertisement by 8.75%.
• Not including the access to public transport decreases the
chances of a customer responding to the advertisement by
8.75%.
• Mentioning the Amenities and access to public transport are the
most significant factors and should be included in the housing
ads.
• Posting time of the ad is not a significant factor so the
advertisement can be posted at any time during the day.
• Street Maps and Pictures are not significant factors as well and
can be done without.
Factors and Levels
has
Human Resource Retention
Team: Aditya Pendyala, Rhutvij Savant
Instructor: Amir H Gandomi
Data Analyses
• The heatmap below depicts a correlation between
the different features of the data set:
• The count of each of the nine incidences is depicted
below:
• The above deviance table indicates the estimates,
standard errors, and z-values of all variables involved
in the dataset
Introduction
Problem: An organization aims to retain its important resources and reduce the number of employees leaving the company. The HR department
has collated employee related data which can be used to predict which employees may leave. If a prediction model can be designed
with high accuracy, necessary action can be taken by HR Department to prevent critical employees from leaving the company by
addressing the variables which are causing the employee to resign.
Data: The data provided is compiled by Human Resources, and consists of ten variables:
Technology:
• R has been used for developing logistic regression prediction model. Confusion matrix and ROC Curve are used to increase the accuracy
of the model and setting the threshold
• Python has been used to generate data visualizations of various analyses, including Principal Component Analysis
Conclusion
Based on the above analyses, the following observations can be made:
• The model can be used to identify majority of employees who may leave.
• For identified employees the data analyzed for employees, such as
promotion, monthly hours, salary, etc., can be used by HR to stop the
employee from leaving as shown in the correlation and data analysis charts.
Multivariate Data Analytics
Fall 2017
Logistic Regression Prediction Model
• Number of available instances: 15,000 employees.
• Database is divided into the training and testing subsets in ratio of 4:1
• The prediction model was created on training dataset using logistic
regression (12,000 employees).
• This model is validated by employees in testing set (3,000 employees)
• The confusion matrix is used to show the accuracy of the model
• ROC curve is used to decide the threshold to 0.3 which is the optimal
value for True Positive Rate and False Positive rate and which gave the
model an accuracy of 78%
Future Potential
• Address all possible contributions to prevent employee’s
departure proactively
• Use the framework to increase the accuracy of the model
• Analysis based on more variables that might affect on
employee leaving. e.g. time of commute, training opportunity
Principal Component Analysis
With the large number of
variables, correlation between
different variables is inevitable.
By Principal Component Analysis,
all variables can be made to be
linearly uncorrelated and
redundant variables can be
dropped. Based on this graph, it is
visible that the 7th component
can be removed.
 Satisfaction Level
 Last evaluation
 Average monthly hours
 Time Spent at company
 Promotion
 Salary
 Employee Left
 Work Accident
 Department
 Number of projects
T=0.5 Predicted Value
Actual Value 0 1
0 TN 2102 FP 183
1 FN 434 TP 281
T=0.3 Predicted Value
Actual Value 0 1
0 TN 1846 FP 439
1 FN 219 TP 496
ROC Curve
27
Constructing Efficient Investment Portfolios
Team: Cheng Yu, Tsen-Hung Wu, Ping-Lun Yeh
Instructor: Alkis Vazacopoulos
Modeling
Capital Asset Pricing Model(CAPM): Optimization Model:
Analysis
Business Intelligence & Analytics
Fall, 2017
Introduction
It is often that individual investors do not know how to
build up their investment portfolio. For this reason, they
usually completely rely on the advice from financial
companies. We offer a customized recommendation
investment planner to advice what quantities of stock one
needs to buy to get back a certain return rate with the
minimum portfolio variance.
Technology
• Using Capital Asset Pricing Model (CAPM) to generate
expected return by considering systematic risk.
• IBM DOcplex Modeling for Python to execute quadratic
programming optimizing the objective of the
mathematical model.
• Python to make the visualization using pygal and
Microsoft Power BI.
𝐸 𝑅𝑖 = 𝑅𝑓 + 𝛽𝑖[𝐸(𝑅 𝑀) − 𝑅𝑓]
𝐸 𝑅𝑖 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑎 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑅𝑓 = 𝑅𝑅𝑅𝑅-𝑓𝑓𝑓𝑓 𝑟𝑟𝑟𝑟
𝐵𝑖 = 𝐵𝐵𝐵𝐵 𝑜𝑜 𝑎 𝑠𝑠𝑠𝑠𝑠; 𝑎 𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑟𝑟𝑟𝑟
𝐸 𝑅 𝑀 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑡𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑡𝑡𝑡𝑡
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡
𝐸 𝑅 𝑀 − 𝑅𝑓 = 𝑀𝑀𝑀𝑀𝑀𝑀 𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝. 𝐴 𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜 𝑡𝑡𝑡 𝑒𝑒𝑒𝑒𝑒𝑒 𝑟𝑟𝑟𝑟𝑟𝑟
𝑜𝑜 𝑡𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑡𝑡𝑡 𝑟𝑟𝑟𝑟-𝑓𝑓𝑓𝑓 𝑟𝑟𝑟𝑟
𝑃𝑖 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑝𝑝𝑝𝑝𝑝 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑄𝑖 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖
𝐼𝑖 = 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑎𝑎𝑎𝑎𝑎 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑊𝑖 = 𝑊𝑊𝑊𝑊𝑊𝑊 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡 𝑒𝑒𝑒𝑒 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑖𝑖 𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝜎 𝑝
2
= 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑣𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑅𝑖 = 𝑌𝑌𝑌𝑌𝑌𝑌 𝑟𝑒𝑒𝑒𝑒𝑒 𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑉𝐶 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑚𝑎𝑎𝑎𝑎𝑎
𝑅 𝑝 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑟𝑒𝑒𝑒𝑒𝑒
𝐵𝑝 = 𝐴𝐴𝐴𝐴𝐴𝐴 𝑝𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏
𝐵𝑖 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑑𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐼𝐼𝐼𝐼𝐼𝐼
𝐼𝑖 = 𝑄𝑖 × 𝑃𝑖
𝑅𝑓 = 𝐼𝑖
∑ 𝐼𝑖
𝑛
𝑖=1
𝑉𝑉 = 𝑅𝑖 − 𝑅𝑖
� 𝑅𝑗 − 𝑅𝑗
�
𝑁 − 1
𝜎 𝑝
2
= 𝑊 𝑇
× 𝑉𝑉 × 𝑊
𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂: 𝑀𝑀𝑀 𝜎 𝑝
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑡𝑡:
𝑅 𝑝 = � 𝑅𝑖 × 𝑊𝑖 𝑅𝑖 ≤ 𝑡𝑡𝑡𝑡𝑡𝑡 𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟𝑟𝑟
𝑛
𝑖=1
𝐵𝑝 ≤ 𝐵𝑖
Simulation
Conclusion
We incorporated CAPM in our model, as it
is broadly used among financial experts as an
evaluation method for future stock price.
According to the simulation results, the
higher the target return rate, the more risk of the
portfolio. Furthermore, the model recommends a
particular combination of stocks with the
minimum of portfolio risk based on the target
return rate.
Investors can find their targeted stocks to
customize stocks portfolio from the results
achieved in this model.
CAPM
Model
Optimization
Model
• Actual budget spent
• Amount of shares on specific stocks
1
1
6
6
2
4
2 4
4
4
Age: Under 25
Disposables Income: 4,361 USD
Actual expense: 4,342 USD
Target Return: 30%
Portfolio Variance: 0.66
Number of Stocks Selected: 20
28
A Tool for Discovering High Quality Yelp Reviews
Team: Zijing Huang, Po-Hsun Chen, Hao-Wei Chen, Chao Shu
Instructor: Rong Liu Business Intelligence & Analytics
32
Motivation
 For Customers:
Customers are more likely to read reviews with details instead
of reviews that were written to just vent emotions.
 For Companies:
o Objective reviews are helpful in improving their products.
o High-quality reviews can increase customer engagement
and attract more users.
Introduction
 Objective: A tool for discovering high quality reviews
 High quality review: objective and supply insightful details
about user experience.
 Dataset: Yelp hotel and restaurant reviews
o Raw data: 4736897 reviews
o Data with labeled aspects and sentiment: 1132 sentences
 Methodology: Deep learning with Convolutional Neural
Networks (CNN) and Word Embedding
Approach
1. Use Yelp raw data to train word vectors that capture word semantics
2. Create a CNN to identify aspects from every sentence of a review
3. Train another CNN to detect sentiment of each sentence
4. Build an a neural network model to predict the quality of a review base on aspects and sentiment of its sentences.
5. Compare this approach with other models (SVM, Naïve Bayes …) and analyze the pros and cons of each model
6. Visualize the result through user interface and publish a python package (or a Restful API) for third party use
Methodology
High Quality vs. Low Quality
Useful Information we got from the review.
Un-useful Information
Result
Analysis of Diabetes among Pima Indians
What factors affect the occurrence of diabetes?
Team: Junjun Zhu, Jiale Qin, Yi Zhang
Instructor: Amir H. Gandomi
Results & Evaluation
Model 1: Baseline Model
Model 2: Explanatory Model – Logistic Regression
Objectives:
• What factors affect the occurrence of
diabetes?
• How best to classify a new observation?
• What is the best model for this data?
Modeling
1. Simple Logistic Regression
2. Logistic Regression
3. K Nearest Neighbor
Conclusion
Multivariable Data Analysis
Fall, 2017
Data Information
Type: Classification
Data Recourse:
https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes
Data Preparation
The AUC value is
0.8505833, the F1 score
is 0.6394558
the recall (Sensitivity) is
quite bad 0.5875
The AUC value is 0.8336667
the F1 score is 0.528
the recall (Sensitivity) is quite bad 0.4125
Model 3: Predictive model -- KNN
The AUC value is 0.7295417
the F1 score is 0.5384615
the recall (Sensitivity) is quite bad 0.525
We have used three different models to study
the data. From the results, we found that the
simple logistic regression model is the best
among these three models, because the
sensitivity, accuracy, and F1 values are higher
than for the other methods.
Figure.1 shows the distribution of
the data including the people who
are diabetics and those who are
not. Figure.2 shows the people
who have diabetes. Figure 3
shows people who are not
diabetics. We can find that the
people with non diabetes are more
concentrated in the center - these
people are younger, and the index
is more routine.
29
Google Online Marketing Challenge: Aether Game Cafe
Team: Ephraim Schoenbrun, Jaya Prasad Jayakumar, Sunoj Karunanithi, Saketh Patibandla
Instructor: Chihoon Lee
Factors & Sample Ads
Business Intelligence & Analytics
30
Client and Market Analysis
• Aether Game Cafe (AGC) is a blend of traditional coffee shop and
board games play centre located in Hoboken, New Jersey.
• AGC relies only on ‘Word of Mouth’ for its marketing.
User Flow
GOMC
• The Google Online Marketing Challenge is a unique opportunity for
students to experience and create online marketing campaigns
using Google AdWords.
• With a $250 AdWords advertising budget provided by Google, we
developed and ran an online advertising campaign for a business
over a three week period.
Customer Analysis – Google Analytics
Experiments and Results
Hudson County New York
Significant Factors
Main Effects and
Interaction Effects
Successful Ads
Conclusion
• Experimental design helped us to achieve our target which
eventually made us rank in Top 3 % all over the world.
The World’s Best Fitness Assistant
Team: Anand Rai, Jaya Prasad Jayakumar, Saketh Patibandla
Instructor: Christopher Asakiewicz
Fitness Assistant
• Welcome
Intelligence of our BOT
Same Questions but Different answers based on context
Business Intelligence & Analytics
33
Chatbot Framework
Text and Speech Enabled Assistant
Introduction
• We are fitness freaks but never found a good application that
guides us while working out.
• There are a few that exist, with the drawback of being static
Chatbots which answer predefined questions.
• Our Fitness Assistant, which is AI Driven and uses Natural
Language Processing to understand questions and gives the
best answer.
• This is a prototype and we are going to build this application
for all exercises and food habits.
Natural Language Processing
• Natural Language Processing is the key to an efficient chatbot.
• Every question is stored for future analysis and the knowledge
base gives the answer to the question even if the question is
not hardcoded.
• The NLP searches for the entity in the question which is similar
to noun in a sentence and narrows down to the answer.
Future Scopes
• This application will be improved by adding in all varieties of
exercises and also with multiple sources of knowledge base.
• This app will be One Stop Solution for all fitness freaks for
exercises and nutrients, we plan to build an in app module to
track human behavior and our bot will suggest them what to
consume for betterment of their life.
Greetings 1
Greetings 2
Questions
Zillow’s Home Value Prediction
Team: Wenzhuo Lei, Chang Xu, Juncheng Lu
Instructor: Chris Asakiewicz
Background & Objectives
It is absolutely important for homeowners to have a trusted way of monitoring the assets. The “Zestimate”
are created for estimating home values based on a large amount of data. Zillow published a competition of
improving the accuracy of prediction for house.
Business Intelligence & Analytics
Data Imputation
To start, we check the log error (target value) and it follows normal distribution. Also, we checked the
frequency of trades based on the day/month. It shows the date is not a significant factor for trading houses
and people tends to trade houses in summer.
To impute the data. We separated the data with 5 different types.
• Features with more than 90% missing values → Deleted.
• Features with no missing value → Kept.
• Binary features → Filled with 0 when there are missing values.
• Irrational features → Filled with -1 when there are missing values.
• Features with few numbers of missing values → Filled with mean number of the feature.
• Special feature(total living area of home) → Use KNN based on number of bedrooms and bathrooms
(as more bedrooms and bathrooms usually means bigger area).
• Special features → There are some variables which are depend on longitude and latitude, fill in these
missing value by KNN based on longitude and latitude.
To avoid overfitting, we choose to reduce the features. To better
select the features, we checked the importance of the features.
We applied the regression model. The R2 is low and it might because
of the existence of multicollinearity. To reduce the effect of that, we
Used Variance Inflation Factors to drop the features. There are
10 features left.
Challenge
The data offered by Zillow includes a voluminous missing values
in 57 features. The accuracy of prediction is primarily affected by
the chosen of the data.
Modeling
As we mentioned before, we have tried OLS and the R^2 is quite low. We tried Random Forest as the
second method and it gave a reasonable score. After that we used Gradient Boost as the final method
since it has the lowest Mean Squared Error and largest Mean Absolute Error. To be more accurate, we
applied XGboost as the last modeling method.
.
34
Web Traffic Time Series Forecasting
Team: Jujun Huang, Peimin Liu, Luyao Lin
Instructor: Alkis Vazacopoulos
Business Intelligence & Analytics
http://www.stevens.edu/howe/academics/graduate/business-intelligence-analytics
Introduction
• Our group focus on the problem of forecasting the future
values of multiple time series.
• The train dataset contains the views of 145063 Wikipedia
articles.
• Each of these time series represent many daily views of a
different Wikipedia article, starting from July 1st, 2015 up
until December 31st, 2016.
• Our objective is predict the daily views between January
1st, 2017 and March 1st, 2017.
Current & Future Work
• We used heat-map and Fast Fourier Transform in the data
visualization.
• We used the ARIMA model in the data modeling part to
predict the data from January 1st, 2017 to March 1st, 2017.
• In the future, we will combine ARIMA model with other
models to predict the time series and improve our accuracy.
• We are going to predict all of the views of 145063
Wikipedia articles from January 1st,2017 to November 1st
,2017.
Methodology
• ARIMA model: We use the autoregressive integrated
moving average (ARIMA) model to forecast the data.
• If the polynomial has a unit root (a factor (1-L))
of multiplicity d and an ARIMA(p,d,q) process expresses
this polynomial factorization property with p=p’-d, and is
given by:
• The ARIMA(p,d,q) process with dirft can be
generalized as:
• For Xt, t is an integer index and the Xt are real time series
data numbers. L is the lag operator, the are the
parameters of the autoregressive part of the model, the
are the parameters of the moving average part and the
are error terms. The error terms are generally assumed to
be independent, identically distributed variables sampled
from a normal distribution.
Data Modeling
• We can use the ARIMA model without transformations if
our time series is stationary.
• In our case, we take a decomposition on three parts: trend,
seasonality and residual. The sum of these three parts is
equal to our observation.
• Here is one of the results of the forecasting views of one of
Analysis
• Here are the results of data visualization:
• From this heat-map, we can see there is a huge web traffic
at the end of July and at the beginning and the mid of
August. However, we can’t find any periodicity in the heat-
map.
• English articles have the highest visits and the trend of
Russian articles is similar with that of English.
Portfolio Optimization with Machine Learning
Team: Chao Cui, Huili Si, Luotian Yin, Qidi Ying, Yinchen Jiang
Instructor: Alkis Vazacopoulos Business Intelligence & Analytics
36
Motivation
Currently is difficult for an investor to make any inferences
about future price especially in the case where volatility makes
him uncomfortable.
Based on the classical methods, we can only make estimate
using historical data. In order to explore this problem, we
decided to use machine learning techniques to create portfolios
and we attempt to the capital distribution more realistic.
Technology
•Using Python to collect S&P Index500 stocks from Yahoo
Finance for one year period. Set threshold to select 50 stocks
with highest return and lowest risk in stocks pool.
•Using Python to build machine learning models and to perform
train and test.
•Applying IBM Cplex module API for linear/non-linear
programming and optimization.
•All calculations are based on month intervals.
Future Work
•We could not constrain the size of optimized portfolio, which
might lead to a impractically large size of portfolio. Further study
is needed in IBM Cplex.
•Machine learning model will need a lot of improvement. Future
work should be done on more accurate models with wider
range of information.
•We could generate more attributes of the stocks such as
industry, location and reputation to cater customer preferences.
•We could make a more dynamic system that takes price
change and trading fee into consideration.
• Portfolio optimization without prediction
Machine Learning for prediction
• Add features
Ordinary Least Square
Prediction
Finalprediction
• Ensemble modeling
• Training result
• Testing result
• Portfolio optimization without prediction
Portfolio risk increases slowly until Portfolio
return reaches about 0.05
Top N stocks Set as a constraint
Developing a Supply Chain Methodology for an
Innovative Product
Team: Akshay Sanjay Mulay
Instructors: Alkis Vazacopoulos and Chris Asakiewicz
Business Intelligence & Analytics
37
Background
• The health club economy is a multibillion dollar endeavor that has
remained static for many years
•Nova Fit looks to revolutionize the current fitness equipment
by providing fitness enthusiasts to have safer, and more ergonomic
options while they workout
•The aim is to innovate every component of health club fitness
Technology
• Google Analytics for Data Storage and Excel Solver for
Calculations.
• Tableau for Visualizations and Graphical Analysis
• Monte Carlo simulation to calculate the Worst Case, Best Case
and Average Case scenarios during the actual Network Analysis
Market Share in terms of
Revenue and Number of
Bars to be produced
Number of Bars to produced
Customer Survey
Current Scenario
• The typical round barbell,
is unnatural
to hold and create inefficiencies
that hinder workout
•According to an article on fitness in
NY Times, more than 90% of injuries
are caused due to lifting free weights
Nova Fit – Revolutionary Barbell
• NovaFit’s grip solves the problem by confirming more naturally
to the hands, improving confidence and performance. The
spindle design also eliminates the need of the clip function which
is required in the current scenario
Target Customers
• To initially begin with, the target customers will be Gym and
Fitness Club owners in the Northeast region of United States.
•We also look to sell the product to household fitness freaks.
Sample New York Fitness Center Club
Forecast and Finance Model
• The statistics used for calculating NovaFit’s
market share are taken from IBIS.
The primary competitor is Invanko
which sells the Barbell Product
at a price of $1250
Future Scope
• To build a strong Supplier, Distributors and Manufacturer's
network when the actual product starts selling into the market
•To determine the Suppliers based on the cost, lead times and
availability of the components
•Use the actual data and realign the production and sales
strategy in the market
•Determine the modes of selling the product and also to develop
a comprehensive Supply Chain strategy using various
simulations and scenarios
Bike Sharing Optimization
Team: Jiahui Bai,Yuankun Nai, Yuyan Wang, Yanru Zhou
Instructor: Alkis Vazacopoulos
Model
• Linear problem
We selected the trip data in selected stations for 12 months in
peak hours to build the model and optimize the inventory level
of bicycles and minimize the total cost of deployment with Excel
Solver.
Total cost of deployment =
∑ ( cost of configuring one bike in an area * amount of bikes
configured in that area )
• Linear regression
To predict the demand in different weather condition
• Logistic regression
To figure out how weather conditions affect the demand
Introduction
• Bike-sharing system allows people to rent a bicycle at
one of automatic rental stations scattered in the area,
use it for a short journey and return it at any other station
in the area.
• Ford Bike: Located at bay area consists of 700 bikes
and 70 stations across San Francisco and San Jose.
• Unbalance: The difference between the number of
incoming bikes and the number of outgoing bikes within
a specific time period.
Data visualization
Business Intelligence & Analytics
Problem
• Visualization of trip routings on a map and the peak
hours of workdays and weekend.
• Optimize the inventory level of bikes in peak-hour of
stations
• Minimize the cost of deployment of bikes between the
stations
• Improve the utilization of each bike by reducing the
number of bikes which are not used in peak hour
• Predict the demand of bikes in different weather
conditions
- 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000
70
69
50
55
74
61
67
60
65
77
64
start station end station
Peak hour selection
Time period: Aug 2014-Aug 2015
Weekday discrimination: using weekday function combined with IF in Excel,
IF(WEEKDAY(start_date,2)>5,"weekend","workday")
Time group: using group and outline in PivotTable, start by 0:00, end by 24:00,
set step as one hour, so there’s totally 24 time period
3,999
3,836
-
1,000
2,000
3,000
4,000
5,000
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Weekend
start station end station
51,973
48,410 47,161
-
10,000
20,000
30,000
40,000
50,000
60,000
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Workday
46,729
Future work
• Improve the utilization of bicycles with the linear
program model
• Optimize the bike deployment strategy
• Predict the demand in different weather conditions with
machine learning algorithms
300,801
33,303
6,578
43,056
1,629
301,245
34,324
7,183
42,352
1,247
FINE FOG FOG-RAIN RAIN RAIN-
THUNDERSTORM
start station end station
Top 10 popular stations Weather effect on the number of trips
GoFord bike trip pattern ( San Francisco)
38
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017
Posters 2017

Mais conteúdo relacionado

Semelhante a Posters 2017

Human-centered AI: towards the next generation of interactive and adaptive ex...
Human-centered AI: towards the next generation of interactive and adaptive ex...Human-centered AI: towards the next generation of interactive and adaptive ex...
Human-centered AI: towards the next generation of interactive and adaptive ex...
Katrien Verbert
 
使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区
使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区
使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区
Jingjing Lin
 

Semelhante a Posters 2017 (20)

INSPIRE Annual Report 2014 - 2015 (updated 2016.01.12)
INSPIRE Annual Report 2014 - 2015 (updated 2016.01.12)INSPIRE Annual Report 2014 - 2015 (updated 2016.01.12)
INSPIRE Annual Report 2014 - 2015 (updated 2016.01.12)
 
AI in Healthcare Resource forhands on Workshop
AI in Healthcare Resource forhands on  WorkshopAI in Healthcare Resource forhands on  Workshop
AI in Healthcare Resource forhands on Workshop
 
Data Visualization and Learning Analytics with xAPI
Data Visualization and Learning Analytics with xAPIData Visualization and Learning Analytics with xAPI
Data Visualization and Learning Analytics with xAPI
 
Evolution of assessment 2017 final
Evolution of assessment   2017 finalEvolution of assessment   2017 final
Evolution of assessment 2017 final
 
Evolution of assessment 2017 final
Evolution of assessment   2017 finalEvolution of assessment   2017 final
Evolution of assessment 2017 final
 
Human-centered AI: towards the next generation of interactive and adaptive ex...
Human-centered AI: towards the next generation of interactive and adaptive ex...Human-centered AI: towards the next generation of interactive and adaptive ex...
Human-centered AI: towards the next generation of interactive and adaptive ex...
 
Enterprise Gamification by the Numbers
Enterprise Gamification by the NumbersEnterprise Gamification by the Numbers
Enterprise Gamification by the Numbers
 
edmedia2014-learning-analytics-keynote
edmedia2014-learning-analytics-keynoteedmedia2014-learning-analytics-keynote
edmedia2014-learning-analytics-keynote
 
Who are you and makes you special?
Who are you and makes you special?Who are you and makes you special?
Who are you and makes you special?
 
INSPIRE 2014 Updates (San Francisco, CA)
INSPIRE 2014 Updates (San Francisco, CA)INSPIRE 2014 Updates (San Francisco, CA)
INSPIRE 2014 Updates (San Francisco, CA)
 
NPU Vol 1 Issue 2
NPU Vol 1 Issue 2NPU Vol 1 Issue 2
NPU Vol 1 Issue 2
 
Ahfe hsse 20140722 v3
Ahfe hsse 20140722 v3Ahfe hsse 20140722 v3
Ahfe hsse 20140722 v3
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
 
使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区
使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区
使用在线期刊俱乐部和人工 智能工具培训教育研究素养: 走向新的培训模式和构建在 线实践社区
 
Toward Understanding the Impact of Artificial Intelligence on Education: An E...
Toward Understanding the Impact of Artificial Intelligence on Education: An E...Toward Understanding the Impact of Artificial Intelligence on Education: An E...
Toward Understanding the Impact of Artificial Intelligence on Education: An E...
 
Predictive Analytics in Education Context
Predictive Analytics in Education ContextPredictive Analytics in Education Context
Predictive Analytics in Education Context
 
NISO — Cutting Edges with Company: Emerging Technologies as a Collective Effort
NISO — Cutting Edges with Company: Emerging Technologies as a Collective EffortNISO — Cutting Edges with Company: Emerging Technologies as a Collective Effort
NISO — Cutting Edges with Company: Emerging Technologies as a Collective Effort
 
Final Anderson Cutting Edges with Company
Final Anderson Cutting Edges with CompanyFinal Anderson Cutting Edges with Company
Final Anderson Cutting Edges with Company
 
Project Report without.docx
Project Report without.docxProject Report without.docx
Project Report without.docx
 
Learning analytics in a standardisation context
Learning analytics in a standardisation contextLearning analytics in a standardisation context
Learning analytics in a standardisation context
 

Mais de Alkis Vazacopoulos

Missing-Value Handling in Dynamic Model Estimation using IMPL
Missing-Value Handling in Dynamic Model Estimation using IMPL Missing-Value Handling in Dynamic Model Estimation using IMPL
Missing-Value Handling in Dynamic Model Estimation using IMPL
Alkis Vazacopoulos
 
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
Alkis Vazacopoulos
 
Hybrid Dynamic Simulation (HDS) Industrial Modeling Framework (HDS-IMF)
Hybrid Dynamic Simulation (HDS)  Industrial Modeling Framework (HDS-IMF)Hybrid Dynamic Simulation (HDS)  Industrial Modeling Framework (HDS-IMF)
Hybrid Dynamic Simulation (HDS) Industrial Modeling Framework (HDS-IMF)
Alkis Vazacopoulos
 

Mais de Alkis Vazacopoulos (20)

Automatic Fine-tuning Xpress-MP to Solve MIP
Automatic Fine-tuning Xpress-MP to Solve MIPAutomatic Fine-tuning Xpress-MP to Solve MIP
Automatic Fine-tuning Xpress-MP to Solve MIP
 
Data mining 2004
Data mining 2004Data mining 2004
Data mining 2004
 
Amazing results with ODH|CPLEX
Amazing results with ODH|CPLEXAmazing results with ODH|CPLEX
Amazing results with ODH|CPLEX
 
Bia project poster fantasy football
Bia project poster  fantasy football Bia project poster  fantasy football
Bia project poster fantasy football
 
NFL Game schedule optimization
NFL Game schedule optimization NFL Game schedule optimization
NFL Game schedule optimization
 
2017 Business Intelligence & Analytics Corporate Event Stevens Institute of T...
2017 Business Intelligence & Analytics Corporate Event Stevens Institute of T...2017 Business Intelligence & Analytics Corporate Event Stevens Institute of T...
2017 Business Intelligence & Analytics Corporate Event Stevens Institute of T...
 
Very largeoptimizationparallel
Very largeoptimizationparallelVery largeoptimizationparallel
Very largeoptimizationparallel
 
Retail Pricing Optimization
Retail Pricing Optimization Retail Pricing Optimization
Retail Pricing Optimization
 
Optimization Direct: Introduction and recent case studies
Optimization Direct: Introduction and recent case studiesOptimization Direct: Introduction and recent case studies
Optimization Direct: Introduction and recent case studies
 
Informs 2016 Solving Planning and Scheduling Problems with CPLEX
Informs 2016 Solving Planning and Scheduling Problems with CPLEX Informs 2016 Solving Planning and Scheduling Problems with CPLEX
Informs 2016 Solving Planning and Scheduling Problems with CPLEX
 
ODHeuristics
ODHeuristicsODHeuristics
ODHeuristics
 
Missing-Value Handling in Dynamic Model Estimation using IMPL
Missing-Value Handling in Dynamic Model Estimation using IMPL Missing-Value Handling in Dynamic Model Estimation using IMPL
Missing-Value Handling in Dynamic Model Estimation using IMPL
 
Finite Impulse Response Estimation of Gas Furnace Data in IMPL Industrial Mod...
Finite Impulse Response Estimation of Gas Furnace Data in IMPL Industrial Mod...Finite Impulse Response Estimation of Gas Furnace Data in IMPL Industrial Mod...
Finite Impulse Response Estimation of Gas Furnace Data in IMPL Industrial Mod...
 
Industrial Modeling Service (IMS-IMPL)
Industrial Modeling Service (IMS-IMPL)Industrial Modeling Service (IMS-IMPL)
Industrial Modeling Service (IMS-IMPL)
 
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
Dither Signal Design Problem (DSDP) for Closed-Loop Estimation Industrial Mod...
 
Xmr im
Xmr imXmr im
Xmr im
 
Distillation Curve Optimization Using Monotonic Interpolation
Distillation Curve Optimization Using Monotonic InterpolationDistillation Curve Optimization Using Monotonic Interpolation
Distillation Curve Optimization Using Monotonic Interpolation
 
Multi-Utility Scheduling Optimization (MUSO) Industrial Modeling Framework (M...
Multi-Utility Scheduling Optimization (MUSO) Industrial Modeling Framework (M...Multi-Utility Scheduling Optimization (MUSO) Industrial Modeling Framework (M...
Multi-Utility Scheduling Optimization (MUSO) Industrial Modeling Framework (M...
 
Advanced Parameter Estimation (APE) for Motor Gasoline Blending (MGB) Indust...
Advanced Parameter Estimation (APE) for Motor Gasoline Blending (MGB)  Indust...Advanced Parameter Estimation (APE) for Motor Gasoline Blending (MGB)  Indust...
Advanced Parameter Estimation (APE) for Motor Gasoline Blending (MGB) Indust...
 
Hybrid Dynamic Simulation (HDS) Industrial Modeling Framework (HDS-IMF)
Hybrid Dynamic Simulation (HDS)  Industrial Modeling Framework (HDS-IMF)Hybrid Dynamic Simulation (HDS)  Industrial Modeling Framework (HDS-IMF)
Hybrid Dynamic Simulation (HDS) Industrial Modeling Framework (HDS-IMF)
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Posters 2017

  • 1. Stevens Institute of Technology School of Business Business Intelligence & Analytics Program A Snapshot of Data Science Student Poster Presentations Corporate Networking Event – November 28, 2017
  • 2. This document reproduces the posters presented by students of the Business Intelligence & Analytics (BI&A) program at a Corporate Networking event held at Stevens Institute on November 28, 2017. The event was attended by over 80 company representatives and approximately 150 students and faculty members. The posters were presented by students at all stages in their academic programs from their first semester through their final semester. The research described in each poster was conducted under the guidance of a faculty member. The broad range of research topics and methodologies exhibited by the posters in this document reflects the diversity of faculty research interests and the practical nature of our program. For background, the first poster describes the BI&A program. Founded in spring 2012, with just 4 students, the program now has over 220 full-time and part-time masters of science students, 80 graduate certificate students and is ranked 7th in the nation by The Financial Engineer. As illustrated in the first poster, a distinctive feature of the program is its three-layer structure. In the professional skills layer, business and communication skills are developed through workshops, talks by industry leaders and an active student club. In the second layer, the 12-course curriculum covers the concepts and tools associated with database management, data warehousing, data and text mining, web mining, social network analytics, optimization and risk analytics. The curriculum culminates in a capstone course in which students work on a research project – often in conjunction with industry associates. Finally, in the technical skills layer, students attend a series of free weekend boot weekend camps that provide training in industry-standard software packages, such as SQL, R, SAS, Python and Hadoop. The 76 student posters in this document represent a broad array of research projects. We are proud of the quality and innovativeness of our students’ research and of their hard work and enthusiasm without which this event would have been impossible. Chris Asakiewicz, Ted Stohr and Alkis Vazacopoulos Business Intelligence & Analytics Program Stevens Institute of Technology www.stevens.edu/business/bia Forward
  • 3. INDEX TO POSTERS * Indicates the poster was accompanied by a live demo No. Title Student Authors 0 BI&A Curriculum The faculty 1* Google Online Marketing Challenge 2017 – True Mentors AdWords Campaign Philippe Donaus, Rush Kirubi, Salvi Srivastava, Thushara Elizabeth Tom, Archana Vasanthan 2 Multivariate Testing to Improve a Non-Profit’s Home Page Rush Kirubi, Thushara Elizabeth Tom 3 Analyzing the Impact of Earthquakes Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley 4* Real Time Health Monitoring System Ankit Kumar, Khushali Dave, Shruti Agarwal, Nirmala Seshadri 5 Zillow Home Value Prediction Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu, Ziyun Song. 6 Analysis of Opioid Orescriptions and Drug- related Overdoses Nishant Bhushan, Sunoj Karunanithi, Pranjal Gandhi, Raunaq Thind 7 UK Traffic Flow & Accidents Analysis Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang 8 US Permanent Visa Application Process Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang Lu 9 Climate Change Since 1770 Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li, Zhenzhen Liu 10 Predicting Customer Conversion Rate for an Insurance Company Yalan Wang, Cong Shen, Junyuan Zheng, Yang Yang 11 Determine Attractive Laptop Features Among College Students Liwei Cao, Gordon Oxley, Salman Sigari, Haoyue Yu 12 Clustering Large Cap Stocks During Different Phases Of Economic Cycle Nikhil Lohiya, Raj Mehta 13 Predicting interest level in apartment listings on Renthop Cristina Eng, Ying Liu, Haoyue Yu, Salman Sigari, 14 Deep Learning Vs Traditional Machine Learning Performance on NLP Problem Abhinav S Panwar 15 Wine Recognition Biying Feng, Ting Lei, Jin Xing 16 Projects Assignment Optimization Based on Human Resource Analysis Jiarui Li, Siyan Zhang, Siyuan Dang, Xin Lin, 17* Short Term Load Forecasting Using Artificial Neural Networks Bhargav Kulkarni, Ephraim Schoenbrun 18 Optimizing Travel Routes Ruoqi Wang, Yicong Ma, Jinjin Wang, Shuo Jin 19 Exploring and Predicting Airbnb Price in NYC Ruoqi Wang, Yicong Ma, JIahui Bi, Xin Chen 20* The Public’s Opinion of Obamacare: Twitter Data Analyses Saeed Vasebi 21 S&A Processed Seafood Company Redho Putra, Neha Bakhre, Harsh Bhotika, Ankit Dargad, Dinesh Muchandi 22 Portfolio Optimization of Cryptocurrencies Yue Jiang, Ruhui Luan, Jian Xu, Shaoxiu Zhang, Tianwei Zhang 23 Fantasy Premier League Soccer Team Optimization Haoran Du, Xiang Yang, Ruiwen Shi
  • 4. 24 Predicting Movie Rating and Box Office Gross by PCA and LR Model Yunfeng Liu, Erdong Xia, Yash Naik 25 Student Performance Prediction Abineshkumar, Sai Tallapally, Vikit Shah 26 Classifying Iris Flowers Xi Chen, Shan Gao, Lan Zhang 27 Using Data Analytics to Retain Human Resources Aditya Pendyala, Rhutvij Savant 28 Using Financial Models to Construct Efficient Investment Portfolios Cheng Yu, Tsen-Hung Wu, Ping-Lun Yeh 29 Pima Indians Diabetes Analysis Junjun Zhu, Jiale Qin, Yi Zhang 30 Google Online Marketing Challenge: Aether Game Café Jaya Jayakumar, Sunoj Karunanithi, Saketh Patibandla, Ephraim Schoenbrun 31 Experiment for Apartment Rental Advertisements Shavari Joshi, Nirmala Seshadri, Nishant Bhushan 32 A Tool For Discovering High Quality Yelp Reviews Zijing Huang, Po-Hsun Chen, Hao-Wei Chen, Chao Shu 33* Worlds Best Fitness Assistant Cognitive Computing Anand Rai, Jaya Prasad Jayakumar, Saketh Patibandla 34 Zillow’s Home Value Prediction Wenzhuo Lei, Chang Xu, Juncheng Lu 35 Web Traffic Time Series Forecasting Jujun Huang, Peimin Liu, Luyao Lin 36* Portfolio Optimization with Machine Learning Chao Cui, Huili Si, Luotian Yin, Qidi Ying, Yinchen Jiang 37 Developing a Supply Chain Methodology for an Innovative Product Akshay Sanjay Mulay 38 Bike Sharing Optimization Bai, Jiahui Nai, Yuankun Wang, Yuyan Zhou, Yanru 39 Crime in the U.S Minyan Shao, Yuyan Wang, Yuankun Lin 39A Porto Seguro Safe Driver Prediction Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang 40 Identifying Mushroom: Safe to eat or deadly poison Xiang Yang, Haoran Du, Shikuang Gao, Ruiwen Shi 41* Student Grade Prediction Gaurav Sawant, Vipul Gajbhiye, Vikram Singh 42 NBA 2018 Play-offs & Champion Prediction Amit Kumar, Jayesh Mehta , Xiaohai Su 43 AI integrated interactive search interface for Biomedical literature search Akshay Kumar Vikram, Vishnu Pillai, Satya Peravali 44* Text Mining of intellectual contribution information from a corpus of CVs Nishant Bhushan, Neha Mansinghka , Nirmala Seshadri, Arpit Sharma 45 Analysis on Chase Bank Deposits Lulu Zhu, Xinlian Huang, Junxin Xia, Rauhl Nair 46 Predicting Songs Hit on Billboard Chart Siyan Zhang, Yifeng Liu, Yuejie Li
  • 5. 47 Predicting Results of Premier League Contest Hantao Ren, Lanyu Yu, Siyuan Dang, Jiarui Li 48 NLP Meets Yelp Recommendation System for Restaurant Rui Song 49 WSDM — KKBox’s Churn Prediction Challenge Caitlyn Garger, Yina Dong, Shuo Jin 50 Data Mining of Video Game Sale Xin Lin, Fanshu Li, Jingmiao Shen 51 Dengu AI: Predicting Disease Spread Vicky Rana, Pradeepkumar Prabakaran 52 Predict Tesla Model 3 Production Volume Wangming Situ, Liwei Cao, Bohong Chen, Tianyu Hao 53 Vehicle Routing Problem using NYC TLC Data Adrash Alok, Garvita Malhan, Ephraim Schoenbrun, Abhir Yadava 54 Credit Rating for a Lending Club Rui Song, Huili Si, Xiao Wan, Lulu Hu 55 Routify: Personalized Trip Planning Minzhe Huang, Bowan Lu, Jingmiao Shen, Xiaohai Su, Abhitej Kodali 56 Uncover World Happiness Patterns Rui Song, Xiao Wan, Xiaoyu Zhang 57* Data Centers – Where to Locate? Smriti Vimal, Sanjay Pattanayak, Kumar Bipulesh, Nitin Gullah, Souravi Sudamme 58 Drone Optimization in Delivery Industry Ni Man, Xinlian Huang, Xuanyan Li 59 Performance Evaluation of Machine Learning Algorithms on Big Data using Spark Neha Mansinghka, Madhuri Koti, Prathamesh Parchure 60* Duck Wisdom A Personal Portfolio Optimization ToolTed Stohr_Nov 20 2017 Taranpreet Singh, Shivakumar Barathi, Ramona Lasrado, Nikhil Lohiya 61 Porto Seguro’s Safe Driver Prediction Boren Lu, Lanshi Li, Xiaoming Guo, Dingmeng Duan 62 Hospital Recommendation System for Patients Abdullah Khanfor, Danilo Brandao and Pedro Sa 63 Customer Segmentation for B2B Sale of Fitness Gear Juhi Gurbani, Arpit Sharma, Neha Mansinghka 64 Predicting Vehicle Collisions & Dynamic Assignment of Ambulances in NYC Divya Rathore, Dhaval Sawlani, Nitasha Sharma, Shruti Tripathi 65 Iceberg Classifier Challenge Chang Lu, Jing Li, Luotian Yin, Runtian Song, 66 Predicting Movie Success Jialiang Liu, Huaqing Xie, Xiaohai Su, Liang Ma, Lanjun Hui 67 Stock Prediction Based on News Titles Jianuo Xu, Minghao Guo, Simin Liang, Yudong Cao, Yunzhe Xu 68 How Consumer Reviews Affect a Star’s Ratings Jinjin Li, Prabhjot Singh, Yutian Zhou, Xuetong Wei, Xiaoyu Zhang
  • 6. 69 Student Alcohol Consumption: Predicting Final Grades Ping-Lun Yeh, Zhuohui Jiang, Gaurang Pati 70 Mobile Banking Fraud Detection Junyuan Zheng, Ke Cao, Miaochao Wang, Tuo Han 71 Subway Delay Dilemma Smit Mehta, Nishita Gupta, Matthew Miller, Jianfeng Shi 72 Integrated Digital Marketing Studies on Hoboken Local Restaurant Shuting Zhang, Yalan Wang, Haoyue Yu, Liyu Ma, Christina Eng 73* AI Academic Advisor Vaibhav Desai, Piyush Bhattad 74 JFK Airport – Flight Delay Analysis Praveen Thinagarajan, Arun Krishnamurthy, Thushara Elizabeth Tom, Sunoj Karunanithi 75 Machine Learning on Highly Imbalanced Manufacturing Data Set Liyu Ma 76* Duck Finder Salman Sigari, Shankar Raju and Team
  • 7. Master of Science Business Intelligence & Analytics Business Intelligence & Analytics http://www.stevens.edu/bia CURRICULUM Organizational Background • Financial Decision Making Data Management • Strategic Data Management • Data Warehousing & Business Intelligence Data and Information Quality * Optimization and Risk Analysis • Optimization & Process Analytics Risk Management Methods & Apps.* Data Mining • Knowledge Discovery in Databases Statistical Learning & Analytics* Statistics • Multivariate Data Analytics • Experimental Design Social Network Analytics • Network Analytics • Web Mining Management Applications • Marketing Analytics* • Supply Chain Analytics* Big Data Technologies • Data Stream Analytics* • Big Data Seminar* • Cognitive Computing* Practicum Projects with industry * Electives - Choose 2 out of 8 Social Skills Disciplinary Knowledge Technical Skills • Written & Oral Skills Workshops • Team Skills • Job Skills Workshops • Industry speakers • Industry-mentored projects • SQL, SAS, R, Python Hadoop • Software “Boot” Camps • Course Projects • Industry Projects Curriculum Practicum MOOCs Infrastructure Laboratory Facilities • Hadoop, SAS, DB2, Cloudera • Trading Platforms: Bloomberg • Data Sets: Thomson-Reuters, Custom PROGRAM ARCHITECTURE Demographics 2013F 2014F 2015F 2016F 2017F Applications 101 157 351 591 725 Accepted 48 84 124 287 364 Rejected 34 34 186 257 307 In system/other 19 39 41 46 53 Admissions Full-time/Part-time Full-time 201 Part-time 21 Gender Female 41% Male 59% Placement Starting Salaries (without signing bonus): $65 - 140K Range $84K Average $90K (finance and consulting) Data Scientists 23%: Data Analysts: 30% Business Analysts: 47% Our students have accepted jobs at for example: Apple, Bank of America, Blackrock, Cable Vision, Dun & Bradstreet, Ernst & Young, Genesis Research, Jeffreys, Leapset, PriceWaterhouse, Morgan Stanley, New York Times, Nomura, PriceWaterhouse Coopers, RunAds, TIAA- CREF, Verizon Wireless Hanlon Lab -- Hadoop for Professionals The Masters of Science in Business Intelligence and Analytics (BI&A) is a 36-credit STEM program designed for individuals who are interested in applying analytical techniques to derive insights and predictive intelligence from vast quantities of data. The first of its kind in the tri-state area, the program has grown rapidly. We now have approximately 222 master of science students and another 79 students taking 4-course graduate certificates. The program has increased rapidly in quality as well as size. The average test scores of our student body is top 75 percentile. We are ranked #7 among business analytics programs in the U.S. by The Financial Engineer. STATISTICS PROGRAM PHILOSOPHY/OBJECTIVES • Develop a nurturing culture • Race with the MOOCs • Develop innovative pedagogy • Migrate learning upstream in the learning value chain • Continuously improve the curriculum • Use analytics competitions • Improve placement • Partner with industry
  • 8. Google Online Marketing Challenge 2017 True Mentors AdWords Campaign Team: Philippe Donaus, Rush Kirubi, Salvi Srivastava, Thushara Elizabeth Tom, Archana Vasanthan Instructor: Theano Lianidou Business Intelligence & Analytics November 28, 2017 1 Motivation • Google Online Marketing Challenge is an unique opportunity for students to build online Marketing Campaigns on Google AdWords for a business or a non-profit. $250 budget is provided by Google to run these campaigns live for 3 weeks • We worked with TRUE Mentors, a non-profit based in Hoboken,NJ. Built a marketing strategy on Google AdWords to achieve goals like Creating Brand Awareness, promoting Fundraising Events and Volunteer opportunities and Donations. • Technologies Used: Google AdWords, Google Analytics, Google Search Console, Facebook Insights Design of Campaigns • Conducted Market Analysis-Competitors, current market position and platforms used, USP • Analyzed existing data available on Google Analytics, Google Search console, Facebook Insights and established marketing goals • Designed campaigns on Search and Display Ads Performance Results • 23 Ad groups with 206 Ads with 700 keywords were used in total • Text Ads appear for people’s search terms across Google Search and Search partner sites. • Display Ads appear on relevant pages across the display network • The team finished as a “Finalist” in the Social Impact Award category. • The team finished as a “Semi-Finalist” in the Business Award category. • Ranked among the Top-10 teams in the Social Impact Award category. • Ranked among the Top-15 teams in the Business Award Category. • Ranked among the Top-5 teams in the Americas region. • The results can be found at: https://www.google.com/onlinechallenge/past/winners-2017.html • Team ID: 234-571-4266 Targeting and Bidding Target Goals (set before running the campaigns) End Results • Campaigns ran from 24th April 2017-14th May 2017 • KPI’s were monitored and optimized continuously over the 3 weeks using insights drawn from various AdWords Reports, Search Term reports, Google Analytics and Keyword reviews. CAMPAIGN LEVEL Targeting/Bidding TM_Brand TM_Events TM_Donations TM_Volunteers TM_DisplayCampaign Location Hudson County-NJ, New York County-NY Hudson County-NJ Hudson County-NJ Hudson County-NJ Hudson County-NJ, New York County-NY Bidding Strategy Manual CPC Manual CPC Manual CPC Manual CPC Manual CPC Daily Budget? Yes Yes Yes Yes Yes ADGROUP LEVEL Max CPC? Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups Demographic No No No Yes - Male and Female were targetted seperately No KEYWORD LEVEL Max CPC? Yes Yes Yes Yes No Topics No No No No Yes - Charity & Philanthropy and Fast Foods
  • 9. Improving a Non-Profit’s Home Page Team: Rush Kirubi, Thushara Elizabeth Tom Instructor: Chihoon Lee Business Intelligence & Analytics November 23, 2017 2 Experiment Design Methodology: Full factorial design with blocking. Factors & Levels: Responses: Drop Off rate Conclusion • The best setting is Purple Donate Button and No Slider relative to the other settings. • However the above is not substantial at 5% level of significance. • Since none of the factors are significant we opted to select those settings that minimize the page loading speed : No Slider, Testimonial with Text, Purple Donate Button. Data Time (blocking factor) was confounded with the 3-factor interaction, ABC. And thus assumed that this 3-factor interaction is negligible. Result • It was found that no factor significantly stood out. • The results of the experiment are shown below: ● Effect Test ● Normal Plot Motivation • Goal: Optimize True Mentor’s homepage to reduce the drop-off rate. • In turn, improving the quality score of the AdWords bids, leading to more ad exposures at the same or lesser expense. Limitations • We did not get enough time for replication due to the competition deadlines. • Blocking variable was difficult to accommodate. We manually took down the values at certain times of the day and night. Participating in Google's Online Marketing Challenge, we selected a nonprofit to run a digital marketing campaign. Part of our efforts involved optimizing the organization's home page to boost user engagement as measured by drop-off rates. We set up a full-factorial experiment (2 ^ 3) with time of day as the block variable. Put simply, we tested the donate button color, the presence of a slider and the type of testimonial. The type of testimonial was between one that was predominantly text and the other simply a photo with a caption. All the three factors were blocked on the time of day (daytime or nighttime). Empirically, the best setting was having no slider with a purple donate button. However, they were not robust enough to pass a statistical inference test. With the aforementioned settings, we additionally selected the no-slider option, as it reduces page speed, and its presence does not impact user-drop off rates.
  • 10. Analyzing the Impact of Earthquakes Team: Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley Instructor: Prof. Alkis Vazacopoulos Business Intelligence & Analytics Motivation • Earthquakes are one of the most destructive natural forces in the world that ravish entire cities with little notice • We wanted to analyze and visualize patterns in how earthquakes strike and damage specific locations historically in order to provide information on high risk areas and under prepared areas of the world • Earthquake features that were used include magnitude, source, focal depth, date, and type (nuclear or tectonic activity) • We also used an earthquake’s damaging effects using damage in US dollars, deaths, amount of houses damaged and destroyed, and injuries Financial Cost of EarthquakesCasualties due to Earthquakes Technology Utilized •Tableau was used for our analysis of earthquakes •We found tableau especially useful when visualizing the latitude and longitude data clearly identifying trends in the way earthquakes affect certain parts of the world •With the creation of 8 dashboards, we were able to analyze and visualize many different features of earthquakes including depth, source, and more Earthquake Map of the World • Based on the analysis of the number of deaths due to earthquakes, it is clear that a majority of high casualty events happen to coastal regions and many in the Indian subcontinent • We see a peak in deaths in 2010 due to the unfortunate number of casualties in Haiti underlining the fact that certain underdeveloped regions will suffer from increased casualties • We observe the same trend in high cost areas although the largest cost even occurred in Japan where the tsunami caused massive damages in 2011 Source: National Earthquake Information Center 3
  • 11. Real Time Health Monitoring System Team: Ankit Kumar, Khushali Dave, Shruti Agarwal, Nirmala Seshadri Instructor: Prof. David Belanger m Architectural Approach The architecture flow will consist of following steps: 1. Data will be generated and stored in a file 2. Data will be stream lined using apache Kafka 3. A real-time data visualization will be setup using any of the visualization tool Tools used 1. JSON file parsing for initial data analysis 2. Apache Kafka for real time streaming 3. Arduino programming for pulling temperature data in real time 4. Python for data cleaning 5. Tableau for visualization Variables 1. Heart rate - Through Smart watch sensor 2. Steps Count – Through Smart watch 3. Body Temperature – Through Arduino Trigger Cases 1. Fever – High temp, low heart rate and steps 2. Long term unconsciousness – low heart rate , body temperature and steps count 3. Heart attack - high heart rate in a very short time interval Problem Statement To create a real-time health monitoring system which is user specific using sensors from a smart watch and/or an Arduino device. This application should be able to monitor health features like heart rate, steps count and body temperature in real time and should be able to warn the user or any emergency services of any undesired or serious condition. Results The last part of the project is to be able to visualize results in real time and to be able to plot streams of data and show a trigger result if any abnormal use case occurs. Business Intelligence & Analytics 4 Business Impact Our health data analysis application can pull data in real-time from device sensors. Using this system authorities, friends and relatives can easily monitor health of a loved one and can respond immediately in case of an emergency. From a business perspective this can attract customers suffering from a medical condition.
  • 12. Zillow’s Home Value Prediction Team: Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu, Ziyun Song Instructor: Prof. Alkiviadis Vazacopoulos Business Intelligence & Analytics 5 Motivation •Zillow Prize is challenging the data science community to help push the accuracy of the Zestimate even further(improving the median margin of error). •What we should do in this competition is to develop an algorithm that makes predictions about the the logerror for the months in Fall 2017 Technology •Python, Watson, Tableau for exploratory data analysis(EDA). •Python for data preprocessing and feature engineering. •Python to build the model. Competition Process Feature engineering Feature engineering part is the most important part of this competition. It is very important to find out feature importance to keep valuable features and drop useless ones. We also created new features that might make machine learning algorithms work better. EDA Data preprocessing Feature engineering Use 2016 data to train model Test predicted logerror of 2016 data Improve feature engineering part Figure out the best model and adjust the parameters Use both 2016 and 2017 data to train model Several improvements Final submission Modeling Somehow we found a magnificent gradient boost machine to deal with this problem(compared with Xgboost and LightGBM). It is called Catboost. Catboost builts oblivious decision trees and to prevent overfitting includes algorithm, which in some magical way reduces bias of the residuals. In addition it uses another scheme for calculating values in leaves. Also supports several options for converting categorical features based on statistics counting. In general Catboost is declared as an algorithm that can work with categorical features without preprocessing, is resistant to overfitting, and it is supposed that it can be used without wasting time and effort to hyperparameters selection. And of course most interesting that it can be more accurate. But learning is still very slow. On average, 7-8 times longer than LightGBM, 2-3 times longer than XGBoost. Before we train the model, it is necessary to define categorical features. In this case, we have 26 categorical features that can be trained in “catpool” Then we adjusted the parameters to get a better(not the best) result. Conclusion The final submission made a top 11% rank. It is pity that we are so close to the bronze medal which is top 10%. What we can do in the future is to play with the categorical features and keep adjusting the model.
  • 13. Analysis of Opioid Prescriptions and Deaths Team: Pranjal Gandhi, Nishant Bhushan, Sunoj Karunanithi, Raunaq Thind Instructor: Alkis Vazacopoulos Business Intelligence & Analytics 6 The objective is to find the correlation between the prescription of drugs containing opioids and drug related deaths in the USA. What are opioids? Opioids are a class of drugs that include the illicit drug heroin as well as the licit prescription pain relievers oxycodone, hydrocodone, codeine, morphine, fentanyl and others. • Tableau for Visualizations. • R & Excel for cleaning the data and exploratory analysis. • The dataset is a subset of data sourced from cms.gov and contains prescription summaries of 250 common opioid and non-opioid drugs written by the medical professionals in 2014. • Number of deaths due to drug overdoses is leading the deaths occurring due to car accidents by a staggering 11,102 as per a report by the DEA. • In 2014, there were 4.3 Million people aged 12 years or older that were using Opioid based painkillers without prescriptions. • This led to substance abuse among almost 50% of the consumers. • 94% of respondents in a 2014 survey of people in treatment for opioid addiction said they chose to use heroin because prescription opioids were “far more expensive and harder to obtain.” ffmfnfn Analysis & Visualizations Results and Conclusion Facts & Figures Objective & Motivation Tools Used • We found the opioid prescriptions were too high for the prescribers following specialties: • Female Nurse Practitioners • Female Physician Assistants • Female and Male Family Practices • Female and Male Internal Medicines • Male Dentists The top 5 states with most %age of deaths due to overdoses are California, Ohio, Philadelphia, Florida and Texas. All of them had significantly high prescriptions of Hydrocodone- Acetaminophen followed by Oxycodone-Acetaminophen. Results and Conclusion
  • 14. UK Traffic Flow & Accident Analysis Team: Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang Instructor: Alkis Vazacopoulos Business Intelligence & Analytics Technology • Python for integrating data for analysis. •Tableau for data visualization and extracting data insights. Current & Future Work •Generating different plots from data and discovering relationships between variables. •Plan to find relationships between traffic flow and accidents. Motivation • Visualization of a Dataset of 1.6 million accidents and 16 years of traffic flow. 7
  • 15. US Permanent Visa Application Visualization Team: Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang Lu Instructor: Alkiviadis Vazacopoulos Introduction Develop a descriptive analysis based on US permanent application data from 2012 to 2017 in Tableau and provide insights into visa decisions. Data explanation: 374363 applicants from 203 countries in 22 occupations. Descriptive Analysis Employer & Economic Sector Conclusion Business Intelligence & Analytics April 30th, 2015 8 Applications by State Applications by Country Top 10 companies that submit permanent visa applications. Education & Occupation Certified rate and denied rate among all education degrees We can observe that a high-school degree has significant denial rate in the graph. Doctorate’s degree has the lowest denial rate. Applicants with master's and bachelor’s are mostly working in Computer and Mathematical fields. Occupations which have highest certification rates. While High School applicants most working in Production Occupations that have certification rates lower than 5%. Our team used different attributes to analyze the relationships between visa applications and certification rates directly and indirectly. Application decisions are correlative with many vectors, such as education level, income, occupation and so on. In conclusion, applicants with higher education (Bachelor’s, Master’s, Doctorate’s) are mostly working in Computer and Mathematic areas, which has higher income is more likely to be certified. Applicants come from the countries which dominate the occupation have higher certification rate rate when go for in the related jobs. we found out the certification rate increased as time varied from 2012 to 2017. Applicants from different occupations have different nationality. Take the computer science and construction occupation distributed maps for example, we could see that certified applicants in computer domain mainly come from Indian and China, while construction occupations mainly come from Mexico. Nationality & Occupation
  • 16. Business Intelligence & Analytics 9 • Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. • Personally speaking, we feel climate change problem is much more severe in these years. Global warming takes responsibility for the disastrous hurricane recently to some extent. • So we are going to do some descriptive data visualization to see how climate changes since 1970 and make some analysis. • Excel for data cleaning and data filtering. • Tableau to do the data visualizations and create interactive graph. • Tableau and Watson to build some regression models and conduct the analysis. •Illustrate the world’s climate change trend starting from 18th century in the line chart. •Specify the trend of climate change for each country and its average temperature in the whole period. •Extract data from original data file to show how much each country’s temperature has increased and compare this percentage change with one another. •Customize the period to show trends of climate change in a certain period of time. •Try to get more data sources to dig in deeply in order to find the factors leading to the climate change. Climate Change Since 1770 Current & Future Work Team: Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li, Zhenzhen Liu Instructor: Alkis Vazacopoulos Motivation Technology Is worldwide mercury really rising? Insights into customized periods Explore average temperature by country Recent 100 years Climate Change
  • 17. Predicting Customer Conversion Rate for an Insurance Company Team: Yalan Wang, Cong Shen, Juanyuan Zheng, Yang Yang Instructors: Alkis Vazacopoulos and Feng Mai Business Intelligence & Analytics 10 Motivation •Use the dataset which contained contact information to predict the customer who would like to purchase the insurance •Help insurance company understand the characteristics of their customers in making the purchasing decisions of their insurance Technology •Used Python to analyze imbalanced data on customers from insurance company •Applied Synthetic Minority Over-sampling Technique (SMOTE) algorithm to balance the data •Built predictive models (Logistic Regression, Random Forest and XGBoost) to predict conversion rate Data Summary Learning Model •Logistic Regression: This was chosen because it is known to serve as a benchmark with which other algorithms are compared. •Random Forests Classifier: Random Forests Classifier is a type of decision tree algorithm. •XGBoost: This model is short for “Extreme Gradient Boosting” The tree ensemble model: which is a set of classification and regression trees (CART) Tree Ensemble: which sums the prediction of multiple trees together.  Raw Data • Dataset shape: 1,892,888 records and 50 variables in the dataset • Features Type: 5 columns are int64, 12 columns are float64, and 33 are object • Missing Value: 42 columns contains NA values  Clean Data • Convert data format to train the model • According to correlation matrix to eliminate features highly correlated but irrelevant to target label • Applied SMOTE Algorithm to balance the dataset Imbalance data: A dataset is imbalanced if the classes are not approximately equally represented Correlation Matrix 0: Contacting without purchase 1: Contacting with Purchase Processed Dataset(1885774,135) Raw Dataset(1892888,50)Consider a sample (6,4)and let (4,3)be its nearest neighbor. (6,4)is the sample for which k-nearest neighbors are being identified. (4,3)is one of its k-nearest neighbors. Let: f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2 f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1 The new samples will be generated as (f1’,f2’)= (6,4)+ rand(0-1)* (-2,-1) rand(0-1)generates a random number between 0 and 1. Processed Data Split data Testing Set 25% Training set 75% Model Fitted model Predictions Results Conclusion & Future Work Model Accuracy LGR 83.6% RFR 94.6% XGB 79.8% Feature Importance From the Random Forest, we get the top 50 features which paly significant role in our model. 'RQ_Flag', 'Original_Channel_Broker', 'First_Contact_Date_month', 'First_Contact_Time_Hour', 'PDL_Special_Coverage', 'RQ_Date_month', 'Inception- First_Contact', 'Original_Channel_Internet', 'PPA_Coverage', 'Inception_Date_month', 'Mileage', 'Region_(03)関東', 'Original_Channel_Phone', 'License_Color_(02) Blue', 'Previous_Insurer_Category_(02) • Comparing the accuracy of the three models, we choose the Random Forest as our final model. • According to the feature importance, digging the business insights from these features and giving the suggestions on what characteristics of insurance’s customers in making the purchasing decisions of their insurance After training model, we get the results of the three models: Logistic Regression, Random Forest and XGBoost
  • 18. Determining Attractive Laptop Features for College Students Team: Liwei Cao, Gordon Oxley, Haoyue Yu, Salman Sigari Instructor: Chihoon Lee Business Intelligence & Analytics November 28, 2017 11 Experiment Design Stage 1: Plackett-Burman Design Objective: Identify the most important factors early in the experimentation Factors & Levels: Stage 2: Fractional Factorial Design Objective: study the effects and interactions that several factors have on the response. Factors & Levels: Blocks: Conclusion • Price and operating system play really important roles when it comes to laptop purchase • To maximize Probability of Purchasing, price at plus level (<750) and operating system at minus level (Windows) would be chosen. Our maximum predicted probability of purchasing would be Probability of Purchasing = 51.77 + (11.36/2)*(1) - (7.23/2)*(1)*(-1) = 61.065 Data Collection We handed out slips to Stevens students randomly and recorded their responses Stage 1 (32 observations): Stage 2 (64 observations): ● Effect Test ● Pareto Plot ● Normal Plot Motivation • Laptops have become a staple in our lives as we use them for work, entertainment, and other daily activities • From a marketing perspective, it is critical to find the factors that interest consumers in order to produce and sell a laptop that will be successful . • Survey conducted by Pearson says 66% of undergraduates use their laptop every day in college • We wanted to find what drives laptop demand among college students Result Stage 1 Stage 2 Probability of Purchasing = 51.77 + (11.36/2)*Price - (7.23/2)*Price*Operating system =51.77 + (5.68)*Price - (3.615)*Price*Operating system Response: Probability of Purchasing (0 – 100%)
  • 19. Clustering Large Cap Stocks During Different Phases of the Economic Cycle Students: Nikhil Lohiya, Raj Mehta Instructor: Amir H. Gandomi Results Clustering of Stocks during Recovery phase Clustering of Stocks during Recession phase K means plot shows that the stocks are clustered with similarities by their Sharpe ratio, volatility, and average return. There are 9 graphs in total, and 2 of them are displayed above for the expansion and recession phase. The x-axis shows the ticker/symbol of snp500 and the Y- axis shows the Cluster number. If we hover on the dot on the graph, it shows the ticker along with its cluster number, and the variable used for clustering. We used Silhouette and visually inspected the data points to find the optimal value of k, which turns out to be 22. Introduction OBJECTIVE We tried to provide a set of securities that behave similarly during a particular phase of the economic cycle. For this project, the creation of sub-asset classes is done only for large-cap Stocks. BACKGROUND Over time, developed economies such as the US are becoming more volatile and hence the underlying risk of securities has risen. This project aims to identify the risks & potential returns associated with different securities and to cluster similar stock similarities according to their Sharpe ratio, volatility, and an average return of stocks for a better analysis of the portfolio. Business Intelligence & Analytics 12 Data Acquire • Data of Large Cap Stocks & US Treasury Bonds is gathered directly using APIs. • The Data potentially consists of 2 time frames i.e. Recessionary & Expansionary Economy. Data Preprocessing • This segment included the application of formulae to calculate the pre-required parameters. (Eq 1,2,3,4). Analysis • This segment consists of K means clustering Analysis done on the Large Cap Stocks. (K = 22) (500 Stocks) • The clustered securities is then further tested for their correlation among the sub asset classes. Results • The results of K means clustering varies in the range (9 to 45) • There were some outliers in our analysis as well. Flow - Project Conclusion & Future Scope • With the above methodology, we have been able to develop a set of classes which behave in a similar fashion during each phase of the economic cycle. • The same methodology can be extended to different asset classes available online. • Application of Neural Networks can significantly reduce the error in cluster formation. • Also, application of different parameters such as Valuation, Solvency or Growth potential factors can be included for clustering purposes. • Next, we plan to add leading economic indicator data to identify the economic trend and to perform the relevant analysis. Mathematical Modelling • Here we take daily returns for all the 500 securities. 𝑅 = 𝑃𝑐−𝑃 𝑜 𝑃 𝑜 × 100 Eq1 • Average Return and Volatility. 𝜇 𝑗 = ∑ 𝑅 𝑖 𝑛 𝑖=1 𝑛 Eq2 𝜎𝑗 = ∑ (𝑅 𝑖 − 𝜇 𝑗)2𝑛 𝑖=1 𝑛−1 Eq3 • Sharpe Ratio calculation for the Securities. 𝑆𝑅𝑗 = (𝑅 𝑗−𝑅 𝑓) 𝜎 𝑗 Eq4 • Correlation Matrix between the clustered securities following the cluster formation.
  • 20. Introduction • Predict how popular an apartment rental listing is based on the listing content • Help Renthop better identify listing quality and renters’ preference • Test several machine learning algorithms and then evaluate them to choose the best one Statistical Learning & Analytics Spring 2017 Exploratory Data Analysis • Data source: Kaggle.com • 15 attributes with 49352 training samples. • Target variable: Interest level (high, medium, low) Data Preprocessing • Transformed categorical data into numerical data • Obtained zip code from latitude & longitude • “One-hot encoding”: created dummy variables • Extracted the top apartment features from the list of apartment features Feature importance - Built ExtraTrees Classifier - Computed feature importance Top features: - Price - Building_index - Manager_index - Zip_id - Number of photos Modeling • Ensemble Methods: Random Forest, Bagging, AdaBoost, Xgboost • Logistic regression, KNN, Naïve Bayes, Decision Tree • Model Evaluation - Multi-class logarithmic loss: logloss = - 1 𝑁 ∑ ∑ 𝑦𝑖𝑖 𝑀 𝑗=1 𝑁 𝑖=1 log(𝑝𝑖𝑖) - For imbalanced data, use Precision-recall curve and average precision score • Best Classifier: Random Forest - Lowest logloss: 0.6072 - Highest average precision score: 0.8374 • Precision-recall curve for each class Conclusion • Built 8 models to predict the probability. The best performance is Random forest with 0.6972 lowest logloss and 0.8374 highest precision score • Regarding to imbalanced data, precision-recall curve could be appropriate evaluation metric while accuracy score is not applicable • Hyperparameter tuning for better peformance • SMOTE technique to balance dataset 13
  • 21. 14 Deep Learning vs Traditional Machine Learning Performance on an NLP Problem Abhinav S Panwar Instructor: Christopher Asakiewicz Modeling • Traditional Machine Learning: • Bag of Words (up to 2 grams) is used for feature generation • L1 regularization is used for feature selection • Hyperparameter values are searched through Cross Validation method • FastText: • Data is directly fed into the FastText pipeline without worrying about feature generation • ‘Number of Epoch’ and ‘Learning Rate’ are tuned for best performance Introduction • When you need to tackle an NLP task, the sheer number of algorithms available can be overwhelming. Task-specific packages and generic libraries all claim to offer the best solution to your problem, and it can be hard to decide which one to use. • Through this project, we are comparing performance of traditional Machine Learning Algorithms such as Support Vector Machine, Logistic Regression, Naïve Bayes Algorithm, etc. with Facebook’s AI lab released FastText, which is based on a deep neural network. • The challenge is to predict ‘Dow Jones Industrial Average’ movement based on Top 25 news headlines in the media. This is a binary classification task with ‘1’ indicating DJIA rose or stayed as the same, and ‘0’ indicating DJIA value decreased. Conclusion • If working with large data sets, and if speed and low memory usage are crucial, FastText looks like best option. • Based on research published, performance of FastText is comparable to other deep neural net architectures, and sometimes even better. • Running complex neural net architectures require GPU’s for getting good performance but one can get comparable performance with FastText while running on general CPU’s. Business Intelligence & Analytics Data Pre-Processing • Total Number of Samples: 1989 • Some Text Cleaning was done: •Converting headline to lowercase letters •Splitting the sentence into a list of words •Removing punctuation and meaningless words • Dataset was divided into Training and Testing (80:20 split): •Training: Data from 08-08-2008 to 12-31-2014 was used •Testing: Data from following two years was used Data Description •News Data: Historical news headlines were obtained from Reddit WorldNews Channel. They are ranked by reddit users votes, and only the top 25 headlines are considered for a single date. •Stock Data:. Dow Jones Industrial Average (DJIA) Adjusted Close Value for the period 8th August 2008 to 1st July 2016. Results •Precision, Accuracy, and Time taken are compared for all the methods of Learning. Word Cloud for +ve class Word Cloud for -ve class •After analyzing the Word Cloud for both classes of data, we can see Political sensitive words are dominant for both of them. •Since the dataset contains only 2000 samples, the final model features are not distinct enough resulting in average model fitting. •Among all the algorithms, FastText gives nearly best performance and the time consumed is far less than other algorithms. Learning Algorithms • Traditional Machine Learning: •Support Vector Machine, Logistic Regression, Naïve Bayes, Random Forest • Deep Learning: •FastText: Neural Network based library released by Facebook for efficient learning of word representations and sentence classification. Unigram Bigram Unigram Bigram Unigram Bigram Logistic Regression 68.7 69.3 71.2 71.8 53 57 LSVC 70.9 71.2 72.1 72.3 61 63 Naïve Bayes 67.4 68.1 67.6 68.4 52 53 Random Forest 53.2 54.9 55.8 57.1 69 72 FastText 68.9 70.2 70.9 71.7 10 11 Precision (in %) Accuracy (in %) Time (in s)
  • 22. Wine Recognition Team: Biying Feng, Ting Lei, Jin Xing Instructor: Amir Gandomi Results and Discussions Model 1: KNN (Kth Nearest Neighbor Classification) Result: Evaluation: cross validation The error rate of cross-validation is 0.3124, and the error rate of KNN model is 0.3483. Model 2: LDA (Linear Discriminant Analysis) Result: Evaluation: cross validation According to the result of the cross validation, we can know that the error rate is 0.0168. Model 3: Recursive Partitioning Result: Evaluation: We don’t have to use cross validation to evaluate this model, the reason is Recursive Partitioning uses cross validation internally. So, we can trust the error rates implied by the table. Here, the error rate is 0.1685. Goals we want to accomplish • Selecting most influence variables for wine classification. • Developing classifiers to classify a new observation. • Finding the best model for wine classification. • Find the correlation between each pair of variables. Classification Models 1. K Nearest Neighbor 2. Linear Discriminant Analysis 3. Recursive partitioning Conclusion • The error rate of KNN is 0.3146 which is the worst accuracy among these three models. • The error rate of LDA model and Recursive Partitioning models are the same and it is equal to 0.01685. • However the LDA model is overestimate the result and, therefore, it is not good enough. • The Recursive Partitioning model uses cross validation internally to build its decision rules, that means this model is more reliable. • Finally, the Recursive Partitioning model is selected as the best model for wine classification. Multivariate Data Analysis Fall, 2017 Data Information Problem Type: Classification Variables: (1) Alcohol, (2) Malic acid, (3) Ash, (4) Alcalinity of ash, (5) Magnesium, (6) Total phenols, (7) Flavanoids, (8) Nonflavanoid phenols, (9) Proanthocyanins, (10) Color intensity, (11) Hue, (12) OD280/OD315 of diluted wines, (13) Proline Outcome: Wine Type Data recourse: http://archive.ics.uci.edu/ml/datasets/Wine Data preparation 1. Type of variables 2. Summary of variables 3.Correlations 4. Variable importance 5. Scatter Plots 6. Variable dependencies 7. Standardizing variables 15
  • 23. The Public’s Opinion of Obamacare: Tweet Analyses Saeed Vasebi Instructor: Chris Asakiewicz Results Monthly trend of tweets based on their sentiment: Geological distribution of tweets: Main #Hashtags of authors: Trumpcare sentiment analyses by Language: Introduction  Patient protection and affordable care act (Obama Care) provides healthcare insurance services for US citizens.  This act has been highly debated by Democrat and Republican parties.  The main beneficiary of the act is people who use and pay for it.  This study tries to find out what people think about Obamacare and Trumpcare which is potential substitute for the act. Modeling and Data  Watson Analytics social network tool is used to gather data from tweeter based on #Obamacare and #Trumpcare in this study. Also, IBM Bluemix sentiment analysis is used for detail valuation of tweeters.  Obamacare tweets have been gathered for July-September 2016 and 2017.  Trumpcare tweets have been extracted form November 2016 to October 2017.  The geological area of study is limited to United States.  The tweeters’ language is limited to English and Spanish which represents two main races in US. Conclusion  Most of tweets have negative sentiments about Obamacare and Trumpcare; however, Trumpcare has relatively higher opposing.  CA, TX, NY, and FL have high tweeting rate about the acts. They have high negative tweets for the acts and partly support Obamacare but Trumpcare is not supported at any state.  Hashtags with Obamacare talk about its negative impacts on government’s budget. Hashtags with Trumpcare talk about stopping the program and President Trump’s violations.  Spanish tweets have higher rate of positive tweets than English ones. 20 Obamacare tweets’ trend on July-September 2016 and 2017 Trumpcare tweets’ trend from November 2016 to October 2017 Geological distribution of Obamacare’s all, negative, and positive tweets Geological distribution of Trumpcare’s all, negative, and positive tweets Obamacare and Trumpcare #hashtags
  • 24. Project Assignment Optimization Based on Human Resource Analysis Team: Jiarui Li, Siyan Zhang, Siyuan Dang, Xin Lin, Yuejie Li Instructor : Alkis Vazacopoulos Optimization: Constraints: • 𝑃𝑖𝑖 stands for project i assigned to employee j. It takes value 0 or 1. • 𝑎𝑖 stands for how many employees needed for one project. • 𝑏𝑗 stands for how many projects one employee can complete. Introduction The primary goal of our project is to provide insight to companies which need to properly assign different projects to each employee. In order to maintain a low turnover rate as well as increase employee productivity and growth, we use a combination of machine learning and powerful tool of optimization to help companies build and preserve their successful business. Business Intelligence & Analytics Model Objective: Minimize the turnover rate. Data Exploration: 1.Project Account of Turnover VS No Turnover : Moving Forward For our future work, we are looking forward to improve our model by introducing additional constraints. In addition, we will combine machine learning using Random Forest to give a prediction of the turnover rate and the optimization question. Thus, we can generate a model with more accuracy to give a dynamic result based on the enterprise’s data. 2.Decision Tree Model: Get insights of data and predict the turnover rate. 3.Heat Map: We build the assignment model to optimize the project assigned to each employee while minimizing the employee turnover rate. there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. 16
  • 25. S&A Processed Seafood Company Team: Redho Putra, Neha Bakhre, Harsh Bhotika, Ankit Dargad, Dinesh Muchandi Instructor: Alkis Vazacopoulos Methodology • We found out how many units S&A Company sold. (The article give the ‘value’-> 330M) • Then we used forecasting techniques to estimate how many units S&A will sell in the next 3 years to hit sales target 550M, which is our objective of this project. • Suggested new supply chain strategy to reach desired target. Introduction • S&A Company is a distributor of processed seafood products. • The broader offering of products has improved sales in the area of Western Europe. • Sales have improved with the new product acquisitions, but S&A now wishes to expand the entire product line into more of Eastern Europe. • However, the company realizes that their current model of putting a distribution center (DC) in each country that they service may no longer be the best option as they look to expand. Current Strategy Conclusion • The optimum number of DCs for S&A Company is 3 (UK, Germany, and Poland) • UK DC covered demand from UK & Ireland; Germany DC covered demand from Germany, France & Belgium; Poland DC covered demand from Poland & Hungary. • By meeting the industry benchmark of inventory turnover (15), UK DC expected average inventory is 257,000 cases; Germany DC expected average inventory is 261,000 cases, and Poland DC expected average inventory is 175,000 cases. • The warehouse utilization rate of those 3 DC would be 80%. Business Intelligence & Analytics 21 Suggested Strategy Results • Only Ireland DC has utilization rate around 80%. The other DC utilization rate is between 25-42%. • The average inventory turnover in all DC is 11, while the industry benchmark is 15. Objective • S&A Company is looking to drive significant organic revenue growth in the European market growing from $340M (as of the end of the fiscal year ending June 30, 2017) to $550M over the next 3 years. • We need to provide a supply chain strategy that helps them answer fundamental questions regarding the expected inventory and flow of goods that will be required three years from now to support the desired sales growth. • The company needs advice regarding the supply chain network, infrastructure and processes needed to serve its customers within the expected delivery window while optimizing costs.
  • 26. Optimizing Travel Routes Team: Ruoqi Wang, Yicong Ma, Jinjin Wang, Shuo Jin Instructor: Alkiviadis Vazacopoulos Result Business Intelligence & Analytics 18 Introduction There are more than a hundred million international tourists arriving to the United States every year and they spend billions of dollars. To maximize the profit and make more competitive offers, it is crucial to lower costs. One important factor is arranging an efficient tour route so that tour agents and travelers can lower the cost of time and transportation. Our project goal is to optimize a tour route by using certain constraints to minimize costs and maximize tour experience. Future Work Experiment •To achieve our objective, we constructed a two-step experiment. •Our first step is defining attributes and setting up constraints so that we can pick 6 target cities from the 10 most popular tourist destinations in the U.S. We collected data from TripAdvisor. •The cost of living index is used to estimate the expenses of the tour, which include dining and lodging. •Our second step is optimizing the tour route based on the cities that we picked in step one. We compared the costs of 3 transportation methods (train, bus, flight). We first optimized the cost for each city and then we optimized the travel route using Excel solver. • Step one- we added several other constraints, such as visiting a minimum of two national parks and visiting a minimum of one Michelin starred restaurant to achieve the objective of maximizing the tour experience. We used Excel solver to obtain our final selection of destinations. • Step two – after we were successfully done data cleaning, we applied our data into a Travelling Salesman problem model to optimize our route. • In the near future, we want to make our model more realistic by adding timetables into our model. • In addition, we want to add more constraints to decide the optimal group size and ways to deal with customers’ luggage.
  • 27. Predicting Airbnb Prices in NYC Team: Jiahui Bi, Ruoqi Wang, Yicong Ma, Xin Chen Instructor: Yifan Hu Business Intelligence & Analytics November 13th 30th, 2017 19 Introduction • Background: After showing up in 2013, the sharing economy has become a vital area in the industry and is affecting almost everyone’s life. Airbnb, as one of the representatives of sharing economy, is always focused by explorers and users. • Purpose: exploring what factors will influence the price of the Airbnb houses/ apartments / rooms and using those factors to predict the price in New York City. • Key Questions: What factors will influence the price in the New York and how will they influence? • Technology: Python(sklearn, xgboost) Data Preparation • Data Source:http://insideairbnb.com/get-the-data.html New York City, New York, United States October 2017 • Data cleaning : We get 44317 rows data which include 96 features  Drop entries that are missing (NaN) values for columns like “bedroom”, “bed”  Substitute missing values with the mean of the columns like "review_scores_location”  Set the value of 'reviews_per_month' to 0 where there is currently a NaN  Transfer the string values into the integer or float type  Delete entries that are outliers (more than 2000$ or 0) for “price” • Finally, we get 43148 rows data and 28 features Exploratory Data Analysis • Highly relevant features • Price Distribution Conclusions • Conducted 5 models and ensemble the model • Tested xgboost model with baseline model and Ridge Regression • The most important features that influence the prices in NYC are location, the type of room and the number of bedrooms Future Work • Abstract new features, such as amenities. • The price is highly positively related to location, in our case, listings in Manhattan is highly correlated with the pricing. • Accommodates, room type, the number of bathrooms, bedrooms, beds also have strong correlations with the price. Feature Importance Method 1 • Built Vanilla Linear Regression, Ridge regression, Lasso Regression, Bayesian Ridge on cross-validation split data • Figure 1 shows when used 18 features Figure2 shows after One-Hot Encoding (create dummy variables) Figure 1 Figure 2 • So, we can see Lasso comes out on top • Ensemble model: Gradient Boosting • We're doing better with Gradient Boosting Regressor with MAE=25.52, almost 20% less than previous method Method 2 • Ridge, Baseline & Xgboost • To test our different models in depth, we repeated train- validation split and then see how our errors are distributed. We also added a baseline model to compare. • Both ridge and xgboost beat the baseline model • Ridge Regression is performing slightly better than xgboost model Mean(RMSE) baseline 129.69 ridge 97.24 xgboost 98.31
  • 28. Portfolio Optimization of Cryptocurrencies Team: Yue Jiang, Ruhui Luan, Jian Xu, Shaoxiu Zhang, Tianwei Zhang Instructor: Alkis Vazacopoulos Business Intelligence & Analytics 22 Motivation •Cryptocurrencies are a very important topic for investors and the economic world. •Everyone wants to understand whether it is worthy and safe to invest in this type of assets. •We want to create a portfolio optimization model so that we can test the performance of Cryptocurrencies . •We’d like to know and understand the returns and the risk we take for different portfolios of cryptocurrencies. Technology •User API file to craw the data of cryptocurrencies from the internet. •Python to construct the portfolio with the Monte Carlo Simulation, optimize the Sharpe Ratio and the portfolio variance. •Python’s matplotlib package to make the visualization of scatter plot of expected return and expected volatility. Current & Future Work •We have price data of 9 different cryptocurrencies for the period of time from May 24th, 2017 to Oct 3rd, 2017, which is a very small subset of the data but it is a good start for us to test the investment concepts and to optimize the portfolio of digital currencies. •With Monte Carlo Simulation method, we have constructed out portfolio model. Then, we have calculated the covariance of this 9 cryptocurrencies and also the correlations of every pair of combination. •We have used Python to calculate and visualize the Markowitz efficient frontier of our data. •We have optimized the Sharpe Ratio so that we can get the portfolio of maximized Sharpe Ratio. Also, we have minimized the variance of the portfolio so we can get lower risks. • We can observe from our findings that the volatility of digital currencies are very high due to the changes of the markets, the expectation of investors and also the regulations from government. •In the future, we can include more cryptocurrencies and longer time range into our data so that we can get more exact results. Moreover, if the regulations are loose and investors can make transactions between digital currencies and real currencies, we then can get the data of such transactions and find the opportunity of arbitrage. Data Snapshot Our data is crawled from the internet of cryptocurrencies markets. There are nearly 1,500 types of digital currencies in the market, however, for testing the concepts and due to the limits of our computers, we only got the market price for 9 cryptocurrencies and the time frame was from May 24th, 2017 to Oct 3rd, 2017. Log Returns As the majority of statistical analysis approaches rely on log returns rather than the absolute time series data, we use Numpy package of Python to get the log returns for the 9 cryptocurrencies. The returns of the digital currencies indicated that the risk of investing in the virtual currencies was very high during the time period because only two of them had the positive returns. Covariance & Correlation Aiming to see the dependence between these 9 digital currencies, we got the covariance and the correlations between each pair of them. It shows that they have the positive correlations and most of them are tightly related to Bitcoin. Random Weights Subject to the constraint that the sum of the weights of the portfolio holdings must be 1, we have randomly assigned weights to our 9 cryptocurrencies. Monte Carlo Simulation & Sharpe Ratio Optimization We then use Monte Carlo simulation to run 4000 times of different randomly generated weights for the individual virtual currencies and then calculate the expected return, expected volatility and Sharpe Ratio for each of the randomly generated portfolios. It’s then very helpful to plot these combinations of expected returns and volatilities on a scatter plot , and we color the data points based on the Sharpe Ratio of that particular portfolio. Optimization on Variance of Portfolio We have minimized the portfolio variance so that we can invest on the portfolio with lowest volatility. The result suggested that we should invest on the first, the second, the third and the last digital currencies. Maximization of Sharpe Ratio By looking for the minimized Sharpe Ratio of the portfolio, the result is choosing only the last digital currency.
  • 29. Regression Model for Box Office Gross • Same as the model for rating, 3 models with different variables are compared by MSE in 4 testing dataset from which model 2 is the best • In the last regression model for gross, budget,factor1,2,3, 5,duration ,director, actors are the explaining variables in which factor3 and 5 are negative factors. Regression Model for Rating • 3 models with different variable entered are simulated in training datasets and verified in testing datasets from which we examine the MSE of each model in each test where model 1 is considered as the best • In the last regression model for rating, duration ,director, year, and actors are the explaining variables in which year is a negative factor. Predicting Movie Rating and Box Office Gross Team: Yunfeng Liu, Erdong Xia, Yash Naik Instructor: Amir H. Gandomi Statistical Learning & Analytics Spring, 2017 MSE of rating model test1 test2 test3 test4 model1 0.3168 0.3487 0.3872 0.4128 model2 0.3171 0.3627 0.3872 0.4128 model3 0.3195 0.3492 0.3890 0.4148 MSE of gross model test1 test2 test3 test4 model1 2.278 × 1015 4.017 × 1015 1.926 × 1015 1.635 × 1015 model2 2.269 × 1015 3.960 × 1015 1.928 × 1015 1.650 × 1015 model3 2.277 × 1015 4.011 × 1015 1.925 × 1015 1.636 × 1015 rating model of all years Variable Parameter Estimate Standardize d Estimate Intercept 41.94717 0 duration 0.01275 0.25958 director 0.13974 0.17509 actors 0.05001 0.06428 year -0.01848 -0.16053 gross model of all years Variable Parameter Estimate Standardized Estimate Intercept 5.1 × 10 6 0 Factor1 2 × 10 7 0.25932 Factor2 1.9 × 10 7 0.24701 Factor3 -5.0 × 10 6 -0.069 Factor5 -5.0 × 10 6 -0.0678 duration 2.5 × 10 5 0.06465 director 4.3 × 10 6 0.06724 actors 7.6 × 10 6 0.12248 budget 2.2 × 10 -1 0.24059 Introduction • The primary purpose of this project was to create models using existing movie data for prediction of box office gross and movie ratings. • The first model was created to predict the Gross box office revenue that a movie will generate given inputs like the genre, actors involved, the director and many more influencing factors. • The second model was created to predict the movie rating considering the genre, the budget of the production and many more influencing factors. Models • Principal Components Analysis • Multivariate Regression Model Data Source • Database is from www.kaggle.com • Over 5,000 movies data in past 20 years with 18 variables Data Processing • Delete null and missing values from raw data (5500 rows-3116 rows) • Select movies with sufficient reviewer numbers (3116 rows-1558 rows) • Transform directors and actors Facebook likes into score level from 1 to 5 • Transform string variable of genres into Boolean variable of 17 specific categories • Divide the population into four testing datasets evenly and set training dataset according to different testing dataset for regression model testing Principal Component Analysis • To investigate the movie score and gross regression models, principal components analysis are conducted with 17 concerning categories. • From the scree plot, the first 6 factors are selected as principal components of movie genres, the eigenvalues are greater than 1 and the cumulative variance explained is greater than 0.6. • The scoring coefficients of 6 principal components are displayed where the green cells are those positive indicators of that factor and red cells are those negative indicators of that factor • From PCA we conclude that Factor 1 is like family animation without thriller and crime, factor 2 is like action and sci-fi with thriller, factor 3 is action without horror, factor 4 is biography, factor 5 is crime, factor 6 is documentary Conclusion • From movie rating model, it can be seen that a long movie shot by a famous director is more likely to earn a high rating score on reviewer websites and the genres of the movie are irrelevant. • From movie gross model, it can be seen that a big-budget movie with the genres of family-animation or action-sci-fi is more likely to earn a successful box office. Also, in terms of earning money, actors seem more important than the director based on this model. 1 2 3 4 2 2 2 3 3 3 4 4 4 1 1 1 White = testing red = training Test1 Test2 Test3 Test4 model1 model2 model3 duratio n duratio n duratio n director director director year year year actor actor actor budget budget budget model1 model2 model3 duration duration duration director director director Factor 1 Factor 1 Factor 1 Factor 2 Factor 2 Factor 2 Factor 3 Factor 3 Factor 3 Factor 5 Factor 5 Factor 5 actor actor actor budget budget budget year year year Factor 4 Factor 4 Factor 4 24
  • 30. Student Performance Prediction Team: Abineshkumar, Sai Tallapally, Vikit Shah Instructor: Amir H. Gandomi Multivariate Data Analysis Fall, 2017 Introduction & Motivation • Regression and Classification models are developed to help schools to predict the performance of students in final exam using historical data including demographic and previous test scores • Prediction tools can be used in improving the quality of education and enhancing school resource management Data Understanding • Final grade (G3) of students ranges from 0 – 20 which is displayed in the histogram below. Modeling and Technology • Two datasets with the same number of variables are used, first dataset is about Mathematic subject and the second dataset is about Portuguese subject • G3 (Final grade), ranges from 0 – 20, is the target. • A Linear regression model is build to predict the final grade of students by using their given information • Linear equation is formed by selecting the top 5 variables 5 variables those provide the lowest BIC and Mallow’s CP while providing an optimal Adjusted R2 • The most important variables are G1 and G2 which are the first and second period grades which are important in predicting the final grade • A classification is done for the grade. When a student score is more than 10, it is classified as Pass and otherwise it is classified as Fail • The classification technique used here is Logistics regression • Classifying the grades (0 – 20) into two classes (Pass / Fail ) produces a more accurate model. But in this experiment, the focus was to build a linear regression model which predicts the final grade as a numeric value which is shown below. Result and Conclusion • The final grades of the students were mostly dependent on the first period and second period grades, and also on demographic information such as if a student’s father had a job or his attendance percentage is high were also useful in the model building process. • Final grade had a decent correlation with the first period and second period grades but these fields were very important in building a model. In the decision tree model above, we can see that impact of mother having a job title “teacher or other”. In Linear Model, we could see the family relationship and age are in the top 5 independent variables shows the importance of those variables as well. • From the above results the educators can understand the impact of the previous grades (First period and Second period), therefore, they could start planning on taking care of the students during those exams. The predicted student performances also helps in implementing this model as a policy in the schools. Reference: http://www3.dsi.uminho.pt/pcortez/student.pdf • Linear Regression Model • From the summary of the model, we took top 5 variable to build the linear regression model, Model = lm(G3~ G1 + G2 + absences + famrel + age) • G1, G2, Absence of a student, quality of family relationship (famrel), and age of the student are the top 5 independent variables to predict final grade (G3). • R-squared: 0.8599, Adjusted R-squared: 0.846 Decision Tree in R • Decision Tree algorithm used first period, second period, and mother’s job to predict if a student will pass or fail in the final exam. • Studied Model predicts student passes in final exam if their second period Grade is >8 or First period grade is >10 or if the student’s mother has a job title “Teacher or Other”. The accuracy of the model is 87% which is very close to the models built on Azure. 25
  • 31. Classifying Iris Flowers Team: Xi Chen, Shan Gao, Lan Zhang Instructor: Amir H. Gandomi Business Intelligence & Analytics 26 Purpose • For the iris flowers, the sepal and petal are very different from other flowers. Also, there are three different types of iris in the world. To classify iris flowers, we intend to create a k-nearest neighbors (KNN) model to assigning a label or a class to a new instance. A regression model is also developed to assign a value to the new instance. Data Description • This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Methodology Linear Regression • We plotted the scatter matrix of Iris data before diving into this topic and the data set consists of four measurements (length and width for petals and sepals) • We did the regression analysis for petals and sepals to see their significance. KNN method with R • Similarity measure which is typically expressed by a distance measure like the Euclidean distance that we will use in Iris dataset. • We used this similarity value to perform predictive modeling like classification with assigning a label or a class to the new instance. • We divide the dataset into the Training and Test Sets so as to test the accuracy of probability in our classification. Clustering • The hierarchical cluster analysis is used to visualize how cluster are formed and what the relationship among Iris members. Data Analysis Results:
  • 32. Business Intelligence & Analytics 31 Experiments for Apartment Rental Advertisements Team: Nishant Bhushan, Sharvari Joshi, Nirmala Seshadri Instructor: Chihoon Lee Resulting Regression Equation Likelihood of responding to the Ad = 0.6425 -0.0875 * Amenities[No] - 0.0775 * Public Transport[No] Expected Response under Best Factor-Level Choice Likelihood of Responding to the Ad = 0.6425 – 0.0875 (-1) – 0.0775 (-1) = 0.81 Objective • To determine what factors in apartment rental advertisements contribute to responses from ad viewers. • Apartment hunt has always been a crucial experience for graduate students. Therefore, we would like to perform an experiment to see how response is affected by different factors in a posting on the website. • We are performing an experiment to maximize the student’s response to a post on a housing website of an apartment rental by determining which factors affect the student’s response. Approach A Customer survey was conducted and 16 sample Ads were made. Each Ad was sent to 5 customers and they were asked the likelihood to respond to that particular variable. The Customers were asked to choose from one of the following options:- • Extremely Likely (100%) • Very Likely (80%) • Moderately Likely (60%) • Slightly Likely (40%) • Not Very Likely (20%) CHOICE OF EXPERIMENTAL DESIGN • 2_𝐼𝑉^(8−4) Fractional Factorial Design • NUMBER OF LEVELS FOR EACH FACTOR = 2 • NUMBER OF FACTORS = 8 • NUMBER OF OBSERVATIONS = 16 • RESOLUTION = 4 • No. of Replications: 5 • We chose this design as the number of factors is large i.e. 8. Thus, a full factorial design would result in large number of interactions and 2^8 =256 runs which would be impossible to carry out given the time and budget constraints. Example of a survey - 16 advertisement surveys were created. - Each survey had 5 responses - Target Audience: Students at Stevens Institute of Technology Business Intelligence & Analytics THE DESIGN OF EXPERIMENT Conclusion • Not including the amenities decreases the chances of a customer responding to the advertisement by 8.75%. • Not including the access to public transport decreases the chances of a customer responding to the advertisement by 8.75%. • Mentioning the Amenities and access to public transport are the most significant factors and should be included in the housing ads. • Posting time of the ad is not a significant factor so the advertisement can be posted at any time during the day. • Street Maps and Pictures are not significant factors as well and can be done without. Factors and Levels
  • 33. has Human Resource Retention Team: Aditya Pendyala, Rhutvij Savant Instructor: Amir H Gandomi Data Analyses • The heatmap below depicts a correlation between the different features of the data set: • The count of each of the nine incidences is depicted below: • The above deviance table indicates the estimates, standard errors, and z-values of all variables involved in the dataset Introduction Problem: An organization aims to retain its important resources and reduce the number of employees leaving the company. The HR department has collated employee related data which can be used to predict which employees may leave. If a prediction model can be designed with high accuracy, necessary action can be taken by HR Department to prevent critical employees from leaving the company by addressing the variables which are causing the employee to resign. Data: The data provided is compiled by Human Resources, and consists of ten variables: Technology: • R has been used for developing logistic regression prediction model. Confusion matrix and ROC Curve are used to increase the accuracy of the model and setting the threshold • Python has been used to generate data visualizations of various analyses, including Principal Component Analysis Conclusion Based on the above analyses, the following observations can be made: • The model can be used to identify majority of employees who may leave. • For identified employees the data analyzed for employees, such as promotion, monthly hours, salary, etc., can be used by HR to stop the employee from leaving as shown in the correlation and data analysis charts. Multivariate Data Analytics Fall 2017 Logistic Regression Prediction Model • Number of available instances: 15,000 employees. • Database is divided into the training and testing subsets in ratio of 4:1 • The prediction model was created on training dataset using logistic regression (12,000 employees). • This model is validated by employees in testing set (3,000 employees) • The confusion matrix is used to show the accuracy of the model • ROC curve is used to decide the threshold to 0.3 which is the optimal value for True Positive Rate and False Positive rate and which gave the model an accuracy of 78% Future Potential • Address all possible contributions to prevent employee’s departure proactively • Use the framework to increase the accuracy of the model • Analysis based on more variables that might affect on employee leaving. e.g. time of commute, training opportunity Principal Component Analysis With the large number of variables, correlation between different variables is inevitable. By Principal Component Analysis, all variables can be made to be linearly uncorrelated and redundant variables can be dropped. Based on this graph, it is visible that the 7th component can be removed.  Satisfaction Level  Last evaluation  Average monthly hours  Time Spent at company  Promotion  Salary  Employee Left  Work Accident  Department  Number of projects T=0.5 Predicted Value Actual Value 0 1 0 TN 2102 FP 183 1 FN 434 TP 281 T=0.3 Predicted Value Actual Value 0 1 0 TN 1846 FP 439 1 FN 219 TP 496 ROC Curve 27
  • 34. Constructing Efficient Investment Portfolios Team: Cheng Yu, Tsen-Hung Wu, Ping-Lun Yeh Instructor: Alkis Vazacopoulos Modeling Capital Asset Pricing Model(CAPM): Optimization Model: Analysis Business Intelligence & Analytics Fall, 2017 Introduction It is often that individual investors do not know how to build up their investment portfolio. For this reason, they usually completely rely on the advice from financial companies. We offer a customized recommendation investment planner to advice what quantities of stock one needs to buy to get back a certain return rate with the minimum portfolio variance. Technology • Using Capital Asset Pricing Model (CAPM) to generate expected return by considering systematic risk. • IBM DOcplex Modeling for Python to execute quadratic programming optimizing the objective of the mathematical model. • Python to make the visualization using pygal and Microsoft Power BI. 𝐸 𝑅𝑖 = 𝑅𝑓 + 𝛽𝑖[𝐸(𝑅 𝑀) − 𝑅𝑓] 𝐸 𝑅𝑖 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑎 𝑠𝑠𝑠𝑠𝑠 𝑖 𝑅𝑓 = 𝑅𝑅𝑅𝑅-𝑓𝑓𝑓𝑓 𝑟𝑟𝑟𝑟 𝐵𝑖 = 𝐵𝐵𝐵𝐵 𝑜𝑜 𝑎 𝑠𝑠𝑠𝑠𝑠; 𝑎 𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑟𝑟𝑟𝑟 𝐸 𝑅 𝑀 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑡𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑡𝑡𝑡𝑡 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡 𝐸 𝑅 𝑀 − 𝑅𝑓 = 𝑀𝑀𝑀𝑀𝑀𝑀 𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝. 𝐴 𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜 𝑡𝑡𝑡 𝑒𝑒𝑒𝑒𝑒𝑒 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑡𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑡𝑡𝑡 𝑟𝑟𝑟𝑟-𝑓𝑓𝑓𝑓 𝑟𝑟𝑟𝑟 𝑃𝑖 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑝𝑝𝑝𝑝𝑝 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖 𝑄𝑖 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖 𝐼𝑖 = 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑎𝑎𝑎𝑎𝑎 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖 𝑊𝑖 = 𝑊𝑊𝑊𝑊𝑊𝑊 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡 𝑒𝑒𝑒𝑒 𝑠𝑠𝑠𝑠𝑠 𝑖 𝑖𝑖 𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝜎 𝑝 2 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑣𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑅𝑖 = 𝑌𝑌𝑌𝑌𝑌𝑌 𝑟𝑒𝑒𝑒𝑒𝑒 𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠 𝑖 𝑉𝐶 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑚𝑎𝑎𝑎𝑎𝑎 𝑅 𝑝 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑟𝑒𝑒𝑒𝑒𝑒 𝐵𝑝 = 𝐴𝐴𝐴𝐴𝐴𝐴 𝑝𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏 𝐵𝑖 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑑𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝑖 = 𝑄𝑖 × 𝑃𝑖 𝑅𝑓 = 𝐼𝑖 ∑ 𝐼𝑖 𝑛 𝑖=1 𝑉𝑉 = 𝑅𝑖 − 𝑅𝑖 � 𝑅𝑗 − 𝑅𝑗 � 𝑁 − 1 𝜎 𝑝 2 = 𝑊 𝑇 × 𝑉𝑉 × 𝑊 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂: 𝑀𝑀𝑀 𝜎 𝑝 2 𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑡𝑡: 𝑅 𝑝 = � 𝑅𝑖 × 𝑊𝑖 𝑅𝑖 ≤ 𝑡𝑡𝑡𝑡𝑡𝑡 𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟𝑟𝑟 𝑛 𝑖=1 𝐵𝑝 ≤ 𝐵𝑖 Simulation Conclusion We incorporated CAPM in our model, as it is broadly used among financial experts as an evaluation method for future stock price. According to the simulation results, the higher the target return rate, the more risk of the portfolio. Furthermore, the model recommends a particular combination of stocks with the minimum of portfolio risk based on the target return rate. Investors can find their targeted stocks to customize stocks portfolio from the results achieved in this model. CAPM Model Optimization Model • Actual budget spent • Amount of shares on specific stocks 1 1 6 6 2 4 2 4 4 4 Age: Under 25 Disposables Income: 4,361 USD Actual expense: 4,342 USD Target Return: 30% Portfolio Variance: 0.66 Number of Stocks Selected: 20 28
  • 35. A Tool for Discovering High Quality Yelp Reviews Team: Zijing Huang, Po-Hsun Chen, Hao-Wei Chen, Chao Shu Instructor: Rong Liu Business Intelligence & Analytics 32 Motivation  For Customers: Customers are more likely to read reviews with details instead of reviews that were written to just vent emotions.  For Companies: o Objective reviews are helpful in improving their products. o High-quality reviews can increase customer engagement and attract more users. Introduction  Objective: A tool for discovering high quality reviews  High quality review: objective and supply insightful details about user experience.  Dataset: Yelp hotel and restaurant reviews o Raw data: 4736897 reviews o Data with labeled aspects and sentiment: 1132 sentences  Methodology: Deep learning with Convolutional Neural Networks (CNN) and Word Embedding Approach 1. Use Yelp raw data to train word vectors that capture word semantics 2. Create a CNN to identify aspects from every sentence of a review 3. Train another CNN to detect sentiment of each sentence 4. Build an a neural network model to predict the quality of a review base on aspects and sentiment of its sentences. 5. Compare this approach with other models (SVM, Naïve Bayes …) and analyze the pros and cons of each model 6. Visualize the result through user interface and publish a python package (or a Restful API) for third party use Methodology High Quality vs. Low Quality Useful Information we got from the review. Un-useful Information Result
  • 36. Analysis of Diabetes among Pima Indians What factors affect the occurrence of diabetes? Team: Junjun Zhu, Jiale Qin, Yi Zhang Instructor: Amir H. Gandomi Results & Evaluation Model 1: Baseline Model Model 2: Explanatory Model – Logistic Regression Objectives: • What factors affect the occurrence of diabetes? • How best to classify a new observation? • What is the best model for this data? Modeling 1. Simple Logistic Regression 2. Logistic Regression 3. K Nearest Neighbor Conclusion Multivariable Data Analysis Fall, 2017 Data Information Type: Classification Data Recourse: https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes Data Preparation The AUC value is 0.8505833, the F1 score is 0.6394558 the recall (Sensitivity) is quite bad 0.5875 The AUC value is 0.8336667 the F1 score is 0.528 the recall (Sensitivity) is quite bad 0.4125 Model 3: Predictive model -- KNN The AUC value is 0.7295417 the F1 score is 0.5384615 the recall (Sensitivity) is quite bad 0.525 We have used three different models to study the data. From the results, we found that the simple logistic regression model is the best among these three models, because the sensitivity, accuracy, and F1 values are higher than for the other methods. Figure.1 shows the distribution of the data including the people who are diabetics and those who are not. Figure.2 shows the people who have diabetes. Figure 3 shows people who are not diabetics. We can find that the people with non diabetes are more concentrated in the center - these people are younger, and the index is more routine. 29
  • 37. Google Online Marketing Challenge: Aether Game Cafe Team: Ephraim Schoenbrun, Jaya Prasad Jayakumar, Sunoj Karunanithi, Saketh Patibandla Instructor: Chihoon Lee Factors & Sample Ads Business Intelligence & Analytics 30 Client and Market Analysis • Aether Game Cafe (AGC) is a blend of traditional coffee shop and board games play centre located in Hoboken, New Jersey. • AGC relies only on ‘Word of Mouth’ for its marketing. User Flow GOMC • The Google Online Marketing Challenge is a unique opportunity for students to experience and create online marketing campaigns using Google AdWords. • With a $250 AdWords advertising budget provided by Google, we developed and ran an online advertising campaign for a business over a three week period. Customer Analysis – Google Analytics Experiments and Results Hudson County New York Significant Factors Main Effects and Interaction Effects Successful Ads Conclusion • Experimental design helped us to achieve our target which eventually made us rank in Top 3 % all over the world.
  • 38. The World’s Best Fitness Assistant Team: Anand Rai, Jaya Prasad Jayakumar, Saketh Patibandla Instructor: Christopher Asakiewicz Fitness Assistant • Welcome Intelligence of our BOT Same Questions but Different answers based on context Business Intelligence & Analytics 33 Chatbot Framework Text and Speech Enabled Assistant Introduction • We are fitness freaks but never found a good application that guides us while working out. • There are a few that exist, with the drawback of being static Chatbots which answer predefined questions. • Our Fitness Assistant, which is AI Driven and uses Natural Language Processing to understand questions and gives the best answer. • This is a prototype and we are going to build this application for all exercises and food habits. Natural Language Processing • Natural Language Processing is the key to an efficient chatbot. • Every question is stored for future analysis and the knowledge base gives the answer to the question even if the question is not hardcoded. • The NLP searches for the entity in the question which is similar to noun in a sentence and narrows down to the answer. Future Scopes • This application will be improved by adding in all varieties of exercises and also with multiple sources of knowledge base. • This app will be One Stop Solution for all fitness freaks for exercises and nutrients, we plan to build an in app module to track human behavior and our bot will suggest them what to consume for betterment of their life. Greetings 1 Greetings 2 Questions
  • 39. Zillow’s Home Value Prediction Team: Wenzhuo Lei, Chang Xu, Juncheng Lu Instructor: Chris Asakiewicz Background & Objectives It is absolutely important for homeowners to have a trusted way of monitoring the assets. The “Zestimate” are created for estimating home values based on a large amount of data. Zillow published a competition of improving the accuracy of prediction for house. Business Intelligence & Analytics Data Imputation To start, we check the log error (target value) and it follows normal distribution. Also, we checked the frequency of trades based on the day/month. It shows the date is not a significant factor for trading houses and people tends to trade houses in summer. To impute the data. We separated the data with 5 different types. • Features with more than 90% missing values → Deleted. • Features with no missing value → Kept. • Binary features → Filled with 0 when there are missing values. • Irrational features → Filled with -1 when there are missing values. • Features with few numbers of missing values → Filled with mean number of the feature. • Special feature(total living area of home) → Use KNN based on number of bedrooms and bathrooms (as more bedrooms and bathrooms usually means bigger area). • Special features → There are some variables which are depend on longitude and latitude, fill in these missing value by KNN based on longitude and latitude. To avoid overfitting, we choose to reduce the features. To better select the features, we checked the importance of the features. We applied the regression model. The R2 is low and it might because of the existence of multicollinearity. To reduce the effect of that, we Used Variance Inflation Factors to drop the features. There are 10 features left. Challenge The data offered by Zillow includes a voluminous missing values in 57 features. The accuracy of prediction is primarily affected by the chosen of the data. Modeling As we mentioned before, we have tried OLS and the R^2 is quite low. We tried Random Forest as the second method and it gave a reasonable score. After that we used Gradient Boost as the final method since it has the lowest Mean Squared Error and largest Mean Absolute Error. To be more accurate, we applied XGboost as the last modeling method. . 34
  • 40. Web Traffic Time Series Forecasting Team: Jujun Huang, Peimin Liu, Luyao Lin Instructor: Alkis Vazacopoulos Business Intelligence & Analytics http://www.stevens.edu/howe/academics/graduate/business-intelligence-analytics Introduction • Our group focus on the problem of forecasting the future values of multiple time series. • The train dataset contains the views of 145063 Wikipedia articles. • Each of these time series represent many daily views of a different Wikipedia article, starting from July 1st, 2015 up until December 31st, 2016. • Our objective is predict the daily views between January 1st, 2017 and March 1st, 2017. Current & Future Work • We used heat-map and Fast Fourier Transform in the data visualization. • We used the ARIMA model in the data modeling part to predict the data from January 1st, 2017 to March 1st, 2017. • In the future, we will combine ARIMA model with other models to predict the time series and improve our accuracy. • We are going to predict all of the views of 145063 Wikipedia articles from January 1st,2017 to November 1st ,2017. Methodology • ARIMA model: We use the autoregressive integrated moving average (ARIMA) model to forecast the data. • If the polynomial has a unit root (a factor (1-L)) of multiplicity d and an ARIMA(p,d,q) process expresses this polynomial factorization property with p=p’-d, and is given by: • The ARIMA(p,d,q) process with dirft can be generalized as: • For Xt, t is an integer index and the Xt are real time series data numbers. L is the lag operator, the are the parameters of the autoregressive part of the model, the are the parameters of the moving average part and the are error terms. The error terms are generally assumed to be independent, identically distributed variables sampled from a normal distribution. Data Modeling • We can use the ARIMA model without transformations if our time series is stationary. • In our case, we take a decomposition on three parts: trend, seasonality and residual. The sum of these three parts is equal to our observation. • Here is one of the results of the forecasting views of one of Analysis • Here are the results of data visualization: • From this heat-map, we can see there is a huge web traffic at the end of July and at the beginning and the mid of August. However, we can’t find any periodicity in the heat- map. • English articles have the highest visits and the trend of Russian articles is similar with that of English.
  • 41. Portfolio Optimization with Machine Learning Team: Chao Cui, Huili Si, Luotian Yin, Qidi Ying, Yinchen Jiang Instructor: Alkis Vazacopoulos Business Intelligence & Analytics 36 Motivation Currently is difficult for an investor to make any inferences about future price especially in the case where volatility makes him uncomfortable. Based on the classical methods, we can only make estimate using historical data. In order to explore this problem, we decided to use machine learning techniques to create portfolios and we attempt to the capital distribution more realistic. Technology •Using Python to collect S&P Index500 stocks from Yahoo Finance for one year period. Set threshold to select 50 stocks with highest return and lowest risk in stocks pool. •Using Python to build machine learning models and to perform train and test. •Applying IBM Cplex module API for linear/non-linear programming and optimization. •All calculations are based on month intervals. Future Work •We could not constrain the size of optimized portfolio, which might lead to a impractically large size of portfolio. Further study is needed in IBM Cplex. •Machine learning model will need a lot of improvement. Future work should be done on more accurate models with wider range of information. •We could generate more attributes of the stocks such as industry, location and reputation to cater customer preferences. •We could make a more dynamic system that takes price change and trading fee into consideration. • Portfolio optimization without prediction Machine Learning for prediction • Add features Ordinary Least Square Prediction Finalprediction • Ensemble modeling • Training result • Testing result • Portfolio optimization without prediction Portfolio risk increases slowly until Portfolio return reaches about 0.05 Top N stocks Set as a constraint
  • 42. Developing a Supply Chain Methodology for an Innovative Product Team: Akshay Sanjay Mulay Instructors: Alkis Vazacopoulos and Chris Asakiewicz Business Intelligence & Analytics 37 Background • The health club economy is a multibillion dollar endeavor that has remained static for many years •Nova Fit looks to revolutionize the current fitness equipment by providing fitness enthusiasts to have safer, and more ergonomic options while they workout •The aim is to innovate every component of health club fitness Technology • Google Analytics for Data Storage and Excel Solver for Calculations. • Tableau for Visualizations and Graphical Analysis • Monte Carlo simulation to calculate the Worst Case, Best Case and Average Case scenarios during the actual Network Analysis Market Share in terms of Revenue and Number of Bars to be produced Number of Bars to produced Customer Survey Current Scenario • The typical round barbell, is unnatural to hold and create inefficiencies that hinder workout •According to an article on fitness in NY Times, more than 90% of injuries are caused due to lifting free weights Nova Fit – Revolutionary Barbell • NovaFit’s grip solves the problem by confirming more naturally to the hands, improving confidence and performance. The spindle design also eliminates the need of the clip function which is required in the current scenario Target Customers • To initially begin with, the target customers will be Gym and Fitness Club owners in the Northeast region of United States. •We also look to sell the product to household fitness freaks. Sample New York Fitness Center Club Forecast and Finance Model • The statistics used for calculating NovaFit’s market share are taken from IBIS. The primary competitor is Invanko which sells the Barbell Product at a price of $1250 Future Scope • To build a strong Supplier, Distributors and Manufacturer's network when the actual product starts selling into the market •To determine the Suppliers based on the cost, lead times and availability of the components •Use the actual data and realign the production and sales strategy in the market •Determine the modes of selling the product and also to develop a comprehensive Supply Chain strategy using various simulations and scenarios
  • 43. Bike Sharing Optimization Team: Jiahui Bai,Yuankun Nai, Yuyan Wang, Yanru Zhou Instructor: Alkis Vazacopoulos Model • Linear problem We selected the trip data in selected stations for 12 months in peak hours to build the model and optimize the inventory level of bicycles and minimize the total cost of deployment with Excel Solver. Total cost of deployment = ∑ ( cost of configuring one bike in an area * amount of bikes configured in that area ) • Linear regression To predict the demand in different weather condition • Logistic regression To figure out how weather conditions affect the demand Introduction • Bike-sharing system allows people to rent a bicycle at one of automatic rental stations scattered in the area, use it for a short journey and return it at any other station in the area. • Ford Bike: Located at bay area consists of 700 bikes and 70 stations across San Francisco and San Jose. • Unbalance: The difference between the number of incoming bikes and the number of outgoing bikes within a specific time period. Data visualization Business Intelligence & Analytics Problem • Visualization of trip routings on a map and the peak hours of workdays and weekend. • Optimize the inventory level of bikes in peak-hour of stations • Minimize the cost of deployment of bikes between the stations • Improve the utilization of each bike by reducing the number of bikes which are not used in peak hour • Predict the demand of bikes in different weather conditions - 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 70 69 50 55 74 61 67 60 65 77 64 start station end station Peak hour selection Time period: Aug 2014-Aug 2015 Weekday discrimination: using weekday function combined with IF in Excel, IF(WEEKDAY(start_date,2)>5,"weekend","workday") Time group: using group and outline in PivotTable, start by 0:00, end by 24:00, set step as one hour, so there’s totally 24 time period 3,999 3,836 - 1,000 2,000 3,000 4,000 5,000 0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Weekend start station end station 51,973 48,410 47,161 - 10,000 20,000 30,000 40,000 50,000 60,000 0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Workday 46,729 Future work • Improve the utilization of bicycles with the linear program model • Optimize the bike deployment strategy • Predict the demand in different weather conditions with machine learning algorithms 300,801 33,303 6,578 43,056 1,629 301,245 34,324 7,183 42,352 1,247 FINE FOG FOG-RAIN RAIN RAIN- THUNDERSTORM start station end station Top 10 popular stations Weather effect on the number of trips GoFord bike trip pattern ( San Francisco) 38