Posters 2017

Stevens Institute of Technology
School of Business
Business Intelligence & Analytics Program
A Snapshot of Data Science
Student Poster Presentations
Corporate Networking Event – November 28, 2017

This document reproduces the posters presented by students of the Business Intelligence
& Analytics (BI&A) program at a Corporate Networking event held at Stevens Institute on
November 28, 2017. The event was attended by over 80 company representatives and
approximately 150 students and faculty members.
The posters were presented by students at all stages in their academic programs from
their first semester through their final semester. The research described in each poster
was conducted under the guidance of a faculty member. The broad range of research
topics and methodologies exhibited by the posters in this document reflects the diversity
of faculty research interests and the practical nature of our program.
For background, the first poster describes the BI&A program. Founded in spring 2012,
with just 4 students, the program now has over 220 full-time and part-time masters of
science students, 80 graduate certificate students and is ranked 7th in the nation by The
Financial Engineer. As illustrated in the first poster, a distinctive feature of the program is
its three-layer structure. In the professional skills layer, business and communication
skills are developed through workshops, talks by industry leaders and an active student
club. In the second layer, the 12-course curriculum covers the concepts and tools
associated with database management, data warehousing, data and text mining, web
mining, social network analytics, optimization and risk analytics. The curriculum
culminates in a capstone course in which students work on a research project – often in
conjunction with industry associates. Finally, in the technical skills layer, students attend
a series of free weekend boot weekend camps that provide training in industry-standard
software packages, such as SQL, R, SAS, Python and Hadoop.
The 76 student posters in this document represent a broad array of research projects. We
are proud of the quality and innovativeness of our students’ research and of their hard
work and enthusiasm without which this event would have been impossible.
Chris Asakiewicz, Ted Stohr and Alkis Vazacopoulos
Business Intelligence & Analytics Program
Stevens Institute of Technology
www.stevens.edu/business/bia
Forward

INDEX TO POSTERS
* Indicates the poster was accompanied by a live demo
No. Title Student Authors
0 BI&A Curriculum The faculty
1*
Google Online Marketing Challenge 2017 –
True Mentors AdWords Campaign
Philippe Donaus, Rush Kirubi, Salvi Srivastava,
Thushara Elizabeth Tom, Archana Vasanthan
2
Multivariate Testing to Improve a Non-Profit’s
Home Page Rush Kirubi, Thushara Elizabeth Tom
3 Analyzing the Impact of Earthquakes Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley
4* Real Time Health Monitoring System
Ankit Kumar, Khushali Dave, Shruti Agarwal,
Nirmala Seshadri
5 Zillow Home Value Prediction
Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu,
Ziyun Song.
6
Analysis of Opioid Orescriptions and Drug-
related Overdoses
Nishant Bhushan, Sunoj Karunanithi, Pranjal
Gandhi, Raunaq Thind
7 UK Traffic Flow & Accidents Analysis
Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian
Xu, Jiahui Tang
8 US Permanent Visa Application Process
Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang
Lu
9 Climate Change Since 1770
Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li,
Zhenzhen Liu
10
Predicting Customer Conversion Rate for an
Insurance Company
Yalan Wang, Cong Shen, Junyuan Zheng, Yang
Yang
11
Determine Attractive Laptop Features Among
College Students
Liwei Cao, Gordon Oxley, Salman Sigari, Haoyue
Yu
12
Clustering Large Cap Stocks During Different
Phases Of Economic Cycle Nikhil Lohiya, Raj Mehta
13
Predicting interest level in apartment listings
on Renthop Cristina Eng, Ying Liu, Haoyue Yu, Salman Sigari,
14
Deep Learning Vs Traditional Machine
Learning Performance on NLP Problem Abhinav S Panwar
15 Wine Recognition Biying Feng, Ting Lei, Jin Xing
16
Projects Assignment Optimization Based on
Human Resource Analysis Jiarui Li, Siyan Zhang, Siyuan Dang, Xin Lin,
17*
Short Term Load Forecasting Using Artificial
Neural Networks Bhargav Kulkarni, Ephraim Schoenbrun
18 Optimizing Travel Routes Ruoqi Wang, Yicong Ma, Jinjin Wang, Shuo Jin
19 Exploring and Predicting Airbnb Price in NYC Ruoqi Wang, Yicong Ma, JIahui Bi, Xin Chen
20*
The Public’s Opinion of Obamacare: Twitter
Data Analyses Saeed Vasebi
21 S&A Processed Seafood Company
Redho Putra, Neha Bakhre, Harsh Bhotika, Ankit
Dargad, Dinesh Muchandi
22 Portfolio Optimization of Cryptocurrencies
Yue Jiang, Ruhui Luan, Jian Xu, Shaoxiu Zhang,
Tianwei Zhang
23
Fantasy Premier League Soccer Team
Optimization Haoran Du, Xiang Yang, Ruiwen Shi

24
Predicting Movie Rating and Box Office
Gross by PCA and LR Model Yunfeng Liu, Erdong Xia, Yash Naik
25 Student Performance Prediction Abineshkumar, Sai Tallapally, Vikit Shah
26 Classifying Iris Flowers Xi Chen, Shan Gao, Lan Zhang
27
Using Data Analytics to Retain Human
Resources Aditya Pendyala, Rhutvij Savant
28
Using Financial Models to Construct Efficient
Investment Portfolios Cheng Yu, Tsen-Hung Wu, Ping-Lun Yeh
29 Pima Indians Diabetes Analysis Junjun Zhu, Jiale Qin, Yi Zhang
30
Google Online Marketing Challenge: Aether
Game Café
Jaya Jayakumar, Sunoj Karunanithi, Saketh
Patibandla, Ephraim Schoenbrun
31
Experiment for Apartment Rental
Advertisements
Shavari Joshi, Nirmala Seshadri, Nishant
Bhushan
32
A Tool For Discovering High Quality Yelp
Reviews
Zijing Huang, Po-Hsun Chen, Hao-Wei Chen,
Chao Shu
33*
Worlds Best Fitness Assistant Cognitive
Computing
Anand Rai, Jaya Prasad Jayakumar, Saketh
Patibandla
34 Zillow’s Home Value Prediction Wenzhuo Lei, Chang Xu, Juncheng Lu
35 Web Traffic Time Series Forecasting Jujun Huang, Peimin Liu, Luyao Lin
36*
Portfolio Optimization with Machine
Learning
Chao Cui, Huili Si, Luotian Yin, Qidi Ying, Yinchen
Jiang
37
Developing a Supply Chain Methodology for
an Innovative Product Akshay Sanjay Mulay
38 Bike Sharing Optimization
Bai, Jiahui Nai, Yuankun Wang, Yuyan Zhou,
Yanru
39 Crime in the U.S Minyan Shao, Yuyan Wang, Yuankun Lin
39A Porto Seguro Safe Driver Prediction
Xiaoming Guo, Weiyi Chen, Dingmeng Duan,
Jian Xu, Jiahui Tang
40
Identifying Mushroom: Safe to eat or deadly
poison
Xiang Yang, Haoran Du, Shikuang Gao, Ruiwen
Shi
41* Student Grade Prediction Gaurav Sawant, Vipul Gajbhiye, Vikram Singh
42 NBA 2018 Play-offs & Champion Prediction Amit Kumar, Jayesh Mehta , Xiaohai Su
43
AI integrated interactive search interface for
Biomedical literature search
Akshay Kumar Vikram, Vishnu Pillai, Satya
Peravali
44*
Text Mining of intellectual contribution
information from a corpus of CVs
Nishant Bhushan, Neha Mansinghka , Nirmala
Seshadri, Arpit Sharma
45 Analysis on Chase Bank Deposits Lulu Zhu, Xinlian Huang, Junxin Xia, Rauhl Nair
46 Predicting Songs Hit on Billboard Chart Siyan Zhang, Yifeng Liu, Yuejie Li

47
Predicting Results of Premier League
Contest Hantao Ren, Lanyu Yu, Siyuan Dang, Jiarui Li
48
NLP Meets Yelp Recommendation System
for Restaurant Rui Song
49
WSDM — KKBox’s Churn Prediction
Challenge Caitlyn Garger, Yina Dong, Shuo Jin
50 Data Mining of Video Game Sale Xin Lin, Fanshu Li, Jingmiao Shen
51 Dengu AI: Predicting Disease Spread Vicky Rana, Pradeepkumar Prabakaran
52 Predict Tesla Model 3 Production Volume
Wangming Situ, Liwei Cao, Bohong Chen, Tianyu
Hao
53 Vehicle Routing Problem using NYC TLC Data
Adrash Alok, Garvita Malhan, Ephraim
Schoenbrun, Abhir Yadava
54 Credit Rating for a Lending Club Rui Song, Huili Si, Xiao Wan, Lulu Hu
55 Routify: Personalized Trip Planning
Minzhe Huang, Bowan Lu, Jingmiao Shen,
Xiaohai Su, Abhitej Kodali
56 Uncover World Happiness Patterns Rui Song, Xiao Wan, Xiaoyu Zhang
57* Data Centers – Where to Locate?
Smriti Vimal, Sanjay Pattanayak, Kumar
Bipulesh, Nitin Gullah, Souravi Sudamme
58 Drone Optimization in Delivery Industry Ni Man, Xinlian Huang, Xuanyan Li
59
Performance Evaluation of Machine
Learning Algorithms on Big Data using Spark
Neha Mansinghka, Madhuri Koti, Prathamesh
Parchure
60*
Duck Wisdom A Personal Portfolio
Optimization ToolTed Stohr_Nov 20 2017
Taranpreet Singh, Shivakumar Barathi, Ramona
Lasrado, Nikhil Lohiya
61 Porto Seguro’s Safe Driver Prediction
Boren Lu, Lanshi Li, Xiaoming Guo, Dingmeng
Duan
62
Hospital Recommendation System for
Patients Abdullah Khanfor, Danilo Brandao and Pedro Sa
63
Customer Segmentation for B2B Sale of
Fitness Gear Juhi Gurbani, Arpit Sharma, Neha Mansinghka
64
Predicting Vehicle Collisions & Dynamic
Assignment of Ambulances in NYC
Divya Rathore, Dhaval Sawlani, Nitasha Sharma,
Shruti Tripathi
65 Iceberg Classifier Challenge Chang Lu, Jing Li, Luotian Yin, Runtian Song,
66 Predicting Movie Success
Jialiang Liu, Huaqing Xie, Xiaohai Su, Liang Ma,
Lanjun Hui
67 Stock Prediction Based on News Titles
Jianuo Xu, Minghao Guo, Simin Liang, Yudong
Cao, Yunzhe Xu
68
How Consumer Reviews Affect a Star’s
Ratings
Jinjin Li, Prabhjot Singh, Yutian Zhou, Xuetong
Wei, Xiaoyu Zhang

69
Student Alcohol Consumption: Predicting
Final Grades Ping-Lun Yeh, Zhuohui Jiang, Gaurang Pati
70 Mobile Banking Fraud Detection
Junyuan Zheng, Ke Cao, Miaochao Wang, Tuo
Han
71 Subway Delay Dilemma
Smit Mehta, Nishita Gupta, Matthew Miller,
Jianfeng Shi
72
Integrated Digital Marketing Studies on
Hoboken Local Restaurant
Shuting Zhang, Yalan Wang, Haoyue Yu, Liyu Ma,
Christina Eng
73* AI Academic Advisor Vaibhav Desai, Piyush Bhattad
74 JFK Airport – Flight Delay Analysis
Praveen Thinagarajan, Arun Krishnamurthy,
Thushara Elizabeth Tom, Sunoj Karunanithi
75
Machine Learning on Highly Imbalanced
Manufacturing Data Set Liyu Ma
76* Duck Finder Salman Sigari, Shankar Raju and Team

Master of Science
Business Intelligence & Analytics
http://www.stevens.edu/bia
CURRICULUM
Organizational Background
• Financial Decision Making
Data Management
• Strategic Data Management
• Data Warehousing & Business Intelligence
Data and Information Quality *
Optimization and Risk Analysis
• Optimization & Process Analytics
Risk Management Methods & Apps.*
Data Mining
• Knowledge Discovery in Databases
Statistical Learning & Analytics*
Statistics
• Multivariate Data Analytics
• Experimental Design
Social Network Analytics
• Network Analytics
• Web Mining
Management Applications
• Marketing Analytics*
• Supply Chain Analytics*
Big Data Technologies
• Data Stream Analytics*
• Big Data Seminar*
• Cognitive Computing*
Practicum
Projects with industry
* Electives - Choose 2 out of 8
Social Skills
Disciplinary Knowledge
Technical Skills
• Written & Oral Skills Workshops
• Team Skills
• Job Skills Workshops
• Industry speakers
• Industry-mentored projects
• SQL, SAS, R, Python Hadoop
• Software “Boot” Camps
• Course Projects
• Industry Projects
Curriculum Practicum
MOOCs
Infrastructure
Laboratory Facilities
• Hadoop, SAS, DB2, Cloudera
• Trading Platforms: Bloomberg
• Data Sets: Thomson-Reuters, Custom
PROGRAM ARCHITECTURE
Demographics
2013F 2014F 2015F 2016F 2017F
Applications 101 157 351 591 725
Accepted 48 84 124 287 364
Rejected 34 34 186 257 307
In system/other 19 39 41 46 53
Admissions
Full-time/Part-time
Full-time 201
Part-time 21
Gender
Female 41%
Male 59%
Placement
Starting Salaries (without signing bonus):
$65 - 140K Range
$84K Average
$90K (finance and consulting)
Data Scientists 23%: Data Analysts: 30% Business Analysts: 47%
Our students have accepted jobs at for example:
Apple, Bank of America, Blackrock, Cable Vision, Dun &
Bradstreet, Ernst & Young, Genesis Research, Jeffreys,
Leapset, PriceWaterhouse, Morgan Stanley, New York
Times, Nomura, PriceWaterhouse Coopers, RunAds, TIAA-
CREF, Verizon Wireless
Hanlon Lab -- Hadoop for Professionals
The Masters of Science in Business Intelligence and Analytics (BI&A) is a
36-credit STEM program designed for individuals who are interested in
applying analytical techniques to derive insights and predictive intelligence
from vast quantities of data.
The first of its kind in the tri-state area, the program has grown rapidly. We
now have approximately 222 master of science students and another 79
students taking 4-course graduate certificates. The program has increased
rapidly in quality as well as size. The average test scores of our student
body is top 75 percentile. We are ranked #7 among business analytics
programs in the U.S. by The Financial Engineer.
STATISTICS
PROGRAM PHILOSOPHY/OBJECTIVES
• Develop a nurturing culture
• Race with the MOOCs
• Develop innovative pedagogy
• Migrate learning upstream in the learning value chain
• Continuously improve the curriculum
• Use analytics competitions
• Improve placement
• Partner with industry

Google Online Marketing Challenge 2017
True Mentors AdWords Campaign
Team: Philippe Donaus, Rush Kirubi, Salvi Srivastava, Thushara Elizabeth Tom, Archana Vasanthan
Instructor: Theano Lianidou
November 28, 2017
1
Motivation
• Google Online Marketing Challenge is an unique opportunity for students to build online Marketing Campaigns on Google
AdWords for a business or a non-profit. $250 budget is provided by Google to run these campaigns live for 3 weeks
• We worked with TRUE Mentors, a non-profit based in Hoboken,NJ. Built a marketing strategy on Google AdWords to
achieve goals like Creating Brand Awareness, promoting Fundraising Events and Volunteer opportunities and Donations.
• Technologies Used: Google AdWords, Google Analytics, Google Search Console, Facebook Insights
Design of Campaigns
• Conducted Market Analysis-Competitors, current
market position and platforms used, USP
• Analyzed existing data available on Google
Analytics, Google Search console, Facebook
Insights and established marketing goals
• Designed campaigns on Search and Display
Ads
Performance
Results
• 23 Ad groups with 206 Ads with 700 keywords were used in total
• Text Ads appear for people’s search terms across Google
Search and Search partner sites.
• Display Ads appear on relevant pages across the display
network
• The team finished as a “Finalist” in the Social Impact Award category.
• The team finished as a “Semi-Finalist” in the Business Award category.
• Ranked among the Top-10 teams in the Social Impact Award category.
• Ranked among the Top-15 teams in the Business Award Category.
• Ranked among the Top-5 teams in the Americas region.
• The results can be found at: https://www.google.com/onlinechallenge/past/winners-2017.html
• Team ID: 234-571-4266
Targeting and Bidding
Target Goals (set before running the campaigns)
End Results
• Campaigns ran from 24th April 2017-14th May 2017
• KPI’s were monitored and optimized continuously over the 3
weeks using insights drawn from various AdWords Reports,
Search Term reports, Google Analytics and Keyword reviews.
CAMPAIGN LEVEL
Targeting/Bidding TM_Brand TM_Events TM_Donations TM_Volunteers TM_DisplayCampaign
Location
Hudson County-NJ,
New York County-NY
Hudson County-NJ Hudson County-NJ Hudson County-NJ
Hudson County-NJ,
New York County-NY
Bidding Strategy Manual CPC Manual CPC Manual CPC Manual CPC Manual CPC
Daily Budget? Yes Yes Yes Yes Yes
ADGROUP LEVEL
Max CPC? Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups
Demographic No No No
Yes - Male and Female
were targetted
seperately
No
KEYWORD LEVEL
Max CPC? Yes Yes Yes Yes No
Topics No No No No
Yes - Charity &
Philanthropy and Fast
Foods

Improving a Non-Profit’s Home Page
Team: Rush Kirubi, Thushara Elizabeth Tom
Instructor: Chihoon Lee Business Intelligence & Analytics
November 23, 2017
2
Experiment Design
Methodology:
Full factorial design with blocking.
Factors & Levels:
Responses: Drop Off rate
Conclusion
• The best setting is Purple Donate Button and No Slider
relative to the other settings.
• However the above is not substantial at 5% level of
significance.
• Since none of the factors are significant we opted to select
those settings that minimize the page loading speed : No
Slider, Testimonial with Text, Purple Donate Button.
Data
Time (blocking factor) was confounded with the 3-factor interaction,
ABC. And thus assumed that this 3-factor interaction is negligible.
Result
• It was found that no factor significantly stood out.
• The results of the experiment are shown below:
● Effect Test
● Normal Plot
Motivation
• Goal: Optimize True Mentor’s homepage to reduce the
drop-off rate.
• In turn, improving the quality score of the AdWords bids,
leading to more ad exposures at the same or lesser
expense.
Limitations
• We did not get enough time for replication due to the
competition deadlines.
• Blocking variable was difficult to accommodate. We
manually took down the values at certain times of the day
and night.
Participating in Google's Online Marketing Challenge, we selected a nonprofit to run a digital marketing campaign. Part of our
efforts involved optimizing the organization's home page to boost user engagement as measured by drop-off rates. We set up
a full-factorial experiment (2 ^ 3) with time of day as the block variable. Put simply, we tested the donate button color, the
presence of a slider and the type of testimonial. The type of testimonial was between one that was predominantly text and the
other simply a photo with a caption. All the three factors were blocked on the time of day (daytime or nighttime). Empirically,
the best setting was having no slider with a purple donate button. However, they were not robust enough to pass a statistical
inference test. With the aforementioned settings, we additionally selected the no-slider option, as it reduces page speed, and
its presence does not impact user-drop off rates.

Analyzing the Impact of Earthquakes
Team: Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley
Instructor: Prof. Alkis Vazacopoulos Business Intelligence & Analytics
Motivation
• Earthquakes are one of the most destructive natural forces in the world that ravish entire cities with little notice
• We wanted to analyze and visualize patterns in how earthquakes strike and damage specific locations historically in order to provide
information on high risk areas and under prepared areas of the world
• Earthquake features that were used include magnitude, source, focal depth, date, and type (nuclear or tectonic activity)
• We also used an earthquake’s damaging effects using damage in US dollars, deaths, amount of houses damaged and destroyed, and
injuries
Financial Cost of EarthquakesCasualties due to Earthquakes
Technology Utilized
•Tableau was used for our analysis of earthquakes
•We found tableau especially useful when visualizing
the latitude and longitude data clearly identifying trends
in the way earthquakes affect certain parts of the world
•With the creation of 8 dashboards, we were able to
analyze and visualize many different features of
earthquakes including depth, source, and more
Earthquake Map of the World
• Based on the analysis of the number of
deaths due to earthquakes, it is clear that a
majority of high casualty events happen to
coastal regions and many in the Indian
subcontinent
• We see a peak in deaths in 2010 due to
the unfortunate number of casualties in
Haiti underlining the fact that certain
underdeveloped regions will suffer from
increased casualties
• We observe the same trend in high cost
areas although the largest cost even
occurred in Japan where the tsunami
caused massive damages in 2011
Source: National Earthquake Information Center
3

Real Time Health Monitoring System
Team: Ankit Kumar, Khushali Dave, Shruti Agarwal, Nirmala Seshadri
Instructor: Prof. David Belanger m
Architectural Approach
The architecture flow will consist of following
steps:
1. Data will be generated and stored in a
file
2. Data will be stream lined using apache
Kafka
3. A real-time data visualization will be
setup using any of the visualization tool
Tools used
1. JSON file parsing for initial data analysis
2. Apache Kafka for real time streaming
3. Arduino programming for pulling
temperature data in real time
4. Python for data cleaning
5. Tableau for visualization
Variables
1. Heart rate - Through Smart watch
sensor
2. Steps Count – Through Smart watch
3. Body Temperature – Through Arduino
Trigger Cases
1. Fever – High temp, low heart rate and
steps
2. Long term unconsciousness – low heart
rate , body temperature and steps count
3. Heart attack - high heart rate in a very
short time interval
Problem Statement
To create a real-time health monitoring system which is user specific using sensors from a smart
watch and/or an Arduino device. This application should be able to monitor health features like
heart rate, steps count and body temperature in real time and should be able to warn the user or
any emergency services of any undesired or serious condition.
Results
The last part of the project is to be able to
visualize results in real time and to be able to
plot streams of data and show a trigger result if
any abnormal use case occurs.
4
Business Impact
Our health data analysis application can pull
data in real-time from device sensors. Using
this system authorities, friends and relatives
can easily monitor health of a loved one and
can respond immediately in case of an
emergency. From a business perspective this
can attract customers suffering from a medical
condition.

Zillow’s Home Value Prediction
Team: Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu, Ziyun Song
Instructor: Prof. Alkiviadis Vazacopoulos Business Intelligence & Analytics
5
Motivation
•Zillow Prize is challenging the data science community to
help push the accuracy of the Zestimate even
further(improving the median margin of error).
•What we should do in this competition is to develop an
algorithm that makes predictions about the the logerror for the
months in Fall 2017
Technology
•Python, Watson, Tableau for exploratory data analysis(EDA).
•Python for data preprocessing and feature engineering.
•Python to build the model.
Competition Process
Feature engineering
Feature engineering part is the most important part of this competition. It is
very important to find out feature importance to keep valuable features and
drop useless ones. We also created new features that might make machine
learning algorithms work better.
EDA
Data preprocessing
Feature
engineering
Use 2016 data to
train model
Test predicted
logerror of 2016
data
Improve feature
engineering part
Figure out the best
model and adjust
the parameters
Use both 2016 and
2017 data to train
model
Several
improvements
Final submission
Modeling
Somehow we found a magnificent gradient boost machine to deal with this
problem(compared with Xgboost and LightGBM). It is called Catboost.
Catboost builts oblivious decision trees and to prevent overfitting includes
algorithm, which in some magical way reduces bias of the residuals. In addition
it uses another scheme for calculating values in leaves.
Also supports several options for converting categorical features based on
statistics counting.
In general Catboost is declared as an algorithm that can work with categorical
features without preprocessing, is resistant to overfitting, and it is supposed
that it can be used without wasting time and effort to hyperparameters
selection. And of course most interesting that it can be more accurate. But
learning is still very slow. On average, 7-8 times longer than LightGBM, 2-3
times longer than XGBoost.
Before we train the model, it is necessary to define categorical features. In this
case, we have 26 categorical features that can be trained in “catpool”
Then we adjusted the parameters to get a better(not the best) result.
Conclusion
The final submission made a top 11% rank. It is pity that we are so close to the
bronze medal which is top 10%.
What we can do in the future is to play with the categorical features and keep
adjusting the model.

Analysis of Opioid Prescriptions and Deaths
Team: Pranjal Gandhi, Nishant Bhushan, Sunoj Karunanithi, Raunaq Thind
Instructor: Alkis Vazacopoulos Business Intelligence & Analytics
6
The objective is to find the correlation between the prescription of
drugs containing opioids and drug related deaths in the USA.
What are opioids?
Opioids are a class of drugs that include the illicit drug heroin as well as
the licit prescription pain relievers oxycodone, hydrocodone, codeine,
morphine, fentanyl and others.
• Tableau for Visualizations.
• R & Excel for cleaning the data and exploratory analysis.
• The dataset is a subset of data sourced from cms.gov and contains
prescription summaries of 250 common opioid and non-opioid drugs
written by the medical professionals in 2014.
• Number of deaths due to drug overdoses is leading the deaths occurring due to car accidents by a staggering 11,102 as per a report by the DEA.
• In 2014, there were 4.3 Million people aged 12 years or older that were using Opioid based painkillers without prescriptions.
• This led to substance abuse among almost 50% of the consumers.
• 94% of respondents in a 2014 survey of people in treatment for opioid addiction said they chose to use heroin because prescription opioids were
“far more expensive and harder to obtain.”
ffmfnfn
Analysis & Visualizations Results and Conclusion
Facts & Figures
Objective & Motivation Tools Used
• We found the opioid prescriptions were too high for the prescribers
following specialties:
• Female Nurse Practitioners
• Female Physician Assistants
• Female and Male Family Practices
• Female and Male Internal Medicines
• Male Dentists
The top 5 states with most %age of deaths due to overdoses are
California, Ohio, Philadelphia, Florida and Texas.
All of them had significantly high prescriptions of Hydrocodone-
Acetaminophen followed by Oxycodone-Acetaminophen.
Results and Conclusion

UK Traffic Flow & Accident Analysis
Team: Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang
Technology
• Python for integrating data for analysis.
•Tableau for data visualization and extracting data insights.
Current & Future Work
•Generating different plots from data and discovering relationships between variables.
•Plan to find relationships between traffic flow and accidents.
Motivation
• Visualization of a Dataset of 1.6 million accidents and 16
years of traffic flow.
7

US Permanent Visa Application Visualization
Team: Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang Lu
Instructor: Alkiviadis Vazacopoulos
Introduction
Develop a descriptive analysis based on US
permanent application data from 2012 to 2017
in Tableau and provide insights into visa
decisions.
Data explanation:
374363 applicants from 203 countries in 22 occupations.
Descriptive Analysis
Employer & Economic Sector
Conclusion
April 30th, 2015
8
Applications by State Applications by Country
Top 10 companies that submit
permanent visa applications.
Education & Occupation
Certified rate and denied rate
among all education degrees We can observe that a
high-school degree has
significant denial rate
in the graph.
Doctorate’s degree has
the lowest denial rate.
Applicants with
master's and bachelor’s
are mostly working in
Computer and
Mathematical fields.
Occupations which
have highest
certification rates.
While High School
applicants most
working in
Production
Occupations that
have certification
rates lower than
5%.
Our team used different attributes to analyze the relationships
between visa applications and certification rates directly and
indirectly. Application decisions are correlative with many
vectors, such as education level, income, occupation and so on.
In conclusion, applicants with higher education (Bachelor’s,
Master’s, Doctorate’s) are mostly working in Computer and
Mathematic areas, which has higher income is more likely to be
certified.
Applicants come from the countries which dominate the
occupation have higher certification rate rate when go for in the
related jobs. we found out the certification rate increased as time
varied from 2012 to 2017.
Applicants from different occupations have different nationality.
Take the computer science and construction occupation
distributed maps for example, we could see that certified
applicants in computer domain mainly come from Indian and
China, while construction occupations mainly come from
Mexico.
Nationality & Occupation

9
• Some say climate change is the biggest threat of our age
while others say it’s a myth based on dodgy science.
• Personally speaking, we feel climate change problem is
much more severe in these years. Global warming takes
responsibility for the disastrous hurricane recently to some
extent.
• So we are going to do some descriptive data visualization to
see how climate changes since 1970 and make some
analysis.
• Excel for data cleaning and data filtering.
• Tableau to do the data visualizations and create interactive
graph.
• Tableau and Watson to build some regression models and
conduct the analysis.
•Illustrate the world’s climate change trend starting from 18th century in the line chart.
•Specify the trend of climate change for each country and its average temperature in the whole period.
•Extract data from original data file to show how much each country’s temperature has increased and compare this percentage
change with one another.
•Customize the period to show trends of climate change in a certain period of time.
•Try to get more data sources to dig in deeply in order to find the factors leading to the climate change.
Climate Change Since 1770
Team: Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li, Zhenzhen Liu
Instructor: Alkis Vazacopoulos
Motivation Technology
Is worldwide mercury really rising? Insights into customized periods
Explore average temperature by country Recent 100 years Climate Change

Predicting Customer Conversion Rate for
an Insurance Company
Team: Yalan Wang, Cong Shen, Juanyuan Zheng, Yang Yang
Instructors: Alkis Vazacopoulos and Feng Mai
10
Motivation
•Use the dataset which contained contact information to
predict the customer who would like to purchase the
insurance
•Help insurance company understand the characteristics of
their customers in making the purchasing decisions of their
insurance
Technology
•Used Python to analyze imbalanced data on customers from
insurance company
•Applied Synthetic Minority Over-sampling Technique
(SMOTE) algorithm to balance the data
•Built predictive models (Logistic Regression, Random Forest
and XGBoost) to predict conversion rate
Data Summary
Learning Model
•Logistic Regression: This was chosen because it is known to serve as a benchmark
with which other algorithms are compared.
•Random Forests Classifier: Random Forests Classifier is a type of decision tree algorithm.
•XGBoost: This model is short for “Extreme Gradient Boosting”
The tree ensemble model: which is a set of
classification and regression trees (CART)
Tree Ensemble: which sums the prediction of
multiple trees together.
 Raw Data
• Dataset shape: 1,892,888 records and 50 variables in the dataset
• Features Type: 5 columns are int64, 12 columns are float64, and 33 are object
• Missing Value: 42 columns contains NA values
 Clean Data
• Convert data format to train the model
• According to correlation matrix to eliminate features highly correlated but irrelevant to target label
• Applied SMOTE Algorithm to balance the dataset
Imbalance data: A dataset is imbalanced if the classes are not approximately equally represented
Correlation Matrix
0: Contacting without purchase
1: Contacting with Purchase
Processed Dataset(1885774,135)
Raw Dataset(1892888,50)Consider a sample (6,4)and let (4,3)be its nearest neighbor.
(6,4)is the sample for which k-nearest neighbors are being
identified.
(4,3)is one of its k-nearest neighbors. Let:
f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2
f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1
The new samples will be generated as
(f1’,f2’)= (6,4)+ rand(0-1)* (-2,-1)
rand(0-1)generates a random number between 0 and 1.
Processed
Data
Split data
Testing Set
25%
Training set
75%
Model
Fitted model
Predictions
Results
Conclusion & Future Work
Model Accuracy
LGR 83.6%
RFR 94.6%
XGB 79.8%
Feature Importance
From the Random Forest, we get the top
50 features which paly significant role in
our model.
'RQ_Flag',
'Original_Channel_Broker',
'First_Contact_Date_month',
'First_Contact_Time_Hour',
'PDL_Special_Coverage',
'RQ_Date_month', 'Inception-
First_Contact',
'Original_Channel_Internet',
'PPA_Coverage',
'Inception_Date_month', 'Mileage',
'Region_(03)関東',
'Original_Channel_Phone',
'License_Color_(02) Blue',
'Previous_Insurer_Category_(02)
• Comparing the accuracy of the three models, we choose the Random Forest as our final
model.
• According to the feature importance, digging the business insights from these features and
giving the suggestions on what characteristics of insurance’s customers in making the
purchasing decisions of their insurance
After training model, we get the results of the three models: Logistic Regression, Random
Forest and XGBoost

Determining Attractive Laptop Features for College Students
Team: Liwei Cao, Gordon Oxley, Haoyue Yu, Salman Sigari
Instructor: Chihoon Lee
November 28, 2017
11
Experiment Design
Stage 1: Plackett-Burman Design
Objective: Identify the most important factors early in the
experimentation
Factors & Levels:
Stage 2: Fractional Factorial Design
Objective: study the effects and interactions that several factors
have on the response.
Factors & Levels:
Blocks:
Conclusion
• Price and operating system play really important roles
when it comes to laptop purchase
• To maximize Probability of Purchasing, price at plus level
(<750) and operating system at minus level (Windows)
would be chosen. Our maximum predicted probability of
purchasing would be Probability of Purchasing = 51.77 +
(11.36/2)*(1) - (7.23/2)*(1)*(-1) = 61.065
Data Collection
We handed out slips to Stevens students randomly and recorded their
responses
Stage 1 (32 observations): Stage 2 (64 observations):
● Effect Test
● Pareto Plot
● Normal Plot
Motivation
• Laptops have become a staple in our lives as we use them
for work, entertainment, and other daily activities
• From a marketing perspective, it is critical to find the
factors that interest consumers in order to produce and sell
a laptop that will be successful .
• Survey conducted by Pearson says 66% of
undergraduates use their laptop every day in college
• We wanted to find what drives laptop demand among
college students
Result
Stage 1 Stage 2
Probability of Purchasing = 51.77 + (11.36/2)*Price -
(7.23/2)*Price*Operating system
=51.77 + (5.68)*Price - (3.615)*Price*Operating system
Response: Probability of Purchasing (0 – 100%)

Clustering Large Cap Stocks During Different Phases of the
Economic Cycle
Students: Nikhil Lohiya, Raj Mehta
Instructor: Amir H. Gandomi
Results
Clustering of Stocks during Recovery phase
Clustering of Stocks during Recession phase
K means plot shows that the stocks are clustered with
similarities by their Sharpe ratio, volatility, and average
return. There are 9 graphs in total, and 2 of them are
displayed above for the expansion and recession phase.
The x-axis shows the ticker/symbol of snp500 and the Y-
axis shows the Cluster number. If we hover on the dot
on the graph, it shows the ticker along with its cluster
number, and the variable used for clustering. We used
Silhouette and visually inspected the data points to find
the optimal value of k, which turns out to be 22.
Introduction
OBJECTIVE
We tried to provide a set of securities that behave
similarly during a particular phase of the economic
cycle. For this project, the creation of sub-asset classes is
done only for large-cap Stocks.
BACKGROUND
Over time, developed economies such as the US are
becoming more volatile and hence the underlying risk of
securities has risen. This project aims to identify the risks
& potential returns associated with different securities
and to cluster similar stock similarities according to their
Sharpe ratio, volatility, and an average return of stocks
for a better analysis of the portfolio.
12
Data
Acquire
• Data of Large Cap Stocks & US Treasury Bonds is gathered
directly using APIs.
• The Data potentially consists of 2 time frames i.e.
Recessionary & Expansionary Economy.
Data
Preprocessing
• This segment included the application of formulae to
calculate the pre-required parameters. (Eq 1,2,3,4).
Analysis
• This segment consists of K means clustering Analysis done on
the Large Cap Stocks. (K = 22) (500 Stocks)
• The clustered securities is then further tested for their
correlation among the sub asset classes.
Results
• The results of K means clustering varies in the range (9 to 45)
• There were some outliers in our analysis as well.
Flow - Project
Conclusion & Future Scope
• With the above methodology, we have been able to
develop a set of classes which behave in a similar
fashion during each phase of the economic cycle.
• The same methodology can be extended to
different asset classes available online.
• Application of Neural Networks can significantly
reduce the error in cluster formation.
• Also, application of different parameters such as
Valuation, Solvency or Growth potential factors can
be included for clustering purposes.
• Next, we plan to add leading economic indicator
data to identify the economic trend and to perform
the relevant analysis.
Mathematical Modelling
• Here we take daily returns for all the 500 securities.
𝑅 =
𝑃𝑐−𝑃 𝑜
𝑃 𝑜
× 100 Eq1
• Average Return and Volatility.
𝜇 𝑗 =
∑ 𝑅 𝑖
𝑛
𝑖=1
𝑛
Eq2
𝜎𝑗 =
∑ (𝑅 𝑖 − 𝜇 𝑗)2𝑛
𝑖=1
𝑛−1
Eq3
• Sharpe Ratio calculation for the Securities.
𝑆𝑅𝑗 =
(𝑅 𝑗−𝑅 𝑓)
𝜎 𝑗
Eq4
• Correlation Matrix between the clustered securities following
the cluster formation.

Introduction
• Predict how popular an apartment rental listing is based on the listing content
• Help Renthop better identify listing quality and renters’ preference
• Test several machine learning algorithms and then evaluate them to choose the best one
Statistical Learning & Analytics
Spring 2017
Exploratory Data Analysis
• Data source: Kaggle.com
• 15 attributes with 49352 training samples.
• Target variable: Interest level (high, medium, low)
Data Preprocessing
• Transformed categorical data into numerical data
• Obtained zip code from latitude & longitude
• “One-hot encoding”: created dummy variables
• Extracted the top apartment features from the list
of apartment features
Feature importance
- Built ExtraTrees Classifier
- Computed feature importance
Top features:
- Price
- Building_index
- Manager_index
- Zip_id
- Number of photos
Modeling
• Ensemble Methods: Random Forest, Bagging, AdaBoost, Xgboost
• Logistic regression, KNN, Naïve Bayes, Decision Tree
• Model Evaluation
- Multi-class logarithmic loss: logloss = -
1
𝑁
∑ ∑ 𝑦𝑖𝑖
𝑀
𝑗=1
𝑁
𝑖=1 log(𝑝𝑖𝑖)
- For imbalanced data, use Precision-recall curve and average precision score
• Best Classifier: Random
Forest
- Lowest logloss: 0.6072
- Highest average precision
score: 0.8374
• Precision-recall curve for each
class
Conclusion
• Built 8 models to predict the probability. The
best performance is Random forest with
0.6972 lowest logloss and 0.8374 highest
precision score
• Regarding to imbalanced data, precision-recall
curve could be appropriate evaluation metric
while accuracy score is not applicable
• Hyperparameter tuning for better peformance
• SMOTE technique to balance dataset
13

14
Deep Learning vs Traditional Machine Learning Performance
on an NLP Problem
Abhinav S Panwar
Instructor: Christopher Asakiewicz
Modeling
• Traditional Machine Learning:
• Bag of Words (up to 2 grams) is used for feature generation
• L1 regularization is used for feature selection
• Hyperparameter values are searched through Cross Validation
method
• FastText:
• Data is directly fed into the FastText pipeline without worrying
about feature generation
• ‘Number of Epoch’ and ‘Learning Rate’ are tuned for best
performance
Introduction
• When you need to tackle an NLP task, the sheer number of
algorithms available can be overwhelming. Task-specific
packages and generic libraries all claim to offer the best
solution to your problem, and it can be hard to decide which
one to use.
• Through this project, we are comparing performance of
traditional Machine Learning Algorithms such as Support
Vector Machine, Logistic Regression, Naïve Bayes Algorithm,
etc. with Facebook’s AI lab released FastText, which is based
on a deep neural network.
• The challenge is to predict ‘Dow Jones Industrial Average’
movement based on Top 25 news headlines in the media.
This is a binary classification task with ‘1’ indicating DJIA
rose or stayed as the same, and ‘0’ indicating DJIA value
decreased.
Conclusion
• If working with large data sets, and if speed and low memory
usage are crucial, FastText looks like best option.
• Based on research published, performance of FastText is
comparable to other deep neural net architectures, and
sometimes even better.
• Running complex neural net architectures require GPU’s for
getting good performance but one can get comparable
performance with FastText while running on general CPU’s.
Data Pre-Processing
• Total Number of Samples: 1989
• Some Text Cleaning was done:
•Converting headline to lowercase letters
•Splitting the sentence into a list of words
•Removing punctuation and meaningless words
• Dataset was divided into Training and Testing (80:20 split):
•Training: Data from 08-08-2008 to 12-31-2014 was used
•Testing: Data from following two years was used
Data Description
•News Data: Historical news headlines were obtained from Reddit WorldNews
Channel. They are ranked by reddit users votes, and only the top 25 headlines
are considered for a single date.
•Stock Data:. Dow Jones Industrial Average (DJIA) Adjusted Close Value for the
period 8th August 2008 to 1st July 2016.
Results
•Precision, Accuracy, and Time taken are compared for all the methods
of Learning.
Word Cloud for +ve class Word Cloud for -ve class
•After analyzing the Word Cloud for both classes of data, we can see
Political sensitive words are dominant for both of them.
•Since the dataset contains only 2000 samples, the final model features
are not distinct enough resulting in average model fitting.
•Among all the algorithms, FastText gives nearly best performance and
the time consumed is far less than other algorithms.
Learning Algorithms
• Traditional Machine Learning:
•Support Vector Machine, Logistic Regression, Naïve Bayes, Random
Forest
• Deep Learning:
•FastText: Neural Network based library released by
Facebook for efficient learning of word representations and sentence
classification.
Unigram Bigram Unigram Bigram Unigram Bigram
Logistic Regression 68.7 69.3 71.2 71.8 53 57
LSVC 70.9 71.2 72.1 72.3 61 63
Naïve Bayes 67.4 68.1 67.6 68.4 52 53
Random Forest 53.2 54.9 55.8 57.1 69 72
FastText 68.9 70.2 70.9 71.7 10 11
Precision (in %) Accuracy (in %) Time (in s)

Wine Recognition
Team: Biying Feng, Ting Lei, Jin Xing
Instructor: Amir Gandomi
Results and Discussions
Model 1: KNN (Kth Nearest Neighbor Classification)
Result: Evaluation: cross validation
The error rate of cross-validation is 0.3124, and the error
rate of KNN model is 0.3483.
Model 2: LDA (Linear Discriminant Analysis)
Result: Evaluation: cross validation
According to the result of the cross validation, we can
know that the error rate is 0.0168.
Model 3: Recursive Partitioning
Result:
Evaluation:
We don’t have to use cross validation to evaluate this
model, the reason is Recursive Partitioning uses cross
validation internally. So, we can trust the error rates
implied by the table. Here, the error rate is 0.1685.
Goals we want to accomplish
• Selecting most influence variables for wine
classification.
• Developing classifiers to classify a new observation.
• Finding the best model for wine classification.
• Find the correlation between each pair of variables.
Classification Models
1. K Nearest Neighbor
2. Linear Discriminant Analysis
3. Recursive partitioning
Conclusion
• The error rate of KNN is 0.3146 which is the
worst accuracy among these three models.
• The error rate of LDA model and Recursive
Partitioning models are the same and it is equal
to 0.01685.
• However the LDA model is overestimate the
result and, therefore, it is not good enough.
• The Recursive Partitioning model uses cross
validation internally to build its decision rules,
that means this model is more reliable.
• Finally, the Recursive Partitioning model is
selected as the best model for wine classification.
Multivariate Data Analysis
Fall, 2017
Data Information
Problem Type: Classification
Variables: (1) Alcohol, (2) Malic acid, (3) Ash, (4) Alcalinity of
ash, (5) Magnesium, (6) Total phenols, (7) Flavanoids,
(8) Nonflavanoid phenols, (9) Proanthocyanins, (10)
Color intensity, (11) Hue, (12) OD280/OD315 of diluted
wines, (13) Proline
Outcome: Wine Type
Data recourse:
http://archive.ics.uci.edu/ml/datasets/Wine
Data preparation
1. Type of variables
2. Summary of variables
3.Correlations 4. Variable importance
5. Scatter Plots
6. Variable dependencies
7. Standardizing variables
15

The Public’s Opinion of Obamacare: Tweet Analyses
Saeed Vasebi
Instructor: Chris Asakiewicz
Results
Monthly trend of tweets based on their
sentiment:
Geological distribution of tweets:
Main #Hashtags of authors:
Trumpcare sentiment analyses by Language:
Introduction
 Patient protection and affordable care
act (Obama Care) provides healthcare
insurance services for US citizens.
 This act has been highly debated by
Democrat and Republican parties.
 The main beneficiary of the act is
people who use and pay for it.
 This study tries to find out what people
think about Obamacare and Trumpcare
which is potential substitute for the act.
Modeling and Data
 Watson Analytics social network tool is
used to gather data from tweeter based
on #Obamacare and #Trumpcare in this
study. Also, IBM Bluemix sentiment
analysis is used for detail valuation of
tweeters.
 Obamacare tweets have been gathered
for July-September 2016 and 2017.
 Trumpcare tweets have been extracted
form November 2016 to October 2017.
 The geological area of study is limited to
United States.
 The tweeters’ language is limited to
English and Spanish which represents
two main races in US.
Conclusion
 Most of tweets have negative sentiments about Obamacare and Trumpcare; however,
Trumpcare has relatively higher opposing.
 CA, TX, NY, and FL have high tweeting rate about the acts. They have high negative tweets
for the acts and partly support Obamacare but Trumpcare is not supported at any state.
 Hashtags with Obamacare talk about its negative impacts on government’s budget.
Hashtags with Trumpcare talk about stopping the program and President Trump’s
violations.
 Spanish tweets have higher rate of positive tweets than English ones.
20
Obamacare tweets’ trend on July-September 2016 and 2017
Trumpcare tweets’ trend from November 2016 to October 2017
Geological distribution of Obamacare’s all, negative, and positive tweets
Geological distribution of Trumpcare’s all, negative, and positive tweets
Obamacare and Trumpcare #hashtags

Project Assignment Optimization Based on
Human Resource Analysis
Team: Jiarui Li, Siyan Zhang, Siyuan Dang, Xin Lin, Yuejie Li
Instructor : Alkis Vazacopoulos
Optimization:
Constraints:
• 𝑃𝑖𝑖 stands for project i assigned to employee j. It
takes value 0 or 1.
• 𝑎𝑖 stands for how many employees needed for
one project.
• 𝑏𝑗 stands for how many projects one employee
can complete.
Introduction
The primary goal of our project is to provide
insight to companies which need to properly
assign different projects to each employee. In
order to maintain a low turnover rate as well as
increase employee productivity and growth, we
use a combination of machine learning and
powerful tool of optimization to help companies
build and preserve their successful business.
Model
Objective: Minimize the turnover rate.
Data Exploration:
1.Project Account of Turnover VS No Turnover :
Moving Forward
For our future work, we are looking forward to
improve our model by introducing additional
constraints. In addition, we will combine machine
learning using Random Forest to give a prediction
of the turnover rate and the optimization question.
Thus, we can generate a model with more accuracy
to give a dynamic result based on the enterprise’s
data.
2.Decision Tree Model:
Get insights of data and
predict the turnover rate.
3.Heat Map:
We build the assignment model to optimize the
project assigned to each employee while
minimizing the employee turnover rate.
there is a positive(+) correlation
between projectCount,
averageMonthlyHours, and
evaluation.
16

S&A Processed Seafood Company
Team: Redho Putra, Neha Bakhre, Harsh Bhotika, Ankit Dargad, Dinesh Muchandi
Methodology
• We found out how many units S&A Company sold. (The
article give the ‘value’-> 330M)
• Then we used forecasting techniques to estimate how many
units S&A will sell in the next 3 years to hit sales target 550M,
which is our objective of this project.
• Suggested new supply chain strategy to reach desired target.
Introduction
• S&A Company is a distributor of processed seafood products.
• The broader offering of products has improved sales in the
area of Western Europe.
• Sales have improved with the new product acquisitions, but
S&A now wishes to expand the entire product line into more
of Eastern Europe.
• However, the company realizes that their current model of
putting a distribution center (DC) in each country that they
service may no longer be the best option as they look to
expand.
Current Strategy
Conclusion
• The optimum number of DCs for S&A Company is 3
(UK, Germany, and Poland)
• UK DC covered demand from UK & Ireland; Germany
DC covered demand from Germany, France & Belgium;
Poland DC covered demand from Poland & Hungary.
• By meeting the industry benchmark of inventory
turnover (15), UK DC expected average inventory is
257,000 cases; Germany DC expected average
inventory is 261,000 cases, and Poland DC expected
average inventory is 175,000 cases.
• The warehouse utilization rate of those 3 DC would be
80%.
21
Suggested Strategy
Results
• Only Ireland DC has utilization rate around 80%. The other DC
utilization rate is between 25-42%.
• The average inventory turnover in all DC is 11, while the
industry benchmark is 15.
Objective
• S&A Company is looking to drive significant organic
revenue growth in the European market growing from
$340M (as of the end of the fiscal year ending June 30,
2017) to $550M over the next 3 years.
• We need to provide a supply chain strategy that helps
them answer fundamental questions regarding the
expected inventory and flow of goods that will be
required three years from now to support the desired
sales growth.
• The company needs advice regarding the supply chain
network, infrastructure and processes needed to serve
its customers within the expected delivery window while
optimizing costs.

Optimizing Travel Routes
Team: Ruoqi Wang, Yicong Ma, Jinjin Wang, Shuo Jin
Instructor: Alkiviadis Vazacopoulos
Result
18
Introduction
There are more than a hundred million international tourists arriving to the United States every year and
they spend billions of dollars. To maximize the profit and make more competitive offers, it is crucial to lower
costs. One important factor is arranging an efficient tour route so that tour agents and travelers can lower
the cost of time and transportation.
Our project goal is to optimize a tour route by using certain constraints to minimize costs and maximize tour
experience.
Future Work
Experiment
•To achieve our objective, we constructed a two-step
experiment.
•Our first step is defining attributes and setting up constraints so
that we can pick 6 target cities from the 10 most popular tourist
destinations in the U.S. We collected data from TripAdvisor.
•The cost of living index is used to estimate the expenses of the
tour, which include dining and lodging.
•Our second step is optimizing the tour route based on the cities
that we picked in step one. We compared the costs of 3
transportation methods (train, bus, flight). We first optimized the
cost for each city and then we optimized the travel route using
Excel solver.
• Step one- we added several other constraints, such as
visiting a minimum of two national parks and visiting a
minimum of one Michelin starred restaurant to achieve the
objective of maximizing the tour experience. We used Excel
solver to obtain our final selection of destinations.
• Step two – after we were successfully done data cleaning,
we applied our data into a Travelling Salesman problem
model to optimize our route.
• In the near future, we want to make our model more realistic by adding timetables into our model.
• In addition, we want to add more constraints to decide the optimal group size and ways to deal with
customers’ luggage.

Predicting Airbnb Prices in NYC
Team: Jiahui Bi, Ruoqi Wang, Yicong Ma, Xin Chen
Instructor: Yifan Hu Business Intelligence & Analytics
November 13th 30th, 2017
19
Introduction
• Background: After showing up in 2013, the sharing economy
has become a vital area in the industry and is affecting almost
everyone’s life. Airbnb, as one of the representatives of sharing
economy, is always focused by explorers and users.
• Purpose: exploring what factors will influence the price of
the Airbnb houses/ apartments / rooms and using those factors
to predict the price in New York City.
• Key Questions: What factors will influence the price in the
New York and how will they influence?
• Technology: Python(sklearn, xgboost)
Data Preparation
• Data Source：http://insideairbnb.com/get-the-data.html
New York City, New York, United States
October 2017
• Data cleaning : We get 44317 rows data which include 96
features
 Drop entries that are missing (NaN) values for columns
like “bedroom”, “bed”
 Substitute missing values with the mean of the columns
like "review_scores_location”
 Set the value of 'reviews_per_month' to 0 where there is
currently a NaN
 Transfer the string values into the integer or float type
 Delete entries that are outliers (more than 2000$ or 0)
for “price”
• Finally, we get 43148 rows data and 28 features
Exploratory Data Analysis
• Highly relevant features
• Price Distribution
Conclusions
• Conducted 5 models and ensemble the model
• Tested xgboost model with baseline model and Ridge
Regression
• The most important features that influence the prices in NYC
are location, the type of room and the number of bedrooms
Future Work
• Abstract new features, such as amenities.
• The price is highly positively related to location, in our case,
listings in Manhattan is highly correlated with the pricing.
• Accommodates, room type, the number of bathrooms,
bedrooms, beds also have strong correlations with the price.
Feature Importance
Method 1
• Built Vanilla Linear Regression, Ridge regression, Lasso
Regression, Bayesian Ridge on cross-validation split data
• Figure 1 shows when used 18 features
Figure2 shows after One-Hot Encoding (create dummy
variables)
Figure 1 Figure 2
• So, we can see Lasso comes out on top
• Ensemble model: Gradient Boosting
• We're doing better with Gradient Boosting Regressor with
MAE=25.52, almost 20% less than previous method
Method 2
• Ridge, Baseline & Xgboost
• To test our different models in depth, we repeated train-
validation split and then see how our errors are distributed.
We also added a baseline model to compare.
• Both ridge and xgboost beat the baseline model
• Ridge Regression is performing slightly better than xgboost
model
Mean(RMSE)
baseline 129.69
ridge 97.24
xgboost 98.31

Portfolio Optimization of Cryptocurrencies
Team: Yue Jiang, Ruhui Luan, Jian Xu, Shaoxiu Zhang, Tianwei Zhang
22
Motivation
•Cryptocurrencies are a very important topic for investors and
the economic world.
•Everyone wants to understand whether it is worthy and safe
to invest in this type of assets.
•We want to create a portfolio optimization model so that we
can test the performance of Cryptocurrencies .
•We’d like to know and understand the returns and the risk we
take for different portfolios of cryptocurrencies.
Technology
•User API file to craw the data of cryptocurrencies from the
internet.
•Python to construct the portfolio with the Monte Carlo
Simulation, optimize the Sharpe Ratio and the portfolio variance.
•Python’s matplotlib package to make the visualization of
scatter plot of expected return and expected volatility.
•We have price data of 9 different cryptocurrencies for the period of time from May 24th, 2017 to Oct 3rd, 2017, which is a very small
subset of the data but it is a good start for us to test the investment concepts and to optimize the portfolio of digital currencies.
•With Monte Carlo Simulation method, we have constructed out portfolio model. Then, we have calculated the covariance of this 9
cryptocurrencies and also the correlations of every pair of combination.
•We have used Python to calculate and visualize the Markowitz efficient frontier of our data.
•We have optimized the Sharpe Ratio so that we can get the portfolio of maximized Sharpe Ratio. Also, we have minimized the
variance of the portfolio so we can get lower risks.
• We can observe from our findings that the volatility of digital currencies are very high due to the changes of the markets, the
expectation of investors and also the regulations from government.
•In the future, we can include more cryptocurrencies and longer time range into our data so that we can get more exact results.
Moreover, if the regulations are loose and investors can make transactions between digital currencies and real currencies, we then
can get the data of such transactions and find the opportunity of arbitrage.
Data Snapshot
Our data is crawled from the internet of cryptocurrencies markets. There are
nearly 1,500 types of digital currencies in the market, however, for testing the
concepts and due to the limits of our computers, we only got the market price
for 9 cryptocurrencies and the time frame was from May 24th, 2017 to Oct 3rd,
2017.
Log Returns
As the majority of statistical analysis approaches rely on log returns rather than
the absolute time series data, we use Numpy package of Python to get the log
returns for the 9 cryptocurrencies. The returns of the digital currencies
indicated that the risk of investing in the virtual currencies was very high during
the time period because only two of them had the positive returns.
Covariance & Correlation
Aiming to see the dependence
between these 9 digital currencies, we
got the covariance and the correlations
between each pair of them. It shows
that they have the positive correlations
and most of them are tightly related to
Bitcoin.
Random Weights
Subject to the constraint that the sum of the weights of the portfolio holdings
must be 1, we have randomly assigned weights to our 9 cryptocurrencies.
Monte Carlo Simulation & Sharpe Ratio
Optimization
We then use Monte Carlo simulation to run 4000 times of different
randomly generated weights for the individual virtual currencies and then
calculate the expected return, expected volatility and Sharpe Ratio for
each of the randomly generated portfolios.
It’s then very helpful to plot
these combinations of
expected returns and
volatilities on a scatter plot ,
and we color the data points
based on the Sharpe Ratio
of that particular portfolio.
Optimization on Variance of Portfolio
We have minimized the portfolio variance so that we can invest on the
portfolio with lowest volatility. The result suggested that we should invest
on the first, the second, the third and the last digital currencies.
Maximization of Sharpe Ratio
By looking for the minimized Sharpe
Ratio of the portfolio, the result is
choosing only the last digital
currency.

Regression Model for Box Office Gross
• Same as the model for
rating, 3 models with
different variables are
compared by MSE in 4
testing dataset from
which model 2 is the best
• In the last regression model for gross, budget,factor1,2,3,
5,duration ,director, actors are the explaining variables in
which factor3 and 5 are negative factors.
Regression Model for Rating
• 3 models with different variable
entered are simulated in training
datasets and verified in testing
datasets from which we examine
the MSE of each model in each
test where model 1 is considered as the best
• In the last regression model for rating, duration
,director, year, and actors are the explaining
variables in which year is a negative factor.
Predicting Movie Rating and Box Office Gross
Team: Yunfeng Liu, Erdong Xia, Yash Naik
Instructor: Amir H. Gandomi Statistical Learning & Analytics
Spring, 2017
MSE of rating model
test1 test2 test3 test4
model1 0.3168 0.3487 0.3872 0.4128
model2 0.3171 0.3627 0.3872 0.4128
model3 0.3195 0.3492 0.3890 0.4148
MSE of gross model
test1 test2 test3 test4
model1 2.278 × 1015
4.017 × 1015
1.926 × 1015
1.635 × 1015
model2 2.269 × 1015
3.960 × 1015
1.928 × 1015
1.650 × 1015
model3 2.277 × 1015
4.011 × 1015
1.925 × 1015
1.636 × 1015
rating model of all years
Variable
Parameter
Estimate
Standardize
d
Estimate
Intercept 41.94717 0
duration 0.01275 0.25958
director 0.13974 0.17509
actors 0.05001 0.06428
year -0.01848 -0.16053
gross model of all years
Variable
Parameter
Estimate
Standardized
Estimate
Intercept 5.1 × 10
6
0
Factor1 2 × 10
7
0.25932
Factor2 1.9 × 10
7
0.24701
Factor3 -5.0 × 10
6
-0.069
Factor5 -5.0 × 10
6
-0.0678
duration 2.5 × 10
5
0.06465
director 4.3 × 10
6
0.06724
actors 7.6 × 10
6
0.12248
budget 2.2 × 10
-1
0.24059
Introduction
• The primary purpose of this project was to create models using existing movie data for prediction of box office gross and movie ratings.
• The first model was created to predict the Gross box office revenue that a movie will generate given inputs like the genre, actors involved,
the director and many more influencing factors.
• The second model was created to predict the movie rating considering the genre, the budget of the production and many more influencing
factors.
Models
• Principal
Components
Analysis
• Multivariate
Regression
Model
Data Source
• Database is from
www.kaggle.com
• Over 5,000 movies
data in past 20 years
with 18 variables
Data Processing
• Delete null and missing values from raw data (5500 rows-3116 rows)
• Select movies with sufficient reviewer numbers (3116 rows-1558 rows)
• Transform directors and actors Facebook likes into score level from 1 to 5
• Transform string variable of genres into Boolean variable of 17 specific categories
• Divide the population into four testing datasets evenly and set training dataset according to
different testing dataset for regression model testing
Principal Component Analysis
• To investigate the movie score and gross regression models, principal components analysis
are conducted with 17 concerning categories.
• From the scree plot, the first 6 factors are selected as principal components of movie genres, the
eigenvalues are greater than 1 and the cumulative variance explained is greater than
0.6.
• The scoring coefficients of 6 principal
components are displayed where the
green cells are those positive indicators
of that factor and red cells are those
negative indicators of that factor
• From PCA we conclude that Factor 1 is
like family animation without thriller
and crime, factor 2 is like action and
sci-fi with thriller, factor 3 is action
without horror, factor 4 is biography,
factor 5 is crime, factor 6 is documentary
Conclusion
• From movie rating model, it can be seen that a long movie shot by a famous director is
more likely to earn a high rating score on reviewer websites and the genres of the
movie are irrelevant.
• From movie gross model, it can be seen that a big-budget movie with the genres of
family-animation or action-sci-fi is more likely to earn a successful box office. Also, in
terms of earning money, actors seem more important than the director based on this
model.
1
2
3
4
2
2
2
3
3
3
4
4
4
1
1
1
White = testing red = training
Test1
Test2
Test3
Test4
model1
model2
model3
duratio
n
duratio
n
duratio
n
director
director
director
year
year
year
actor
actor
actor
budget
budget
budget
model1
model2
model3
duration
duration
duration
director
director
director
Factor 1
Factor 1
Factor 1
Factor 2
Factor 2
Factor 2
Factor 3
Factor 3
Factor 3
Factor 5
Factor 5
Factor 5
actor
actor
actor
budget
budget
budget
year
year
year
Factor 4
Factor 4
Factor 4
24

Student Performance Prediction
Team: Abineshkumar, Sai Tallapally, Vikit Shah
Multivariate Data Analysis
Fall, 2017
Introduction & Motivation
• Regression and Classification models are developed to help
schools to predict the performance of students in final exam
using historical data including demographic and previous test
scores
• Prediction tools can be used in improving the quality of
education and enhancing school resource management
Data Understanding
• Final grade (G3) of students ranges from 0 – 20 which is
displayed in the histogram below.
Modeling and Technology
• Two datasets with the same number of variables are used,
first dataset is about Mathematic subject and the second
dataset is about Portuguese subject
• G3 (Final grade), ranges from 0 – 20, is the target.
• A Linear regression model is build to predict the final grade
of students by using their given information
• Linear equation is formed by selecting the top 5 variables 5
variables those provide the lowest BIC and Mallow’s CP while
providing an optimal Adjusted R2
• The most important variables are G1 and G2 which are the
first and second period grades which are important in
predicting the final grade
• A classification is done for the grade. When a student score
is more than 10, it is classified as Pass and otherwise it is
classified as Fail
• The classification technique used here is Logistics
regression
• Classifying the grades (0 – 20) into two classes (Pass / Fail )
produces a more accurate model. But in this experiment, the
focus was to build a linear regression model which predicts
the final grade as a numeric value which is shown below.
Result and Conclusion
• The final grades of the students were mostly dependent on the first period and second period grades, and also on demographic
information such as if a student’s father had a job or his attendance percentage is high were also useful in the model building process.
• Final grade had a decent correlation with the first period and second period grades but these fields were very important in building a
model. In the decision tree model above, we can see that impact of mother having a job title “teacher or other”. In Linear Model, we
could see the family relationship and age are in the top 5 independent variables shows the importance of those variables as well.
• From the above results the educators can understand the impact of the previous grades (First period and Second period), therefore,
they could start planning on taking care of the students during those exams. The predicted student performances also helps in
implementing this model as a policy in the schools.
Reference: http://www3.dsi.uminho.pt/pcortez/student.pdf
•
Linear Regression Model
• From the summary of the model, we took top 5 variable to
build the linear regression model,
Model = lm(G3~ G1 + G2 + absences + famrel + age)
• G1, G2, Absence of a student, quality of family relationship
(famrel), and age of the student are the top 5 independent
variables to predict final grade (G3).
• R-squared: 0.8599, Adjusted R-squared: 0.846
Decision Tree in R
• Decision Tree algorithm used first period, second period, and
mother’s job to predict if a student will pass or fail in the final
exam.
• Studied Model predicts student passes in final exam if their
second period Grade is >8 or First period grade is >10 or if the
student’s mother has a job title “Teacher or Other”. The
accuracy of the model is 87% which is very close to the
models built on Azure.
25

Classifying Iris Flowers
Team: Xi Chen, Shan Gao, Lan Zhang
26
Purpose
• For the iris flowers, the sepal and petal are very different
from other flowers. Also, there are three different types of
iris in the world. To classify iris flowers, we intend to create
a k-nearest neighbors (KNN) model to assigning a label or
a class to a new instance. A regression model is also
developed to assign a value to the new instance.
Data Description
• This famous (Fisher's or Anderson's) iris data set gives the
measurements in centimeters of the variables sepal length
and width and petal length and width, respectively, for 50
flowers from each of 3 species of iris. The species are Iris
setosa, versicolor, and virginica.
Methodology
Linear Regression
• We plotted the scatter matrix of Iris data before diving into this topic and the data set consists of four measurements (length and
width for petals and sepals)
• We did the regression analysis for petals and sepals to see their significance.
KNN method with R
• Similarity measure which is typically expressed by a distance measure like the Euclidean distance that we will use in Iris dataset.
• We used this similarity value to perform predictive modeling like classification with assigning a label or a class to the new instance.
• We divide the dataset into the Training and Test Sets so as to test the accuracy of probability in our classification.
Clustering
• The hierarchical cluster analysis is used to visualize how cluster are formed and what the relationship among Iris members.
Data Analysis Results:

31
Experiments for Apartment Rental Advertisements
Team: Nishant Bhushan, Sharvari Joshi, Nirmala Seshadri
Resulting Regression Equation
Likelihood of responding to the Ad = 0.6425 -0.0875 * Amenities[No]
- 0.0775 * Public Transport[No]
Expected Response under Best Factor-Level Choice
Likelihood of Responding to the Ad = 0.6425 – 0.0875 (-1) – 0.0775 (-1)
= 0.81
Objective
• To determine what factors in apartment rental advertisements
contribute to responses from ad viewers.
• Apartment hunt has always been a crucial experience for graduate
students. Therefore, we would like to perform an experiment to see
how response is affected by different factors in a posting on the
website.
• We are performing an experiment to maximize the student’s
response to a post on a housing website of an apartment rental by
determining which factors affect the student’s response.
Approach
A Customer survey was conducted and 16 sample Ads were made.
Each Ad was sent to 5 customers and they were asked the likelihood to
respond to that particular variable.
The Customers were asked to choose from one of the following options:-
• Extremely Likely (100%)
• Very Likely (80%)
• Moderately Likely (60%)
• Slightly Likely (40%)
• Not Very Likely (20%)
CHOICE OF EXPERIMENTAL DESIGN
• 2_𝐼𝑉^(8−4) Fractional Factorial Design
• NUMBER OF LEVELS FOR EACH FACTOR = 2
• NUMBER OF FACTORS = 8
• NUMBER OF OBSERVATIONS = 16
• RESOLUTION = 4
• No. of Replications: 5
• We chose this design as the number of factors is large i.e. 8. Thus, a
full factorial design would result in large number of interactions and
2^8 =256 runs which would be impossible to carry out given the
time and budget constraints.
Example of a survey
- 16 advertisement surveys were created.
- Each survey had 5 responses
- Target Audience: Students at Stevens Institute of Technology
THE DESIGN OF EXPERIMENT
Conclusion
• Not including the amenities decreases the chances of a customer
responding to the advertisement by 8.75%.
• Not including the access to public transport decreases the
chances of a customer responding to the advertisement by
8.75%.
• Mentioning the Amenities and access to public transport are the
most significant factors and should be included in the housing
ads.
• Posting time of the ad is not a significant factor so the
advertisement can be posted at any time during the day.
• Street Maps and Pictures are not significant factors as well and
can be done without.
Factors and Levels

has
Human Resource Retention
Team: Aditya Pendyala, Rhutvij Savant
Instructor: Amir H Gandomi
Data Analyses
• The heatmap below depicts a correlation between
the different features of the data set:
• The count of each of the nine incidences is depicted
below:
• The above deviance table indicates the estimates,
standard errors, and z-values of all variables involved
in the dataset
Introduction
Problem: An organization aims to retain its important resources and reduce the number of employees leaving the company. The HR department
has collated employee related data which can be used to predict which employees may leave. If a prediction model can be designed
with high accuracy, necessary action can be taken by HR Department to prevent critical employees from leaving the company by
addressing the variables which are causing the employee to resign.
Data: The data provided is compiled by Human Resources, and consists of ten variables:
Technology:
• R has been used for developing logistic regression prediction model. Confusion matrix and ROC Curve are used to increase the accuracy
of the model and setting the threshold
• Python has been used to generate data visualizations of various analyses, including Principal Component Analysis
Conclusion
Based on the above analyses, the following observations can be made:
• The model can be used to identify majority of employees who may leave.
• For identified employees the data analyzed for employees, such as
promotion, monthly hours, salary, etc., can be used by HR to stop the
employee from leaving as shown in the correlation and data analysis charts.
Multivariate Data Analytics
Fall 2017
Logistic Regression Prediction Model
• Number of available instances: 15,000 employees.
• Database is divided into the training and testing subsets in ratio of 4:1
• The prediction model was created on training dataset using logistic
regression (12,000 employees).
• This model is validated by employees in testing set (3,000 employees)
• The confusion matrix is used to show the accuracy of the model
• ROC curve is used to decide the threshold to 0.3 which is the optimal
value for True Positive Rate and False Positive rate and which gave the
model an accuracy of 78%
Future Potential
• Address all possible contributions to prevent employee’s
departure proactively
• Use the framework to increase the accuracy of the model
• Analysis based on more variables that might affect on
employee leaving. e.g. time of commute, training opportunity
Principal Component Analysis
With the large number of
variables, correlation between
different variables is inevitable.
By Principal Component Analysis,
all variables can be made to be
linearly uncorrelated and
redundant variables can be
dropped. Based on this graph, it is
visible that the 7th component
can be removed.
 Satisfaction Level
 Last evaluation
 Average monthly hours
 Time Spent at company
 Promotion
 Salary
 Employee Left
 Work Accident
 Department
 Number of projects
T=0.5 Predicted Value
Actual Value 0 1
0 TN 2102 FP 183
1 FN 434 TP 281
T=0.3 Predicted Value
Actual Value 0 1
0 TN 1846 FP 439
1 FN 219 TP 496
ROC Curve
27

Constructing Efficient Investment Portfolios
Team: Cheng Yu, Tsen-Hung Wu, Ping-Lun Yeh
Modeling
Capital Asset Pricing Model(CAPM): Optimization Model:
Analysis
Fall, 2017
Introduction
It is often that individual investors do not know how to
build up their investment portfolio. For this reason, they
usually completely rely on the advice from financial
companies. We offer a customized recommendation
investment planner to advice what quantities of stock one
needs to buy to get back a certain return rate with the
minimum portfolio variance.
Technology
• Using Capital Asset Pricing Model (CAPM) to generate
expected return by considering systematic risk.
• IBM DOcplex Modeling for Python to execute quadratic
programming optimizing the objective of the
mathematical model.
• Python to make the visualization using pygal and
Microsoft Power BI.
𝐸 𝑅𝑖 = 𝑅𝑓 + 𝛽𝑖[𝐸(𝑅 𝑀) − 𝑅𝑓]
𝐸 𝑅𝑖 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑎 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑅𝑓 = 𝑅𝑅𝑅𝑅-𝑓𝑓𝑓𝑓 𝑟𝑟𝑟𝑟
𝐵𝑖 = 𝐵𝐵𝐵𝐵 𝑜𝑜 𝑎 𝑠𝑠𝑠𝑠𝑠; 𝑎 𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑟𝑟𝑟𝑟
𝐸 𝑅 𝑀 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑡𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑡𝑡𝑡𝑡
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡
𝐸 𝑅 𝑀 − 𝑅𝑓 = 𝑀𝑀𝑀𝑀𝑀𝑀 𝑟𝑟𝑟𝑟 𝑝𝑝𝑝𝑝𝑝𝑝𝑝. 𝐴 𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜 𝑡𝑡𝑡 𝑒𝑒𝑒𝑒𝑒𝑒 𝑟𝑟𝑟𝑟𝑟𝑟
𝑜𝑜 𝑡𝑡𝑡 𝑚𝑚𝑚𝑚𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑡𝑡𝑡 𝑟𝑟𝑟𝑟-𝑓𝑓𝑓𝑓 𝑟𝑟𝑟𝑟
𝑃𝑖 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑝𝑝𝑝𝑝𝑝 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑄𝑖 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖
𝐼𝑖 = 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑎𝑎𝑎𝑎𝑎 𝑜𝑜 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑊𝑖 = 𝑊𝑊𝑊𝑊𝑊𝑊 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡 𝑒𝑒𝑒𝑒 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑖𝑖 𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝜎 𝑝
2
= 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑣𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑅𝑖 = 𝑌𝑌𝑌𝑌𝑌𝑌 𝑟𝑒𝑒𝑒𝑒𝑒 𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠 𝑖
𝑉𝐶 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑚𝑎𝑎𝑎𝑎𝑎
𝑅 𝑝 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑟𝑒𝑒𝑒𝑒𝑒
𝐵𝑝 = 𝐴𝐴𝐴𝐴𝐴𝐴 𝑝𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏
𝐵𝑖 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑑𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐼𝐼𝐼𝐼𝐼𝐼
𝐼𝑖 = 𝑄𝑖 × 𝑃𝑖
𝑅𝑓 = 𝐼𝑖
∑ 𝐼𝑖
𝑛
𝑖=1
𝑉𝑉 = 𝑅𝑖 − 𝑅𝑖
� 𝑅𝑗 − 𝑅𝑗
�
𝑁 − 1
𝜎 𝑝
2
= 𝑊 𝑇
× 𝑉𝑉 × 𝑊
𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂: 𝑀𝑀𝑀 𝜎 𝑝
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑡𝑡:
𝑅 𝑝 = � 𝑅𝑖 × 𝑊𝑖 𝑅𝑖 ≤ 𝑡𝑡𝑡𝑡𝑡𝑡 𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟𝑟𝑟
𝑛
𝑖=1
𝐵𝑝 ≤ 𝐵𝑖
Simulation
Conclusion
We incorporated CAPM in our model, as it
is broadly used among financial experts as an
evaluation method for future stock price.
According to the simulation results, the
higher the target return rate, the more risk of the
portfolio. Furthermore, the model recommends a
particular combination of stocks with the
minimum of portfolio risk based on the target
return rate.
Investors can find their targeted stocks to
customize stocks portfolio from the results
achieved in this model.
CAPM
Model
Optimization
Model
• Actual budget spent
• Amount of shares on specific stocks
1
1
6
6
2
4
2 4
4
4
Age: Under 25
Disposables Income: 4,361 USD
Actual expense: 4,342 USD
Target Return: 30%
Portfolio Variance: 0.66
Number of Stocks Selected: 20
28

A Tool for Discovering High Quality Yelp Reviews
Team: Zijing Huang, Po-Hsun Chen, Hao-Wei Chen, Chao Shu
Instructor: Rong Liu Business Intelligence & Analytics
32
Motivation
 For Customers:
Customers are more likely to read reviews with details instead
of reviews that were written to just vent emotions.
 For Companies:
o Objective reviews are helpful in improving their products.
o High-quality reviews can increase customer engagement
and attract more users.
Introduction
 Objective: A tool for discovering high quality reviews
 High quality review: objective and supply insightful details
about user experience.
 Dataset: Yelp hotel and restaurant reviews
o Raw data: 4736897 reviews
o Data with labeled aspects and sentiment: 1132 sentences
 Methodology: Deep learning with Convolutional Neural
Networks (CNN) and Word Embedding
Approach
1. Use Yelp raw data to train word vectors that capture word semantics
2. Create a CNN to identify aspects from every sentence of a review
3. Train another CNN to detect sentiment of each sentence
4. Build an a neural network model to predict the quality of a review base on aspects and sentiment of its sentences.
5. Compare this approach with other models (SVM, Naïve Bayes …) and analyze the pros and cons of each model
6. Visualize the result through user interface and publish a python package (or a Restful API) for third party use
Methodology
High Quality vs. Low Quality
Useful Information we got from the review.
Un-useful Information
Result

Analysis of Diabetes among Pima Indians
What factors affect the occurrence of diabetes?
Team: Junjun Zhu, Jiale Qin, Yi Zhang
Results & Evaluation
Model 1: Baseline Model
Model 2: Explanatory Model – Logistic Regression
Objectives:
• What factors affect the occurrence of
diabetes?
• How best to classify a new observation?
• What is the best model for this data?
Modeling
1. Simple Logistic Regression
2. Logistic Regression
3. K Nearest Neighbor
Conclusion
Multivariable Data Analysis
Fall, 2017
Data Information
Type: Classification
Data Recourse:
https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes
Data Preparation
The AUC value is
0.8505833, the F1 score
is 0.6394558
the recall (Sensitivity) is
quite bad 0.5875
The AUC value is 0.8336667
the F1 score is 0.528
the recall (Sensitivity) is quite bad 0.4125
Model 3: Predictive model -- KNN
The AUC value is 0.7295417
the F1 score is 0.5384615
the recall (Sensitivity) is quite bad 0.525
We have used three different models to study
the data. From the results, we found that the
simple logistic regression model is the best
among these three models, because the
sensitivity, accuracy, and F1 values are higher
than for the other methods.
Figure.1 shows the distribution of
the data including the people who
are diabetics and those who are
not. Figure.2 shows the people
who have diabetes. Figure 3
shows people who are not
diabetics. We can find that the
people with non diabetes are more
concentrated in the center - these
people are younger, and the index
is more routine.
29

Google Online Marketing Challenge: Aether Game Cafe
Team: Ephraim Schoenbrun, Jaya Prasad Jayakumar, Sunoj Karunanithi, Saketh Patibandla
Factors & Sample Ads
30
Client and Market Analysis
• Aether Game Cafe (AGC) is a blend of traditional coffee shop and
board games play centre located in Hoboken, New Jersey.
• AGC relies only on ‘Word of Mouth’ for its marketing.
User Flow
GOMC
• The Google Online Marketing Challenge is a unique opportunity for
students to experience and create online marketing campaigns
using Google AdWords.
• With a $250 AdWords advertising budget provided by Google, we
developed and ran an online advertising campaign for a business
over a three week period.
Customer Analysis – Google Analytics
Experiments and Results
Hudson County New York
Significant Factors
Main Effects and
Interaction Effects
Successful Ads
Conclusion
• Experimental design helped us to achieve our target which
eventually made us rank in Top 3 % all over the world.

The World’s Best Fitness Assistant
Team: Anand Rai, Jaya Prasad Jayakumar, Saketh Patibandla
Instructor: Christopher Asakiewicz
Fitness Assistant
• Welcome
Intelligence of our BOT
Same Questions but Different answers based on context
33
Chatbot Framework
Text and Speech Enabled Assistant
Introduction
• We are fitness freaks but never found a good application that
guides us while working out.
• There are a few that exist, with the drawback of being static
Chatbots which answer predefined questions.
• Our Fitness Assistant, which is AI Driven and uses Natural
Language Processing to understand questions and gives the
best answer.
• This is a prototype and we are going to build this application
for all exercises and food habits.
Natural Language Processing
• Natural Language Processing is the key to an efficient chatbot.
• Every question is stored for future analysis and the knowledge
base gives the answer to the question even if the question is
not hardcoded.
• The NLP searches for the entity in the question which is similar
to noun in a sentence and narrows down to the answer.
Future Scopes
• This application will be improved by adding in all varieties of
exercises and also with multiple sources of knowledge base.
• This app will be One Stop Solution for all fitness freaks for
exercises and nutrients, we plan to build an in app module to
track human behavior and our bot will suggest them what to
consume for betterment of their life.
Greetings 1
Greetings 2
Questions

Zillow’s Home Value Prediction
Team: Wenzhuo Lei, Chang Xu, Juncheng Lu
Instructor: Chris Asakiewicz
Background & Objectives
It is absolutely important for homeowners to have a trusted way of monitoring the assets. The “Zestimate”
are created for estimating home values based on a large amount of data. Zillow published a competition of
improving the accuracy of prediction for house.
Data Imputation
To start, we check the log error (target value) and it follows normal distribution. Also, we checked the
frequency of trades based on the day/month. It shows the date is not a significant factor for trading houses
and people tends to trade houses in summer.
To impute the data. We separated the data with 5 different types.
• Features with more than 90% missing values → Deleted.
• Features with no missing value → Kept.
• Binary features → Filled with 0 when there are missing values.
• Irrational features → Filled with -1 when there are missing values.
• Features with few numbers of missing values → Filled with mean number of the feature.
• Special feature(total living area of home) → Use KNN based on number of bedrooms and bathrooms
(as more bedrooms and bathrooms usually means bigger area).
• Special features → There are some variables which are depend on longitude and latitude, fill in these
missing value by KNN based on longitude and latitude.
To avoid overfitting, we choose to reduce the features. To better
select the features, we checked the importance of the features.
We applied the regression model. The R2 is low and it might because
of the existence of multicollinearity. To reduce the effect of that, we
Used Variance Inflation Factors to drop the features. There are
10 features left.
Challenge
The data offered by Zillow includes a voluminous missing values
in 57 features. The accuracy of prediction is primarily affected by
the chosen of the data.
Modeling
As we mentioned before, we have tried OLS and the R^2 is quite low. We tried Random Forest as the
second method and it gave a reasonable score. After that we used Gradient Boost as the final method
since it has the lowest Mean Squared Error and largest Mean Absolute Error. To be more accurate, we
applied XGboost as the last modeling method.
.
34

Web Traffic Time Series Forecasting
Team: Jujun Huang, Peimin Liu, Luyao Lin
http://www.stevens.edu/howe/academics/graduate/business-intelligence-analytics
Introduction
• Our group focus on the problem of forecasting the future
values of multiple time series.
• The train dataset contains the views of 145063 Wikipedia
articles.
• Each of these time series represent many daily views of a
different Wikipedia article, starting from July 1st, 2015 up
until December 31st, 2016.
• Our objective is predict the daily views between January
1st, 2017 and March 1st, 2017.
• We used heat-map and Fast Fourier Transform in the data
visualization.
• We used the ARIMA model in the data modeling part to
predict the data from January 1st, 2017 to March 1st, 2017.
• In the future, we will combine ARIMA model with other
models to predict the time series and improve our accuracy.
• We are going to predict all of the views of 145063
Wikipedia articles from January 1st,2017 to November 1st
,2017.
Methodology
• ARIMA model: We use the autoregressive integrated
moving average (ARIMA) model to forecast the data.
• If the polynomial has a unit root (a factor (1-L))
of multiplicity d and an ARIMA(p,d,q) process expresses
this polynomial factorization property with p=p’-d, and is
given by:
• The ARIMA(p,d,q) process with dirft can be
generalized as:
• For Xt, t is an integer index and the Xt are real time series
data numbers. L is the lag operator, the are the
parameters of the autoregressive part of the model, the
are the parameters of the moving average part and the
are error terms. The error terms are generally assumed to
be independent, identically distributed variables sampled
from a normal distribution.
Data Modeling
• We can use the ARIMA model without transformations if
our time series is stationary.
• In our case, we take a decomposition on three parts: trend,
seasonality and residual. The sum of these three parts is
equal to our observation.
• Here is one of the results of the forecasting views of one of
Analysis
• Here are the results of data visualization:
• From this heat-map, we can see there is a huge web traffic
at the end of July and at the beginning and the mid of
August. However, we can’t find any periodicity in the heat-
map.
• English articles have the highest visits and the trend of
Russian articles is similar with that of English.

Portfolio Optimization with Machine Learning
Team: Chao Cui, Huili Si, Luotian Yin, Qidi Ying, Yinchen Jiang
36
Motivation
Currently is difficult for an investor to make any inferences
about future price especially in the case where volatility makes
him uncomfortable.
Based on the classical methods, we can only make estimate
using historical data. In order to explore this problem, we
decided to use machine learning techniques to create portfolios
and we attempt to the capital distribution more realistic.
Technology
•Using Python to collect S&P Index500 stocks from Yahoo
Finance for one year period. Set threshold to select 50 stocks
with highest return and lowest risk in stocks pool.
•Using Python to build machine learning models and to perform
train and test.
•Applying IBM Cplex module API for linear/non-linear
programming and optimization.
•All calculations are based on month intervals.
Future Work
•We could not constrain the size of optimized portfolio, which
might lead to a impractically large size of portfolio. Further study
is needed in IBM Cplex.
•Machine learning model will need a lot of improvement. Future
work should be done on more accurate models with wider
range of information.
•We could generate more attributes of the stocks such as
industry, location and reputation to cater customer preferences.
•We could make a more dynamic system that takes price
change and trading fee into consideration.
• Portfolio optimization without prediction
Machine Learning for prediction
• Add features
Ordinary Least Square
Prediction
Finalprediction
• Ensemble modeling
• Training result
• Testing result
• Portfolio optimization without prediction
Portfolio risk increases slowly until Portfolio
return reaches about 0.05
Top N stocks Set as a constraint

Developing a Supply Chain Methodology for an
Innovative Product
Team: Akshay Sanjay Mulay
Instructors: Alkis Vazacopoulos and Chris Asakiewicz
37
Background
• The health club economy is a multibillion dollar endeavor that has
remained static for many years
•Nova Fit looks to revolutionize the current fitness equipment
by providing fitness enthusiasts to have safer, and more ergonomic
options while they workout
•The aim is to innovate every component of health club fitness
Technology
• Google Analytics for Data Storage and Excel Solver for
Calculations.
• Tableau for Visualizations and Graphical Analysis
• Monte Carlo simulation to calculate the Worst Case, Best Case
and Average Case scenarios during the actual Network Analysis
Market Share in terms of
Revenue and Number of
Bars to be produced
Number of Bars to produced
Customer Survey
Current Scenario
• The typical round barbell,
is unnatural
to hold and create inefficiencies
that hinder workout
•According to an article on fitness in
NY Times, more than 90% of injuries
are caused due to lifting free weights
Nova Fit – Revolutionary Barbell
• NovaFit’s grip solves the problem by confirming more naturally
to the hands, improving confidence and performance. The
spindle design also eliminates the need of the clip function which
is required in the current scenario
Target Customers
• To initially begin with, the target customers will be Gym and
Fitness Club owners in the Northeast region of United States.
•We also look to sell the product to household fitness freaks.
Sample New York Fitness Center Club
Forecast and Finance Model
• The statistics used for calculating NovaFit’s
market share are taken from IBIS.
The primary competitor is Invanko
which sells the Barbell Product
at a price of $1250
Future Scope
• To build a strong Supplier, Distributors and Manufacturer's
network when the actual product starts selling into the market
•To determine the Suppliers based on the cost, lead times and
availability of the components
•Use the actual data and realign the production and sales
strategy in the market
•Determine the modes of selling the product and also to develop
a comprehensive Supply Chain strategy using various
simulations and scenarios

Bike Sharing Optimization
Team: Jiahui Bai，Yuankun Nai, Yuyan Wang, Yanru Zhou
Model
• Linear problem
We selected the trip data in selected stations for 12 months in
peak hours to build the model and optimize the inventory level
of bicycles and minimize the total cost of deployment with Excel
Solver.
Total cost of deployment =
∑ ( cost of configuring one bike in an area * amount of bikes
configured in that area )
• Linear regression
To predict the demand in different weather condition
• Logistic regression
To figure out how weather conditions affect the demand
Introduction
• Bike-sharing system allows people to rent a bicycle at
one of automatic rental stations scattered in the area,
use it for a short journey and return it at any other station
in the area.
• Ford Bike: Located at bay area consists of 700 bikes
and 70 stations across San Francisco and San Jose.
• Unbalance: The difference between the number of
incoming bikes and the number of outgoing bikes within
a specific time period.
Data visualization
Problem
• Visualization of trip routings on a map and the peak
hours of workdays and weekend.
• Optimize the inventory level of bikes in peak-hour of
stations
• Minimize the cost of deployment of bikes between the
stations
• Improve the utilization of each bike by reducing the
number of bikes which are not used in peak hour
• Predict the demand of bikes in different weather
conditions
- 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000
70
69
50
55
74
61
67
60
65
77
64
start station end station
Peak hour selection
Time period: Aug 2014-Aug 2015
Weekday discrimination: using weekday function combined with IF in Excel,
IF(WEEKDAY(start_date,2)>5,"weekend","workday")
Time group: using group and outline in PivotTable, start by 0:00, end by 24:00,
set step as one hour, so there’s totally 24 time period
3,999
3,836
-
1,000
2,000
3,000
4,000
5,000
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Weekend
51,973
48,410 47,161
-
10,000
20,000
30,000
40,000
50,000
60,000
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Workday
46,729
Future work
• Improve the utilization of bicycles with the linear
program model
• Optimize the bike deployment strategy
• Predict the demand in different weather conditions with
machine learning algorithms
300,801
33,303
6,578
43,056
1,629
301,245
34,324
7,183
42,352
1,247
FINE FOG FOG-RAIN RAIN RAIN-
THUNDERSTORM
Top 10 popular stations Weather effect on the number of trips
GoFord bike trip pattern ( San Francisco)
38

Posters 2017

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Posters 2017

Semelhante a Posters 2017 (20)

Mais de Alkis Vazacopoulos

Mais de Alkis Vazacopoulos (20)

Último

Último (20)

Posters 2017