1. Min Xu
1 Bayard Rd, Apt 61, Pittsburgh, PA 15213
Mobile: 412-230-7574 E-mail: xumin9096@gmail.com
Objective
To obtain engineering position in fields of software development or data science
Education
Ph.D. Candidate, Electrical & Computer Engineering, Carnegie Mellon University (CMU), GPA: 3.89/4.0, AUG 2012 – PRESENT
Project: Design, Modeling, Implementation and Analysis of Emerging Reconfigurable RF/Memory Devices; Advisor: Prof. James A. Bain
B.S., Electrical & Computer Engineering, Huazhong University of Sci. and Tech. (HUST), GPA: 89/100, SEP 2008 – JUN 2012
Professional Experience
Data Scientist Intern, Entropy Technology (Startup), AUG 2016 – PRESENT
Integrate data from web sources to develop information retrieval algorithms for search engine production
Skills
Programming languages: Java, Python, C, Matlab, SystemVerilog, LabView, Assembly, Scala, HTML
Frameworks: Flask, VertX, SciPy, scikit-learn, Lucene, Samza, TensorFlow
Platforms and Software: Unix/Linux, Hadoop MapReduce, Spark, AWS, Elasticsearch, MySQL, HBase, MongoDB, Docker
Relevant Courses
Machine Learning, Cloud Computing, Search Engine, Machine Learning for Text Mining, Machine Learning with Large Dataset, Big
Data Analytics, Nature Language Processing, Computer Systems, Data Structures
Projects
Elasticsearch Based Search Engine at Entropy Technology (Python, Java, Elasticsearch)
Developed crawler to crawl and clean data from Chinese job hiring websites and feed data to Elasticsearch
Applied supervised/unsupervised re-scoring algorithms (query expansion, learning to rank, etc.) combining with existing
Elasticsearch features to enhance search relevancy and accommodate for both structured and unstructured data
Zhihu (Chinese Quora) Mining Web Service (Python, MongoDB, Flask, D3.js, HTML/CSS)
Developed a full-stack web service for crawling, mining and visualizing for 50k+ users and 500k+ questions-answers data
Backends: a multi-thread crawler with dynamic proxies, with supports for both MongoDB and on-disk file storage options
Mining: keywords analysis, topic clustering, topic/user recommendation, sentiment analysis, popularity analysis
Frontends: a Flask based web service with supports for mined data visualization using D3.js
Lucene Based Search Engine (Java, Lucene)
Developed a text-based large scale search engine indexed with Lucene API on 500 k documents from ClueWeb09 dataset with a
prefix query language parser, which retrieves relevant documents in a Document-at-a-Time manner
Supported different ranking retrieval models (Unranked/ranked Boolean, VSM, BM25, Indri), ten commonly used operators
(#AND, #OR, #WSUM, #NEAR/n, etc.), query expansion and learning to rank (pair-wise RankSVM)
Twitter Analytics Web Service (Java, HBase, MySQL, VertX, EMR, AWS)
Performed Extract Transform (data cleaning, sentiment score analysis, term censorship, popularity analysis) and Load using
EMR on 1 TB of twitter raw dataset based on schemas designed for a variety of analytics queries
Developed a RESTful web service API using VertX and deployed on AWS in response to different analytics queries.
Deployed backend databases using MySQL and HBase on AWS. Sharding technique is used for the MySQL backend instances.
EMR Hadoop cluster is used for HBase. Fine performance tuning was performance on both databases and frontend
Recommendation System for Netflix Movies (Python, scikit-learn)
Implemented movie rating prediction using memory-based/model-based collaborative filtering and probabilistic matrix
factorization (PMF) based on a subset of the Netflix Prize dataset
Implemented collaborative ranking using pair-wise learning-to-rank based on RankSVM/LR-LETOR and PMF features, which
is used for movie recommendation for given user query
Image Classification on CIFAR Image Dataset (Python, TensorFLow, MATLAB)
Used HOG and PCA for feature selections, implemented various classifiers from scratch (SVM with Linear/RBF kernels and
GNB with Ada Boosting) and performed cross-validation to achieve high classification accuracy
2. Further improved accuracy by applying CNN implemented using TensorFlow, test accuracy ranked top 3 among all the teams
Restaurants Rating based on Yelp Comments (Python, scikit-learn)
Data cleaning, term dictionary construction and feature engineering for sparse matrix samples from raw Yelp JSON dataset
Implemented supervised learning using multi-class logistic regression and SVM for both hard and soft score prediction
Link Analysis and Personalized Search on CiteEval Dataset (Python)
Performed K-Means with K-Means++ initialization on documents for topic clustering
Performed link analysis using general PageRank, personalized PageRank (based on user-topic preference) and query
sensitive PageRank (based on query-document relevance score from Indri based search engine) for retrieval ranking
Input Text Predictor Based on Wikipedia Dataset (Java, Hadoop MapReduce, HBase, AWS)
Generated phrase list based on N-Gram language model using MapReduce from Wikipedia plain-text dataset
Stored calculated probabilities of words after each phrase in HBase backend
Built a RESTful API for words prediction and text autocomplete in response to input phrases
Uber-like Rider-Driver Matching Service (Java, Kafka, Samza, AWS)
Developed a driver-matching service using Apache Samza to process streams of GPS data (driver/rider locations and updates,
etc.) produced by Apache Kafka and generate a matching stream, with RocksDB as fast in-memory storage for streaming data
Dynamically calculated and updated surge price based on block-wise driver availability
Social Network Timeline with Heterogeneous Backends (Java, MySQL, HBase and DynamoDB, AWS)
Deployed a RESTful master instance that coordinates three different databases for various features
Login authorization data was stored in MySQL, social graph of followers and followees was stored in HBase, self-posted
contents and posts timeline of followees was stored in Amazon DynamoDB
Distributed Storage API Development (Java, AWS)
Implemented a distributed datastore coordinator API that supports different horizontal partitioning techniques such as
sharding and replication with strong consistency
Implemented distributed datastore API on different EC2 instances based on strong, causal and eventual consistency models
Developed a load balancer that can evenly distribute requests over instances based on CPU utilizations, perform health
monitoring to kill/generate EC2 instances for better system reliability, and horizontally scale EC2 instances to dynamically
handle requests in order to achieve the best performance-cost trade-off (minimize cost while maximize RPS)
Computer System Projects (C and Unix POSIX API)
Implemented a concurrent caching web proxy based on Unix POSIX API with good error handling capability to handle
requests from clients and forward response from servers in a multi-thread manner. An LRU cache was implemented to cache
historically visited pages, supporting concurrent read operations
Implemented a general purpose dynamic storage allocator with functions included malloc, calloc, realloc and free based on
Unix system call sbrk. Used segregated free lists (a combination of linked lists and binary search trees) for free memory blocks
management
Phase Change (PC) RF switch for Reconfigurable RF Systems (Ph.D. Project)
Designed, fabricated and tested a 20 THz PC switch with low insertion loss and high isolation in reconfigurable RF systems
Developed a complete automatic testing software system for high throughput large scale device testing and analysis
Performed unsupervised clustering and semi-supervised learning for fault analysis and defect detection among devices
Integrated the in-house fabricated device with a dual-band low noise amplifier (0.13 μm CMOS process) that can be reliably
cycled between 2.4 GHz and 5 GHz (results published in IEDM 2015)
Honors and Awards
Qualcomm Innovation Fellowship Final List, CMU, 2016
Carnegie Institute of Technology Dean’s Fellowship, CMU, 2012
HUST Excellent Graduate Award, HUST, 2012
HUST Excellent Academic Performance Scholarship, HUST, 2010&2011
HUST Top College Student Leader Scholarship, HUST, 2009
HUST Most Impressive Freshman Scholarship, HUST, 2009