Over the decades, we have seen tremendous success in biometrics technologies being used in all types of applications based on the physical attributes of the individual such as face, fingerprints, voice and iris. Inspired by this, we introduce a new concept Mobile Behaviometrics, which uses algorithms and models to measure and quantify unique human behavioral patterns in place of human bio-attributes. Behaviometrics algorithms take multiple data from various sensors as input and fuse them to build behavioral models which are capable of producing application specific quantitative analysis on the unique individuals that were the originators of the data.
Automating Google Workspace (GWS) & more with Apps Script
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
1. 1
Thesis Defense for the degree of
Doctor of Philosophy
Electrical and Computer Engineering
Carnegie Mellon University
Jiang Zhu
jiang.zhu@sv.cmu.edu
Thesis Committee
Prof. Joy Zhang, Chair
Prof. Jason Hong
Prof. Patrick Tague
Dr. Fabio Maino, Cisco Research
2. 2
• Data collected from physical and soft sensors
• Identify various behavioral factors from sensor streams
• Build behavioral models to generate quantifiable metrics
• Apply these models to a mobile security application: SenSec
Study the fundamental scientific problem
modeling a mobile user’s behavior
heterogeneous sensory time-series
of from
5. 5
0%
10%
20%
30%
40%
50%
60%
Mobile Device Loss or theft
Strategy One Survey conducted among a U.S. sample of 3017 adults age 18 years older in September
21-28, 2010, with an oversample in the top 20 cities (based on population).
• “The 329 organizations
polled had collectively lost
more than 86,000 devices
… with average cost of lost
data at $49,246 per device,
worth $2.1 billion or $6.4
million per organization.
"The Billion Dollar Lost-Laptop Study,"
conducted by Intel Corporation and the
Ponemon Institute, analyzed the scope
and circumstances of missing laptop
PCs.
7. 7
Passwords
Normal passwords are not strong enough: usually meaningful words that can be
remembered
Stringent strong password can be annoying
Most users do not use the password-aid tools (Hong et al. 2009)
Fingerprint? Iris recognition? Face recognition? Voice recognition?
Password for the DHS E-file:
Contain from 8 to 16 characters
Contain at least 2 of the following 3 characters: uppercase alphabetic,
lowercase alphabetic, numeric
Contain at least 1 special character (e.g., @, #, $, %, & *, +, =)
Begin and end with an alphabetic character
Not contain spaces
Not contain all or part of your UserID
Not use 2 identical characters consecutively
Not be a recently used password
8. 8
• Provides a way to passively authenticate while using common,
sensitive applications and services.
• Allows for rapid detection of unauthorized users
Block their access as quickly as possible.
• Uses a variety of sensors available on common smartphones
8
9. 9
• Derived from
• Behavioral: the way a human subject behaves
• Biometrics: technologies and methods that measure and analyzes
biological characteristics of the human body
• Finger prints, eye retina, voice patterns
• Behaviometrics: Measurable behavior to Recognize or to Verify
• Identity of a human subject, or
• Subject’s certain behaviors
Behavioral BiometricsBehaviometrics
10. 10
• Mobile devices come with embedded sensors
• Accelerometers, gyroscope, magnetometer
• GPS receiver
• WiFi, Bluetooth, NFC
• Microphone, camera,
• Temperature, light sensor
• “Clock” and “Calendar”
• Connect with other sensors
• EEG, EMG, GSR
• Mobile devices are connected with the Internet
• Upload sensor data to the cloud
• Viewing information computing on the server side
• Users carry the device almost at all time.
• My phone “knows” where I am, what I am doing and my future
activities.
11. 11
• Motion Metrics
• Location Metrics
• Interaction Metrics
• System Metrics
• Accelerometer
• activity, motion, hand trembling, driving
style
• sleeping pattern
• inferred activity level, steps made per
day, estimated calorie burned
• Motion sensors, WiFi, Bluetooth
• accurate indoor position and trace.
• GPS
• outdoor location, geo-trace,
commuting pattern
• Microphone, camera:
• From background noise: activity, type
of location.
• From voice: stress level, emotion
• Video/audio: additional contexts
• Keyboard, touches, slides
• Specific tasks, user interactions, …
12. 12
• Monitor and track user behavior on smartphones using various
on-device sensors
• Convert sensory traces and other context information to
Behaviometrics
• Build statistical models with these features and use them for
calculation of Certainty Scores as security measure
• Trigger various secondary Authentication Schemes when certain
application is launched or certain system function is invoked.
16. 16
• Human behavior/activities share some common properties
with natural languages
• Meanings are composed from meanings of building blocks
• Exists an underlying structure (grammar)
• Expressed as a sequence (time-series)
• Apply rich sets of Statistical NLPs to mobile sensory data
3
3.5
4
4.5
5
5.5
6
0 20 40 60 80 100 120 140 160 180 200
log(freq)
Rank of words by frequency
Zipf’s Law
17. 17
• Is this play Shakespeare’s work?
• Comparing the play to Shakespeare’s known
library of works
• Track words and phases patterns in the data
• Calculate the probability the unknown U
given all the known Shakespeare’s work {S}
• Compare with a threshold θ
• Authentic work (a=1)
• Fake, Forgery or Plagiarism (a=0)
ˆa = sign[P(U|{S}) > ]
23. 23
• Accelerometer
• Used to summarize
acceleration stream
• Calculated separately for each
dimension [x,y,z,m]
• Meta features:
Total Time, Window Size
• GPS: location string from Google Map API and mobility path
• WiFi: SSIDs, RSSIs and path
• Applications: Bitmap of well-known applications
• Application Traffic Pattern: TCP UDP traffic pattern vectors:
[ remote host, port, rate ]
24. 24
• Offline data collection (for training and testing)
Pick up the device from a desk
Unlock the device using the right slide pattern
Invoke Email app from the "Home Screen”
Some typing on the soft keyboard
Lock the device by pressing the "Power" button
Put the device back on the desk
27. 27
• Various Behaviometric applications be framed as a classification
problem
• Prerequisite: a learned Behaviometric model
• Input: a given observation of user’s behavior,
• Output: some decisions based on the observation and the model
• Behavioral text representation may lead to huge features space
• Algorithms to handle large feature space
• Smart feature set construction to limit size of the feature space
• Identification and anomaly detection work better if the users are
performing the same activity
• Activity recognition before identification/detection
• High FPR
• Incorporate other factors and metrics
• UX improvements
29. 29
• The language model in general builds a single model for all types
of activities.
• Often the way people perform a certain activity is enough to
distinguish them(as in PoC).
• Models to identify between the diff. of how people perform the
same, or same class of activities.
0
5
10
15
20
0 20 40 60 80 100 120 140
acceleration
time
Accelerometer X-axis
standing
walking
running
Activity
Class
Extraction
AC-1 Model
AC-2 Model
AC-n Model
…
31. 31
• Extract the activity level of motion time series data by examining
magnitude of the data and frequency composition via DFT
• Activity Level =
• Activity Class can be arbitrary segments of this function which has
range [0, 1)
0
5
10
15
20
0 20 40 60 80 100 120 140
acceleration
time
Accelerometer Y-axis
standing
walking
running
0
200
400
600
800
1000
1200
1400
0 1 2 3 4 5 6 7
amplitude
frequency
Accelerometer Y-axis
standing
walking
running
32. 32
• t-SNE to map feature space into 2-D plane
• First tested activity recognition method on ambulatory
behavior with 381 samples of standing, walking, and
running. Correctly classifies 359/381 samples,
giving an accuracy of 94.23%
• For user identification we ranked the features and used
the top features to create models using classical ML
algorithms
33. 33
• We approach the authentication problem by building a
Multivariate Gaussian distribution for each activity class
• We fit the parameters, mean and standard deviation, for each
feature for a training set for which we define to contain only non-
anomalous data
• The calculate the probability of some unknown data being
generated by the model with
34. 34
• Each range is an activity class with lower ranges representing
activities with low magnitude or low frequency and higher ranges
representing activities with high magnitude and high frequency.
• 110GB dataset of accelerometer data from 25 users.
35. 35
• Using only motion data can lead to a very high false positive rate
• Combine motion with location factor to mitigate
• Verify most users spend the majority time in a small number of
places. Incorporate location in experiments lead to reduced FPRs
37. 37
• Hypothesis: the micro-behavior a user interacts with the soft keyboard
reflects his/her cognitive and physical characteristics.
Cognitive fingerprints: typing rhythms, correction rate, delay between keys,
duration at each key….
Physical characteristics: area of pressure, amount of pressure, position of
contact, shift …
38. 38
• Keystroke Dynamics are a popular subject
Many papers—focusing primarily on desktops
Great success for passwords, good success for arbitrary text
Typing rate, key-to-key latencies are the primary features
Once people are skilled at typing, they develop natural rhythms (on
desktops)
• Detecting keystroke patterns on mobile phones is challenging
Focus on Desktop-like attributes
Typing rate, timing, di-graphs, tri-graphs, etc.
• Use background applications to “sniff” keystrokes
Without direct access to keyboard
Successful demonstrations using accelerometers
38
40. 40
• Location pressed on keys
• Length of press
key down to key up
• Force of press
Varies across device types
• Change of force over key press
• Size of finger
Varies across device types
• Drift of finger during press
• Recent accelerometer history
46. 46
• 13 initial users after short recruiting drive
• 2 week long collection period
• 86,000 keystrokes
• 430,000 data points @ ~5/keystroke
• Data split into training and testing:
• 99% accuracy with just 1 key stroke
Training Data for Model
50%
CV
15%
Training
for Keys
15%
Final
Testing
15%
CV for
Keys
10%
51. 51
• Alpha test started in 6/2012, 1st Google Play Store release in 10/2012
• False Positive: 13% FPR still annoying users sometimes
• 75% alpha users mentioned feeling annoyed when SenSec prompted
for passcode, at least once. Some of them are confused by the UI.
“I couldn’t get the passcode right multiple times when trying to answer an
important phone call because I was single-handed”
“I just entered my passcode a min ago and it asked me again”
“I was so lost about this UI. What’s this training and testing knob?”
52. 52
• Use adaptive model
• Adding the trace data shortly before a false positive to the training data and
update the model
• Add sliding pattern in addition to passcode validation
• A confirmed false positive will grant a “free ride” for a configurable
duration
• Assumption: just authenticated user should control the device for a given
period of time
• “free ride” period will end immediately if abrupt context change is
detected.
• Interaction Metrics added as part of the sensory fusion
• Motion, location and interaction metrics work together
• Different user flow: Reactive triggering vs. Proactive triggering
53. 53
1 USER INTERFACE AND USER EXPERIENCE CHALLENGES 1
Figure 6.3: SenSec Protection Options: Secured, Protected via either Passcode or Sliding Patter
54. 54
6.1 USER INTERFACE AND USER EXPERIENCE CHALLENGES 117
Figure 6.2: SenSec Tutorial
.3: SenSec Protection Options: Secured, Protected via either Passcode or Sliding Pattern
Figure 6.4: SenSec Home Screen
55. 55
• Data collected from physical and soft sensors
• Identify various behavioral factors from sensor streams
• Build behavioral models to generate quantifiable metrics
• Apply these models to a mobile security application: SenSec
Study the fundamental scientific problem
modeling a mobile user’s behavior
heterogeneous sensory time-series
of from
56. 56
• Propose a concept Behaviometric to study the fundamental scientific
problem of modeling a mobile user’s behaviors from heterogeneous
sensor time series
• Adopt a Language approach to solve Behaviometric problems via
various NLP techniques
• Unsupervised algorithm Helix to discover the hierarchical structure, i.e.
grammar, in activity recognition
• Investigate existing statistical learning algorithms being adopted and
applied to Behaviometric context when NLP approach is less sufficient
• Derive an effective yet simple “activity level” metric from time series to
be used in identification and detection. Use location metrics to augment
motion metrics to reduce FPR.
• Develop and deploy versions of SenSec app and adapt through user
feedbacks.
57. 57
• Extended data set for System Metrics
TCP, UDP traffic; sound; ambient lighting; battery status, etc.
• Data and Modeling
Gain more insights into the data, features and factorized relationships among
various sensors
• Enhanced security of SenSec components prepared for
commercial release
Integration with Android security framework and other applications
• Privacy as expectation (Liu et al., 2012)
Users need to know where the data resides, how the data is going to be used
and shared. Whom to trust the data with?
• Energy efficiency
58. 58
• CyLab at Carnegie Mellon
• Northrop Grumman Cybersecurity Research Consortium
• Cisco
Research aware for “A Language Approach in Behavioral Modeling”
Research award for “Privacy Preserved Personal Big Data Analytics through
Fog Computing’’
• Special thanks: Dr. Hao Hu, Jiatong Zhou, Dr. Flavio Bonomi, Cisco Systems;
Sky Hu, Twitter Inc.; Yuan Tian, Google Inc.; Pang Wu, Samsung Research
North America.
58
Cybersecurity
Research Consortium
59. 59
“Mobile behaviometrics: Models and applications” In Proceedings of the Second IEEE/CIC Inter- national Conference on
Communications in China (ICCC), Xi’An, China, August 12-14 2013., [with H.Hu, S.Hu, P.Wu, J.Zhang]
“MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang]
"SenSec: Mobile Application Security through Passive Sensing," to appear in the Proceedings of International Conference
on Computing, Networking and Communications. (ICNC 2013). San Diego, USA. January 28-31, 2013 [with P.Wu, X.Wang,
J.Zhang]
“Towards Accountable Mobility Model: A Language Approach on User Behavior Modeling in Office WiFi Networks”, accepted
to ICCCN 2011, Maui, HI, Aug 1-5, 2011 [with Y.Zhang]
"Retweet Modeling Using Conditional Random Fields," in the Proceedings of DMCCI 2011: ICDM 2011 Workshop on Data
Mining Technologies for Computational Collective Intelligence, December 11, 2011.[ with H.Peng, D.Piao, R.Yan and
Y.Zhang]
“Mobile Lifelogger - recording, indexing, and understanding a mobile user's life", in the Proceedings of The Second
International Conference on Mobile Computing, Applications, and Services, Santa Clara, CA, Oct 25-28, 2010 [With
S.Chennuru, P.Cheng, Y.Zhang]
"SensCare: Semi-Automatic Activity Summarization System for Elderly Care", MobiCase 2011, Los Angeles, CA, October
24-27, 2011. [with Pang Wu, Huan-kai Peng,Joy Ying Zhang]
"Helix: Unsupervised Grammar Induction for Structured Human Activity Recognition," to appear in the Proceedings of The
IEEE International Conference on Data Mining series (ICDM), Vancouver, Canada, Dec 11-14, 2011.[with Huan-Kai Peng,
Pang Wu, and Ying Zhang]
"Statistically Modeling the Effectiveness of Disaster Information in Social Media," to appear in the Proceedings of IEEE
Global Humanitarian Technology Conference (GHTC), Seattle, Washington, Oct. 30 - Nov. 1st, 2011.[with Fei Xiong,
Dongzhen Piao, Yun Liu, and Ying Zhang]
"A dissipative network model with neighboring activation," to appear in THE EUROPEAN PHYSICAL JOURNAL B.[with F.
Xiong, Y. Liu, J. Zhu, Z. J. Zhang, Y. C. Zhang, and J. Zhang]
"Opinion Formation with the Evolution of Network," to appear in the Proceedings of 2011 Cross-Strait Conference on
Information Science and Technology and iCube, TaiBei, China, Dec 8-9, 2011.[with F.Xiong, Y.Liu, Y.Zhang]
61. 61
Location
Time of the Day
Day of the week
R =f(handholding,
indoor loc, app)
Alert!
Activity
R =f(WiFi trace,
app, time)
Traveling Speed
Gait R =f(geo trace,
app, time)
At home At work Between home and work
8am-6pm Other
Mon.-Fri. Sat. & Sun idle
Walking Driving
Notas do Editor
Smartphones
Globally, grew 60% in 2011, reaching 698 million. will grow 2.9-fold between 2011 and 2016, reaching 2,027 million
In North America, grew 58% during 2011, reaching 138 million in number. will grow 1.8-fold between 2011 and 2016, reaching 253 million
Tablets
Globally, grew 3.4-fold in 2011, reaching 33.6 million in number.
Globally, the number of mobile-connected tablets will grow 7.6-fold between 2011 and 2016, reaching 256.7 million.
In North America, 2.7-fold during 2011, reaching 15.5 million. will grow 3.8-fold between 2011 and 2016, reaching 58.3 million
As mobile applications and devices are becoming ubiquitous, it is crucial for mobile users to privately and securely interact with their environment and data and for mobile services to trust the identity of the user. While mobile devices such as smartphones make our lives convenient in ways that were unimaginable before, applications such as email, web browsing, social network, shopping and online banking know too much about our private lives. Mobility introduces additional security and privacy challenges in being able to provide services in a way that neither compromises the environment of users nor their data. Protecting a user's privacy and ensuring the accountability of mobile applications in a seamless and non-intrusive way poses great challenges to next generation mobile computing platforms.
Recently, a new survey* has revealed that 36 percent of consumers in the United States have either lost their mobile phone or had it stolen. Another survey† has also revealed that 329 organizations polled had collectively lost more than 86,000 devices with average cost of lost data at $49,246 per device, worth $2.1 billion or $6.4 million per organization. Given the high loss rate and high cost associated with these losses, accountable schemes are needed to protect the data on the mobile devices.
Reliable authentication is an essential requirement for a mobile device and its applications. Today, passwords are the most common form of authentication. This results in two potential problems. First, passwords are also a major source of security vulnerabilities, as they are often easy to guess, re-used, often forgotten, often shared with others, and are susceptible to social engineering attacks. Secondly, to secure the data and applications on a mobile device, the mobile system would prompt user for authentication quite often and this results in series usability issues. We also observe that different applications on a mobile device may have different sensitivities towards the aforementioned threats and data loss. For example, the Angry Bird game on an android is less sensitive than Contact List or Phone Album should the device is operated by unauthorized user. One-thing-for-all approach in authentication schemes may be either too loose for some applications, which expose them to risks, or too tight for others, which cause usability problems.
As a quick overview of our work, in order to enforce application security, we monitor and track user behavior through the traces collected from the on-device sensors. And then we convert these trace and other context information to behavior features.
We adopt a n-gram model to model the user’s behavior and use that to monitor and calculate certainty score.
This score will be fed to smartphone’s authentication module to enforce the security of various application and its data on the device.
we convert the raw sensory data into behavior text representation as sequences of behavior labels. Each behavior label is considered as a ``word'' in the language.
train a continues n-gram language model
SenSec is constantly collecting sensory data from accelerometer, gyroscope, GPS, WiFi, microphone or even camera. Through analyzing the sensory data, it constructs the context under which the mobile device is used. This includes locations, movements and usage patterns, etc. From the context, the system can calculate the certainty that the system is at risk.
Different applications on mobile device are assigned either manually or automatically with a sensitivity value.
When user is invoking an application, SenSec compares the certainty with this application’s sensitivity level. If the sensitivity passes the certainty threshold, authentication mechanism would be employed to ensure security policy for that application.
T-SNE plots show clear clusters for feature vectors of different users
Models were trained with 3000 keystrokes from primary user and 2000 from each of 3 other users.
These models were tested against [on average] 539 ‘primary user’ keystrokes and 489 keystrokes from a wide variety of other users (not used to train the model)
That brings me to the end of my presentation.
Thank you very much for your attention.