This study analyzed the vulnerability levels of 1000 mobile apps from Google Play across 23 categories. The key findings were:
1) Medical apps had significantly fewer vulnerabilities than other categories like Finance and Shopping.
2) An app's vulnerability level did not affect its rating, but apps with more downloads tended to have higher vulnerability levels.
3) Contextual information like app description, metadata, and static code features could predict an app's vulnerability level with over 75% accuracy, with market data providing complementary insights to code analysis. Addressing app security is important as users may not be aware of risks when installing apps.
Exposed! Vulnerability-proneness of Google Play Apps
1. Exposed! On the Vulnerability-
proneness of Google Play apps.
Andrea Sebastiano
Di Sorbo Panichella
https://spanichella.github.io/
https://www.unisannio.it/en/user/9355
ESEC/FSE - Journal First Presentation
14-18, November 2022, Singapore
2. OUTLINE
CONTEXT: User perception of risks when
installing mobile apps
RESEARCH DESIGN: data collection,
information extraction, and tools used
FINDINGS: Interesting stuff
CONCLUSIONS are drawn
5. PAST WORK
Factors affecting app
success
Vulnerabilities in mobile
apps
“Fault- and change-prone APIs can
hinder the success of mobile apps”.
“High-rated apps have larger sizes, more
complex code, more requirements on
users, more marketing efforts, more
dependence on libraries, and adopt higher
quality Android APIs”.
“Roughly 70% of free apps and roughly
50% of paid apps with vulnerabilities
were vulnerable due to libraries”.
“For most vulnerability types, third-
party code (including common
libraries) represents the main carrier of
app vulnerabilities”.
“User feedback reporting bugs are negatively correlated with the
rating, while reviews reporting feature requests do not”.
6. RESEARCH GOALS
‘‘Vulnerability-proneness: the number of different types of
known security issues exhibited by the app’’
• Assess the vulnerability-
proneness levels of mobile
apps
• Evaluate the extent to which
users can perceive
vulnerability-proneness
• RQ1: Which are the different vulnerabilities
exhibited by Google market apps (belonging to
different app categories)?
• RQ2: Does the vulnerability-proneness of
Google market apps affect app success?
• RQ3: Is it possible to predict the level of
vulnerability-proneness of an app by using the
app’s contextual information?
GOALS RQs
7. DATASET
• About 1,000 apps spanning
23 different Play Store’s
categories
• For each category we have
both low- and high-rated
apps
• We only considered apps
having a reliable number of
user votes
15. RQ1: Which are the different vulnerabilities exhibited by Google
market apps (belonging to different app categories)?
Approach:
Compare the vulnerability-proneness of apps
belonging to different app categories
(statistical analysis + qualitative analysis)
16. Categories adjusted p-value Cliff's d
Medical - Communication 0.0074 -0.5143
Medical - Entertainment 0.0011 -0.5283
Medical - Food & Drink 0.0023 -0.6402
Medical - News & Magazines 0.0026 -0.5859
Medical - Social 0.0081 -0.5201
RQ1: Which are the different vulnerabilities exhibited by Google
market apps (belonging to different app categories)?
17. • Medical apps exhibit less security
flaws than other categories
• Finance and Shopping apps exhibit
vulnerability-proneness levels
similar to other categories
Categories adjusted p-value Cliff's d
Medical - Communication 0.0074 -0.5143
Medical - Entertainment 0.0011 -0.5283
Medical - Food & Drink 0.0023 -0.6402
Medical - News & Magazines 0.0026 -0.5859
Medical - Social 0.0081 -0.5201
RQ1: Which are the different vulnerabilities exhibited by Google
market apps (belonging to different app categories)?
18. Vulnerability Communication Entertainment Food Medical News Social
<SSL_Security> SSL Connection Checking 100.00% 97.17% 100.00% 84.85% 100.00% 96.72%
<WebView><Remote Code
Execution><#CVE-2013-4710#> WebView
RCE Vulnerability Checking
73.13% 86.79% 81.25% 36.36% 86.67% 73.77%
<Implicit_Intent> Implicit Service Checking 52.24% 49.06% 46.88% 15.15% 35.56% 54.10%
App Sandbox Permission Checking 17.91% 27.36% 37.50% 24.24% 31.11% 27.87%
<SSL_Security> SSL Certificate Verification
Checking
16.42% 18.87% 21.88% 6.06% 40.00% 14.75%
<KeyStore><Hacker> KeyStore Protection
Checking
13.43% 9.43% 31.25% 9.09% 35.56% 19.67%
<Command> Runtime Command Checking 34.33% 13.21% 21.88% 12.12% 31.11% 9.84%
<#BID 64208 CVE-2013-6271#> Fragment
Vulnerability Checking
22.39% 11.32% 12.50% 3.03% 13.33% 8.20%
AndroidManifest ContentProvider Exported
Checking
19.40% 16.98% 18.75% 9.09% 24.44% 14.75%
<SSL_Security> SSL Implementation Checking
(Verifying Host Name in Custom Classes)
11.94% 17.92% 9.38% 12.12% 28.89% 21.31%
<SSL_Security> SSL Implementation
Checking (Verifying Host Name in Fields)
5.97% 8.49% 12.50% 9.09% 26.67% 11.48%
RQ1: Which are the different vulnerabilities exhibited by Google
market apps (belonging to different app categories)?
19. RQ2: Does the vulnerability-proneness
of Google market apps affect app success?
App
success
Average rating
Downloads
Compare apps with different
levels of success
20. No relations between vulnerability-proneness and app
rating are observed
RQ2: Does the vulnerability-proneness
of Google market apps affect app success?
21. RQ2: Does the vulnerability-proneness
of Google market apps affect app success?
No relations between vulnerability-proneness and
app rating are observed
Apps with higher number of downloads tend to
exhibit higher levels of vulnerability-proneness
(statistically significant)
22. RQ2: Does the vulnerability-proneness
of Google market apps affect app success?
Apps having a lower average rating tend to have a higher
vulnerability-proneness density
Apps with higher number of downloads tend to exhibit
higher levels of vulnerability-proneness (statistically
significant)
Vulnerability-proneness density:
“we divided the number of vulnerability warnings
signaled by AndroBugs by the number of classes”
23. RQ2: Does the vulnerability-proneness
of Google market apps affect app success?
Apps having a lower average rating tend to have a higher
vulnerability-proneness density
Apps with higher number of downloads tend to exhibit
higher levels of vulnerability-proneness (statistically
significant)
Vulnerability-proneness density:
“we divided the number of vulnerability warnings
signaled by AndroBugs by the number of classes”
Perception
Reality
User could be not aware of the risks that they take when installing an app..
24. RQ3: Is it possible to predict the level of vulnerability-proneness of
an app by using the app’s contextual information?
App market
metrics
Textual features
(app description)
Static analysis
features
(number of libs,
classes, interfaces,
etc.)
25. Algorithm Precision Recall F-Measure
J48 0.691 0.687 0.686
Random Forest 0.760 0.751 0.751
Naive Bayes 0.660 0.657 0.652
Experiment 5 (app market features + static analysis)
Algorithm Precision Recall F-Measure
J48 0.664 0.665 0.664
Random Forest 0.723 0.720 0.719
Naive Bayes 0.592 0.590 0.590
Experiment 4 (app market features + text features + static analysis)
Algorithm Precision Recall F-Measure
J48 0.671 0.667 0.666
Random Forest 0.730 0.728 0.728
Naive Bayes 0.647 0.647 0.645
Experiment 2 (app market features)
Algorithm Precision Recall F-Measure
J48 0.619 0.620 0.619
Random Forest 0.660 0.660 0.658
Naive Bayes 0.581 0.577 0.576
Experiment 1 (app market features + text features)
Considering only app market info Considering also static analysis info
Random Forest outperforms the other ML algorithms
(low vs. high)
RQ3: Is it possible to predict the level of vulnerability-proneness of
an app by using the app’s contextual information?
26. Algorithm Precision Recall F-Measure
J48 0.691 0.687 0.686
Random Forest 0.760 0.751 0.751
Naive Bayes 0.660 0.657 0.652
Experiment 5 (app market features + static analysis)
Algorithm Precision Recall F-Measure
J48 0.671 0.667 0.666
Random Forest 0.730 0.728 0.728
Naive Bayes 0.647 0.647 0.645
Experiment 2 (app market features)
Algorithm Precision Recall F-Measure
J48 0.664 0.665 0.664
Random Forest 0.723 0.720 0.719
Naive Bayes 0.592 0.590 0.590
Experiment 4 (app market features + text features + static analysis)
Algorithm Precision Recall F-Measure
J48 0.619 0.620 0.619
Random Forest 0.660 0.660 0.658
Naive Bayes 0.581 0.577 0.576
Experiment 1 (app market features + text features)
Considering only app market info Considering also static analysis info
Textual features introduce noise affecting the classification performance
(low vs. high)
RQ3: Is it possible to predict the level of vulnerability-proneness of
an app by using the app’s contextual information?
27. Considering only static analysis info Considering also app market info
App market metrics provide complementary information
to the one related to code.
Algorithm Precision Recall F-Measure
J48 0.691 0.687 0.686
Random Forest 0.760 0.751 0.751
Naive Bayes 0.660 0.657 0.652
Experiment 5 (app market features + static analysis)
Algorithm Precision Recall F-Measure
J48 0.726 0.712 0.709
Random Forest 0.716 0.714 0.714
Naive Bayes 0.660 0.652 0.643
Experiment 3 (static analysis features)
(low vs. high)
RQ3: Is it possible to predict the level of vulnerability-proneness of
an app by using the app’s contextual information?
28. CONCLUSIONS
• RQ1: Which are the different vulnerabilities exhibited by Google market
apps (belonging to different app categories)?
Almost all apps have known security defects
Apps belonging to the Medical category exhibit less security flaws
than apps in the other categories
29. CONCLUSIONS
• RQ1: Which are the different vulnerabilities exhibited by Google market apps (belonging to
different app categories)?
• Almost all apps have known security defects
• Apps belonging to the Medical category exhibit less security flaws than apps in the
other categories
• RQ2: Does the vulnerability-proneness of Google market apps
affect app success?
- Vulnerability-proneness levels are not reflected in app
ratings
- Vulnerability-proneness density levels are reflected in
app ratings
Popular apps tend to exhibit higher levels of
vulnerability-proneness
30. CONCLUSIONS
• RQ1: Which are the different vulnerabilities exhibited by Google market apps (belonging to
different app categories)?
• Almost all apps have known security defects
• Apps belonging to the Medical category exhibit less security flaws than apps in the
other categories
• RQ2: Does the vulnerability-proneness of Google market apps affect app success?
• Vulnerability-proneness levels are not reflected in app ratings
• Popular apps tend to exhibit higher levels of vulnerability-proneness
• RQ3: Is it possible to predict the level of vulnerability-
proneness of an app by using the app’s contextual
information?
App market information is useful to predict the
vulnerability-proneness level of an app in about 3 out of
4 cases
App market information could be used in addition to
static analysis features to improve the prediction results
31. FUTURE WORK
• Surveying app users
• to better understand how they deal with privacy and security concerns
• and, to further validate our results
• Improve the prediction results
• Extract additional features from the app store (e.g., interactive
elements, developer’s information, last update, compatible devices,
etc.)
• Extract additional static analysis features (e.g., quality metrics,
intents, etc.)
• Investigate features weighting
• Investigate the possibility to predict specific type of security defects in other
domains
32. Exposed! On the Vulnerability-
proneness of Google Play apps.
Andrea Sebastiano
Di Sorbo Panichella
https://spanichella.github.io/
https://www.unisannio.it/en/user/9355
ESEC/FSE - Journal First Presentation
14-18, November 2022, Singapore
https://link.springer.com/article/10.1007/s10664-021-09978-0
Thanks for the Attention!
Editor's Notes
In particular, in this presentation, I will first introduce the CONTEXT of this study and the related literature.
Afterwards, I will state the goal the research questions, and the analysis done.
The achieved results and the answers to our research questions are then used for drawing the conclusions outlining future research directions.
Mobile applications are used for several everyday life activities, such as shopping, banking, social communications, and so on.
However, users share a lot of sensitive data to use these apps and recent research demonstrated that the majority of mobile applications present critical security defects.
In this study we try to better understand if and the extent to which app users could perceive these security risks and if these security risks can undermine the success of mobile apps
Previous studies have explored some of the factors influencing the app success, as well as mobile app vulnerabilities have been investigated from many research perspectives.
In particular, previous work observed that the app success is related to the adoption of higher numbers of libraries,
but libraries also represent the main carriers of app vulnerabilities.
In our work, we define the vulnerability-proneness of an app as the number of the different types of security issues that the app exhibits.
The underlying hypothesis is that a higher vulnerability-proneness may increase the probability of being attacked, as a wider attack surface is offered.
Thus, we (i) investigate the vulnerability-proneness of mobile apps belonging to different categories.
(ii) evaluate if users can perceive the risks of installing vulnerable apps
(iii) explore the extent to which the app-related information provided by the store can be used to predict the vulnerability-proneness levels of apps.
To carry out the study we extracted data from about one thousand apps spanning 23 different Play Store categories.
Note that for each category we have both low and high-rated apps.
For extracting information related to vulnerabilities, all the collected apks, have been inspected through AndroBugs a state-of-the-art vulnerability scanner.
In the next slides, I will also discuss the vulnerabilities that this tool is able to detect.
In addition, we also extracted metadata related to the app from the Google play store using ad-hoc scripts and browser automation capabilities.
Once all the data were collected, we analyzed them through statistical tools, and these data were used for training machine learning algorithms enabling the prediction of vulnerability-proneness levels of apps.
In the slide you can see an example of report provided by Androbugs, which marks each identified vulnerability with a type and a severity level.
Androbugs is fast and accurate and can statically analyze apk files (without executing them).
It was successfully used to find vulnerabilities in many popular Android apps, such as Facebook and Twitter.
As highlighted in the slide, Androbugs can statically detect a lot of different types of vulnerabilities:
such as (i) vulnerabilities that could be exploited for performing man-in-the-middle attacks,
(ii) vulnerabilities that could be exploited for code injection,
or (iii) vulnerabilities that may allow access to sensitive data.
We also extracted contextual information related to the apps and provided by the app store.
Some of these data, as the ones related to the Permission, Monetization and Richness of functionalities aspects can be easily associated with security and privacy issues,
while we argue that the ones related to Behavior and Success aspects need further investigation.
And now I will present the preliminary results we obtained to answer our research question.
For each research question I will briefly discuss the analysis done and the findings.
For answering RQ1 we compared the vulnerability-proneness of apps belonging to different app categories through non-parametric statistical tools (as we deal with distribution that are not normal).
To corroborate the quantitative results we also more-in depth investigate the specific types of vulnerabilities detected in apps of different categories.
As evidenced in the slide, we can observe that in terms of vulnerability-proneness apps in the Medical category differ from apps in other categories, with statistical evidence and large effect-size.
The good news is that we can affirm that Medical apps (that usually handle very sensitive information) tend to exhibit less security flaws than all the other considered categories.
The bad news is that we cannot say the same thing about for example Finance and Shopping apps in which we usually share bank account details.
A confirmation of these results is given by the table in the slide, in which we can see that apps belonging to Medical category are more rarely affected by most of the recurrent vulnerability types.
To answer RQ2, as in previous work, we use two different proxy metrics for estimating app success:
average rating and number of downloads.
Thus, we compared the vulnerability-proneness of apps belonging to different rating and download groups.
As quite expected, no relations could be observed between the app rating and app vulnerability-proneness.
While a counterintuitive result is observed for what concerning the number of downloads. Apps with higher number of downloads tend to exhibit higher levels of vulnerability-proneness.
Normalized results can lead to different outomes… as shown in the the slides, left side, apps having a lower average rating tend to have a higher vulnerability-proneness density.
Normalized results can lead to different outomes… as shown in the the slides, left side, apps having a lower average rating tend to have a higher vulnerability-proneness density.
To answer RQ3 and better understand if the information provided by the app store can be used to predict the level of an app’s vulnerability-proneness
we trained 3 different machine learning algorithms using different combinations of features, namely:
app market metrics (as downloads, rating, so on),
textual features extracted from app descriptions using text analysis techniques
simple static analysis features (such as number of 3rd party libraries, number of classes, and so on)
And evaluate the classification perfomance of these algorithms in identifying apps with both low and high vulnerability-proneness levels.
The first result is that Random Forest is the best performing algorithm for this task.
The Random forest algorithm trained with only app market information is effective in identifying low or high risk apps, in about 3 out of 4 cases.
As expected the best performance is achieved by using both app market and static analysis features.
Surprisingly, while textual features have been successfully used in bug prediction/classification tasks, in this context they seem to introduce noise that affects the classification results.
The first result is that Random Forest is the best performing algorithm for this task.
Thus, we can summarize the findings of this presentation as follows:
Almost all apps present known security defects,
but apps belonging to the medical category are less vulnerability-prone than apps in the other categories.
Vulnerability-proneness does not affect app ratings.
Indeed, more popular apps tend to exhibit higher levels of vulnerability-proneness.
App market could provide useful information to predict, in early stages, the vulnerability-proneness level of an app.
Such information could be complementary to the ones provided by metrics related to app code.
Textual descriptions do not provide useful information for this task.
In the future we plan to explore several research directions.
In particular, we want to survey app users to better understand how they deal with privacy and security concerns.
We want to improve the prediction results by considering additional features, and tunings
And we also want to investigate the possibility of predicting the specific type of security defects.