1. Machine Learning for COVID-19 targeted testing
A risk assessment tool that optimises the use of COVID19 tests in order to implement fact-based strategies for deconfinement.
A project supported by Compellio S.A.
15, côte d'Eich
L-1450 Luxembourg
https://compell.io (https://compell.io)
Contributors
All project contributors worked voluntarily in this project.
Christos Avrilionis, Credit Risk and Governance Manager, PayPal
Theo Papasternos, Business Manager, Compellio
Denis Avrilionis, CEO, Compellio
Vivi Tzekou, Machine Learning / Full-Stack Developer
Yuri Visovsiouk. Full-Stack Developer
Christina Dimopoulou, Business Operations, Compellio
Disclaimer: the opinions expressed in this publication are those of theauthors. They do not express the opinions of any entity whatsoever with
which they are affiliated.
Contacts
We would be delighted to further discuss this project with you. You can directly reach us at hello@compell.io (mailto:hello@compell.io).
The way forward
2. We are currently looking to join forces with governments, health organisations, laboratories, and pharmaceuticals. Interested parties can fill in the
partnership form found on the project’s website: https://covid19smartscreeningtool.launchaco.com (https://covid19smartscreeningtool.launchaco.com)
Stay healthy,
The COVID19 Smart Screening Tool Team
Hacking for #EUvsVirus
Overview
The project aims at the development and the deployment of a software platform that would:
Allow an individual to fill-in an electronic questionnaire with health and demographic information, securing and protecting sensitive personal data
using Compellio’s blockchain- enabled registry technology
Predict the likelihood of positive Covid-19 diagnosis of an individual at a given point in time using machine learning (ML) and artificial intelligence (AI)
Enable policy makers to build an optimal Covid-19 exit strategy based on the targeted use of Covid-19 tests of high-risk individuals
This solution will be implemented in 2 phases.
1. Phase 1: Data collection and modelling
a. Design the questionnaire
b. Collect medical and demographic data of a person when that person takes a test for Covid-19 using the questionnaire from step 1.a
c. Link the test’s outcome (Covid-19 positive or negative) to the data collected in step 1.b
d. Build a machine learning model on data from step 1.c
2. Phase 2: Deployment and general availability
a. Use the model from step 1.d to generate a prediction of Covid-19 positiveness of any person
b. Target Covid-19 tests for persons having a high likelihood of positive Covid-19 diagnosis
c. Link the test result (Covid-19 positive or negative) to the prediction calculated in step 2.a
d. Monitor model performance and fine-tune the machine learning model built in step 1.d
This document illustrates phases 1.b, 1.c, 1.d, 2.a and 2.b using simulated data
3. Phase 1.b.
Collect medical and demographic data of a person when that person takes a test for Covid-
19 using the questionnaire
As a result of phase 1.a, let's assume that we have a questionnaire of 23 questions about medical and demographic information.
For the purpose of this illustration, let's assume that:
Each question is referred to as q1, q2, ..., q23
Each patient has a unique identifier from 1 to 1000
The questionnaire was proposed to 1000 patients as part of the Covid-19 testing procedure
All patients answered all the questions
Each answer to each question is a continuous variable (this can be extended to categorical variables as well)
The output of phase 1.b. is a table similar to this (first 20 patients shown):
In [7]: df_X.head(20)
5. Phase 1.c.
Link the test’s outcome (Covid-19 positive or negative) to the data collected in step 1.b.
Let's assume that:
All 1000 patients from phase 1.b. have been tested for Covid-19 positiveness using a lab test
The tests were done on respiratory samples obtained by a nasopharyngeal swab using real-time reverse transcription polymerase chain reaction (rRT-
PCR)
The data of the test outcome are captured as follows:
If a patient is Covid-19 positive, the Covid-19 test outcome is equal to 1
If a patient is Covid-19 negative, the Covid-19 test outcome is equal to 0
The data of the test outcome for the first 20 patients are the following:
7. From the table above, we see that:
Patient with ID 3 is Covid-19 positive
Patient with ID 4 is Covid-19 negative
Let's assume that the proportion of Covid-19 positive patients is approximately 700 / 1000 (70%)
In [9]: df_y['Covid-19 test outcome'].value_counts()
In [10]: sns.countplot(x='Covid-19 test outcome', data=df_y, color='grey')
plt.ylabel('Count of patients')
plt.show()
Out[9]: 1 701
0 299
Name: Covid-19 test outcome, dtype: int64
8. Then, we link the patient's answers to the questionnaire with the test results.
The output of phase 1.c. is a table similar to this:
In [12]: df.head(20)
10. The following figure illustrates the pairwise scaterplots of each combination of questions, as well as the distribution of values for each question.
Covid-19 positive patients are shown in orange.
Covid-19 negative patients are shown in blue.
In [13]: sns.set(style="ticks", color_codes=True)
df_sample = df.sample(frac=0.1, replace=False, random_state=0)
g = sns.pairplot(df_sample, hue='Covid-19 test outcome')
11. Phase 1.d.
Build a machine learning model on data from step 1.c
As a best practice, we leave aside 20% of the data (200 patients) in order to measure model performance in a subset of data which was not used to fit the
model
In [15]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)
12. In [16]: print('The training partition has', X_train.shape[0], 'rows (patients) and', X_train.shape[1],'inputs (a
nswered questions from the questionnaire)')
print('The training partition has', y_train.shape[0], 'class labels (results of the Covid-19 test for ea
ch patient)')
print('n')
print('The validation partition has', X_test.shape[0],'rows (patients) and', X_test.shape[1],'inputs (an
swered questions from the questionnaire)')
print('The validation partition has', y_test.shape[0], 'class labels (results of the Covid-19 test for e
ach patient)')
The proportion of the target variable (Covid-19 test outcome) in the training partition is the following:
In [17]: pd.Series(y_train).value_counts(normalize=True)
The proportion of the target variable (Covid-19 test outcome) in the validation partition is the following:
In [18]: pd.Series(y_test).value_counts(normalize=True)
The training partition has 800 rows (patients) and 23 inputs (answered questions from the questionnaire
)
The training partition has 800 class labels (results of the Covid-19 test for each patient)
The validation partition has 200 rows (patients) and 23 inputs (answered questions from the questionnai
re)
The validation partition has 200 class labels (results of the Covid-19 test for each patient)
Out[17]: 1 0.70125
0 0.29875
dtype: float64
Out[18]: 1 0.7
0 0.3
dtype: float64
13. The details about how the Covid-19 risk model is fit are not shown.
The graphs below show the distribution of the risk score for the training and the validation partition. We can see that the distribution has two distinct
spikes. Risk scores close to zero show low-risk people and the risk scores close to 1 show high-risk people.
In [31]: sns.distplot(pd.Series(pred_proba_train), kde=False)
plt.xlabel('Covid-19 risk score')
plt.ylabel('Count of patients')
plt.title('Covid-19 risk score for the training partition')
plt.show()
14. In [32]: sns.distplot(pd.Series(pred_proba_test), kde=False)
plt.xlabel('Covid-19 risk score')
plt.ylabel('Count of patients')
plt.title('Covid-19 risk score for the validation partition')
plt.show()
The following output shows the model performance on the training partition
15. In [36]: pred_train = adjusted_classes(pred_proba_train, prior_proba)
print('Training partition')
print('n')
print(pd.DataFrame(confusion_matrix(y_train, pred_train),
columns=['Predicted Covid-19 = 0', 'Predicted Covid-19 = 1'],
index=['Actual Covid-19 = 0', 'Actual Covid-19 = 1']))
print('n')
print(classification_report(y_train,pred_train))
The following output shows the model performance on the validation partition
Training partition
Predicted Covid-19 = 0 Predicted Covid-19 = 1
Actual Covid-19 = 0 239 0
Actual Covid-19 = 1 9 552
precision recall f1-score support
0 0.96 1.00 0.98 239
1 1.00 0.98 0.99 561
accuracy 0.99 800
macro avg 0.98 0.99 0.99 800
weighted avg 0.99 0.99 0.99 800
16. In [37]: pred_test = adjusted_classes(pred_proba_test, prior_proba)
print('Validation partition')
print('n')
print(pd.DataFrame(confusion_matrix(y_test, pred_test),
columns=['Predicted Covid-19 = 0', 'Predicted Covid-19 = 1'],
index=['Actual Covid-19 = 0', 'Actual Covid-19 = 1']))
print('n')
print(classification_report(y_test,pred_test))
The output of phase 1.d is a machine learning model that can be deployed at scale in order to calculate the risk score of any person, on the basis of
his/her answers to the questionnaire.
Validation partition
Predicted Covid-19 = 0 Predicted Covid-19 = 1
Actual Covid-19 = 0 58 2
Actual Covid-19 = 1 12 128
precision recall f1-score support
0 0.83 0.97 0.89 60
1 0.98 0.91 0.95 140
accuracy 0.93 200
macro avg 0.91 0.94 0.92 200
weighted avg 0.94 0.93 0.93 200
17. Phase 2.a.
Use the model from step 1.d to generate a prediction of Covid-19 positiveness of any
person
Post model deployment (in production) let's assume that there are 500 previously unknown patients that answered the questionnaire. The data for the first
10 of them are shown below.
In [45]: pd.set_option('precision', 5)
df_X_score.head(10)
Out[45]:
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14
Patient
ID
1001 1.53175 -0.04577 0.24018 -3.01630 3.78771 0.14376 0.10031 1.51729 -1.76818 -1.63343 -0.17644 -0.93879 0.15535 1.46685
1002 -0.10546 0.27277 0.19986 -2.45632 2.71857 -0.67213 0.65718 1.81098 -1.08473 -2.55134 -0.01416 0.69048 -0.05910 0.62776
1003 -0.53627 1.91788 -1.27851 0.08680 0.73835 0.83755 -1.04482 -0.79318 0.05228 -0.39619 -0.05698 0.78990 0.75103 -1.12659
1004 0.22495 1.32366 2.31517 2.10098 0.12442 0.06402 0.13983 -2.27711 -0.34322 -0.45823 -0.86908 1.73886 -1.13832 -1.09103
1005 1.12422 2.21340 1.03132 2.05760 -3.09501 -3.00650 0.06349 -0.10149 1.79921 1.97837 0.02817 -0.22139 -0.08733 -0.17335
1006 0.13391 -0.76154 0.83318 1.40468 -1.82245 -0.19431 -0.20189 0.14828 -2.96337 -0.15200 -0.70969 0.09331 -0.62289 -0.52828
1007 0.55378 1.14685 1.57146 -1.62518 2.78392 -0.22966 1.07731 2.42854 -0.50233 1.09827 -0.25322 0.81109 -1.83957 1.25795
1008 -1.68730 1.69995 -0.99108 1.42300 -2.63067 -1.44764 0.89630 2.26976 -2.73905 0.17660 0.64604 1.48959 -1.64372 -1.62723
1009 -1.14398 1.17859 0.54627 0.11912 0.45548 -0.25665 -1.09810 -1.01112 0.41393 -0.73649 -0.62525 0.51227 0.55505 -1.23055
1010 -1.38965 1.11901 -1.67252 0.45198 -3.02987 0.72636 0.21500 -0.64388 -1.34596 -0.22011 0.13737 -0.67231 -2.50142 0.97951
18. Distribution of Covid-19 risk score for 500 previously unknown patients
In [48]: sns.distplot(pd.Series(pred_proba_score), kde=False)
plt.xlabel('Covid-19 risk score')
plt.ylabel('Count of patients')
plt.title('Covid-19 risk score for 500 previously unknown patients')
plt.show()
The table below shows the prediction for the first 10 previously unknown patients based on the Covid-19 risk score
19. In [50]: pred_score_df = pd.DataFrame(pred_score, index=idx_score, columns=['Predicted Covid-19 test outcome'])
pred_score_df.head(10)
Patient with ID 1003 has positive predicted Covid-19 test outcome, while patient 1004 has negative predicted Covid-19 outcome.
Phase 2.b.
Target Covid-19 tests for persons having a high likelihood of positive Covid-19
diagnosis
Out[50]:
Predicted Covid-19 test outcome
Patient ID
1001 1
1002 1
1003 1
1004 0
1005 0
1006 1
1007 1
1008 1
1009 0
1010 1
20. Get the top 20 persons with highest Covis-19 risk score
In [51]: pred_proba_score_df = pd.DataFrame(pred_proba_score, index=idx_score, columns=['Predicted Covid-19 risk
score'])
pd.concat([pred_proba_score_df, pred_score_df], axis=1).sort_values(by='Predicted Covid-19 risk score',
ascending=False).head(20).drop(columns='Predicted Covid-19 risk score')
22. The proposed solution is capable of registering and reporting the following information:
Daily count of participants taking the questionnaire
Average daily Covid-19 risk score
Average Covid-19 risk score by age group
Average Covid-19 risk score by geographical region
etc.
In [53]: plt.figure(figsize=(15, 10))
plt.title("Daily count of participants taking the questionnaire", fontsize=16)
plt.plot(daily_volume.index, daily_volume['Volume'], color="b", linestyle="-")
plt.ylabel("Volume", fontsize=14)
plt.xlabel("Date", fontsize=14)
plt.ylim(0, 1000)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.grid(True)
plt.show()