Covid19 Smart Assessment Tool

Machine Learning for COVID-19 targeted testing
A risk assessment tool that optimises the use of COVID19 tests in order to implement fact-based strategies for deconﬁnement.
A project supported by Compellio S.A.
15, côte d'Eich
L-1450 Luxembourg
https://compell.io (https://compell.io)
Contributors
All project contributors worked voluntarily in this project.
Christos Avrilionis, Credit Risk and Governance Manager, PayPal
Theo Papasternos, Business Manager, Compellio
Denis Avrilionis, CEO, Compellio
Vivi Tzekou, Machine Learning / Full-Stack Developer
Yuri Visovsiouk. Full-Stack Developer
Christina Dimopoulou, Business Operations, Compellio
Disclaimer: the opinions expressed in this publication are those of theauthors. They do not express the opinions of any entity whatsoever with
which they are aﬃliated.
Contacts
We would be delighted to further discuss this project with you. You can directly reach us at hello@compell.io (mailto:hello@compell.io).
The way forward

We are currently looking to join forces with governments, health organisations, laboratories, and pharmaceuticals. Interested parties can fill in the
partnership form found on the project’s website: https://covid19smartscreeningtool.launchaco.com (https://covid19smartscreeningtool.launchaco.com)
Stay healthy,
The COVID19 Smart Screening Tool Team
Hacking for #EUvsVirus
Overview
The project aims at the development and the deployment of a software platform that would:
Allow an individual to fill-in an electronic questionnaire with health and demographic information, securing and protecting sensitive personal data
using Compellio’s blockchain- enabled registry technology
Predict the likelihood of positive Covid-19 diagnosis of an individual at a given point in time using machine learning (ML) and artificial intelligence (AI)
Enable policy makers to build an optimal Covid-19 exit strategy based on the targeted use of Covid-19 tests of high-risk individuals
This solution will be implemented in 2 phases.
1. Phase 1: Data collection and modelling
a. Design the questionnaire
b. Collect medical and demographic data of a person when that person takes a test for Covid-19 using the questionnaire from step 1.a
c. Link the test’s outcome (Covid-19 positive or negative) to the data collected in step 1.b
d. Build a machine learning model on data from step 1.c
2. Phase 2: Deployment and general availability
a. Use the model from step 1.d to generate a prediction of Covid-19 positiveness of any person
b. Target Covid-19 tests for persons having a high likelihood of positive Covid-19 diagnosis
c. Link the test result (Covid-19 positive or negative) to the prediction calculated in step 2.a
d. Monitor model performance and fine-tune the machine learning model built in step 1.d
This document illustrates phases 1.b, 1.c, 1.d, 2.a and 2.b using simulated data

Phase 1.b.
Collect medical and demographic data of a person when that person takes a test for Covid-
19 using the questionnaire
As a result of phase 1.a, let's assume that we have a questionnaire of 23 questions about medical and demographic information.
For the purpose of this illustration, let's assume that:
Each question is referred to as q1, q2, ..., q23
Each patient has a unique identiﬁer from 1 to 1000
The questionnaire was proposed to 1000 patients as part of the Covid-19 testing procedure
All patients answered all the questions
Each answer to each question is a continuous variable (this can be extended to categorical variables as well)
The output of phase 1.b. is a table similar to this (ﬁrst 20 patients shown):
In [7]: df_X.head(20)

Out[7]:
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14
Patient
ID
1 0.86360 -1.00490 -0.32880 0.63232 -0.76382 2.48937 0.30792 0.69260 0.87849 0.39463 1.00259 0.78609 -0.17739 0.77910
2 0.16313 -0.24986 0.13275 -1.50085 -2.92733 0.20983 0.66074 -0.65538 0.11255 0.59372 0.88890 -0.50316 0.25568 0.76523
3 -0.73682 -0.55274 -1.18046 -0.42813 -1.34706 -0.50470 0.23098 2.24479 -0.68447 -0.19708 -1.26557 -3.01053 0.82140 -0.71519
4 -1.04279 -0.65561 -0.76028 0.25715 -0.35681 -2.86298 -2.25225 -5.07991 0.63268 1.13516 0.00391 1.42961 0.43997 -1.27914
5 -0.51371 0.50819 -0.38651 -1.34063 1.36827 -0.89227 1.27378 -0.07773 0.77796 -0.47807 -2.04526 1.64837 -0.67013 1.54823
6 -2.55599 2.15763 -0.41735 -0.45903 -0.99303 0.68273 1.86696 0.92724 -0.51942 0.97351 1.01371 0.12639 -0.88108 0.90742
7 0.25446 -0.66239 0.14898 -1.14714 0.26427 -0.12783 -0.13116 0.68001 -0.22781 0.89755 -0.50767 -0.22261 -0.43984 -1.13151
8 -1.01046 -0.75284 0.22125 -0.25886 -1.23720 -1.26904 -0.06989 -0.54133 0.54484 0.41281 -0.17778 -2.23708 0.43346 -0.75199
9 0.21368 -1.49480 0.80215 -0.55766 -2.11991 0.22287 -2.60513 0.86176 0.86479 0.20923 -0.66948 -0.15163 0.98832 -1.26166
10 0.47976 0.04171 -2.12566 0.07869 0.83110 -0.12056 -1.66437 0.79751 -0.97663 1.29526 -0.57091 -1.01142 -0.88971 -2.32314
11 1.00421 0.99791 0.76168 0.40136 -0.52947 -1.03565 -0.96048 -0.63995 2.44512 1.08679 1.41085 2.73563 1.36788 -0.69813
12 0.52490 -0.47379 -0.65672 -1.35932 -2.25998 -2.31555 -0.89348 -4.19650 -0.84165 -0.69299 0.69274 2.07068 0.22034 0.43498
13 0.63667 1.19087 0.05326 1.21838 -0.08718 2.12081 0.13317 1.77220 -0.62710 -1.29448 -0.33494 -0.95674 0.53233 0.48086
14 -0.37914 -0.14957 0.76062 -0.83470 -0.77427 0.27242 1.21496 1.95481 -0.16722 0.31711 0.31330 -1.84803 0.40436 2.54233
15 -0.89963 -0.38678 0.95247 1.40977 2.22997 -2.98061 -0.18962 -5.50990 0.93988 0.17021 0.26472 2.38602 -0.99076 0.25673
16 -0.17194 0.40791 -0.80500 0.17541 0.35286 1.02670 -0.26927 0.56005 -0.07986 0.04604 -0.95871 -0.59690 1.51679 -1.01261
17 -2.05931 -1.57735 0.46265 -0.18091 2.63300 -2.58241 -0.87385 -4.06331 -0.57685 0.57172 1.38340 1.93061 -1.13699 -1.81634
18 0.17660 -1.64372 1.26173 -0.01106 -1.44764 1.86687 -0.28044 1.77416 -1.68730 -0.23558 -0.41543 -0.63237 -0.79932 0.64604
19 -0.58746 -0.40418 1.12721 -1.26260 -0.27359 -1.22690 -1.83024 -0.53333 0.93635 -0.91030 1.40048 2.55845 -0.12744 -0.34864
20 -0.00095 0.10829 -0.64400 0.06277 -0.76381 -0.67486 0.11844 2.36569 -0.63694 -0.69341 -1.27323 -0.42776 0.45477 0.90311

Phase 1.c.
Link the test’s outcome (Covid-19 positive or negative) to the data collected in step 1.b.
Let's assume that:
All 1000 patients from phase 1.b. have been tested for Covid-19 positiveness using a lab test
The tests were done on respiratory samples obtained by a nasopharyngeal swab using real-time reverse transcription polymerase chain reaction (rRT-
PCR)
The data of the test outcome are captured as follows:
If a patient is Covid-19 positive, the Covid-19 test outcome is equal to 1
If a patient is Covid-19 negative, the Covid-19 test outcome is equal to 0
The data of the test outcome for the ﬁrst 20 patients are the following:

In [8]: df_y.head(20)
Out[8]:
Covid-19 test outcome
Patient ID
1 1
2 1
3 1
4 0
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 0
13 1
14 1
15 0
16 1
17 0
18 1
19 0
20 0

From the table above, we see that:
Patient with ID 3 is Covid-19 positive
Patient with ID 4 is Covid-19 negative
Let's assume that the proportion of Covid-19 positive patients is approximately 700 / 1000 (70%)
In [9]: df_y['Covid-19 test outcome'].value_counts()
In [10]: sns.countplot(x='Covid-19 test outcome', data=df_y, color='grey')
plt.ylabel('Count of patients')
plt.show()
Out[9]: 1 701
0 299
Name: Covid-19 test outcome, dtype: int64

Then, we link the patient's answers to the questionnaire with the test results.
The output of phase 1.c. is a table similar to this:
In [12]: df.head(20)

Out[12]:
Covid-
19 test
outcome
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13
Patient
ID
1 1 0.86360 -1.00490 -0.32880 0.63232 -0.76382 2.48937 0.30792 0.69260 0.87849 0.39463 1.00259 0.78609 -0.17739
2 1 0.16313 -0.24986 0.13275 -1.50085 -2.92733 0.20983 0.66074 -0.65538 0.11255 0.59372 0.88890 -0.50316 0.25568
3 1 -0.73682 -0.55274 -1.18046 -0.42813 -1.34706 -0.50470 0.23098 2.24479 -0.68447 -0.19708 -1.26557 -3.01053 0.82140
4 0 -1.04279 -0.65561 -0.76028 0.25715 -0.35681 -2.86298 -2.25225 -5.07991 0.63268 1.13516 0.00391 1.42961 0.43997
5 1 -0.51371 0.50819 -0.38651 -1.34063 1.36827 -0.89227 1.27378 -0.07773 0.77796 -0.47807 -2.04526 1.64837 -0.67013
6 1 -2.55599 2.15763 -0.41735 -0.45903 -0.99303 0.68273 1.86696 0.92724 -0.51942 0.97351 1.01371 0.12639 -0.88108
7 1 0.25446 -0.66239 0.14898 -1.14714 0.26427 -0.12783 -0.13116 0.68001 -0.22781 0.89755 -0.50767 -0.22261 -0.43984
8 1 -1.01046 -0.75284 0.22125 -0.25886 -1.23720 -1.26904 -0.06989 -0.54133 0.54484 0.41281 -0.17778 -2.23708 0.43346
9 1 0.21368 -1.49480 0.80215 -0.55766 -2.11991 0.22287 -2.60513 0.86176 0.86479 0.20923 -0.66948 -0.15163 0.98832
10 1 0.47976 0.04171 -2.12566 0.07869 0.83110 -0.12056 -1.66437 0.79751 -0.97663 1.29526 -0.57091 -1.01142 -0.88971
11 1 1.00421 0.99791 0.76168 0.40136 -0.52947 -1.03565 -0.96048 -0.63995 2.44512 1.08679 1.41085 2.73563 1.36788
12 0 0.52490 -0.47379 -0.65672 -1.35932 -2.25998 -2.31555 -0.89348 -4.19650 -0.84165 -0.69299 0.69274 2.07068 0.22034
13 1 0.63667 1.19087 0.05326 1.21838 -0.08718 2.12081 0.13317 1.77220 -0.62710 -1.29448 -0.33494 -0.95674 0.53233
14 1 -0.37914 -0.14957 0.76062 -0.83470 -0.77427 0.27242 1.21496 1.95481 -0.16722 0.31711 0.31330 -1.84803 0.40436
15 0 -0.89963 -0.38678 0.95247 1.40977 2.22997 -2.98061 -0.18962 -5.50990 0.93988 0.17021 0.26472 2.38602 -0.99076
16 1 -0.17194 0.40791 -0.80500 0.17541 0.35286 1.02670 -0.26927 0.56005 -0.07986 0.04604 -0.95871 -0.59690 1.51679
17 0 -2.05931 -1.57735 0.46265 -0.18091 2.63300 -2.58241 -0.87385 -4.06331 -0.57685 0.57172 1.38340 1.93061 -1.13699
18 1 0.17660 -1.64372 1.26173 -0.01106 -1.44764 1.86687 -0.28044 1.77416 -1.68730 -0.23558 -0.41543 -0.63237 -0.79932
19 0 -0.58746 -0.40418 1.12721 -1.26260 -0.27359 -1.22690 -1.83024 -0.53333 0.93635 -0.91030 1.40048 2.55845 -0.12744
20 0 -0.00095 0.10829 -0.64400 0.06277 -0.76381 -0.67486 0.11844 2.36569 -0.63694 -0.69341 -1.27323 -0.42776 0.45477

The following ﬁgure illustrates the pairwise scaterplots of each combination of questions, as well as the distribution of values for each question.
Covid-19 positive patients are shown in orange.
Covid-19 negative patients are shown in blue.
In [13]: sns.set(style="ticks", color_codes=True)
df_sample = df.sample(frac=0.1, replace=False, random_state=0)
g = sns.pairplot(df_sample, hue='Covid-19 test outcome')

Phase 1.d.
Build a machine learning model on data from step 1.c
As a best practice, we leave aside 20% of the data (200 patients) in order to measure model performance in a subset of data which was not used to ﬁt the
model
In [15]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

In [16]: print('The training partition has', X_train.shape[0], 'rows (patients) and', X_train.shape[1],'inputs (a
nswered questions from the questionnaire)')
print('The training partition has', y_train.shape[0], 'class labels (results of the Covid-19 test for ea
ch patient)')
print('n')
print('The validation partition has', X_test.shape[0],'rows (patients) and', X_test.shape[1],'inputs (an
swered questions from the questionnaire)')
print('The validation partition has', y_test.shape[0], 'class labels (results of the Covid-19 test for e
ach patient)')
The proportion of the target variable (Covid-19 test outcome) in the training partition is the following:
In [17]: pd.Series(y_train).value_counts(normalize=True)
The proportion of the target variable (Covid-19 test outcome) in the validation partition is the following:
In [18]: pd.Series(y_test).value_counts(normalize=True)
The training partition has 800 rows (patients) and 23 inputs (answered questions from the questionnaire
)
The training partition has 800 class labels (results of the Covid-19 test for each patient)
The validation partition has 200 rows (patients) and 23 inputs (answered questions from the questionnai
re)
The validation partition has 200 class labels (results of the Covid-19 test for each patient)
Out[17]: 1 0.70125
0 0.29875
dtype: float64
Out[18]: 1 0.7
0 0.3
dtype: float64

The details about how the Covid-19 risk model is ﬁt are not shown.
The graphs below show the distribution of the risk score for the training and the validation partition. We can see that the distribution has two distinct
spikes. Risk scores close to zero show low-risk people and the risk scores close to 1 show high-risk people.
In [31]: sns.distplot(pd.Series(pred_proba_train), kde=False)
plt.xlabel('Covid-19 risk score')
plt.title('Covid-19 risk score for the training partition')
plt.show()

In [32]: sns.distplot(pd.Series(pred_proba_test), kde=False)
plt.title('Covid-19 risk score for the validation partition')
plt.show()
The following output shows the model performance on the training partition

In [36]: pred_train = adjusted_classes(pred_proba_train, prior_proba)
print('Training partition')
print('n')
print(pd.DataFrame(confusion_matrix(y_train, pred_train),
columns=['Predicted Covid-19 = 0', 'Predicted Covid-19 = 1'],
index=['Actual Covid-19 = 0', 'Actual Covid-19 = 1']))
print('n')
print(classification_report(y_train,pred_train))
The following output shows the model performance on the validation partition
Training partition
Predicted Covid-19 = 0 Predicted Covid-19 = 1
Actual Covid-19 = 0 239 0
precision recall f1-score support
0 0.96 1.00 0.98 239
1 1.00 0.98 0.99 561
accuracy 0.99 800
macro avg 0.98 0.99 0.99 800
weighted avg 0.99 0.99 0.99 800

In [37]: pred_test = adjusted_classes(pred_proba_test, prior_proba)
print('Validation partition')
print('n')
print(pd.DataFrame(confusion_matrix(y_test, pred_test),
columns=['Predicted Covid-19 = 0', 'Predicted Covid-19 = 1'],
index=['Actual Covid-19 = 0', 'Actual Covid-19 = 1']))
print('n')
print(classification_report(y_test,pred_test))
The output of phase 1.d is a machine learning model that can be deployed at scale in order to calculate the risk score of any person, on the basis of
his/her answers to the questionnaire.
Validation partition
Predicted Covid-19 = 0 Predicted Covid-19 = 1
precision recall f1-score support
0 0.83 0.97 0.89 60
1 0.98 0.91 0.95 140
accuracy 0.93 200
macro avg 0.91 0.94 0.92 200
weighted avg 0.94 0.93 0.93 200

Phase 2.a.
Use the model from step 1.d to generate a prediction of Covid-19 positiveness of any
person
Post model deployment (in production) let's assume that there are 500 previously unknown patients that answered the questionnaire. The data for the ﬁrst
10 of them are shown below.
In [45]: pd.set_option('precision', 5)
df_X_score.head(10)
Out[45]:
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14
Patient
ID
1001 1.53175 -0.04577 0.24018 -3.01630 3.78771 0.14376 0.10031 1.51729 -1.76818 -1.63343 -0.17644 -0.93879 0.15535 1.46685
1002 -0.10546 0.27277 0.19986 -2.45632 2.71857 -0.67213 0.65718 1.81098 -1.08473 -2.55134 -0.01416 0.69048 -0.05910 0.62776
1003 -0.53627 1.91788 -1.27851 0.08680 0.73835 0.83755 -1.04482 -0.79318 0.05228 -0.39619 -0.05698 0.78990 0.75103 -1.12659
1004 0.22495 1.32366 2.31517 2.10098 0.12442 0.06402 0.13983 -2.27711 -0.34322 -0.45823 -0.86908 1.73886 -1.13832 -1.09103
1005 1.12422 2.21340 1.03132 2.05760 -3.09501 -3.00650 0.06349 -0.10149 1.79921 1.97837 0.02817 -0.22139 -0.08733 -0.17335
1006 0.13391 -0.76154 0.83318 1.40468 -1.82245 -0.19431 -0.20189 0.14828 -2.96337 -0.15200 -0.70969 0.09331 -0.62289 -0.52828
1007 0.55378 1.14685 1.57146 -1.62518 2.78392 -0.22966 1.07731 2.42854 -0.50233 1.09827 -0.25322 0.81109 -1.83957 1.25795
1008 -1.68730 1.69995 -0.99108 1.42300 -2.63067 -1.44764 0.89630 2.26976 -2.73905 0.17660 0.64604 1.48959 -1.64372 -1.62723
1009 -1.14398 1.17859 0.54627 0.11912 0.45548 -0.25665 -1.09810 -1.01112 0.41393 -0.73649 -0.62525 0.51227 0.55505 -1.23055
1010 -1.38965 1.11901 -1.67252 0.45198 -3.02987 0.72636 0.21500 -0.64388 -1.34596 -0.22011 0.13737 -0.67231 -2.50142 0.97951

Distribution of Covid-19 risk score for 500 previously unknown patients
In [48]: sns.distplot(pd.Series(pred_proba_score), kde=False)
plt.title('Covid-19 risk score for 500 previously unknown patients')
plt.show()
The table below shows the prediction for the ﬁrst 10 previously unknown patients based on the Covid-19 risk score

In [50]: pred_score_df = pd.DataFrame(pred_score, index=idx_score, columns=['Predicted Covid-19 test outcome'])
pred_score_df.head(10)
Patient with ID 1003 has positive predicted Covid-19 test outcome, while patient 1004 has negative predicted Covid-19 outcome.
Phase 2.b.
Target Covid-19 tests for persons having a high likelihood of positive Covid-19
diagnosis
Out[50]:
Predicted Covid-19 test outcome
Patient ID
1001 1
1002 1
1003 1
1004 0
1005 0
1006 1
1007 1
1008 1
1009 0
1010 1

Get the top 20 persons with highest Covis-19 risk score
In [51]: pred_proba_score_df = pd.DataFrame(pred_proba_score, index=idx_score, columns=['Predicted Covid-19 risk
score'])
pd.concat([pred_proba_score_df, pred_score_df], axis=1).sort_values(by='Predicted Covid-19 risk score',
ascending=False).head(20).drop(columns='Predicted Covid-19 risk score')

Out[51]:
Predicted Covid-19 test outcome
Patient ID
1204 1
1057 1
1223 1
1212 1
1374 1
1375 1
1026 1
1386 1
1293 1
1270 1
1407 1
1267 1
1417 1
1272 1
1195 1
1038 1
1322 1
1481 1
1163 1
1225 1

The proposed solution is capable of registering and reporting the following information:
Daily count of participants taking the questionnaire
Average daily Covid-19 risk score
Average Covid-19 risk score by age group
Average Covid-19 risk score by geographical region
etc.
In [53]: plt.figure(figsize=(15, 10))
plt.title("Daily count of participants taking the questionnaire", fontsize=16)
plt.plot(daily_volume.index, daily_volume['Volume'], color="b", linestyle="-")
plt.ylabel("Volume", fontsize=14)
plt.xlabel("Date", fontsize=14)
plt.ylim(0, 1000)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.grid(True)
plt.show()

plt.title("Daily average Covid-19 risk score for predicted positive patients", fontsize=16)
plt.plot(daily_scores.index, daily_scores['Average Covid-19 risk score'], color="b", linestyle="-")
plt.ylabel("Average Covid-19 risk score", fontsize=14)
plt.xlabel("Date", fontsize=14)
plt.ylim(0, 1)
plt.grid(True)
plt.show()

plt.title("Weekly average Covid-19 risk score by age group", fontsize=16)
plt.plot(weekly_avg_score_age.index, weekly_avg_score_age['18-39'], color="b", linestyle="-", label='18-
39')
plt.plot(weekly_avg_score_age.index, weekly_avg_score_age['40-59'], color="r", linestyle="-", label='40-
59')
plt.plot(weekly_avg_score_age.index, weekly_avg_score_age['60+'], color="g", linestyle="-", label='60+')
plt.ylabel("Average Covid-19 risk score", fontsize=14)
plt.xlabel("Week", fontsize=14)
plt.grid(True)
plt.legend(loc='best', fontsize=14)
plt.show()

Covid19 Smart Assessment Tool

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Covid19 Smart Assessment Tool

Semelhante a Covid19 Smart Assessment Tool (20)

Último

Último (20)

Covid19 Smart Assessment Tool