This document contains an agenda for an AI-Bio convergence training course that will take place from August to October 2022. The course will cover topics like Python and R for data analysis, statistical analysis techniques like ANOVA and multivariate analysis, genomics analysis including genome, transcriptome, epigenome and proteome, machine learning algorithms like linear models, clustering and association analysis, deep learning models like CNNs and RNNs, applying these techniques to medical data for tasks like predictive modeling and image analysis. It also includes sessions on computational chemistry, drug discovery, ontology and its applications in biology.
Hematinics and Erythropoietin- Pharmacology of Hematinics
AI 바이오 (2_3일차).pdf
1. AI-Bio 융합 전문 과정
2022-8~10
윤형기 (hky@openwith.net)
2일차
2. 주제 세부사항
1일차 인사 및 과정 소개
인사
수강생 현황 및 수강목적 등 파악
의료/바이오 개관 (기술/산업) 의료/바이오 기술 및 산업동향
기반기술 (1-1) Python과 분석 패키지 분석도구 (1) (Python, Scipy, numpy/pandas)
2일차 기반기술 (1-2) R과 통계분석 분석도구 (2) (R과 통계학)
생명통계 활용 (1) 생명정보와 ANOVA, 다변량분석 등
유전체 분석
3일차 생명통계 활용 (2) 메타분석
유전체 분석 (Omics) (1)
유전체(genome) 분석
전사체(transcriptome) 분석
4일차 유전체 분석 (Omics) (2)
후성유전체(epigenome) 분석
단백체(proteome) 분석
차세대 Sequencing
GenBank와 NCBI데이터
VCF 데이터 분석, NGS 데이터 처리 등
5일차 기반기술 (3) 기계학습 (1)
모델링 방법론 (모델 개념 및 Cross-Validation)
지도학습 알고리즘 (선형모델, 분류)
기반기술 (3) 기계학습 (2) 비지도학습 알고리즘 (군집, 연관분석 등)
6일차 지도학습과 생명정보 응용
의료데이터에서의 예측모델
선형모델과 헬스케어 데이터의 분류
비지도학습과 생명정보 응용
임상데이터의 연관성분석
동반질병 (comorbidity) 분석
의료/바이오 도메인 이해
헬스케어 데이터셋과 생명통계
바이오 데이터와 기계학습
일정
3. 주제 세부사항
7일차 기반기술 (4) 딥러닝 (1) 신경망 학습과 딥러닝 모델
기반기술 (3) 딥러닝 (2)
TensorFlow
PyTorch
8일차 딥러닝과 생명정보 응용
Bi-LSTM을 이용한 헬스케어 시뮬레이션
딥러닝을 이용한 피부병 식별
온톨로지와 생명정보 응용
세만틱웹과 ontologies
Ontology의 생명정보 응용
9일차 기반 기술 (3) 이미지 처리 이미지 처리와 컴퓨터 비전 개요
의료영상분석 (1)
Segmentation
영상등록 (image registration)
10일차 의료영상분석 (2)
심전도 (ECG)
Rendering과 Surface Models
MRI
11일차 기반기술 (4) 생명정보와 계산화학 계산화학 (computational chemistry) 개요
신약개발 (drug discovery) (1)
표적규명 (target identification)
시약과 검정법 개발
ADME (흡수, 분포, 대사, 배설)
독성학과 기계학습 응용
12일차 기반 기술 (5) GAN GAN (Generative Adversarial Networks)과 VAE
신약개발과 GAN 생성모델을 이용한 신약후보물질 추천
총정리 Wrap-up 총정리
의료영상 분석
약물분석과 신약설계
바이오 데이터와 딥러닝
8. 1. 기본개념과 기술통계
• 1.1 통계 개념
– Everything Varies - Heterogeneity is universal
- “변화의 정도가 통계적 의미 (유의성)이 있는지 여부"
– Statistical vs. Practical Significance
• Statistical Significance: differences in group means are not likely due
to sampling error.
• Practical (or clinical) Significance
9. • 통계 개념 (2)
– Everything Varies
• Heterogeneity is universal
- “변화의 정도가 통계적 의미 (유의성)이 있는지 여부"
– Statistical vs. Practical Significance
• Statistical Significance
– differences in group means are not likely due to sampling error.
• Practical (or clinical) Significance
– Practical significance asks the larger question about differences
– “Are the differences between samples big enough to have real
meaning.”
– Generally assessed with measure of effect size – 2 categories:
» Difference measures
» Variance accounted for measures
10. • 1.2 기술통계 (Descriptive Statistics)
– (1) 중심경향성: Ungrouped Data
• Mode, Mean, Median
• Percentile, Quantile/Quartile
– (2) 변동성: Ungrouped Data
• Range & IQR (Interquartile Range)
• MAD (Mean Absolute Deviation)
• Variance, Standard Deviation
• 모분산 vs. 표본분산 및 표준편차
• Unbiased estimator
• Z-score
• Coefficient of Variation (CV)
11. – (3) Measures of Shape
• Skewness
– Coefficient of Skewness
• Kurtosis
• Box-and-Whisker Plots
14. – Counting Possibilities
• mn Counting Rule: m x n
• Sampling from a Population with Replacement: (N)n possibilities
• Combinations: Sampling from Population Without Replacement:
NCn = 𝑁!/𝑛!(𝑁−𝑛)!
• 기대값과 분산
– (a) 기대값
– (b) 분산
• Geometric Mean
17. • 4.1 개요
• 확률변수 (Random variable)
• = a variable that contains the outcomes of a chance experiment
• 4.2 이산분포의 모양
– Mean or Expected Value
• = long-run average of occurrences
– Variance와 Standard Deviation
• 4.2 이항분포
– Binomial formula
– 이항분포의 평균과 표준편차
• 4.3 Poisson 분포
– Law of improbable events
λ = long-run average
18. • 4.5 초기하 (Hypergeometric) 분포
– 개요
• = 유한 모집단으로부터 비복원추출 시 나타나는 확률분포
– 다음 경우에 이항분포 대신 사용:
• (i) Sampling is done without replacement.
• (ii) n ≥ 5% N
19. (연속 분포 )
• 4.6 일양분포 (一樣分布 Uniform Distribution)
• 4.7 정규분포
– 개요
• Gaussian 분포
• 정규분포의 확률밀도함수
– Standardized Normal Distribution
• z score = # of s.d. that a value x is above or below the mean
• z distribution
• 4.8 이항분포 대신 정규분포의 적용 (Approximate)
– 경험법칙;
• 대략 normal curve value의 99.7%가 3 s.d. 이내
• n • p > 5 and n • q > 5
– Correcting for Continuity
• ; Converting discrete distribution into a continuous distribution.
20. • 4.7 지수분포
– Inter-arrival times of random arrivals
• = Random occurrences 사이의 시간의 분포
• cf. Poisson distribution = random occurrences over some interval
– 지수분포의 확률
21. • 4.8 𝜒2분포
• 4.9 Lognormal 분포
– 그 로그가 정규분포를 따르는 분포
25. 6. 추정
• 신뢰구간 추정 (단일 모집단)
– z 통계량 이용한 신뢰구간 추정 (단일 모집단) (σ Known)
• 점추정 (point estimation)
• 100(1-α)% Confidence Interval to Estimate μ: σ known]
• 유한조정계수
• Sample Size가 작은 경우
– 여태까지 주로 n ≥ 30
– n < 30 이어도 중심극한정리에 의해 z formula 적용 :
– sample size가 클 때 또는 작아도 모집단이 정규분포 (σ known)
26. – t 통계량 이용한 신뢰구간추정 (단일모집단) (σ Unknown)
• 모집단이 정규분포인데 모집단 s.d 를 모르는 경우 t 분포 적용.
– 표본크기에 따라 분포가 다르다.
– t 통계량의 가정: 모집단이 정규분포
» 모집단이 정규분포가 아니면 비모수통계 기법
– t 분포의 특징: Robust
• t 통계량을 이용한 모집단 평균 추정에서의 신뢰구간
– 모비율 추정
27. 7. 가설검정 (단일 모집단)
• 7.1 개요
– Good and Bad Hypotheses
• “a good hypothesis is a falsifiable hypothesis.”
- Karl Popper
• Absence of evidence is not evidence of absence.
– 귀무가설 (Null Hypotheses)
• ‘nothing is happening’. == slope of the relationship is zero.
– 대립가설 (Alternative Hypotheses)
28. Statistical Hypothesis Testing
• Step 1: State the Null Hypothesis (H0)
• Step 2: State the Alternative Hypothesis
• Step 3: Set 𝛼
• Step 4: Collect Data
Decision
In Reality
H0 is TRUE H0 is FALSE
Accept H0 correct Type II Error
β = probability of Type II Error
Reject H0 Type I Error
α = probability of Type I Error
correct
29. • Step 5: Calculate a test statistic
• Fcalculated
• Step 6: Construct Acceptance / Rejection regions
• Step 7: Based on steps 5 & 6, draw a conclusion about H0
• If Fcalculated from data > Fα, then you are in the Rejection region and you
can reject H0 with (1-α) level of confidence.
30. – Rejection and Nonrejection Regions
– Type I 및 Type II Errors
31. • 7.2 z 통계량 이용한 모평균의 가설검정 (σ Known)
– 단일평균에 대한 z 검정
– 유한모집단의 평균에 대한 검정
– p-Value를 이용한 가설검정
• p-value = 관측된 유의수준 (level of significance)
– defines the smallest value of 𝛼 for which the H0 can be rejected.
• “α 가 p보다 커야만 H0를 reject 가능”
– Critical Value Method를 이용한 가설검정
• Rejecting H0 using p-values
32. • p values vs. Effect sizes
– p values
• are calculated on the assumption that the H0 is true.
• p values are about the size of the test statistic.
• = an estimate of the probability that a value of the test statistic,
or a value more extreme than this, could have occurred by
chance when the null hypothesis is true.
– Effect sizes
• = measure of strength of a phenomenon
– r2, regression coefficients, … → magic criteria
33. • 7.3 t 통계량 이용한 모평균 가설검정 (σ Unknown)
– (…)
• z Test of a Population Proportion
– Critical Value Method를 이용한 가설검정
• p-values를 이용한 H0 기각
• 7.4 비율에 관한 가설검정
– […]
• Using p-value
• Using the critical value method
34. • 7.5 분산에 관한 가설검정
• Table χ2 vs. Observed χ2
• H0 can also be tested by the critical value method.
• Observed χ2 값대신 critical χ2 value for alpha를 적용하여 s2 계산
→ critical sample variance (sc
2)
• 7.6 Type II Errors
36. 회귀분석
• 개요
– single numeric D.V. (value to be predicted)과 one or more
numeric I.V. (predictors)간의 관계식.
– "regression" = process of fitting lines to data (Galton)
– 용도:
• 수치예측
• 그 밖에 가설검정, 각종 전제조건의 적합성 결정 등
• 다양한 모델에 적용
– SLR
– MLR
– GLM
• Link functions
• Logistic regression, Poisson regression, …
37. • 단순회귀분석
– Correlation과 단순회귀분석
– OLS (ordinary least squares)
– 회귀선 방정식의 결정
• deterministic model: y = β0 + β1x
• probabilistic model: y = β0 + β1x + ε
39. – 추정값의 표준오차
• Error분석 을 위해 잔차 (= 개별 데이터에 대한 추정 에러) 계산 대
신 standard error of the estimate 이용.
– SSE (Error Sum of Squares)
– 더 좋은 지표: 추정치의 표준오차 (se) = 회귀모델에서 잔차의 표준편차
– (정규분포 empirical rule: “68% 가 μ+ 1σ 범위, 95%가 μ+ 2σ 범위.
» 회귀분석의 가정도 주어진 x에 대해 error terms ~ ND() )
» 이제 error terms ~ ND(), se 는 error의 s.d., AVG error =0 이므로
• 68% of the error values (residuals) should be within 0 ±1se
• 95% of the error values (residuals) should be within 0 ±2se.
– se provides a single measure of magnitude of errors in model.
– 또한 outlier 식별에 이용. (예: outside ±2se or ±3se)
40. – 결정계수
• R2 = I.V. (x)가 D.V. (y)의 변동성을 얼마나 설명하는가
» r2=0 … r2= 1
– D.V. (y) – SS로 측정된 변동성: y (SSyy):
» SSyy=SSR +SSE 에서 각 항을 SSyy 로 나누면
– r2 is proportion of y variability explained by regression model:
• r 과 R2 의 관계
– r2 = (R)2
– 회귀모델 기울기의 가설검정 & 모델 전반의 Testing
• 기울기
– r = (r)2
41. 다중회귀분석
• 독립변수를 가진 다중회귀모델 (First Order)
– 𝑦 = β0 + β1 𝑥1 + β2 𝑥2 + ε
– Constant & coefficients는 표본으로부터 추출:
ො
y =b0 +b1x1 +b2x2 → response surface / response plane
• 회귀모델과 계수에 대한 유의성 검정
– <회귀모델의 adequacy 분석>
– 모델 전반의 검정
• 단순회귀; t test of slope of the regression line to see if ≠ 0. (즉,
whether I.V. contribute significantly in predicting D.V. )
• 다중회귀; an analogous test makes use of F statistic.
42. – 회귀계수에 대한 유의성 검정
• 각각의 회귀계수에 대한 t-검정
– H0: β1 =0 H0: β2 =0 … H0: βk =0
– Ha: β1 ≠ 0 Ha: β2 ≠ 0 Ha: βk ≠ 0
– 회귀계수에 대한 개별 검정에서의 자유도 = n - k - 1.
– 추정치의 잔차와 표준오차 및 R2
• 잔차 (= error of the regression model)
– 활용: outlier 탐지, regression분석 시 assumptions 검정
• SSE 와 추정 값의 표준오차
– = 추정표준오차(표준추정오차)= 차이의 표준오차
– = 최적선에 대한 산포도에서 점들의 분산도
– = ො
𝑦를 중심으로 실제 y 점수분포가 (회귀선에 의한) 어느 정도인가 표시
– SSE =Σ(y - ො
𝑦)2
• 회귀분석의 가정 (error terms ~ ND(0)
+ 경험칙 (대략 잔차의 68%가 ±1se 범위, 95% 가 ±2se 범위)
→ 데이터 fitting정도 측정에 standard error of estimate가 유용.
43. 주요 이슈
• (1) Response-Predictors간의 관계성 여부?
– 가설검정
• H0: β1 = β2 = ···= βp =0
• Ha: at least one βj is non-zero.
– F-statistic 계산:
• 단, TSS = σ(yi − ത
y)2 and RSS = σ(yi − ෝ
yi)2.
– IF H0 is true (=response-predictors간 no relationship)
THEN F 값은 1에 근접
– IF Ha is true,
– THEN E{(TSS - RSS)/p} >σ2, so we expect F > 1 .
44. • (2) 변수 별 중요도 결정
– Variable Selection
• Mallow’s Cp,
• Akaike information criterion (AIC),
• Bayesian information criterion (BIC),
• adjusted R2
– 그런데 2p 모델
• Forward selection
• Backward selection
• Mixed selection
45. • (3) Model Fit
– In SLR, R2 = 설명변수와 상관계수간의 상관계수의 제곱
– In MLR, it equals Cor(Y,
Y.)2
– fitted linear model의 특징: maximizes this correlation among
all possible linear models.
– p-value를 통해 R2 의 개선 정도를 계수화
– RSE의 정의:
• 따라서 변수가 많은 모델일수록 higher RSE if the decrease in RSS
is small relative to the increase in p.
46. • (4) Predictions
• β0, β1,..., βp의 true value를 안다 해도 random error로 인해 완벽한
예측은 불가능. (즉, irreducible error)
– confidence interval
– prediction interval
48. R
• R 언어의 여러 측면
– 수리/통계 분석도구로서의 R
– 프로그래밍 언어로서의 R
– 시각화 도구로서의 R
• R과 AI/딥러닝
– 기계학습과 예측적 분석 (Predictive Analysis)
– Keras with R
• Cheatsheet
– https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf
– https://www.rstudio.com/resources/cheatsheets/
52. 분산분석의 기본개념
• 기본 개념
• 분산분석: 평균에 차이가 있는지를 분석
• 실험계획법 (experimental design): 자료를 어떻게 수집할 것인가
• 인자 (factor) – 실험에 직접 취급되는 대상
– 인자의 수에 따라 일원배치법, 이원배치법, 다원배치법
• 수준 (level) = 실험을 실시하는 인자의 조건,
• 처리 (treatment) = 인자의 수준
• 특성값 = 실험실시 후 자료의 형태로 얻어지는 반응값
– [분산분석을 위해 필요한 가정]
• 독립성: 각 수준에서의 표본의 관측값들이 서로 독립
• 정규성: 관측값의 분포~ ND()
• 분산의 동일성: 모집단의 분산이 동일
– Diagnostic tests
• Residual Analysis
53. • 예: LED 형광등 생산하는 세 전구회사에서 생산된 형광등 수명자료
– 인자, 수준, 특성의 예
인자의 수에 따라서
▪ 인자가 하나인 경우 일원배치법
– 완전확률화계획법 (CRD)
– 수준수가 k개, 각 수준에 대해
r회의 반복을 시행할 때 전체 실
험을 k x r개로 분할하고 난수표
나 제비뽑기 등으로 확률적으로
배치.
▪ 인자가 둘인 경우 이원배치법
▪ 인자 > 3개 경우 다원배치법
54. 분산분석의 원리
• 기본 원리
– 𝐻0: 𝜇1 = 𝜇2 = ⋯ = 𝜇3 각 집단의 평균은 동일하다
– 𝐻𝑎: 𝜇𝑖 ≠ 𝜇𝑗; 평균이 적어도 두 개는 다르다.
– 예: 수준 수가 k개, 반복실험횟수가 각각 nk개의 일원배치법 자료
– 다음 페이지의 표:
• 𝑌𝑖𝑗 는 인자 A의 i번째 수준에서 j번째 관측값을 나타낸 것이다.
• 일원배치법 데이터 𝑌𝑖𝑗에 대한 구조식은 다음과 같다.
• 𝑌𝑖𝑗 = 𝜇𝑖 + 𝑒𝑖𝑗 𝑖 = 1,2, … , 𝑘 𝑗 = 1,2, … , 𝑛𝑖
• 𝜇𝑖 = 인자 A의 i번째 수준의 평균
• 𝑒𝑖𝑗 는 서로 독립이고 평균이 0이고 분산이 𝜎2
인 정규분포를 따른다.
56. • 각 관측값(𝑌𝑖𝑗)와 전체평균 ഥ
𝑌..의 차이인 편차 𝑌𝑖𝑗 − ഥ
𝑌.. 는 두 부분으로 나눌 수 있다.
• 𝑌𝑖𝑗 − ഥ
𝑌.. = 𝑌𝑖𝑗 − ഥ
𝑌𝑖. + ( ഥ
𝑌𝑖. − ഥ
𝑌..)
• 양변을 동시에 제곱하여 σ𝑖=1
𝑘
σ𝑗=1
𝑛𝑖
𝑌𝑖𝑗 를 취하면 교차합은 0이 되므로:
– σ𝑖=1
𝑘 σ𝑗=1
𝑛𝑖
(𝑌𝑖𝑗 − ത
𝑌..)2 = σ𝑖=1
𝑘 σ𝑗=1
𝑛𝑖
(𝑌𝑖𝑗 − ത
𝑌𝑖.)2 + σ𝑖=1
𝑘 σ𝑗=1
𝑛𝑖
(ത
𝑌𝑖 − ത
𝑌..)2
• σ𝑖=1
𝑘
σ𝑗=1
𝑛𝑖
(𝑌𝑖𝑗 − ത
𝑌..)2
는 각 관측값이 전체평균으로부터 얼마나 퍼져있는가를 측정
하는 것으로 총제곱합 (total sum of squres)라 하고 SST로 나타낸다.
• σ𝑖=1
𝑘
σ𝑗=1
𝑛𝑖
(𝑌𝑖𝑗 − ത
𝑌𝑖.)2
는 집단(수준)내 분산으로 관측값이 i번째 수준의 평균을 중심
으로 얼마나 퍼져 있는가를 측정하는 것으로 오차제곱합 (error sum of squres)라
하고 SSE로 나타낸다.
• σ𝑖=1
𝑘
σ𝑗=1
𝑛𝑖
(ത
𝑌𝑖 − ത
𝑌..)2
는 집단(수준)간 분산으로 i번째 수준의 평균이 전체평균으로
부터 얼마나 떨어져 있는가를 측정하는 것으로 처리제곱합 (treatment sum of
squares)라 하고 SSA로 나타낸다.
57. – 편차의 모든 관측값에 대한 제곱합은 다음과 같이 표현할 수 있다.
– MS (Mean Square: 평균제곱) = 각 제곱합을 자유도로 나눈 것.
• MSA =
𝑆𝑆𝐴
𝑘−1
MSE =
𝑆𝑆𝐸
(𝑁−𝑘)
– SST 중 인자처리효과인 SSA의 비율이 높을수록 모평균들 간 차이가 크게 된다.
– SSA의 값이 커지면
𝑆𝑆𝐴
𝑆𝑆𝐸
의 값이 커지게 된다. 검정은 F검정통계량을 사용하는데 귀무가설이
참일 때 분자 자유도 k-1, 분모 자유도 N-k인 F분포를 따른다.
– F =
ൗ
𝑆𝑆𝐴
𝑘−1
ൗ
𝑆𝑆𝐸
(𝑁−𝑘)
=
𝑀𝑆𝐴
𝑀𝑆𝐸
SST = SSE + SSA
총제곱합 = 오차제곱합 + 처리제곱합
전체분산 집단(수준)내 분산 집단(수준)간 분산
자유도 N-1
𝑛1 − 1 + 𝑛2 − 1 + ⋯ + 𝑛𝑘 − 1
=
𝑖=1
𝑘
𝑛𝑖 − 1 = 𝑛 − 𝑘
SSA: k-1
MS (Mean Square: 평균제곱
) = 각 제곱합을 자유도로 나
눈 것.
MSE =
𝑆𝑆𝐸
(𝑁−𝑘)
MSA =
𝑆𝑆𝐴
𝑘−1
58. – 분산분석표 (ANOVA table)
– MS (Mean Square: 평균제곱) = 각 제곱합을 자유도로 나눈 것.
MSA =
𝑆𝑆𝐴
𝑘−1
MSE =
𝑆𝑆𝐸
(𝑁−𝑘)
– SST 중 인자 처리효과를 나타내는 SSA 비율이 높을수록 모평균 간 차이가 있을
확률이 높다.
– 검정통계는 F검정통계량을 사용 - H0가 참일 때 분자 자유도 k-1, 분모 자유도
N-k인 F분포를 따른다.
F =
ൗ
𝑆𝑆𝐴
𝑘 − 1
ൗ
𝑆𝑆𝐸
(𝑁 − 𝑘)
=
𝑀𝑆𝐴
𝑀𝑆𝐸
요인 제곱합 자유도 평균제곱 F
집단 간 SSA k-1 MSA =
𝑆𝑆𝐴
𝑘−1
F =
ൗ
𝑆𝑆𝐴
𝑘 − 1
ൗ
𝑆𝑆𝐸
(𝑁 − 𝑘)
=
𝑀𝑆𝐴
𝑀𝑆𝐸
집단 내 SSE N-k MSE =
𝑆𝑆𝐸
(𝑁−𝑘)
합계 SST N-1
59. – 분산분석표를 계산할 때 다음의 간편계산식을 이용하면 편리
◼ 𝑁 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘
◼ 전체 관측값의 합: T = σ𝑖=1
𝑘
σ𝑗=1
𝑛𝑖
𝑌𝑖𝑗
◼ 수준 i에서의 모든 관측값의 합 𝑇𝑖 = σ𝑖=1
𝑛𝑖
𝑌𝑖𝑗
◼ CT (correction term) =
𝑇2
𝑁
– 이라 할 때 SST, SSA, SSE를 간편하게 구할 수 있다.
– SST = σ𝑖=1
𝑘
σ𝑗=1
𝑛𝑖
𝑌𝑖𝑗
2
− 𝐶𝑇
– 𝑆𝑆𝐴 = σ𝑖=1
𝑘 𝑇𝑖
2
𝑁
− 𝐶𝑇
– 𝑆𝑆𝐸 = σ𝑖
𝑘
σ𝑗=1
𝑛𝑖
𝑌𝑖𝑗 − σ𝑖=1
𝑘 𝑇𝑖
2
𝑁
= 𝑆𝑆𝑇 − 𝐶𝑇
62. 다중비교
• 개요
– 분산분석에서 F검정을 통해 귀무가설을 기각한 경우 각 처리평
균들 사이에 통계적으로 의미있는 차이가 있는지 여부를 검정하
기위해 모든 인자수준 평균들로 짝을 지어 두 인자수준 평균을
차례로 비교한다.
• 방법
– Bonferroni 방법
– Duncan 방법
– Tukey 방법
– 최소유의차 (LSD: least significant difference) 방법
63. • Tukey Test for Pairwise Mean Comparisons
– Step 1: Compute Tukey’s w value
– Step 2: Rank the means, calculate differences
64. • 기타의 Pairwise 평균 비교방법
– to compare all possible means, two-at-a-time, as t-tests.
• Unlike an ordinary two sample t-test, however, the method does
rely on the experiment–wide error (the MSE).
• standard error for the difference between two treatment means
(𝑠 ത
𝑑 or SE)
• Fisher’s Protected Least Significant Difference (LSD).
68. • Dunnett
Comparisons significant at the 0.05 level are indicated by ***.
Fertilizer
Comparison
Difference
Between
Means Simultaneous 95% Confidence Limits ***
F3 - Control 8.200 5.638 10.762 ***
F1 - Control 7.600 5.038 10.162 ***
F2 - Control 4.867 2.305 7.429 ***
69. Contrast Analysis
• 의의: 더 넓은 범위로 분석
– 예: treatment level groups or testing of trends prompting regression
modeling to express the response vs. treatment relationship with
treatment as a numerical predictor
• 1-factor ANOVA: linear contrast as a linear combination of the
treatment means such that numerical coefficients add to 0
73. 단변량 확률변수
• Discrete Random Variable
– 확률변수 (Random Variable: X(ξ)
• = a single-valued real function that assigns a real number (value
of X(ξ)) to each sample point ξ of S.
77. • Discrete Random Variable과 PMF
– = discrete random variable X가 특정 x값에 대해 즉,P(X = x) 일 때
가지는 확률. = probability function, frequency function,
• Properties
– 1 𝐹 𝑋 = 𝑥 = 𝑓 𝑥 > 0 𝑖𝑓 𝑘 ∈ 𝑡ℎ𝑒 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑆
– 2 σ𝑥∈𝑆 𝑓 𝑥 = 1
– 3 𝑃 𝑋 ∈ 𝐴 = σ𝑥∈𝐴 𝑓(𝑥)
• 종류
– Finite, Countably infinite (가산무한)
• 다변량에서의 PMF
– Joint Probability Distribution
78. • Continuous Random Variable과 PDF
– 확률변수 X가 연속확률변수일 때 X가 이루는 연속확률분포를 함
수 f(x)로 나타낸 것
– fX(x)는 연속확률변수 X의 PDF
79. • Joint PDF of Multiple RV
– Marginal PDF
– Joint PDF
• Conditioning
• Bayes' Rule
80. • Univariate Moment
– = a specific quantitative measure of the shape of a function
– 평균, 분산
Moment
ordinal
Moment Cumulant
Raw Central Normalised Raw Standardised
1 Mean 0 0 Mean N/A
2 - Variance 1 Variance 1
3 - - Skewness - Skewness
4 - - (Non-excess or
historical) kurtosis
- Excess kurtosis
5 - - Hyperskewness - -
6 - Hypertailedness - -
7 - - - - -
82. • Correlation
– Population correlation of two r.v. x and y
– Sample correlation
• rxy is related to the cosine of the angle between two vectors.
83. 다변량 확률변수
• Bivariate Random Variables
– Random experiment의 표본공간 S에서 X, Y의 두 r.v.를 가짐.
– → (X, Y) = bivariate r.v. (or 2-D random vector).
– → Range space of bivariate r.v. (X, Y) is denoted by RXY & defined by
– If r.v.'s X & Y are discrete r.v.'s, then (X, Y) is a discrete bivariate r.v.
– if X & Y are continuous r.v.'s, then (X, Y) is a continuous bivariate r.v.
– If one of X and Y is discrete while the other is continuous, then (X, Y)
is called a mixed bivariate r.v.
84. MV RV에서의 Mean Vectors
• 표본
– Let y represent a random vector of p variables measured on a
sampling unit (subject or object).
– If there are n individuals in the sample, the n observation
vectors are denoted by y1, y2, . . . , yn, where
• 모집단
85. MV RV에서의 Covariance Matrices
• Sample covariance matrix S
• Population covariance matrix
86. Covariance & Correlation Coefficient
– (k, n)th moment of a bivariate r.v. (X, Y) is defined by
– If n = 0, we obtain kth moment of X, and
if k = 0, we obtain the nth moment of Y.
– If (X, Y) is a discrete bivariate r.v., then
87. – 마찬가지로
– If (X, Y) is a continuous bivariate r.v., then
88. – (1, 1)th joint moment of (X, Y) is the correlation of X and Y.
• If E(XY) = 0, then we say that X and Y are orthogonal.
• The covariance of X and Y, Cov(X, Y) or σXY, is defined by
– If Cov(X, Y) = 0, then we say that X and Y are uncorrelated.
89. – If X and Y are independent, then they are uncorrelated, but the
converse is not true in general;
– the fact that X and Y are uncorrelated does not, in general, imply
that they are independent.
– The correlation coefficient, denoted by ρ (X, Y) or pXY, is defined by
– Correlation coefficient of X and Y is a measure of linear dependence
between X and Y.
91. Matrices for Subsets of Variables
• Subset 행렬에서의 Mean Vector & Covariance
– 2 Subsets
– 3+ subsets
92. N-Variate Random Variables
• 개념
– n-tuple of r.v.'s (X1, X2, . . . , Xn) is an n-variate r.v. (n-D r.v.) if
each Xi, i = 1, 2, ... , n, associates a real number with every
sample point ξ ∈ S. Thus, an n-variate r.v. is simply a rule
associating an n-tuple of real numbers with every ξ ∈ S.
– Let (X1, . . . , Xn) be an n-variate r.v. on S. Then its joint cdf is
96. ANOVA model
– Yij = subject j in group i 관측치
– ni = Number of subjects in group i
– N = n1 + n2 + ... + ng
• Assumptions
– E(Yij) = μi
– var(Yij) = σ2
– Independence
– Normality
• Under H0: F ~ Fg-1, N-g
97. Multi-Factor ANOVA
• Factorial or Crossed Treatment Design
– In Multi-factor experiments combinations of treatments are
applied to experimental units.
– In a factorial design, each level of every treatment is
combined with each level of all other treatments.
• With the addition of crossed factors the number of experimental
units increases very quickly and so tough decisions have to be
made regarding the number of treatments and the number of
levels of each treatment.
98. no effect of Factor A, a small effect of Factor B
(and if there were no effect of Factor B the two
lines would coincide), and no interaction
between Factor A and Factor B.
large effect of Factor A small effect of Factor B,
and no interaction.
No effect of Factor A, larger effect of Factor B,
and no interaction.
large effect of Factor A, a large effect of Factor
B and no interaction.
99. no effect of Factor A, no effect of Factor B but
an interaction between A and B.
Large effect of Factor A, no effect of Factor B
with a slight interaction.
No effect of Factor A, a large effect of Factor B,
with a very large interaction.
An effect of Factor A, a large effect of Factor B
with a large interaction.
100. Additive Model (No Interaction)
• In factorial design we first look at the interactions for
significance.
– If interaction is not significant, we can drop the interaction term
from our model, and we end up with an additive model.
• For a two-factor factorial, the model we initially consider is:
• Note that interaction term (αβ)ij is a multiplicative term.
• If interaction is found to be non-significant, then model
reduces to:
– Here we can see that response variable is simply a function of
adding the effects of the factors.
101. Crossed and Nested Factors
• Single-factor studies
• Multifactor studies
Crossed Factors and Nested Factors - Chemical Yield
102. • Crossed - Nested Designs
– Multi-factor studies can involve treatment combinations
• → some are crossed with other factors, and some are nested
within other factors.
– Statistical model
• contains both crossed and nested effects:
– ANOVA table
Source df
Factor A a - 1
Factor B(A) a(b - 1)
Factor C c-1
AC (a-1)(c-1)
BC a(b-1)(c-1)
Error abc(n-1)
Total (nabc)-1
103. ANCOVA
• 개념
– → evaluates whether the means of a dependent variable are
equal across different groups while statistically controlling
effects of other variables that are not of primary interest
(covariates)
• Used to control variable (covariate):
– used when we suspect that the variance of the dependent
variable is not solely explained by the group variable
– (1) Systematic bias → When the members in each group were
not selected randomly, which leads to bias of test results
• (2) Within-Group error SS → Variance due to individual differences
among subjects in a group
105. • How Covariance Analysis Reduces Error
(a) Error Variability with Single-factor Analysis of Variance Model
(b) Error Variability with Covariance Analysis Model
108. MANOVA
– Yijk = Observation for variable k from subject j in group i.
– Assumptions
• Data from group i has common mean vector μi=
• Data from all groups have common covariance matrix Σ.
• Independence: The subjects are independently sampled.
• Normality: The data are multivariate normally distributed.
Ha: for at least one i≠j
109. • mean vector for treatment i:
• mean vector for block j:
• grand mean vector:
• Total Sum of Squares and Cross Products Matrix.
H = Treatment SSCP matrix;
B = Block SSCP matrix;
E = Error SSCP matrix.
110. • (k,l)th element of Treatment SSCP matrix H
– If k = l, is treatment SS for k, and measures variation among treatments.
– If k ≠ l, this measures how k and l vary together across treatments.
• (k,l)th element of Block SSCP matrix B is
– For k = l, is block SS for k, and measures variation among blocks.
– For k ≠ l, this measures how variables k and l vary together.
• (k,l)th element of the Error SSCP matrix E is
– For k = l, is the error SS for k, and measures variability within treatment
and block combinations of variable k.
– For k ≠ l, this measures association or dependence between k and l
111. • Notations
– Sample Mean Vector
– Grand Mean Vector
• is comprised of grand means for each p variables
– Total SSCP
112. • Two Types of MANOVAs
– (1) One-Way MANOVA (One group variable)
– (2) Factorial MANOVA (more than one group variable)
113. • One-Way MANOVA (One group variable)
– (EX) Student grades of 4 countries: H0: μCan = μUS = μMex = μPan
– Calculate F approximations of 4 statistics and look up F-table
114. • Factorial MANOVA (more than one group variable)
– 3 Types of Sum of Squares
• (1) Type I SS → for Balanced data
• (2) Type II SS → Most powerful when no significant interaction terms
• (3) Type III SS → when there is a significant interaction term
– H0:
• (1) Group variable A/B do not significantly influence the means of the
outcome variables
• (2) The interaction of group variables A and B do not significantly
influence mean of outcome variables
116. • MANOVA Assumptions
– (1) Independent Observations
– (2) Normality
• Test using Shapiro-Wilks test
– (3) Equal Variance-Covariance Matrices Between Groups
• Test using Box M test
117. • Test statistic
– Wilks Lambda: To test H0:treatment mean vectors are equal,
• reject H0 if Wilks lambda is small (close to zero).
– Hotelling-Lawley Trace
• reject H0 if this test statistic is large.
– Pillai Trace
• reject H0 if this test statistic is large.
– Roy's Maximum Root: Largest eigenvalue of HE-1
• reject the null hypothesis if this test statistic is large.
118. Effect Size
• Partial η2 values
– % of variance explained by the group variable (i.v.)
• Partial η2 = 1 – Λ1/S
• S = min(P, dfeffect)
– P = number of dependent variables
– dfeffect = d.f. for the effect tested (independent variable)
• One-way MANOVA
– 예: Baumann Education Data - Group variable: Education
• Λ = 0.63202
• S = min(P = 3, dfeffect=2) = 2
– 20.5% variance of the grades of the 3 tests taken by the
students are due to the difference of education style
119. RBD: 2-way MANOVA
• Within randomized block designs, we have two factors:
– Blocks, and
– Treatments
• RBD with a treatments + b blocks is constructed in 2
steps:
– The experimental units (the units to which our treatments are
going to be applied) are partitioned into b blocks, each
comprised of a units.
– Treatments are randomly assigned to the experimental units in
such a way that each treatment appears once in each block.
• 일반적으로 block을 분할 (partition) → 다음의 효과
– Units within blocks are as uniform as possible.
– Differences between blocks are as large as possible.
120. 2-way MANOVA Additive Model
• Assumptions
– Error vectors εij have zero population mean;
– Error vectors εij have common variance-covariance matrix Σ — (the usual
assumption of a homogeneous variance-covariance matrix)
– Error vectors εij are independently sampled;
– Error vectors εij are sampled from a multivariate normal distribution;
– No block by treatment interaction. This means that the effect of the treatment
is not affected by, or does not depend on the block.
• Treatment mean vector for treatment i:
H = treatment SSCP
B = Block SSCP
E = Error SSCP
121. MANCOVA
• 개념
– ത
𝑌
𝑗(𝑎𝑑𝑗) = ത
𝑌
𝑗 − 𝑏𝑤
ത
𝑋𝑗 − ത
𝑋
– ത
𝑌
𝑗(𝑎𝑑𝑗)= adjusted d.v. mean in group j
(j=1,2,... ; total no. of groups)
– ത
𝑌
𝑗 = d.v. mean in group j before adjustment
– 𝑏𝑤 =common regression coef. in entire sample
– ത
𝑋𝑗 = mean of covariate variable for group j
– ത
𝑋 = covariate mean for entire sample
– 𝐻0: = ത
𝑌1 𝑎𝑑𝑗 = ത
𝑌2 𝑎𝑑𝑗 = ത
𝑌
𝑗 𝑎𝑑𝑗
122. Assumptions
• For ANOVA
– (1) Observations independent from each other
– (2) Population variances of groups are equal
– (3) Dependent variable normal
• ANCOVA assumptions include assumptions of ANOVA plus:
– (4) Continuous dependent variables and membership exclusive
(fixed) independent group variable
– (5) Linear relationship between dependent variables
– (6) Covariate is related to dependent variable, not group variable
– (7) Regression line for the groups are parallel (check by
introducing interaction term of group variable and covariate)
– (8) Homoscedasticity of regression slops (check by introducing
MSE from separate group regressions)
125. AI-Bio 융합 전문 과정
2022-8~10
윤형기 (hky@openwith.net)
3일차
126. 주제 세부사항
1일차 인사 및 과정 소개
인사
수강생 현황 및 수강목적 등 파악
의료/바이오 개관 (기술/산업) 의료/바이오 기술 및 산업동향
기반기술 (1-1) Python과 분석 패키지 분석도구 (1) (Python, Scipy, numpy/pandas)
2일차 기반기술 (1-2) R과 통계분석 분석도구 (2) (R과 통계학)
생명통계 활용 (1) 생명정보와 ANOVA, 다변량분석 등
유전체 분석
3일차 생명통계 활용 (2) 메타분석
유전체 분석 (Omics) (1)
유전체(genome) 분석
전사체(transcriptome) 분석
4일차 유전체 분석 (Omics) (2)
후성유전체(epigenome) 분석
단백체(proteome) 분석
차세대 Sequencing
GenBank와 NCBI데이터
VCF 데이터 분석, NGS 데이터 처리 등
5일차 기반기술 (3) 기계학습 (1)
모델링 방법론 (모델 개념 및 Cross-Validation)
지도학습 알고리즘 (선형모델, 분류)
기반기술 (3) 기계학습 (2) 비지도학습 알고리즘 (군집, 연관분석 등)
6일차 지도학습과 생명정보 응용
의료데이터에서의 예측모델
선형모델과 헬스케어 데이터의 분류
비지도학습과 생명정보 응용
임상데이터의 연관성분석
동반질병 (comorbidity) 분석
의료/바이오 도메인 이해
헬스케어 데이터셋과 생명통계
바이오 데이터와 기계학습
일정
127. 주제 세부사항
7일차 기반기술 (4) 딥러닝 (1) 신경망 학습과 딥러닝 모델
기반기술 (3) 딥러닝 (2)
TensorFlow
PyTorch
8일차 딥러닝과 생명정보 응용
Bi-LSTM을 이용한 헬스케어 시뮬레이션
딥러닝을 이용한 피부병 식별
온톨로지와 생명정보 응용
세만틱웹과 ontologies
Ontology의 생명정보 응용
9일차 기반 기술 (3) 이미지 처리 이미지 처리와 컴퓨터 비전 개요
의료영상분석 (1)
Segmentation
영상등록 (image registration)
10일차 의료영상분석 (2)
심전도 (ECG)
Rendering과 Surface Models
MRI
11일차 기반기술 (4) 생명정보와 계산화학 계산화학 (computational chemistry) 개요
신약개발 (drug discovery) (1)
표적규명 (target identification)
시약과 검정법 개발
ADME (흡수, 분포, 대사, 배설)
독성학과 기계학습 응용
12일차 기반 기술 (5) GAN GAN (Generative Adversarial Networks)과 VAE
신약개발과 GAN 생성모델을 이용한 신약후보물질 추천
총정리 Wrap-up 총정리
의료영상 분석
약물분석과 신약설계
바이오 데이터와 딥러닝
129. 개요
• 메타분석이란?
– an “analysis of analyses” (Glass 1976) - to combine, summarize
and interpret all available evidence pertaining to a clearly defined
research field or research question (Lipsey and Wilson 2001).
– 목적 = the statistical synthesis of the data
• 배경
– (1) Traditional/Narrative Reviews.
• narrative reviews by experts → biases
– (2) Systematic Reviews
• try to summarize evidence using clearly defined and transparent rules,
assessing the validity of evidence using predefined standards and
present a synthesis of outcomes in a systematic way.
– (3) Meta-Analyses.
• aim to combine results from previous studies in a quantitative way.
• quantify the effect of a medication, the prevalence of a disease, or
the correlation between two properties, across all studies
130. • 주요 Bibliographic Databases
– 대표적인 데이터베이스
PubMed Openly accessible database of the US National Library of
Medicine. Primarily contains biomedical research.
PsycInfo Database of American Psychological Association. Primarily covers
research in the social and behavioral sciences.
Cochrane Central Register
of Controlled Trials
(CENTRAL)
Openly accessible database of the Cochrane Collaboration.
Primarily covers health-related topics.
Embase Database of biomedical research maintained by the large scientific
publisher Elsevier. Requires a license.
ProQuest International
Bibliography of the Social
Sciences
Database of social science research. Requires a license.
Education Resources
Information Center (ERIC)
Openly accessible database on education research.
131. – Citation Database
– Dissertations
– Study Registries
Web of Science Interdisciplinary citation database maintained by Clarivate
Analytics. Requires a license.
Scopus Interdisciplinary citation database maintained by Elsevier. Requires
a license.
Google Scholar Openly accessible citation database maintained by Google. Has
only limited search and reference retrieval functionality.
Dissertations
ProQuest Dissertations
Database of dissertations. Requires a license
WHO International
Clinical Trials Registry
Platform (ICTRP)
Openly accessible database of clinical trial registrations worldwide.
Can be used to identify studies that have not (yet) been
published.
OSF Registries Openly accessible interdisciplinary database of study registrations.
Can be used to identify studies that have not (yet) been
published.
132. • 사용
– Medicine, psychology, criminology, business, …
• 주된 방법
– meta-analyses of effect sizes
• 조심할 점
– “Apples and Oranges” Problem
– “Garbage In, Garbage Out” Problem
– “File Drawer” Problem
– “Researcher Agenda” Problem
134. • Effect size
• Compute effect size for each study, and assess the consistency of the
effect across studies and to compute a summary effect.
• The effect size represent any relationship between two variables -
impact of an intervention, such as medical treatment, ...
• 여기서는 risk ratio < 1.0 = risk was lower in the high-dose group
• Precision
• the effect size for each study is bounded by a confidence interval,
reflecting the precision with which the effect size has been estimated
in that study.
• Study weights
• the size of each square reflecting the weight that is assigned to the
corresponding study when we compute the summary effect.
• relationship between a study’s precision and that study’s weight -
Since precision is driven primarily by sample size, we can think of the
studies as being weighted by sample size
• p - values
135. • Fixed effects vs. Random effects
• Under fixed-effect model,
– assume that all studies in the analysis share the same true effect size,
and the summary effect is our estimate of this common effect size.
• Under random-effects model,
– assume that the true effect size varies from study to study, and the
summary effect is our estimate of the mean of the distribution of
effect sizes.
• Precision
• The location of the diamond represents the effect size while its
width reflects the precision of the estimate.
• The precision addresses the accuracy of the summary effect as an
estimate of the true effect.
• p - values
136. • Heterogeneity of effect sizes
• treatment effect is usually NOT consistent across all studies
– 과제: assess the dispersion of effect sizes from study to study
• If the effect size is consistent,
– focus on the summary effect, and note that this effect is robust
across the domain of studies included in the analysis.
• If the effect size varies modestly,
– report the summary effect but note that the true effect in any given
study could be somewhat lower or higher than this value.
• If the effect varies substantially from one study to the next,
– shift our attention from the summary effect to the dispersion itself.
137. – Raw (unstandardized) mean difference D
• Computing D from studies that use independent groups
138. – Standardized mean difference, d and g
• If studies use different instruments to assess the outcome, then
the scale of measurement will differ from study to study and it
would not be meaningful to combine raw mean differences. In
such cases, use standardized mean difference
• Computing d and g from studies that use independent groups
140. • Effect size의 예
Effect sizes based on means
Raw (unstandardized) mean difference (D)
Based on studies with independent groups
Based on studies with matched groups or pre-post designs
Standardized mean difference (d or g)
Based on studies with independent groups
Based on studies with matched groups or pre-post designs
Response ratios (R)
Based on studies with independent groups
Effect sizes based on binary data
Risk ratio (RR)
Based on studies with independent groups
Odds ratio (OR)
Based on studies with independent groups
Risk difference (RD)
Based on studies with independent groups
Effect sizes based on correlational data
Correlation (r)
Based on studies with one group
출처: MichaelBorenstein et. al., Introduction to Meta-Analysis, 2009
142. 생명과학과 Omics
• 세포細胞, cell
• 중심원리 (Central Dogma)
• Sequencing?
• operation of determining the precise order of nucleotides of a given
DNA molecule, to determine the sequence of individual genes, full
chromosomes or entire genomes of an organism.
143. • Omics
– By https://en.wikipedia.org/wiki/Omics
• aims at the collective characterization and quantification of pools
of biological molecules that translate into the structure, function,
and dynamics of an organism or organisms.
• Computational Biology와 Bioinformatics의 주된 연구
– 1. Genomics (& Genetics)
– the study of the structure, functions and mapping of genomes
– 2 Transcriptomics (전사체학)
• Transcriptome (전사체)에 대한 연구
– transcriptome = the sum of an organism’s RNA transcripts.
144. – 3. Proteomics (단백질체학)
• study of proteins – The process of transcription produces messenger
RNA (mRNA) which serves as a template for the synthesis of protein
through translation. Hence proteins produced depend on the genes
that are transcribed from the mRNA.
– 1. Applications of proteomics in drug discovery
– 2. Protein folding
– 3. Protein structure prediction
– 4. Protein-protein interaction networks
– 4. Metabolomics (대사체학)
• metabolites (대사물질, 대사산물)에 대한 연구
– = molecules produced by metabolism within tissues and cells
– Researchers try to identify and quantify metabolites using different
analytical methods and interpret data. There are difference subfields of
metabolomics such as metabonomics and exometabolomics.
– 1. Metabolic reprogramming
– 2. Mass spectrometry strategies
– 3. Identification of biomarkers
145. – 4. Metabolomics (대사체학)
• metabolites (대사물질, 대사산물)에 대한 연구
– = molecules produced by metabolism within tissues and cells
– to identify and quantify metabolites using different analytical methods
and interpret data.
– 1. Metabolic reprogramming
– 2. Mass spectrometry strategies
– 3. Identification of biomarkers
– 5. Phylogenetics (계통분류학)
• study of how species evolved and what relationships exist within
groups of organisms. Relationships are determined using
phylogenetic inference methods with DNA sequencing data or
morphology.
– 1. Inferring phylogenetic trees
– 2. Phylogenetic networks
– 3. Bayesian phylogenetics
– 4. Phylogenetic model selection
– 5. Evolutionary models
146. – 5. Phylogenetics (계통분류학)
• Relationships are determined using phylogenetic inference
methods with DNA sequencing data or morphology. →
phylogenetic tree
– 1. Inferring phylogenetic trees
– 2. Phylogenetic networks
– 3. Bayesian phylogenetics
– 4. Phylogenetic model selection
– 5. Evolutionary models
– 6. Systems biology
• 수학적 모델과 시뮬레이션을 이용
– 1. Gene regulatory networks
– 2. Modelling metabolic interactions
– 3. Model protective mechanisms induced by antibiotics
– 4. Studying cell signalling pathways