This document presents ChainRank, a method to identify context-specific subnetworks from genome-wide interaction networks. ChainRank models information flow using chains of interactions between start and end nodes. It prioritizes chains based on node scores like expression variability and connectivity. Applying ChainRank to a COPD interaction network, the top chains found showed a 50% improvement in precision over random and enriched for known COPD pathways. Combining multiple node scores yielded even better results, demonstrating ChainRank's ability to identify meaningful subnetworks.
2. Background & Aim
• There is more and more (genome-wide) data available that is still not optimally
used
• Genome-wide networks are too big and complex to be interpreted in a
meaningful way
• Knowledge-based networks are in general non specific: e.g. canonical pathways,
PPI networks…
Develop a flexible method to identify context-specific subnetworks
3. Approach
• Model the flow of information using chains of interactions
• Chains = simple paths: sequence of interactions (e.g. protein modifications) that
connect one start and one ending point.
• Multiple chains can exist between a couple of start and end protein: what is the
best meaningful subnetwork?
• Prioritization of the chains based on many possible scores: gene expression,
functional module identification, …
• Here they present a general tool for combining multiple biological information as
chain scores: ChainRank
4. Methods
1. Search for all chains among user-defined start and end nodes in the network
2. Annotate the nodes with scores in order to calculate chains score and p-value
5. Subnetwork
Restrict the network by heuristic breadth-first search from the fixed initial proteins
to the final one with 2 criteria:
1. Maximal length allowed = length of the shortest path between initial and final
node
2. Prefer the integration of highly connected proteins (canonical signaling
interactors)
6. Scoring scheme
• Chain score =
𝑛𝑜𝑑𝑒 𝑠𝑐𝑜𝑟𝑒𝑠
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠
• Node scores used
1. Localisation: mean expression variability across studied tissue vs. mean
expression variability across all others -> gene expression
2. Relevance: occurrence of each protein among the significant ones across
studies -> gene expression, protein modifications, metabolism…
3. Connectivity: degree centrality -> topology
• Combination of scores
1. Weighted product of normalized scores
2. Filtering: pre-filter chains by score S1 and rank them by score S2
3. Intersection: keep only chains that pass filter on all scores
7. Results
• Application to chronic obstructive pulmonary disease (COPD)
• Network used: experimental interactions from different public databases + COPD
knowledge base (10k nodes, 62k interactions)
• Significance: comparison to chains in random networks
• Evaluation: enrichment of the top ranked chains in gold standard pathways
proteins
• Improvement metric:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑜𝑓 𝑟𝑎𝑛𝑘𝑖𝑛𝑔
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑜𝑓 𝑟𝑎𝑛𝑑𝑜𝑚 𝑟𝑎𝑛𝑘𝑖𝑛𝑔
8. Localisation: expression variability across studied tissue vs. across all others
Relevance: occurrence of each protein among the significant ones across COPD-related studies
Connectivity: degree centrality
Combination by weighted product: no improvement
Filtering: connectivity<0.05, ranked by localization
Intersection: connectivity and localization
Filtering: top quartile localization, ranked by relevance
Intersection: localization and relevance
IGF-Akt proximity subnetwork MAPK proximity subnetwork
9. Results for the best 50 chains
Other methods:
recall 50-85%
Precision 18-42%
Here (max):
recall 67%,
precision 30%
10. Conclusions and claims
• 50% improvement in finding gold standard proteins (compared to random), and
combining scores even better (x2.5)
• 11% improvement of the AUC (compared to random)
• Generic tool applicable to different network types (GRN, metabolic networks)
• Importance of selected scores based on scientific question
• Applications
• Causal, mechanistic connection?
• Common mechanisms driving different diseases
• Reduce the computational models
• Synthetic lethality
Notas do Editor
Note that from pathway to Ppi notation the structure of the pathway is heavily changing: they cannot aim at recovering the canonical pathway, so they go for improvement
Connectivity score = |dc – max(dc)| + 1
COPD case doesn’t have p-values because Relevance is far from normal distribution
They don’t justify the use of different scores in different scenarios, why connectivity in the first one?
Random score is different from diagonal because it takes into account the topology
Red lines: # gold standard proteins; B2 is the same but with the input network (chains scored but not selected)