This document introduces a clinical data warehouse system developed for Traditional Chinese Medicine (TCM) knowledge discovery and clinical research. The system collects clinical data from electronic medical records and daily practice. It consists of components for data integration, transformation, storage, and online analytical and data mining capabilities. The system currently contains data on over 20,000 inpatients and outpatients. Several subject analyses and data mining applications have been developed using the system to bridge TCM clinical practice and research.
2. developed a clinical data warehouse system based on system, Business Objects (BO). BO has the design and
the structured TCM electronic medical record system analysis clients like Crystal Report, Web/Desktop
(SEMR) [12], which has structured data storage of the Intelligence, Dashboard and Performance manager to
information of medical record (e.g. chief complaint and implement the OLAP functionalities. Meanwhile, we
histories). Furthermore, since most TCM clinical data, integrate the Oracle data mining option with the client-
such as symptom/sign, diagnosis and formula Oracle data miner, and the machine learning platform,
prescription, is represented by terminologies, we have Weka, to perform the online data mining tasks.
a systematic study on the TCM clinical terminology Therefore, the infrastructure builds a technological
and nomenclature [13] to facilitate the data entry and framework for huge TCM clinical data integration,
standard representation. preprocessing, management and online analysis.
We have collected about 20,000 inpatients data in
TCM hospitals (ten top grade hospitals in Beijing,
China) or TCM wards on diabetes, coronary heart
disease (CHD) and stroke. Furthermore, there are more
than 20,000 outpatient data instances, which record the
outpatient clinical process of twenty over famous TCM
physicians in Beijing, China. By comprehensive
analyzing the characteristics of TCM clinical data
structure and the analysis subjects of TCM clinical
researches, we have designed the information model,
physical data model and multidimensional data model Fig.1. The technology infrastructure of TCM clinical
for clinical data warehouse. Meanwhile, we have data warehouse.
developed an extraction-transformation-loading (ETL) As a platform aiming to TCM clinical researches,
tool, Medical Integrator (MI), to take the tasks of TCM clinical data warehouse system also can directly
clinical data integration, cleaning and preprocessing. provide preprocessed data for the statistics softwares
Furthermore, we have integrated the data mining (e.g. SPSS, SAS and STATISTICA) to make possible
systems, namely Weka and Oracle data miner, and statistical analysis and test. Hence, from the
business intelligence tool (Business Objects) to application perspective, TCM clinical data warehouse
implement a TCM clinical intelligence platform with proposes integrative functional platform supporting
data mining and online analytical processing (OLAP) raw clinical data integration and data cleaning, OLAP,
abilities. data mining and statistics analysis tasks.
2. The infrastructure of TCM clinical data 3. Traditional Chinese medicine clinical
warehouse data model design
As a comprehensive platform for TCM clinical and The information model analysis and design is the vital
theoretical researches, the TCM clinical data step of TCM clinical data warehouse development.
warehouse system is designed based on Java and J2EE Medical information model like HL7 reference
platform. The technology infrastructure of TCM information model (RIM) [14] is a very complicated
clinical data warehouse is depicted in Fig.1. We see system with various concepts and relationships. The
that the infrastructure aims to integrate different objectives of HL7 RIM are to support the medical
operational data sources (e.g. SQL Server, Oracle, DB2) operational process, particularly, support the
using a self-developed specific ETL tool. More data information exchanging between different medical
sources are possible by extending the database information systems. The semantic network of unified
interface configuration. Due to the heterogeneous medical language system (UMLS) [15] is considered
operational data sources, we use a series of metadata as the distinguished medical ontology in modern
information tables to record the metadata (e.g. database biomedical science. The semantic types and structures
type, hospital information, physician information, data proposed a global conceptual view of the medical
content description and transforming information) of terminologies. The focus and emphasis of UMLS is to
the different data sources. bridge the gap between different terminological
The data storage management is supported by systems used in the medical literatures. Hence, the
Oracle (currently, we use Oracle 10G as the database conceptual unification principle is adhered to design
server), also the analysis and query service is mainly the core framework of semantic network.
supported by the distinguished business intelligence
616
3. However, the information model of TCM clinical 4. Medical Integrator
data warehouse focuses on the information content that
will be analyzed and used in TCM clinical and ETL is the core component of a successful data
theoretical researches. Hence, the classification and warehouse system. Due to the requirement of complex
definition of the information generated by the TCM clinical data structure, flexible data checking, multiple
clinical processes, is the emphases of our work. We heterogeneous data sources integration and numerous
consider TCM clinical process as a dynamic system terminological standardization processing, even the
with two core entities, namely physician and patient, commercial ETL systems can not fit well for the tasks.
and three core information elements, namely symptom, Hence we develop MI, the specific ETL tool using
disease/TCM syndrome and treatment. The symptom Java and Eclipse standard widget toolkit (SWT), to
information element is regarded as a relatively implement the required functions. Fig.2 is the snapshot
objective disease phenomena, whereas, disease/TCM of the main form of MI. It has the key functions such
syndrome is one type of human morbid status, which is as data connection configuration, data checking, source
the diagnosis result of a specific physician. Meanwhile, database consolidation, data transformation and
the TCM treatment is a clinical event that aims to make loading, data cleaning, data standardization and data
patient healthful. Therefore, while taking the analysis interface.
abstraction of these five core information elements and Besides the traditional ETL functions, MI has
constructing the global conceptual framework of TCM focused on the particular functions like data
information model, we design an information model standardization and data analysis interface. Data
for clinical data warehouse. We consider that the main standardization process mainly concerns the
information content of TCM clinical researches is standardization of the terminological data like
studying on the relationships between different entities symptom, diagnosis and treatment (herb name,
in one event and also the relationships between description phrase of therapeutical method, etc.).
different events. Therefore, we can regard the clinical Because the clinical data contains various terms and
information as various kinds of events (phenomena and phrases with flexible expressions, and also errors, the
activity), in every event there may have several data standardization is vital and important to have an
conceptual entities and physical entities participated at effective analysis. We use a rule-based batch
a specific time. Because of the mixture of TCM and processing approach to take these tasks. About 8 rule
modern medical concepts and methods in current TCM tables are designed to store the different kinds of
clinical process in hospitals, the sub-classes of entity standardization rules. The rules are edited and
class are also the mixture of TCM and modern medical imported into the corresponding tables using Medical
classes. For example, we have defined two distinct Integrator by TCM clinical experts. To keep the origin
disease classes in the model. One class represents the data for different analysis applications, we let MI build
disease concept in TCM, while another class is the the necessary middle tables to store the processed data,
modern medical concept. It should be noted that the and provide a standardized data set for different
entity classes will be materialized as dictionary tables potential data analyses. We take the symptom
in the physical data model in data warehouse. We have standardization process as an instance. The expressions
the more detailed description of the information model of symptom are quite various in clinical practices due
in the work [16]. to the personal favor of different physicians. Also the
Adhering to the information model defined, we error expressions or writings are possible in such huge
have designed the physical data model to help store data storage. We let domain experts edit four kinds of
and manage the TCM clinical data. Furthermore, to transformation rules to standardize the symptom data.
support the multidimensional analysis such as OLAP, The four kinds of rules instruct the process of noise
we have designed several core multidimensional data data cleaning, unified term description, terminological
models as the data structure basis of data marts. We granularity unification and synonymous unification.
have developed several significant subject analysis The result of symptom standardization is the
applications for TCM clinical researches. Each subject terminological phrases with unified concept.
analysis application has the corresponding relational The EAV structure [10][17] is the preferred
multidimensional data model. The practical results choice in clinical data model. However, most statistical
show that the information model and multidimensional and data mining systems are requiring conventional
data model can support very well for the clinical flat style data. Moreover, some analysis systems need
analysis applications. encoded data. Hence, to seamlessly integrate the
statistical and data mining systems, we have developed
several key functions (e.g. automatic encode process,
617
4. EAV to flat schema conversion and data exporting) for from clinical practice is a key step for TCM clinical
data analysis interface. Using the functions of MI, we researches. Moreover, study on the relationships
have a good preparation of data set with high quality between primary conceptual elements like disease,
for various data analysis tasks. syndrome, symptom/sign, herb and formula is the
central issue of TCM clinical researches.
6.1. Online analytical processing and
description analysis
Based on the multidimensional data schema and BO
semantic layers, we have developed 10 OLAP subject
analysis applications with more than 400 analysis
reports. The subjects mainly focus on the two types of
clinical knowledge: empirical diagnosis and treatment
Fig. 2. The main interface of Medical Integrator with knowledge of famous TCM physician, and the clinical
functional items. features of vital chronic diseases like diabetes, stroke
and CHD. The subjects contain data profile of
5. Data analysis components physicians or diseases, clinical herb and formula using,
the relationship among clinical finding, TCM
syndrome, disease and complication, etc. The analysis
Based on the multidimensional data model and ETL
reports can be accessed by authorized web users.
preprocessing, the clinical data has been prepared for
Besides the interactive browsing of reports, the user
the analysis and data mining tasks of clinical
can also export the results as Excel or PDF format.
researches. We use BO to provide the OLAP analysis.
Fig. 3 is the screenshot of the global data profile
Also the data mining systems such as Oracle data
(the graphic area) of a famous TCM physician. It
mining, Weka, are integrated to the clinical data
proposes the information about the total number of
warehouse system.
patient instances, consultation times, the disease
BO has the multidimensional analysis report
distribution, herb and formula using, symptom
designing tools such as crystal report, web/desktop
distribution and therapeutic method, etc. The global
intelligence. Also the BO platform is a middleware
data profile provides the baseline information of the
server to support the management, design and
clinical data related to a specific physician. Fig. 3
browsing of the reports in B/S framework. The
shows that the clinical data of the related physician is
semantic layer is the patent product of BO Company. It
mainly on the diseases such as Xiong Bi (thoracic
realizes the mapping of data structure to domain
obstruction of Qi), gastric pain, Xin Ji (palpitation) and
knowledge category. Compared with the complicated
vertigo.
physical data structure in data warehouse, the semantic
layer (categories and attributes) is rather simple and
with medical sense.
Oracle data mining is an option of Oracle 10g
enterprise edition. We have integrated the data mining
client, Oracle data miner, to TCM clinical data
warehouse. Furthermore, we have integrated the
famous open-source machine learning platform, Weka
(3.4 version) [18], with JDBC configuration to directly
use the data in data warehouse. The integrated two data
mining systems have the online data access ability of
the clinical data warehouse. Hence, it makes the data Fig. 3. The global profile analysis of outpatient
clinical data of a famous TCM physician.
mining tasks more facilitating and on-line.
Also we can know the herb using knowledge on
TCM syndrome (Fig. 4) or symptom (Fig. 5) of a
6. Clinical data analysis and knowledge famous TCM physician. Other empirical knowledge
discovery case studies like clinical using of classical formula, regular herb
dosage is analyzed by the corresponding OLAP reports.
Clinical practice has a vital role for TCM research and All the developed reports have the appropriate
development. Inductive analysis of the empirical data parameters like physician name, disease name that can
618
5. be selected by users on demand to show the analysis case studies on the outpatient clinical data can refer to
results of the different physicians or diseases. The the work [16].
exploring analysis of the inpatient data focuses on the The data mining case studies on the inpatient
relationships among disease, TCM syndrome and clinical data is focusing on T2DM and CHD. T2DM is
clinical findings. still a relatively new disease for TCM treatment and
the TCM syndrome classification of T2DM is a
research issue. We study on the TCM syndrome
classification of T2DM with metabolic syndrome by
herb composition network analysis [20]. We find that
the therapeutic methods for T2DM with metabolic
syndrome mainly include nourish Yin & clear away
hot, replenish Qi & nourish Yin, and replenish Qi &
nourish blood, etc., as the disease course extends. This
indicates that the TCM syndrome categories of T2DM
Fig. 4. The herb using information on a specific affiliated with metabolic syndrome are Yin Deficiency
TCM syndrome of a famous TCM physician. Heat Excess (early stage), Qi-Yin Deficiency (middle
stage) and Qi-Deficiency Blood Stasis (terminal stage).
The result proposes a primary guidance for clinical
treatment for patients with T2DM affiliated with
metabolic syndrome. We have study on the herb
prescription knowledge for T2DM with different
complications [21], which also propose useful
information for TCM treatment of T2DM.
7. Conclusion and Future Work
Fig. 5. The relationships between herb and In conclusion, clinical researches building on the real
symptom show which herbs would be prescribed TCM clinical practices, which keep to STSD, are the
for a specific symptom. essential requirement of TCM research. This paper
proposes a data warehouse solution for the clinical data
6.2. Data mining organization, management, processing and analysis.
We have accomplished the whole framework and
With the integrated data mining abilities and developed the core components such as clinical
preprocessing functions in clinical data warehouse, we information model, ETL tool, OLAP and data mining
have successfully conducted several preliminary TCM functions. Moreover, based on the collected structured
clinical data analysis researches like acupuncture EMR data, we have developed and performed several
prescription knowledge discovery [19], the relationship research oriented subject analyses and data mining
between formula (herbs) and syndrome about T2DM tasks. The data analysis case studies show that the
affiliated metabolic syndrome [20], herb treatment for clinical data warehouse provides a handy platform for
T2DM [21], and cluster analysis on syndrome type of TCM clinical knowledge discovery. Therefore, the
TCM in patients with acute myocardial infarction [22]. clinical data warehouse will be promising to build an
The acupuncture prescription knowledge infrastructure for TCM clinical and theoretical
discovery research [19] focuses on the empirical research. However, the project is still in progress. We
clinical acupuncture prescription of Prof. Conghuo will focus on the following three tasks in the future.
Tian in acupuncture department of Guanganmen The private and security issues are main problems
hospital, Beijing, China. Using the association rule in clinical data using and sharing. We will address the
mining method in Weka, we got 18 acupuncture information content protect about both physicians and
prescriptions from more than one thousand and two patients. This has been considered in the current ETL
hundred medical records. Prof. Tian indicates that one tool and data analysis applications.
of the eighteen acupuncture prescriptions is not a fixed Currently, the clinical data only contains the TCM
prescription in his clinical practice. Therefore, finally, research oriented information, while hospital
we get 17 useful acupuncture prescriptions (with management information is not covered yet. Due to the
prescription name, acupuncture point composition, decision support requirement of hospital management,
modifications, main efficacy, etc.), which reflect the we will consider integrating the data from hospital
empirical knowledge of Prof. Tian. More data mining
619
6. information system and developing the corresponding [10]. Pedersen T.B., Jensen C.S., Research Issues in Clinical
subject analyses. Data Warehousing. In Proceedings of SSDBM-98, Italy,
Compared with the free-text EMR data collecting, July 1-3, 1998.
the collecting of the structured clinical data with high [11]. Sahama T.R., Croll P.R., A data warehouse architecture
for clinical data warehousing. in Proceedings of the fifth
quality is still a laborious job. Therefore, the data limit
Australasian symposium on ACSW frontiers, Australian
has not made full use of the whole clinical data Computer Society, Inc., Darlinghurst, Australia,
warehouse framework. We have hammered at the 2007;68:227-32.
upgrading of the SEMR system to facilitate the data [12]. Li P., Liu B., Wen T., et al, Traditional Chinese
entry tasks. Furthermore, with more TCM hospitals medicine electronic medical record system and the
taking the SEMR system as the regular EMR collecting reorganization of TCM theoretical knowledge (in
tool and more research projects permitted to provide Chinese). Chinese Journal of Information on TCM. 2005;
their data, the current data capacity will increase 12(4):7, 39.
rapidly in the near future. [13]. Guo Y., Liu B., Li P., et al, Ontology and
Standardization of the TCM Terms (in Chinese). Chinese
Archives of TCM. 2007; 25(7):1368-70.
Acknowledgements [14]. HL7 Reference Information Model, http://www.hl7.org/
library/data-model/RIM.
This work is partially supported by Scientific [15]. Lindberg D.A.B., Humphreys B.L., McCray A.T., The
Breakthrough Program of Beijing Municipal Science & Unified Medical Language System. Meth Inform Med.
Technology Commission, China (H020920010130), 1993; 32:281-91.
[16]. Zhou X., The Research on TCM Clinical Data
China Postdoctoral Science Foundation (2005037106),
Warehousing and Clinical Data Mining Methods (in
China Key Technologies R & D Programme Chinese). Postdoctoral Report, China Academy of
(2007BA110B06), China 973 project (2006CB504601) Chinese Medical Sciences, 2007.3.
and the Science and Technology Foundation of Beijing [17]. Deshpande A.M., Brandt C., Nadkarni P.M., Metadata-
Jiaotong University (2007RC072). driven Ad Hoc Query of Patient Data Meeting the Needs
of Clinical Studies. JAMIA. 2002; 9(4):369-82.
References [18]. Witten I.H. and Frank E., Data Mining: Practical
machine learning tools and techniques (2nd Edition)
Morgan Kaufmann, San Francisco, 2005.
[1]. Liu B., Hu J., Xie Y., et al, Conception and Study in [19]. Zhang H., Tian C., Liu B., et al, Study on the idea of
Establishment of Modern Individualized Diagnosis and clinical accupuncture point combination of TCM
Treatment System in TCM (in Chinese). World Science physician Tian (in Chinese). Journal of Clinical
and Technology-Modernization of TCM. 2003; 5(1):1-5. Acupuncture and Moxibustion. 2007.2, 23(2):36-8.
[2]. Liu B., Zhou X., Design and Practice of Wet-Dry [20]. Ni Q., Liu B., Chen S., et al, Study of Relationship
Approach in Clinical Research of TCM (in Chinese), between Formula (herbs) and Syndrome about Type 2
World Science and Technology-Modernization of TCM. Diabetes Mellitus Affiliated Metabolic Syndrome Based
2007; 9(1):85-9. on the Scale-free Network (in Chinese). Chinese Journal
[3]. Inmon W.H., Building the Data Warehouse (Third of Information on TCM. 2006; 13(11):19-22.
Edition), John Wiley & Sons, Inc.2002. [21]. Jian Z., Ni Q., Zhou X., et al, Study on treatment law of
[4]. Silver M., Sakata T., Su H., et al, Case study: how to type 2 diabetes based on structural clinical information
apply data mining techniques in a healthcare data collect system (in Chinese). Journal of Shangdong
warehouse. J Healthc Inf Manag. 2001; 15: 155-64. University of TCM. 2007;31(3):195-7.
[5]. Wisniewski M.F., Kieszkowski P., et al, Development [22]. Zhuye Gao, Hao Xu, Dazuo Shi, et al, The Cluster
of a Clinical Data Warehouse for Hospital Infection Analysis on Syndrome Type of TCM in Patients with
Control. JAMIA. 2003; 10(5):455-62. Acute Myocardial Infarction (in Chinese). Journal of
[6]. Banek M., Tjoa A. M., Stolba N., Integrating Different Emergency in TCM. 2007;16(4): 432-4.
Grain Levels in a Medical Data Warehouse Federation.
In Proceedings of Data Warehousing and Knowledge
Discovery, A. Min Tjoa, Juan Trujillo (Eds.), 2006,
Krakow, Poland, LNCS, 4081, 185-94.
[7]. Einbinder J.S., Scully K., Using a Clinical Data
Repository to Estimate the Frequency and Costs of
Adverse Drug Events. JAMIA. 2002 Nov–Dec; 9(6 Suppl
1): s34-s38.
[8]. Allard R.D., The clinical laboratory data warehouse. An
overlooked diamond mine, Am J Clin Pathol 2003, 817-9.
[9]. Granta A., Moshyka A., Diaba H., et al, Integrating
feedback from a clinical data warehouse into practice
organisation. Int J Med Inform. 2006;75, 232-9.
620