This document summarizes a presentation on potential applications using the class frequency distribution of maximal repeats from tagged sequential data. It discusses using maximal repeat patterns and their frequency distributions over time to analyze trends in topic histories from literature, detect anomalies in manufacturing processes for quality control, and identify distinguishing patterns in genomic sequences. Potential applications discussed include text mining historical archives, individualized learning based on topic histories, detecting changes in language for elderly assessment, monitoring new word adoption, and integrating IoT sensor data with product traceability systems for industrial quality assurance.
Potential Applications of Class Frequency Distributions of Maximal Repeats
1. Potential Applications using the Class Frequency
Distribution of Maximal Repeats
from Tagged Sequential data.
Jing-Doo Wang (王經篤)
Associate Professor
Asia University, Taiwan.
第八屆台灣 Hadoop 社群年會 HadoopCon 2016
中央研究院人文社會科學館 (2016.9.10)
8. Outline
• Introduction
• Pattern History For Trend Analysis
• Product Traceability for Quality Monitoring
• Mining for Distinctive Pattern (Biomarker)
from Genomic Sequences
• Future Works
10. Why use “Maximal Repeats ”
as features?
• Dictionary
– How to identify new words or phrases?
– e.g. “just do it”, “洪荒之力”。
• N-gram
– 2-gram, 3-gram,…,5-grams. (Google Ngram viewer)
– The value of “N” is limited.
• Maximal Repeat
– The length of maximal repeat is variable.
13. Patent Application Serial Number
(US 15/208,994)( 申請中)
• Wang, Ching-Tu. Method for Extracting Maximal
Repeat Patterns and Computing Frequency
Distribution Tables. Patent Application Serial
Number 15/208,994. 13 July 2016.
• 申請美國發明專利PA
– 所有權:王經篤
– 發明人:王經篤
18. Outline
• Introduction
• Pattern History For Trend Analysis
• Product Traceability for Quality Monitoring
• Mining for Distinctive Pattern (Biomarker)
from Genomic Sequences
• Future Works
19. Pattern History for Trend Analysis
Jing-Doo Wang (王經篤)
Associate Professor
Asia University, Taiwan.
2016/9/12 19FSKD 20'11
Sequential Data + Timestamp
24. The Abstracts and Titles of PubMed
Articles (1990~2014)(12GB)
6 PCs=> 5 hours
25. The History of a Significant Pattern
顯要樣式歷史
The history of a significant pattern is the
frequency distribution of that pattern over
equally spaced time intervals.
25
26. Significant Pattern
(顯要樣式)
• A significant pattern is one maximal repeat of
consecutive words within texts.
26
(Length=1) TDP-43
(Length=1)SARS
(Length=1)H1N1
(Length=5)non-small cell lung cancer (NSCLC)
(Length=6)75 g oral glucose tolerance test
(Length=6)4 x 4 Latin square design
(Length=7)2 x 2 factorial arrangement of treatments
(Length=9)the National Institute of Child Health and Human Development
(Length=10)patients with squamous cell carcinoma of the head and neck
(Length=11)anomalous origin of the left coronary artery from the pulmonary artery
(Length=12)Pregnancy and Childbirth Group trials register and the Cochrane Controlled Trials Regist
(Length=13)the European Organization for Research and Treatment of Cancer Quality of Life Questi
37. Outline
• Introduction
• Pattern History For Trend Analysis
• Product Traceability for Quality Monitoring
• Mining for Distinctive Pattern (Biomarker)
from Genomic Sequences
• Future Works
92. It will be a hard work!
http://previews.123rf.com/images/dirkercken/dirkercken1208/dirkercken120800053/14852048-
hard-work-ahead-tough-job-be-ambitious-even-if-you-have-a-difficult-challenging-task-with-
impact-to--Stock-Photo.jpg
93. New Direction & Thinking!
http://switchandshift.com/11-trademarks-of-rebellious-leadership
105. Outline
• Introduction
• Pattern History For Trend Analysis
• Product Traceability for Quality Monitoring
• Mining for Distinctive Pattern (Biomarker)
from Genomic Sequences
• Future Works
128. Maximal Repeats appearing in
all of 24 human chromosomes.
• Length |Maximal Repeats| <= 500 bp
– Ok!
• Length |Maximal Repeats| <= 1000 bp
– Disk Space Full!
130. Outline
• Introduction
• Pattern History For Trend Analysis
• Product Traceability for Quality Monitoring
• Mining for Distinctive Pattern (Biomarker)
from Genomic Sequences
• Future Works