The East Asian Studies Macroscope (EASM) is a joint effort by faculty and staff from the UCLA Department of Asian Languages and Cultures, the UCLA Library, and the UCLA Center for Digital Humanities to build partnerships with institutions in East Asia with significant digitized text archives for the purpose of developing software tools and practices for advanced collaborative research using digital corpora. These efforts build on the field’s notable successes in creating single-corpora digital collections and interfaces, seeking to develop technological infrastructure and methods that can work with multiple corpora held at different institutions.
This talk will review briefly the results of EASM pilot projects conducted with large digitized collections of poetry from the Tang Dynasty and Heian-period Japan. These examples highlight the key infrastructural elements of the proposed platform and their contributions to scholarship: 1) remote, authorized computational access to multiple large-scale corpora, especially those that cannot be shared in full due to their size and/or access restrictions; 2) support for analytical tools that operate across collections, such as multi-corpus topic modeling and network analysis; and 3) features for scholarly collaboration at all stages of the research process, enabling sharing and critiquing of experimental workflows, results, and visualizations.
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholarship across Corpora and Institutions
1. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
1
The East Asian Studies
Macroscope:
Infrastructure for Collaborative Scholarship
across Corpora and Institutions
Peter Broadwell
Academic Projects Developer
UCLA Digital Library
broadwell@library.ucla.edu
@PeterBroadwell
EASM | 東亞研究 宏觀鏡 | ヒュー:マ | 인문학 매크로스콥
Prof. Tim Tangherlini Prof. Jack Chen
2. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
2
Timothy R. Tangherlini, “The Folklore Macroscope: Challenges for a Computational Folkloristics,”
The 34th Archer Taylor Memorial Lecture, Western Folklore 72, no. 1 (2013): 7-27.
An integrated suite of digital tools and
interfaces that allows researchers to model
the complexity of cultural phenomena,
moving between close reading, distant
reading, and all levels in between.
Vision for a humanities macroscope:
micro-scale meso-scale macro-scale
3. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
3
Key features of a distributed humanities
macroscope infrastructure
Facilitates secure use of restricted-access collections
• Sensitive data can remain on its home server
• Access to data is via secure protocols
• Support for server-side processing: if necessary, only
summaries and/or results are exported from corpus servers
Researchers can run analyses across multiple
collections hosted at different participating institutions
• This enables novel types of research that cannot be done
on locally downloaded data, or even on the host’s servers
• Multi-corpus stylometry, topic modeling
• Cross-corpus network analysis, geo-coding, etc.
4. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
4
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Findings
The macroscope research environment
5. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
5
Features of a macroscope research portal
1. Users and group accounts
• Users may choose with whom to share materials
2. Corpus access management
• Authenticates access to external or local data sets
3. Analytical tools and workflow development
• Researchers may run existing tools, or create their own
4. Visualization and sharing of research results
• Scholars can present, view, and comment on findings
• Analytical results can be made available for download
6. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
6
Features of a macroscope research portal
1. Users and group accounts
• Users may choose with whom to share materials
Example: Liferay user portal (www.liferay.com)
7. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
7
User and group accounts
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Findings
8. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
8
Features of a macroscope research portal
2. Corpus access management
• Authenticates access to external or local data sets
Example: Alveo corpus selection interface (http://alveo.edu.au/)
9. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
9
Corpus access policies
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Findings
10. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
10
Features of a macroscope research portal
3. Analytical tools and workflow development
• Researchers may run existing tools, or create their own
Example: Network creation and analysis workflow in Knime (https://www.knime.org/)
11. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
11
Tools and workflow development
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Findings
custom
workflows
12. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
12
Features of a macroscope research portal
4. Visualization and sharing of research results
• Scholars can present, view, and comment on findings
• Analytical results can be made available for download
Example: Network
analysis results
visualized in multiple
offline tools: Knime,
Visone, Cytoscape,
Gephi
13. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
13
Sharing of research results
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Findings
14. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
14
An example: Communication & empire(s) –
全唐詩 and Heian 漢詩
Special thanks:
Tomoko Bialock
Japanese Studies Librarian
UCLA LibraryImage sources: Wikimedia Commons
15. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
15
The Hentaigana mobile app
Funded by the Tadashi Yanai Initiative for Globalizing Japanese Humanities
Supported by the UCLA Library
16. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
16
Hentaigana, classical Japanese, and digital scholarship
17. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
17
Image analysis of thumbnails from IIIF manifest
18. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
18
Topics in Genji monogatari (ca. 1020)
19. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
19
Topics in Genji monogatari (ca. 1020)
20. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
20
Advanced n-gram viewers
21. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
21
t-SNE dimensionality reduction
22. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
22
“Confusion matrix” of original vs. naïve Bayes poem genre classifications
Mimno,
Broadwell,
Tangherlini.
2014. “The
Telltale Hat:
LDA and
Classification
Problems in a
Large
Folklore
Corpus.” DH
2014,
Lausanne,
Switzerland.
23. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
23
“Confusion matrix” of original vs. naïve Bayes poem genre classifications
Mimno,
Broadwell,
Tangherlini.
2014. “The
Telltale Hat:
LDA and
Classification
Problems in a
Large
Folklore
Corpus.” DH
2014,
Lausanne,
Switzerland.
24. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
24
Heian/Kamakura kanshi collections
• Kaifūsō 懐風藻 – 751 (116 poems)
• Ryōunshū 凌雲集 – 814 (91 poems)
• Bunka shūreishū 文華秀麗集 – 818 (140 poems)
• Keikokushū 経国集 – 827 (213 poems)
• Toshi bunshū 都氏文集 – 879, poems probably by 都良香 (71 poems)
• Den-shikashū 田氏家集 – 891 (written or collected by 島田忠臣) (217 poems)
• Kanke bunzō 菅家文草 – ca. 900; 468 poems by 菅原 道眞, rest Buddhist texts
• Zenshūsai-taku shi-awase 善秀才宅詩合 – 963, from poetry contest (12)
• Fusōshū 扶桑集 – 995-999 (100 poems)
• Honchō reisō 本朝麗藻 – 1010 (153 poems)
• Gōrihōshū 江吏部集 – 1011 (135 poems)
• Wakan rōeishū 和漢朗詠集 – 1013 (225 poems)
• Jishin shi-awase 侍臣詩合 – 1051, from a courtiers’ poetry contest (8 poems)
• Hosshōji Kanpaku goshū 法性寺關白御集 – by 藤原 忠通 (1097-1164) (102)
• Honchō mudaisi 本朝無題詩 – 1162-64 (658 poems)
• Tenjō shi-awase 殿上詩合 – from a palace competition, year unknown (40)
• Sukezane Nagakane ryōkyō hyakuban shi-awase 資實長兼百番詩合 –
Sukezane and Nagakane Lords’ 100 poem contests, year unknown (200)
• Poems about Genji monogatari 賦光源氏物語詩 – 1291 (55 poems)
25. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
25
Heian/Kamakura 漢詩 collections
3,004 total poems
Historical source: Gunsho Ruijū
(群書類従), published 1894-1912
Partially digitized in Waseda University’s
Kanshi Database, the Internet Archive (?)
Major source: an
enthusiast’s site
http://miko.org/
~uraki/kuon/furu/
furu_index1.htm
Internet Archive
26. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
26
The organization of the 全唐詩
• poems are organized according to categories (i.e.,
imperial authorship, “Music Bureau” poetry, insult
poetry)
• bulk of the poems belong to individual authors,
organized historically (別集)
• authors may be excluded from historical organization
based on certain traits (women, Buddhists, Daoists,
ghosts)
• later fascicles are dominated by a kind of
miscellaneous quality
27. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
27
全唐詩 table of contents
1-9 Emperors, empresses, and imperial members
10-16 Ritual and ceremonial poems
17-29 “Music Bureau” poetry (樂府詩)
30-731 Individual Tang poets
732-733 Dynastic villains and rebels
734-766 Individual Five Dynasties poets
767-784 Poets with partial biographical information
785-787 Poems without authorial attribution
788-794 Linked verse poems (聯句)
795 Incomplete poems and lines by poets not listed above
796 Incomplete poems and lines without authorial attribution
797-805 Poems by women authors
806-851 Poems by Buddhist figures
852-859 Poems by Daoist figures
860-862 Poems by male immortals (仙)
863 Poems by female immortals (女仙)
864 Poems by divinities (神)
865-866 Poems by ghosts (鬼)
28. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
28
全唐詩 table of contents, continued
867 Poems by weirds (怪)
868 Dream poems (夢)
869-872 Jest and insult poems (諧謔)
873 Poems inscribed on walls (提語) and judgments (判)
874 Songs (歌) sung by local communities or groups
875 Prophetic verse (讖記)
876 Sayings in verse form (語)
877 Orally transmitted enigmatic verse (諺謎)
878 Orally transmitted ditties (謠)
879 Drinking songs (酒令)
880 Divination songs (占辭)
881 The Mengqiu 蒙求 by Li Han 李瀚
882-888 Poems left out of previous sections
889-900 Song-lyrics (詞)
29. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
29
http://etkspace.scandinavian.ucla.edu/~broadwell/poem_clusters.html
Macro-scale: clustering by shared n-grams
30. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
30
The900卷ofthe全唐詩+18volumesof平安時代の漢詩
31. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
31
The900卷ofthe全唐詩+18volumesof平安時代の漢詩
18 kanshi
collections
Individual poets
given fairly
contiguous 卷
ranges in the
全唐詩, in roughly
chronological
order (別集)
卷 424-462
白居易 Bai Juyi
(772-846)
卷 216-234
杜甫 Du Fu
(712-770)
卷 161-185
李白 Li Bai
(701-762)
32. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
32
Meso-scale: LDA topic modeling (全唐詩)
33. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
33
EASM: topic modeling the Quan Tang shiMeso-scale: LDA topic modeling
(全唐詩)
34. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
34
Meso-scale: LDA topic modeling (漢詩)
35. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
35
EASM: topic modeling the Quan Tang shiMeso-scale: LDA topic modeling
(漢詩)
36. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
36
Subcorpus Topic Modeling (STM)
37. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
37
Institution B
Macroscope
portal
Institution A
YX
Access policies
STM tool
Subcorpus Topic Modeling (STM)
Summary
tool
Summary
tool
X=well-known
corpus, e.g.,
the 13 Classics
Y=large, unknown
corpus, e.g.
Tang prose
38. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
38
Institution B
Macroscope
portal
Institution A
YX
Access policies
STM tool
Subcorpus Topic Modeling (STM)
Summary
tool
Summary
tool
X=well-known
corpus, e.g.,
the 13 Classics
Y=large, less well
known corpus,
e.g., Tang prose
39. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
39
Institution B
Macroscope
portal
Institution A
YX
Access policies
STM tool
Subcorpus Topic Modeling (STM)
Summary
tool
Summary
tool
X=well-known
corpus, e.g.,
the 13 Classics
Y=large, less well
known corpus,
e.g., Tang prose
40. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
40
Institution B
Macroscope
portal
Institution A
YX
Access policies
STM tool
Subcorpus Topic Modeling (STM)
Summary
tool
Summary
tool
X=well-known
corpus, e.g.,
the 13 Classics
Y=large, less well
known corpus,
e.g., Tang prose
(topics)
41. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
41
Institution B
Macroscope
portal
Institution A
YX
Access policies
STM tool
Subcorpus Topic Modeling (STM)
Summary
tool
Summary
tool
X=well-known
corpus, e.g.,
the 13 Classics
Y=large, less well
known corpus,
e.g., Tang prose
(topics)
42. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
42
Institution B
Macroscope
portal
Institution A
YX
Access policies
STM tool
Subcorpus Topic Modeling (STM)
Summary
tool
Summary
tool
X=well-known
corpus, e.g.,
the 13 Classics
Y=large, less well
known corpus,
e.g., Tang prose
(topics)
43. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
43
Institution B
Macroscope
portal
Institution A
YX
Access policies
STM tool
Subcorpus Topic Modeling (STM)
Summary
tool
Summary
tool
X=well-known
corpus, e.g.,
the 13 Classics
Y=large, less well
known corpus,
e.g., Tang prose
(topics)
44. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
44
Subcorpus Topic Modeling (STM)
45. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
45
Subcorpus Topic Modeling (STM)
46. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
46
Subcorpus Topic Modeling (STM)
47. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
47
Micro-scale: word embedding
48. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
48
Distributed macroscope infrastructure
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Findings
Secure
data
access
Option 1
49. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
49
Distributed macroscope infrastructure
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Summary
tool
Option 2
Findings
Summary data
only, e.g.,
bibliographic
metadata
50. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
50
Distributed macroscope infrastructure
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Summary
tool
Option 2
Findings
時時: 14
磷緇: 7
日遲: 6
相隨: 5
移時: 4
Summary data
(n-gram counts)
蟬鳴: 25
秋色: 19
唧唧: 9
秋風: 3
秋雨: 2
笙歌: 33
多少: 12
歌吹: 8
曲歌: 4
爭唱: 2
51. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
51
Distributed macroscope infrastructure
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A Tool B
collections
Option 3
(not as
desirable)
Findings
Tool B
52. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
52
Distributed macroscope infrastructure
Institution B
Macroscope
portal
Institution A
Y ZX
Access policies
Tool A
Tool B
collections
Findings
Results
only
Option 3
(not as
desirable)
53. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
53
Jöel de Rosnay, The Macroscope (New York: Harper & Row, 1979).
A macroscope lets us “observe what is at once too
great, slow, or complex for the human eye and
mind to notice and comprehend”
Katy Börner, “Plug-and-Play Macroscopes,” Communications of the ACM 54, no. 3 (2011): 60-69.
54. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
54
East Asian Studies (EASM):
http://macroscope.cdh.ucla.edu
Funding source:
• The Andrew W. Mellon Foundation
Sample macroscope sites at UCLA
The Danish Folklore Macroscope:
http://etkspace.scandinavian.ucla.edu/
macroscope.html
Funding sources:
• American Council of Learned Societies
• The National Endowment for the Humanities
• UCLA Council on Research
• Nordic Council of Ministers
• UCLA Institute for Pure and Applied Mathematics (NSF)
55. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
55
The East Asian Studies Macroscope (EASM)
Exploratory phase (Phase 0): 2014-2015
• Supported by Andrew W. Mellon Foundation, Profs. Jack Chen and
Timothy Tangherlini, UCLA Department of Asian Languages and
Cultures, Co-PIs
• Developed sample macroscope tools and analyses based on a
classical Chinese text corpus (collected poetry of the Tang Dynasty):
http://macroscope.cdh.ucla.edu
• Meetings with faculty and archivists at Academia Sinica, National
Taiwan University, National Tsinghua University, Dharma Drum
Buddhist College, and National Chengchi University, Jan. & Dec. 2015
Implementation Phase 1: 2016-2019 (pending)
• Prospectus submitted to Andrew W. Mellon Foundation, January 2016
• Plan to develop software infrastructure and tools, establish
partnerships with archival institutions in Taiwan, Korea, others (?)
56. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
56
Other macroscope development projects
Sub-corpus topic modeling for large literary corpora
• Supported by a Google Books research fellowship at UCLA, 2013-
2014
• Resulting publication: Tangherlini, T and P Leonard. 2014. “Trawling
in the Sea of the Great Unread: Sub-corpus topic modeling and
Humanities research.” Poetics 41 (6): 725-749.
Collaborations with Scandinavian partners
• Project title: “New Digital Resources and Computational Methods for
the Study of Literature in a Global Context,” 2015-present
• Funded by the Transatlantic program for collaborative work in the field
of digital humanities, Fondation Maison des Sciences de l'Homme
(France) and the Andrew W. Mellon Foundation (USA)
• Core participants: UCLA, Aarhus University (Denmark). Exploring
collaborations with archives in Denmark, Norway, Sweden
57. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
57
EASM: mapping places mentioned in poems
Special thanks:
David Shepard
Lead Academic
Developer,
UCLA Center for
Digital Humanities
58. The East Asian Studies Macroscope
@PeterBroadwell, UCLA Digital Library
Digital Research in East Asian Studies: July 12, 2016
58
EASM: network graph of poem communities
Special thanks: David Shepard, Lead Academic Developer, UCLA Center for
Digital Humanities