International Collaboration Networks in the Emerging (Big) Data Science

International Collaboration Networks in
the Emerging (Big) Data Science
HanWoo Park
Dept. of Media & Communication
YeungNam University
214-1 Dae-dong, Gyeongsan-si,
Gyeongsangbuk-do 712-749
Republic of Korea
www.hanpark.net
Loet Leydesdorff
Amsterdam School of Communication
Research (ASCoR)
University of Amsterdam
Kloveniersburgwal 48, 1012 CX
Amsterdam, The Netherlands
loet@leydesdorff.net
This presentation is based on Park, H.W., & Leydesdorff, L. (2013 forthcoming). Decomposing Social and Semantic Networks
in Emerging “Big Data” Research. Journal of Informetrics*.

빅데이터의 개념 및 특징
데이터 사이언스 배경
(빅)데이터 R&D 동향
사회적 이슈 및 시사점
1.
3.
4.
2.
[목차]

Big data
 The term “big data” refers to “analytical technologies that
have existed for years but can now be applied faster, on
a greater scale and are accessible to more users. (Miller,
2013).
 Big data sizes may vary per discipline.
 Characteristics: Garner’s 3Vs plus SAS’s VC and IBM’s
Veracity
- Volume (amount of data), Velocity (speed of data in and
out), Variety (range of data types and sources)
- Variability: Data flows can be highly inconsistent with
daily, seasonal, and event-triggered peak data loads
- Complexity: Multiple data sources requiring cleaning,
linking, and matching the data across system
- Veracity: 1 in 3 business leaders don’t trust the
information they use to make decisions.
http://en.wikipedia.org/wiki/Big_data
http://www-01.ibm.com/software/data/bigdata/

Data-driven Research that focuses
on extracting meaningful data from
techno-socio-economic systems to
discover some hidden patterns.

“Data Science” refers to “a discipline that incorporates
varying elements and builds on techniques and theories
from many fields, including data visualization with the goal of
extracting meaning from data and creating data products.”
http://en.wikipedia.org/wiki/Data_science

Today’s “big” is probably tomorrow’s “medium” and
next week’s “small” and thus the most effective defini-
tion of “big data” may be derived when the size of data
itself becomes part of the research problem.
Loukides (2012)

Origin of Data Science
 One is Peter Naur’s 1974 book “Concise Survey of Computer Methods”,
a survey of contemporary data processing methods in a wide range of
applications (Gilpress, 2012).
 The other is when the term “big data” first appeared in 1970 in the
Scopus database (Halevi and Moed, 2012). There was no particular key
milestone since 1970s.
 During the 1990s period, the term had been usually related to computer
modeling and software development for large datasets. Knowledge
Discovery and Data Mining in 1997. Rousseau (2012) also regards the
1993 publication as the first documents indexed in the Web version of
Web of Science.

A more recent development was made with
the establishment of journals that included the
term “Data Science” in their titles:
• Data Science Journal in 2002
• Journal of Data Science in 2003
• EPJ Data Science in 2012
• Journal of Big Data in 2013
• GigaScience gigasciencejournal.com in 2012

Science published a special
issue (February 11, 2011) looking
broadly at increasingly data-driven
research efforts as a scientific
domain (Science staff, 2011).
Data Science is composed of interrelated
clusters of research tasks. For example, the
technologies on data collection, curation, and
access, and the unique skill sets have
increasingly been central to Data Science
(Science staff, 2011).

An international conference called “Data Science
Summit” (http://www.greenplum.com/datasciencesummit).

http://novaspivack.typepad.com/nova_spivacks_weblog/2007/02/steps_towards_a.html 에서 재인용

All models are wrong but some are useful
Emergence of data author on dataverse

Andersons claims
 Data is everything we need.
 We don't have to settle for models.
 Agnostic statistics.
 Out with every theory of human behavior.
 This approach to science — hypothesize, model,
test — is becoming obsolete.
 Petabytes allow us to say: "Correlation is enough."
We can stop looking for models.
 What can science learn from Google? E-Science.

Computational (Social) Science
Park, H.W., & Leydesdorff, L. (2013 Work-In-Progress). Decomposing a Data-Driven Science Using a Scientometric Method.
 Focus on the methodological perspective based on
the use of new digital tools to manage the data deluge.
 Development of e-science tools to automate
research process.
 Experimentation with new types of data
visualization.

http://participatorysociety.org/wiki/index.
php?title=Online_Research

Why Data Science?
Savage and Burrows (2007, p.
886) lament, “Fifty years ago,
academic social scientists might
be seen as occupying the apex
of the – generally limited – social
science research ‘apparatus’.
Now they occupy an increasingly
marginal position in the huge
research infrastructure”.
Bonacich, P. (2004).
The Invasion of the Physicists. Social Networks 26(3): 285-288

This approach to science is attributed to the late Jim Gray,
one of the most influential computer scientists, at Microsoft.

“The fourth paradigm”
Research purpose lies in handling huge
amounts of data from technological,
sociological, and economic systems to
discover some hidden patterns.
Jim Gray

Global Communication 2team
(빅) 데이터과학의 도전
이론의 종말-증거기반 경영
Jeffrey Pfeffer, Robert I. Sutton (2006)
How companies can bolster performance and trump the
competition through evidence-based management, an
approach to decision-making and action that is driven by
hard facts rather than half-truths or hype.
· 빅데이터의 등장으로 전통적인
과학 연구방법론 퇴색
· 인식의 한계치를 넘어선 데이
터 (팩트가아닌패턴)

The Signal and the Noise:
Why Most Predictions Fail but Some Don't. Nate Silver
I do not go as far as a Popper in asserting that such
theories are therefore unscientific or that they lack any
value. However, the fact that the few theories we can
test have produced quite poor results suggests that
many of the ideas we haven’t tested are very wrong as
well. We are undoubtedly living with many delusions
that we do not even realize.
page 15

OECD (2012). OECDTechnology Foresight Forum
2012 - Harnessing data as a new source of growth:
Big data analytics and policies. OECD Headquarters,
Paris, France 22 October 2012

Big data and the end of theory?
 Does big data have the answers? Maybe some, but not all, says -
Mark Graham
 In 2008, Chris Anderson, then editor of Wired, wrote a
provocative piece titled The End of Theory. Anderson was
referring to the ways that computers, algorithms, and big data can
potentially generate more insightful, useful, accurate, or true
results than specialists or domain experts who traditionally craft
carefully targeted hypotheses and research strategies.
 We may one day get to the point where sufficient quantities of big
data can be harvested to answer all of the social questions that
most concern us. I doubt it though. There will always be digital
divides; always be uneven data shadows; and always be biases in
how information and technology are used and produced.
 And so we shouldn't forget the important role of specialists to
contextualize and offer insights into what our data do, and maybe
more importantly, don't tell us.
http://www.guardian.co.uk/news/datablog/2012/mar/09/big-data-theory

Number of “Big data” papers per year
Halevi, G., & Moed, H. F. (2012).

Rousseau (2012)
We performed a similar search in the WoS (TS=“Big data”) on October 2,
2012, leading to 142 articles. We removed the oldest one (1974), and
kept 141 published during the period 1993-2012). Halevi and Moed
observed an over-exponential growth over the period 1970-2011, while
we found a growth curve that could best be described by a cubic
polynomial (R2=0.963, with year 1992=0), which is illustrated in Fig. 1.

Subject areas researching Big Data
Halevi, G., & Moed, H. F. (2012).

Geographical Distribution of Big Data papers
Halevi, G., & Moed, H. F. (2012).

Phrase map of highly occurring keywords 1999-2005
Halevi, G., & Moed, H. F. (2012).

Phrase map of highly occurring keywords 2006-2012
Halevi, G., & Moed, H. F. (2012).

Park, H. W., & Leydesdorff, L. (2013 Work-In-Progress). Decomposing a Data-Driven Science Using a Scientometric Method.
 But, Halevi and Moed (2012), and Rousseau (2012) are
based on descriptive statistics. Therefore, we intend to add
the network perspective both in the social (in terms of co-
authorship) and semantic networks.
 Furthermore, we extend search queries to various
terminologies related to Data Science because the term
“big data” is regarded only as one among a list of policy
priority issues.
 We show where the research system in Data Science is
“hot” in terms of international collaborations and
prevailing semantics.

Problem Statement
Previous studies have not systematically
examined whether research efforts driven by
various sources of big data are really becoming
increasingly widespread across the world.
Further, the status of the literature based on big
data has not been extensively discussed or
sufficiently examined with respect to its
semantic variations, disciplinary scope,
institutional adoption, and international
collaboration.

 We employed a method rooted in the social network analysis
(SNA) (Hanneman & Riddle, 2005).
 Here the unit of analysis is often the node, which refers to a
point in a network where ties cross or connect nodes.
 A tie is a connection between parts (i.e., nodes) in a network.
 We considered countries as nodes and a tie as the number of
papers co-authored by a pair of researchers with different
addresses in terms of their country of origin.

 We considered papers published in SCI journals in 2011.
 we selected three types of documents: journal articles, letters,
and reviews.
 We obtained the data from the DVD version of the SCI data-
base by using several search terms based on titles, author key
words, and keyword-plus.

As expected, the global co-authorship network was far
denser than the subnetwork, that is, co-authorship in
big data research. Note that these were not really co-
authorship relationships between countries but
relationships between them measured in terms of co-
authorship relationships.The sum of ties in the global
network and that of the subnetwork were 1,073,764
and 10,798, respectively. In addition, the global network
was more centralized around hub countries than the
network of big data science in terms of all three
measures of centrality. However, the QAP correlation
between the whole 2011 co-authorship network and
big data research demonstrates their significant
relationship: this (Pearson) correlation was .740 (p
< .001).

Network Type Density (S.D.)
Centralization (%)
Degree Node Flow
Global 26.71 (245.70) 5.11 10.08 9.83
Big Data 0.01 (0.18) 4.37 2.70 2.28
N=201.
Comparison of Density and CentralizationValues

Rank Country Degree Rank Country Betweenness Rank Country FlowBet
1 U.S. 4.450 1 U.S. 2.734 1 USA 2.309
2 GERMANY 1.650 2 FRANCE 1.253 2 FRANCE 0.929
3 U.K. 1.600 3 U.K. 0.680 3 CANADA 0.537
4 FRANCE 1.400 4 CANADA 0.643 4 ITALY 0.510
5 AUSTRALIA 1.150 5 ITALY 0.620 5 UK 0.377
6 NETHERLANDS 1.150 6 AUSTRALIA 0.602 6
SOUTH_KORE
A
0.359
7 CHINA 1.100 7 SOUTH_KOREA 0.346 7 BELGIUM 0.331
8 DENMARK 0.950 8 GERMANY 0.291 8 AUSTRALIA 0.328
9 CANADA 0.900 9 BELGIUM 0.290 9 JAPAN 0.262
10 TAIWAN 0.850 10 PORTUGAL 0.266 10 SLOVENIA 0.200
11 ISRAEL 0.750 11 JAPAN 0.256 11 PORTUGAL 0.185
12 SOUTH_KOREA 0.750 12 CHINA 0.137 12 CHINA 0.132
13 SWEDEN 0.750 13 NETHERLAND 0.104 13 SPAIN 0.129
14 ITALY 0.700 14 DENMARK 0.099 14 GERMANY 0.108
15 PORTUGAL 0.700 15 SAUDI_ARABIA 0.088 15 MALAYSIA 0.103
16 IRELAND 0.650 16 SLOVENIA 0.068 16 TANZANIA 0.095
17 NORWAY 0.650 17 TAIWAN 0.057 17 VENEZUELA 0.095
18 SPAIN 0.650 18 SPAIN 0.055 18 NETHERLANDS 0.089
19 SINGAPORE 0.500 19 ISRAEL 0.037 19 SAUDI_ARABIA 0.071
20 SWITZERLAND 0.450 20 AUSTRIA 0.036 20 AUSTRIA 0.063
Table 4. CentralityValues for Countries

Rank Country Effectiveness Rank Country Efficiency Rank Country Constrain
1 U.K. 13.071 1 EGYPT 1.000 1 DENMARK 0.312
2 AUSTRALIA 12.879 2 INDIA 1.000 2 NETHERLAND 0.331
3 FRANCE 12.562 3 POLAND 1.000 3 PORTUGAL 0.338
4 U.S. 11.563 4 UZBEKISTAN 1.000 4 ISRAEL 0.343
5 GERMANY 10.746 5 GREECE 0.805 5 NORWAY 0.345
6 NETHERLANDS 8.873 6 JAPAN 0.789 6 IRELAND 0.352
7 DENMARK 8.530 7 AUSTRIA 0.725 7 UK 0.364
8 PORTUGAL 8.229 8 BRAZIL 0.722 8 SWEDEN 0.365
9 ISRAEL 8.208 9 NEW_ZEALAND 0.722 9 AUSTRALIA 0.381
10 CANADA 7.672 10 MALAYSIA 0.698 10 GERMANY 0.397
11 ITALY 7.554 11 AUSTRALIA 0.678 11 FRANCE 0.411
12 IRELAND 7.252 12 SAUDI_ARABIA 0.667 12 CANADA 0.532
13 NORWAY 7.214 13 IRAN 0.667 13 ITALY 0.535
14 SOUTH_KOREA 6.365 14 THAILAND 0.667 14 SAUDI_ARABIA 0.548
15 CHINA 6.057 15 SINGAPORE 0.659 15 SWITZERLAND 0.556
16 SWEDEN 5.978 16 CZECH_REPUBLIC 0.644 16 USA 0.573
17
JAPAN 5.520
17
CANADA 0.639
17
SOUTH_KORE
A
0.578
18 TAIWAN 5.490 18 SLOVENIA 0.638 18 BELGIUM 0.583
19 SPAIN 5.312 19 SOUTH_KOREA 0.636 19 SPAIN 0.625
20 SWITZERLAND 4.224 20 PORTUGAL 0.633 20 TAIWAN 0.627
Table 5. Structural HoleValues by Country

International Co-Authorship Network of Big Data Research

Semantic Network of Paper Titles in Big Data
(50 Most Frequently OccurringTerms with the Cosine ≥ 0.1)

Semantic Network of PaperTitles and Countries in Big Data
(50 Most Frequently OccurringTerms and theTop 20 Countries with the Cosine ≥ 0.2)

 Internationally co-authored papers in the field of data science
have generally focused on primary technologies.
 SCI papers do not necessarily focus on conceptually new me-
thodologies for analyzing and synthesizing massive data sets.
The results suggest the emergence of some new subjects such
as MapReduce.

 The U.S. was central in various aspects because of its connec-
tions with E.U. member countries as well as individual Asian
countries.
 Various European countries are the second most central posi-
tions based on centrality measures.
 In terms of structural hole indicators, some smaller and less
advanced countries were more efficient than effective in terms
of controlling central positions.
 The results suggest that a combination of words and locations
in a two-mode network can provide a richer representation of
the emerging field of big data science than the sum of two re-
presentations.

Yet, there still are serious problems to overcome. A trenchant
critique concerning the big data field as it is nowadays came in
the form of six statements intending to temper unbridled
enthusiasm. [42] These six provocative statements are:
 Big data change the definition of knowledge;
 Claims to accuracy and objectivity are misleading;
 More data are not always better data;
 Taken out of context, big data loses its meaning;
 Just because it is accessible, it does not make it ethical; and
 (Limited) access to big data creates a new digital divide.
Rousseau (2012)

빅데이터에 대한 부정적인 시각 등장
-빅데이터의 가치
-저장, 분석 및 해석기술 한계 존재
-현재의 붐은 호들갑스러운 측면 존재
빅데이터 갭: PromiseVS Capabilities
빅데이터의 도전

빅데이터의 도전 빅데이터 ‘Gap’ 분석사례
· 151명 연방 정부 CIO및 IT관리자 대상 빅데이터갭 조사실
시 .
· 실질적으로 현재 데이터를 제대로 활용하고 있는 기관도
적으며, 데이터소유권 문제도 확립되지 않은 것으로 나타
[美정부 IT네트워크 ‘Meritalk’는 빅데이터의 가
능성과 현실에는 Gap이 존재한다고 분석]

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

어떤 실험을 하는지 우리는 알고 있는가?
http://www.nature.com/news/facebook-experiment-boosts-us-voter-turnout-1.11401

우리는 정확히 인지하지 못한 채 동의했다

User Content VS Site Content
대부분의 SNS 서비스는 “User Content”를 무력
하게 만드는 “Site Content” 규정이 있음 (p. 60).

Issues in “Big Data” Internet Research
Cugelman, B., Thelwall, M. & Dawes, P. (in press). The psychology of online behavioural influence interventions: a meta
analysis. Journal of Medical Internet Research.
 Health Information Privacy Protection Act (HIPPA) in U.S. put
strict limit on the sharing of an individual’s health information,
• 병원에서 수술 등을 생중계하는 것은 어떻게 해결:
트위터를 가장 활발하게 이용하고 있는 ‘헨리 포드 병원’ 외에
도 현재 미국에서 트위터, 페이스북, 유튜브 등 소셜 네트워
크 서비스를 적극 활용하는 병원이 늘어나고 있는 추세임
• 건강용 스마트폰 Application 개발

3.결론및
시사점
기술+사회문화적 요소에 대한 면밀한 검토
- 빅데이터 및 AI 논의에서 빠지지 않는 것이 개인정보 유출 및 사생활
침해와 같은 역기능 문제
- 기술의 발전과 더불어 우리가 원하는 미래상에 대한 명확한 이해와,
이를 달성하기 위한 정치사회적 기반에 대한 근본적인 모색이 중요.
박한우 교수는 2012년 2월에 미국에서 벌어진
사건을 예로 들었다. 영국의 대학생 두 명이 미국에
입국하면서 로스앤젤레스 공항을 폭파하겠다는
말을 트위터에 썼는데 이것이 미국 정부에
적발됐다. 박 교수는 “이 경우 정부는 트위터
전체가 아니라 트위터에 글을 올린 사람을, 올린
것을 규제한 것인데 미국 정부가 일상적으로
트위터를 들여 다본다는 문제로 번졌다”고
설명했다.

Prof. Han Woo PARK
World Class University Webometrics Institute
CyberEmotions Research Center
Department of Media and Communication,
YeungNam University, Korea
hanpark@ynu.ac.kr www.hanpark.net
이 슬라이드 작성에 도움을 준 사이버감성연구소 연구원들과
학부 /대학원 강의 수강생에게 고마움을 표시합니다.
이 슬라이드는 개인적 목적으로 만든 비공개 자료입니다.
배포 및 복사를 금지합니다.

International Collaboration Networks in the Emerging (Big) Data Science

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a International Collaboration Networks in the Emerging (Big) Data Science

Semelhante a International Collaboration Networks in the Emerging (Big) Data Science (20)

Mais de datasciencekorea

Mais de datasciencekorea (6)

Último

Último (20)

International Collaboration Networks in the Emerging (Big) Data Science