1) Big data is becoming broader as more varied data becomes available on the web from sources like open government and e-commerce.
2) Broad data presents challenges that are different than traditional databases as it includes data from many sources that is only partially structured.
3) Semantics and metadata are crucial for integrating and making sense of broad data from multiple sources that may use different terms or collect data in different ways.
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Data Big and Broad (Oxford, 2012)
1. Tetherless World Constellation
Data: Big and Broad
Jim Hendler
Tetherless World Constellation
Tetherless World Professor of Computer and Cognitive Science
Head, Computer Science Department
Rensselaer Polytechnic Institute
http://www.cs.rpi.edu/~hendler
@jahendler (twitter)
2. Outline (if I stick to it)
Tetherless World Constellation
• What is big data?
• How big is big?
• What is big data on the Web?
• What is Broad data?
• Got an example?
• What’s the problem?
• What’s going on
3. Useful Terms
Tetherless World Constellation
• Machine-readable Data
– Information available in a form that is accessible and
manipulable by computer
– Accessible ≠ Manipulable
• eg PDF documents can be read in and displayed, but the
information in the document is not readily available without special
tooling
• Metadata
– Information associated with (machine-readable) data that
provides information about the data set
• Workflow, Provenance, and lots of other terms
– Useful sorts of metadata with respect to who created the data,
when, how was it processed, etc.
• Metadata and the other stuff most useful when it is
machine-readable and openly available in commonly agreed
upon formats
4. BIG Data is NOT the Web of Data
Tetherless World Constellation
• The term “Big Data” is widely used
nowadays to refer to a whole bunch of
machine-readable data in one accessible
(to the researcher) place
– 3 main contexts
• The large data collections of “big science” projects
– in traditional data warehouse or database formats
• The enterprise data of large, non-Web-based
companies (IBM, TATA, etc.)
– Generally in multiple
• The data holdings of a Google, Facebook or other
large Web company
– Include large “unstructured” holdings
– Include “graph” data
5. Tera, Peta, Zeta
yotta, yotta, yotta…
Tetherless World Constellation
• World Wide Web data is extremely large
• Extremely well “funded”
– eg. Facebook
• 25 Terabytes of logged data per day; valuation $33B (US
NIH budget ~ $31B)
– eg. Google
• In 2008 it was estimated at 20 petabytes per day (not
including youTube); current valuation $190B (about 1/3
the entire US DoD budget)
• And really, really fascinating stuff
– Data about people and their relationships
• To each other
• To products
• To activities and actions
• …
7. BIG Data
Tetherless World Constellation
Google uses their data in many ways
Search => ads => user
8. Big Data is becoming different on the Web
Tetherless World Constellation
• New Work
– is moving away from traditional relational
models
• cf. NoSQL
– Moving towards third party application and
extension
• cf. Mobile apps for local governments
– Includes a focus on interoperability and
exchange with “lightweight” semantics
• Using ideas from the Semantic Web
– Search: Schema.org
– Social Networking: OGP
9. Which in part gives rise to BROAD data
Tetherless World Constellation
• 4th context: Broad Data
– The huge amount of freely available, but widely varied,
Open Data on the World Wide Web (Structured and
Semi-structured)
• Example: The extended Facebook OGP graph (the
part outside Facebook’s datasets)
• Example: The growing linked open data cloud of
freely available RDF linked data
• Example: Hundreds of thousands of datasets that are
available on the Web free from governments around
the world
11. Facebook’s Open Graph Protocol
Tetherless World Constellation
• Facebook now allows other sites to extend the graph
• Open Graph Protocol uses RDFa to let web sites contain
information about the things people “like”
og:title - The title of your object as it should appear within the graph, e.g., "The Rock".
og:type - The type of your object, e.g., "movie". Depending on the type you specify, other
properties may also be required.
og:image - An image URL which should represent your object within the graph.
og:url - The canonical URL of your object that will be used as its permanent ID in the graph
og:description - A one to two sentence description of your object.
og:site_name - If your object is part of a larger web site, the name which should be
displayed for the overall site. e.g., "IMDb".
– Not a traditional “ontology”
12. Big Data
Tetherless World Constellation
Facebook generates terabytes of data per day
What could be learned from this?
14. BROAD data challenges
Tetherless World Constellation
• For broad data the new challenges
that emerge include
– (Web-scale) data search
– “Crowd-sourced” modeling
– rapid (and potentially ad hoc)
integration of datasets
– visualization and analysis of only-
partially modeled datasets
– policies for data use, reuse and
combination.
15. Huh?
Tetherless World Constellation
“The more I work with data, the more I
realize I need Semantics”
Huh?
The traditional database community has,
umm, not always been the first to embrace
semantics
What is different here?
17. The Web of Open
Government Data is Growing
• Analytics based on over 1,000,000 datasets
from around the world can be seen at
– http://logd.tw.rpi.edu/iogds_data_analytics
• The examples that follow are from that page
Datasets 1,028,054
Countries 43
Catalogs 192
Categories 2460
Languages 24
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012 17
18. International
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012 18
20. Many others…
Important note:
quantity is not really the most
important issue
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012 20
21. Topics (Across All Catalogs)
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012 21
22. Topics (Across All Catalogs)
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012 22
23. Combining data from different data sharing sites
Tetherless World Constellation
24. Data Integration Problems
Tetherless World Constellation
Head to head comparions shows that
burglaries in Avon and Somerset (UK) far
exceed those in Los Angeles, California
(one of the highest crime areas in the US)
25. The problem is (likely) semantics
Tetherless World Constellation
Same or
different?
Do the terms mean the same? Are they collected in the same way? Are
they processed differently? …
34. USDA data turns out to be crucial
Tetherless World Constellation
35. Metadata is crucial for Broad Data
Tetherless World Constellation
• Metadata design is crucial to govt data
sharing
– Needed for search and federation in large data
sharing efforts
• International data sharing
– W3C Govt Linked Data Working Group
– Need for vocabularies within govt sectors
• Esp for cross-langauge use
– How can we compare health (or legal, or social, or ….) data
between countries like US, UK, India, Kenya (English) with
Norway, China, France, etc.
– How can we link local govts (in traditional languages, local
dialects, etc) w/national data
38. Government Data in the linked open data cloud
Tetherless World Constellation
Government Data is
currently over ½ the cloud in
size (~17B triples), 10s of
thousands of links to other
data (within and without)
http://linkeddata.org/
39. Research in Govt Data => Broad Data challenges
Tetherless World Constellation
• Trust
– Government data is controversial, and potentially biased
• How do we confirm or dispute?
• Combination
– When we combine data we need to keep the provenance of
information (see trust)
• How do we make policies explicit and sharable
• Scaling
– Our project has already converted 9.9B triples from only
>2,000 of the 710,000 government databases we can identify
(116 catalogs, 32 countries, 16 languages)
• Cross-catalog
• Cross Langauge
• Versioning and updating
• Archiving
• Visualization
40. Big Data needs bigger ideas
for visualization
Tetherless World Constellation
(Fox &Hendler, Science, 2/11/10)
41. A new idea we’re playing with at RPI
Tetherless World Constellation
• Data as “exhibition”
– Museums/Performing Arts have explored
accessibility for real world artifacts, can
we extend these to the data web?
• Data via physical
interaction
– Using theatre techniques
we can literally move a
person through a data landscape, what
new metaphors does this open up?
42. Conclusions
Tetherless World Constellation
• Big data is going Broad
– World Wide Web trend towards more and more
varied data
• In many domains
– E-commerce, Open Govt, many more (cf.
Health/Medical care)
• Broad data requires thinking outside the
“Database” box
– Including considering access
• Broad data opens exciting possibilities for
research and innovation
– And I hope will help provide tools for making
data more accessible