SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
Looking back Looking forward The INCA project Final steps
Big Data and Automated Content Analysis
Week 8 – Monday
»Looking back and froward«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
26 March 2018
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Today
1 Looking back
Putting the pieces together
A good workflow
2 Looking forward
Techniqes we did not cover
3 The INCA project
Scaling up Content Analyis
The INCA architecture
4 Final steps
Big Data and Automated Content Analysis Damian Trilling
Looking back
Putting the pieces together
Looking back Looking forward The INCA project Final steps
Putting the pieces together
First: Our epistomological underpinnings
Computational Social Science
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Putting the pieces together
Computational Social Science
“It is an approach to social inquiry defined by (1) the use of large,
complex datasets, often—though not always— measured in
terabytes or petabytes; (2) the frequent involvement of “naturally
occurring” social and digital media sources and other electronic
databases; (3) the use of computational or algorithmic solutions to
generate patterns and inferences from these data; and (4) the
applicability to social theory in a variety of domains from the study
of mass opinion to public health, from examinations of political
events to social movements”
Shah, D. V., Cappella, J. N., & Neuman, W. R. (2015). Big Data, digital media, and computational social science:
Possibilities and perils. The ANNALS of the American Academy of Political and Social Science, 659(1), 6–13.
doi:10.1177/0002716215572084
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Putting the pieces together
Computational Social Science
“[. . . ] the computational social sciences employ the scientific
method, complementing descriptive statistics with inferential
statistics that seek to identify associations and causality. In other
words, they are underpinned by an epistemology wherein the aim is
to produce sophisticated statistical models that explain, simulate
and predict human life.”
Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data & Society, 1(1), 1–12.
doi:10.1177/2053951714528481
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Putting the pieces together
Steps of a CSS project
We learned techniques for:
• retrieving data
• processing data
• analyzing data
• visualising data
Big Data and Automated Content Analysis Damian Trilling
retrieve
process
and/or
enrich
analyze
explain
predict
communi-
cate
files
APIs
scraping
NLP
sentiment
LDA
SML
group com-
parisons;
statistical
tests and
models
visualizations
and
summary
tables
glob
json & csv
requests
twitter
tweepy
lxml
. . .
nltk
gensim
scikit-learn
. . .
numpy/scipy
pandas
statsmodels
. . .
pandas
matplotlib
seaborn
pyldavis
. . .
Looking back Looking forward The INCA project Final steps
A good workflow
A good workflow
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
The big picture
Start with pen and paper
1 Draw the Big Picture
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
The big picture
Start with pen and paper
1 Draw the Big Picture
2 Then work out what components you need
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Develop components separately
One script for downloading the data, one script for analyzing
• Avoids waste of resources (e.g., unnecessary downloading
multiple times)
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Develop components separately
One script for downloading the data, one script for analyzing
• Avoids waste of resources (e.g., unnecessary downloading
multiple times)
• Makes it easier to re-use your code or apply it to other data
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Develop components separately
One script for downloading the data, one script for analyzing
• Avoids waste of resources (e.g., unnecessary downloading
multiple times)
• Makes it easier to re-use your code or apply it to other data
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Develop components separately
One script for downloading the data, one script for analyzing
• Avoids waste of resources (e.g., unnecessary downloading
multiple times)
• Makes it easier to re-use your code or apply it to other data
Start small, then scale up
• Take your plan (see above) and solve one problem at a time
(e.g., parsing a review page; or getting the URLs of all review
pages)
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Develop components separately
One script for downloading the data, one script for analyzing
• Avoids waste of resources (e.g., unnecessary downloading
multiple times)
• Makes it easier to re-use your code or apply it to other data
Start small, then scale up
• Take your plan (see above) and solve one problem at a time
(e.g., parsing a review page; or getting the URLs of all review
pages)
• (for instance, by using functions [next slides])
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Develop components separately
If you copy-paste code, you are doing something wrong
• Write loops!
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Develop components separately
If you copy-paste code, you are doing something wrong
• Write loops!
• If something takes more than a couple of lines, write a
function!
Big Data and Automated Content Analysis Damian Trilling
Copy-paste approach
(ugly, error-prone, hard to scale up)
1 allreviews = []
2
3 response = requests.get(’http://xxxxx’)
4 tree = fromstring(response.text)
5 reviewelements = tree.xpath(’//div[@class="review"]’)
6 reviews = [e.text for e in reviewelements]
7 allreviews.extend(reviews)
8
9 response = requests.get(’http://yyyyy’)
10 tree = fromstring(response.text)
11 reviewelements = tree.xpath(’//div[@class="review"]’)
12 reviews = [e.text for e in reviewelements]
13 allreviews.extend(reviews)
Better: for-loop
(easier to read, less error-prone, easier to scale up (e.g., more
URLs, read URLs from a file or existing list)))
1 allreviews = []
2
3 urls = [’http://xxxxx’, ’http://yyyyy’]
4
5 for url in urls:
6 response = requests.get(url)
7 tree = fromstring(response.text)
8 reviewelements = tree.xpath(’//div[@class="review"]’)
9 reviews = [e.text for e in reviewelements]
10 allreviews.extend(reviews)
Even better: for-loop with functions
(main loop is easier to read, function can be re-used in multiple
contexts)
1 def getreviews(url):
2 response = requests.get(url)
3 tree = fromstring(response.text)
4 reviewelements = tree.xpath(’//div[@class="review"]’)
5 return [e.text for e in reviewelements]
6
7
8 urls = [’http://xxxxx’, ’http://yyyyy’]
9
10 allreviews = []
11
12 for url in urls:
13 allreviews.extend(getreviews(url))
Looking back Looking forward The INCA project Final steps
A good workflow
Scaling up
Until now, we did not look too much into aspects like code style,
re-usability, scalability
• Use functions and classes (Appendix D.3) to make code more
readable and re-usable
• Avoid re-calculating values
• Think about how to minimize memory usage (e.g.,
Generators, Appendix D.2)
• Do not hard-code values, file names, etc., but take them as
arguments
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
A good workflow
Make it robust
You cannot foresee every possible problem.
Most important: Make sure your program does not fail and loose
all data just because something goes wrong at case 997/1000.
• Use try/except to explicitly tell the program what
• Write data to files (or database) in between
• Use assert len(x) == len(y) for sanity checks
Big Data and Automated Content Analysis Damian Trilling
Looking forward
What other possibilities do exist for each step?
retrieve
process
and/or
enrich
analyze
explain
predict
communi-
cate
? ? ? ?
? ? ? ?
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Retrieve
Webscraping with Selenium
• If content is dynamically loaded (e.g., with JavaScript), our
approach doesn’t work (because we don’t have a browser).
• Solution: Have Python literally open a browser and literally
click on things
• ⇒ Appendix E
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Retrieve
Use of databases
We did not discuss how to actually store the data
• We basically stored our data in files (often, one CSV or JSON
file)
• But that’s not very efficient if we have large datasets;
especially if we want to select subsets later on
• SQL-databases to store tables (e.g., MySQL)
• NoSQL-databases to store less structured data (e.g., JSON
with unknown keys) (e.g., MongoDB, ElasticSearch)
• ⇒ Günther, E., Trilling, D., & Van de Velde, R.N. (2018). But how do we store it? (Big) data
architecture in the social-scientific research process. In: Stuetzer, C.M., Welker, M., & Egger, M. (eds.):
Computational Social Science in the Age of Big Data. Concepts, Methodologies, Tools, and Applications.
Cologne, Germany: Herbert von Halem.
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Storing data
Data Architecture
store
Aspects to consider:
• Data characteristics
• Research design
• Expertise
• Infrastructure
collect preprocess analyze
Figure 1. Proposed model of the social-scientific process of data collection and analysis in
the age of Big Data. As the dotted arrows indicate, the results of preprocessing and analyis
are preferably fed back into the storage system for further re-use.Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
From retrieved data to enriched data
1. Retrieved 2. Structured 3. Cleaned 4. Enriched
<div>
<span id=‘author’
class=‘big’>
Author: John
Doe </span>
<span id=‘viewed’>
seen 42 times
</span>
</div>
{
“author”:
“ Author: John Doe ”,
“viewed”:
“ seen 42 times ”
}
{
“author”:
“ john doe ”,
“viewed”:
42 ,
}
{
“author”:
“ john doe ”,
“author_gender”:
“ M ”,
“viewed”:
42 ,
}
gure 3. Data processing flow. Data types are indicated by a frame for strings of t
nd a shade for integer numbers.
en structured by mapping keys to values from the HTML (2.). Data cleaning ent
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Process and/or enrich
Word embeddings
We did not really consider the meaning of words
• Word embeddings can be trained on large corpora (e.g., whole
wikipedia or a couple of years of newspaper coverage)
• The trained model allows you to calculate with words (hence,
word vectors): king − man + woman =?
• You can find out whether documents are similar even if they
do not use the same words (Word Mover Distance)
• ⇒ word2vec (in gensim!), glove
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Process and/or enrich
Advanced NLP
We did a lot of BOW (and some POS-tagging), but we can get
more
• Named Entity Recognition (NER) to get names of people,
organizations, . . .
• Dependency Parsing to find out exact relationships ⇒ nltk,
Stanford, FROG
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Process and/or enrich
Use images
• Supervised Machine learning does not care about what the
features mean, so instead of texts we can also classify pictures
• Example: Teach the computer to decide whether an avatar on
a social medium is an actual photograph of a person or a
drawn image of something else (research Marthe)
• This principle can be applied to many fields and disciplines –
for example, it is possible to teach a computer to indicate if a
tumor is present or not on X-rays of people’s brains
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Process and/or enrich
Use images
• To learn more about this, the following websites useful
information: http://blog.yhat.com/posts/
image-classification-in-Python.html and
http://cs231n.github.io/python-numpy-tutorial/
#numpy-arrays
• Possibible workflow: Pixel color values as features ⇒ PCA to
reduce features ⇒ train classifier
• Advanced stuff: Neural Networks
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Techniqes we did not cover
Analyze/explain/predict
More advanced modelling
We only did some basic statistical tests
• Especially with social media data, we often have time series
(VAR models etc.)
• There are more advanced regression techniques and
dimension-reduction techniques tailored to data that are, e.g.,
large-scale, sparse, have a lot of features, . . .
• ⇒ scikit-learn, statsmodels
Big Data and Automated Content Analysis Damian Trilling
An example for scaling up:
The INCA project
see also Trilling, D., & Jonkman, J.G.F. (2018). Scaling up content
analysis. Communication Methods and Measures,
doi:10.1080/19312458.2018.1447655
project
management
methodsand
infrastructure
scaling up the data set
one-off
study (P1)
re-use for
multiple studies (P2)
flexible infrastrucutre
for continous multi-
purpose data collection
and analysis (P3)
manual (M1)
String opertions in
Excel/Stata/... (M2)
ACA with R/Python (M4)
additional database server (M5)
manualannotation
dictionaries/frequencies
scalingupthemethod
scaling up the project scope
dedicated ACA
programs (M3)
cloud computing (M6)
machinelearning
Looking back Looking forward The INCA project Final steps
Scaling up Content Analyis
INCA
How do we move beyond one-off projects?
• Collect data in such a way that it can be used for multiple
projects
• Database backend
• Re-usability of preprocessing and analysis code
Big Data and Automated Content Analysis Damian Trilling
Looking back Looking forward The INCA project Final steps
Scaling up Content Analyis
INCA
The idea
• Usable with minimal Python knowledge
• The “hard stuff” is already solved: writing a scraper often
only involves replacing the XPATHs
Big Data and Automated Content Analysis Damian Trilling
Collection of import
filters/parsers
static corpora, e.g.
newspaper database
On-the-fly
scraping from
websites/RSS-feeds
NoSQL database
Cleaning and pre-
processing filter
Cleaned NoSQL
database
word frequencies
and co-occurences
log likelihood visualizations
define details for
analysis (e.g., im-
portant actors)
dictionary fil-
ter/named en-
tity recognition
Latent dirich-
let allocation
Principal com-
ponent analysis
Cluster analysis
Data management phase
Analysis phase
INCA
Scrape Process Analyze
and if you want to:
- export at each stage
- complex search queries
ElasticSearch database
analytics dashboard
(KIBANA)
g = inca.analysis.timeline_analysis.timeline_generator()
g.analyse('moslim',"publication_date",granularity='year',from_time=‘2014’)
timestamp moslim
0 2011-01-01T00:00:00.000Z 1631
1 2012-01-01T00:00:00.000Z 1351
2 2013-01-01T00:00:00.000Z 1221
3 2014-01-01T00:00:00.000Z 2333
4 2015-01-01T00:00:00.000Z 2892
5 2016-01-01T00:00:00.000Z 2253
6 2017-01-01T00:00:00.000Z 2680
Direct access to all
functionality via
Python
Looking back Looking forward The INCA project Final steps
Next meeting
Wednesday
Final chance for questions regarding final project (if you don’t have
any, you don’t need to come.)
Deadline final exam
Hand in via filesender until Sunday evening, 23.59.
One .zip or .tar.gz file with
• .py and/or .ipynb for code
• .pdf for text and figures
• .csv, .json, or .txt for data
• any additional file we need to understand or reproduce your
work
Big Data and Automated Content Analysis Damian Trilling

Mais conteúdo relacionado

Mais procurados

A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchRachel Berryman
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopCosmoAIMS Bassett
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningLex Fridman
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data Vaibhav Kurkute
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 

Mais procurados (20)

BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine Learning
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 

Semelhante a BDACA - Lecture8

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Ali Alkan
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Aravindharamanan S
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesRudiger Wolf
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptxSwarajPatel19
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Benjamin Bengfort
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningKai Wähner
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 

Semelhante a BDACA - Lecture8 (20)

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1
 
BD-ACA Week8a
BD-ACA Week8aBD-ACA Week8a
BD-ACA Week8a
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slides
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptx
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 

Mais de Department of Communication Science, University of Amsterdam

Mais de Department of Communication Science, University of Amsterdam (13)

BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Último

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 

Último (20)

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 

BDACA - Lecture8

  • 1. Looking back Looking forward The INCA project Final steps Big Data and Automated Content Analysis Week 8 – Monday »Looking back and froward« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 26 March 2018 Big Data and Automated Content Analysis Damian Trilling
  • 2. Looking back Looking forward The INCA project Final steps Today 1 Looking back Putting the pieces together A good workflow 2 Looking forward Techniqes we did not cover 3 The INCA project Scaling up Content Analyis The INCA architecture 4 Final steps Big Data and Automated Content Analysis Damian Trilling
  • 3. Looking back Putting the pieces together
  • 4. Looking back Looking forward The INCA project Final steps Putting the pieces together First: Our epistomological underpinnings Computational Social Science Big Data and Automated Content Analysis Damian Trilling
  • 5. Looking back Looking forward The INCA project Final steps Putting the pieces together Computational Social Science “It is an approach to social inquiry defined by (1) the use of large, complex datasets, often—though not always— measured in terabytes or petabytes; (2) the frequent involvement of “naturally occurring” social and digital media sources and other electronic databases; (3) the use of computational or algorithmic solutions to generate patterns and inferences from these data; and (4) the applicability to social theory in a variety of domains from the study of mass opinion to public health, from examinations of political events to social movements” Shah, D. V., Cappella, J. N., & Neuman, W. R. (2015). Big Data, digital media, and computational social science: Possibilities and perils. The ANNALS of the American Academy of Political and Social Science, 659(1), 6–13. doi:10.1177/0002716215572084 Big Data and Automated Content Analysis Damian Trilling
  • 6. Looking back Looking forward The INCA project Final steps Putting the pieces together Computational Social Science “[. . . ] the computational social sciences employ the scientific method, complementing descriptive statistics with inferential statistics that seek to identify associations and causality. In other words, they are underpinned by an epistemology wherein the aim is to produce sophisticated statistical models that explain, simulate and predict human life.” Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data & Society, 1(1), 1–12. doi:10.1177/2053951714528481 Big Data and Automated Content Analysis Damian Trilling
  • 7. Looking back Looking forward The INCA project Final steps Putting the pieces together Steps of a CSS project We learned techniques for: • retrieving data • processing data • analyzing data • visualising data Big Data and Automated Content Analysis Damian Trilling
  • 8. retrieve process and/or enrich analyze explain predict communi- cate files APIs scraping NLP sentiment LDA SML group com- parisons; statistical tests and models visualizations and summary tables glob json & csv requests twitter tweepy lxml . . . nltk gensim scikit-learn . . . numpy/scipy pandas statsmodels . . . pandas matplotlib seaborn pyldavis . . .
  • 9. Looking back Looking forward The INCA project Final steps A good workflow A good workflow Big Data and Automated Content Analysis Damian Trilling
  • 10. Looking back Looking forward The INCA project Final steps A good workflow The big picture Start with pen and paper 1 Draw the Big Picture Big Data and Automated Content Analysis Damian Trilling
  • 11. Looking back Looking forward The INCA project Final steps A good workflow The big picture Start with pen and paper 1 Draw the Big Picture 2 Then work out what components you need Big Data and Automated Content Analysis Damian Trilling
  • 12. Looking back Looking forward The INCA project Final steps A good workflow Develop components separately One script for downloading the data, one script for analyzing • Avoids waste of resources (e.g., unnecessary downloading multiple times) Big Data and Automated Content Analysis Damian Trilling
  • 13. Looking back Looking forward The INCA project Final steps A good workflow Develop components separately One script for downloading the data, one script for analyzing • Avoids waste of resources (e.g., unnecessary downloading multiple times) • Makes it easier to re-use your code or apply it to other data Big Data and Automated Content Analysis Damian Trilling
  • 14. Looking back Looking forward The INCA project Final steps A good workflow Develop components separately One script for downloading the data, one script for analyzing • Avoids waste of resources (e.g., unnecessary downloading multiple times) • Makes it easier to re-use your code or apply it to other data Big Data and Automated Content Analysis Damian Trilling
  • 15. Looking back Looking forward The INCA project Final steps A good workflow Develop components separately One script for downloading the data, one script for analyzing • Avoids waste of resources (e.g., unnecessary downloading multiple times) • Makes it easier to re-use your code or apply it to other data Start small, then scale up • Take your plan (see above) and solve one problem at a time (e.g., parsing a review page; or getting the URLs of all review pages) Big Data and Automated Content Analysis Damian Trilling
  • 16. Looking back Looking forward The INCA project Final steps A good workflow Develop components separately One script for downloading the data, one script for analyzing • Avoids waste of resources (e.g., unnecessary downloading multiple times) • Makes it easier to re-use your code or apply it to other data Start small, then scale up • Take your plan (see above) and solve one problem at a time (e.g., parsing a review page; or getting the URLs of all review pages) • (for instance, by using functions [next slides]) Big Data and Automated Content Analysis Damian Trilling
  • 17. Looking back Looking forward The INCA project Final steps A good workflow Develop components separately If you copy-paste code, you are doing something wrong • Write loops! Big Data and Automated Content Analysis Damian Trilling
  • 18. Looking back Looking forward The INCA project Final steps A good workflow Develop components separately If you copy-paste code, you are doing something wrong • Write loops! • If something takes more than a couple of lines, write a function! Big Data and Automated Content Analysis Damian Trilling
  • 19. Copy-paste approach (ugly, error-prone, hard to scale up) 1 allreviews = [] 2 3 response = requests.get(’http://xxxxx’) 4 tree = fromstring(response.text) 5 reviewelements = tree.xpath(’//div[@class="review"]’) 6 reviews = [e.text for e in reviewelements] 7 allreviews.extend(reviews) 8 9 response = requests.get(’http://yyyyy’) 10 tree = fromstring(response.text) 11 reviewelements = tree.xpath(’//div[@class="review"]’) 12 reviews = [e.text for e in reviewelements] 13 allreviews.extend(reviews)
  • 20. Better: for-loop (easier to read, less error-prone, easier to scale up (e.g., more URLs, read URLs from a file or existing list))) 1 allreviews = [] 2 3 urls = [’http://xxxxx’, ’http://yyyyy’] 4 5 for url in urls: 6 response = requests.get(url) 7 tree = fromstring(response.text) 8 reviewelements = tree.xpath(’//div[@class="review"]’) 9 reviews = [e.text for e in reviewelements] 10 allreviews.extend(reviews)
  • 21. Even better: for-loop with functions (main loop is easier to read, function can be re-used in multiple contexts) 1 def getreviews(url): 2 response = requests.get(url) 3 tree = fromstring(response.text) 4 reviewelements = tree.xpath(’//div[@class="review"]’) 5 return [e.text for e in reviewelements] 6 7 8 urls = [’http://xxxxx’, ’http://yyyyy’] 9 10 allreviews = [] 11 12 for url in urls: 13 allreviews.extend(getreviews(url))
  • 22. Looking back Looking forward The INCA project Final steps A good workflow Scaling up Until now, we did not look too much into aspects like code style, re-usability, scalability • Use functions and classes (Appendix D.3) to make code more readable and re-usable • Avoid re-calculating values • Think about how to minimize memory usage (e.g., Generators, Appendix D.2) • Do not hard-code values, file names, etc., but take them as arguments Big Data and Automated Content Analysis Damian Trilling
  • 23. Looking back Looking forward The INCA project Final steps A good workflow Make it robust You cannot foresee every possible problem. Most important: Make sure your program does not fail and loose all data just because something goes wrong at case 997/1000. • Use try/except to explicitly tell the program what • Write data to files (or database) in between • Use assert len(x) == len(y) for sanity checks Big Data and Automated Content Analysis Damian Trilling
  • 24. Looking forward What other possibilities do exist for each step?
  • 26. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Retrieve Webscraping with Selenium • If content is dynamically loaded (e.g., with JavaScript), our approach doesn’t work (because we don’t have a browser). • Solution: Have Python literally open a browser and literally click on things • ⇒ Appendix E Big Data and Automated Content Analysis Damian Trilling
  • 27. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Retrieve Use of databases We did not discuss how to actually store the data • We basically stored our data in files (often, one CSV or JSON file) • But that’s not very efficient if we have large datasets; especially if we want to select subsets later on • SQL-databases to store tables (e.g., MySQL) • NoSQL-databases to store less structured data (e.g., JSON with unknown keys) (e.g., MongoDB, ElasticSearch) • ⇒ Günther, E., Trilling, D., & Van de Velde, R.N. (2018). But how do we store it? (Big) data architecture in the social-scientific research process. In: Stuetzer, C.M., Welker, M., & Egger, M. (eds.): Computational Social Science in the Age of Big Data. Concepts, Methodologies, Tools, and Applications. Cologne, Germany: Herbert von Halem. Big Data and Automated Content Analysis Damian Trilling
  • 28. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Storing data Data Architecture store Aspects to consider: • Data characteristics • Research design • Expertise • Infrastructure collect preprocess analyze Figure 1. Proposed model of the social-scientific process of data collection and analysis in the age of Big Data. As the dotted arrows indicate, the results of preprocessing and analyis are preferably fed back into the storage system for further re-use.Big Data and Automated Content Analysis Damian Trilling
  • 29. Looking back Looking forward The INCA project Final steps Techniqes we did not cover From retrieved data to enriched data 1. Retrieved 2. Structured 3. Cleaned 4. Enriched <div> <span id=‘author’ class=‘big’> Author: John Doe </span> <span id=‘viewed’> seen 42 times </span> </div> { “author”: “ Author: John Doe ”, “viewed”: “ seen 42 times ” } { “author”: “ john doe ”, “viewed”: 42 , } { “author”: “ john doe ”, “author_gender”: “ M ”, “viewed”: 42 , } gure 3. Data processing flow. Data types are indicated by a frame for strings of t nd a shade for integer numbers. en structured by mapping keys to values from the HTML (2.). Data cleaning ent Big Data and Automated Content Analysis Damian Trilling
  • 30. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Process and/or enrich Word embeddings We did not really consider the meaning of words • Word embeddings can be trained on large corpora (e.g., whole wikipedia or a couple of years of newspaper coverage) • The trained model allows you to calculate with words (hence, word vectors): king − man + woman =? • You can find out whether documents are similar even if they do not use the same words (Word Mover Distance) • ⇒ word2vec (in gensim!), glove Big Data and Automated Content Analysis Damian Trilling
  • 31. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Process and/or enrich Advanced NLP We did a lot of BOW (and some POS-tagging), but we can get more • Named Entity Recognition (NER) to get names of people, organizations, . . . • Dependency Parsing to find out exact relationships ⇒ nltk, Stanford, FROG Big Data and Automated Content Analysis Damian Trilling
  • 32. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Process and/or enrich Use images • Supervised Machine learning does not care about what the features mean, so instead of texts we can also classify pictures • Example: Teach the computer to decide whether an avatar on a social medium is an actual photograph of a person or a drawn image of something else (research Marthe) • This principle can be applied to many fields and disciplines – for example, it is possible to teach a computer to indicate if a tumor is present or not on X-rays of people’s brains Big Data and Automated Content Analysis Damian Trilling
  • 33. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Process and/or enrich Use images • To learn more about this, the following websites useful information: http://blog.yhat.com/posts/ image-classification-in-Python.html and http://cs231n.github.io/python-numpy-tutorial/ #numpy-arrays • Possibible workflow: Pixel color values as features ⇒ PCA to reduce features ⇒ train classifier • Advanced stuff: Neural Networks Big Data and Automated Content Analysis Damian Trilling
  • 34. Looking back Looking forward The INCA project Final steps Techniqes we did not cover Analyze/explain/predict More advanced modelling We only did some basic statistical tests • Especially with social media data, we often have time series (VAR models etc.) • There are more advanced regression techniques and dimension-reduction techniques tailored to data that are, e.g., large-scale, sparse, have a lot of features, . . . • ⇒ scikit-learn, statsmodels Big Data and Automated Content Analysis Damian Trilling
  • 35. An example for scaling up: The INCA project see also Trilling, D., & Jonkman, J.G.F. (2018). Scaling up content analysis. Communication Methods and Measures, doi:10.1080/19312458.2018.1447655
  • 36. project management methodsand infrastructure scaling up the data set one-off study (P1) re-use for multiple studies (P2) flexible infrastrucutre for continous multi- purpose data collection and analysis (P3) manual (M1) String opertions in Excel/Stata/... (M2) ACA with R/Python (M4) additional database server (M5) manualannotation dictionaries/frequencies scalingupthemethod scaling up the project scope dedicated ACA programs (M3) cloud computing (M6) machinelearning
  • 37. Looking back Looking forward The INCA project Final steps Scaling up Content Analyis INCA How do we move beyond one-off projects? • Collect data in such a way that it can be used for multiple projects • Database backend • Re-usability of preprocessing and analysis code Big Data and Automated Content Analysis Damian Trilling
  • 38. Looking back Looking forward The INCA project Final steps Scaling up Content Analyis INCA The idea • Usable with minimal Python knowledge • The “hard stuff” is already solved: writing a scraper often only involves replacing the XPATHs Big Data and Automated Content Analysis Damian Trilling
  • 39. Collection of import filters/parsers static corpora, e.g. newspaper database On-the-fly scraping from websites/RSS-feeds NoSQL database Cleaning and pre- processing filter Cleaned NoSQL database word frequencies and co-occurences log likelihood visualizations define details for analysis (e.g., im- portant actors) dictionary fil- ter/named en- tity recognition Latent dirich- let allocation Principal com- ponent analysis Cluster analysis Data management phase Analysis phase
  • 40. INCA Scrape Process Analyze and if you want to: - export at each stage - complex search queries ElasticSearch database
  • 41. analytics dashboard (KIBANA) g = inca.analysis.timeline_analysis.timeline_generator() g.analyse('moslim',"publication_date",granularity='year',from_time=‘2014’) timestamp moslim 0 2011-01-01T00:00:00.000Z 1631 1 2012-01-01T00:00:00.000Z 1351 2 2013-01-01T00:00:00.000Z 1221 3 2014-01-01T00:00:00.000Z 2333 4 2015-01-01T00:00:00.000Z 2892 5 2016-01-01T00:00:00.000Z 2253 6 2017-01-01T00:00:00.000Z 2680 Direct access to all functionality via Python
  • 42. Looking back Looking forward The INCA project Final steps Next meeting Wednesday Final chance for questions regarding final project (if you don’t have any, you don’t need to come.) Deadline final exam Hand in via filesender until Sunday evening, 23.59. One .zip or .tar.gz file with • .py and/or .ipynb for code • .pdf for text and figures • .csv, .json, or .txt for data • any additional file we need to understand or reproduce your work Big Data and Automated Content Analysis Damian Trilling