SlideShare a Scribd company logo
1 of 21
Download to read offline
Statistics in Python Pandas Exercise
Big Data and Automated Content Analysis
Week 5 – Wednesday
»Statistics with Python«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
7 March 2018
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Today
1 Statistics in Python
General considerations
Useful packages
2 Pandas
Working with dataframes
Plotting and calculating with Pandas
3 Exercise
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python
General considerations
Statistics in Python Pandas Exercise
General considerations
General considerations
After having done all your nice text processing (and got numbers
instead of text!), you probably want to analyse this further.
You can always export to .csv and use R or Stata or SPSS or
whatever. . .
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
General considerations
General considerations
After having done all your nice text processing (and got numbers
instead of text!), you probably want to analyse this further.
You can always export to .csv and use R or Stata or SPSS or
whatever. . .
BUT:
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
General considerations
Reasons for not exporting and analyzing somewhere else
• the dataset might be too big
• it’s cumbersome and wastes your time
• it may introduce errors and makes it harder to reproduce
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
General considerations
What statistics capabilities does Python have?
• Basically all standard stuff (bivariate and multivariate
statistics) you know from SPSS
• Some advanced stuff (e.g., time series analysis)
• However, for some fancy statistical modelling (e.g., structural
equation modelling), you can better look somewhere else (R)
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python
Useful packages
Statistics in Python Pandas Exercise
Useful packages
Useful packages
numpy (numerical python) Provides a lot of frequently used
functions, like mean, standard deviation, correlation,
. . .
scipy (scientic python) More of that ;-)
statsmodels Statistical models (e.g., regression or time series)
matplotlib Plotting
seaborn Even nicer plotting
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Useful packages
Example 1: basic numpy
1 import numpy as np
2 x = [1,2,3,4,3,2]
3 y = [2,2,4,3,4,2]
4 z = [9.7, 10.2, 1.2, 3.3, 2.2, 55.6]
5 np.mean(x)
1 2.5
1 np.std(x)
1 0.9574271077563381
1 np.corrcoef([x,y,z])
1 array([[ 1. , 0.67883359, -0.37256219],
2 [ 0.67883359, 1. , -0.56886529],
3 [-0.37256219, -0.56886529, 1. ]])
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Useful packages
Characteristics
• Operates (also) on simple lists
• Returns output in standard datatypes (you can print it, store
it, calculate with it, . . . )
• it’s fast! np.mean(x) is faster than sum(x)/len(x)
• it is more accurate (less rounding errors)
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Useful packages
Example 2: basic plotting
1 import matplotlib.pyplot as plt
2 x = [1,2,3,4,3,2]
3 y = [2,2,4,3,4,2]
4 plt.hist(x)
5 plt.plot(x,y)
6 plt.scatter(x,y)
Figure: Examples of plots generated with matplotlib
Big Data and Automated Content Analysis Damian Trilling
Pandas
Working with dataframes
Statistics in Python Pandas Exercise
Working with dataframes
When to use dataframes
Native Python data structures
(lists, dicts, generators)
pro:
• flexible (especially dicts!)
• fast
• straightforward and easy to
understand
con:
• if your data is a table, modeling
this as, e.g., lists of lists feels
unintuitive
• very low-level: you need to do
much stuff ‘by hand’
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Working with dataframes
When to use dataframes
Native Python data structures
(lists, dicts, generators)
pro:
• flexible (especially dicts!)
• fast
• straightforward and easy to
understand
con:
• if your data is a table, modeling
this as, e.g., lists of lists feels
unintuitive
• very low-level: you need to do
much stuff ‘by hand’
Pandas dataframes
pro:
• like an R dataframe or a STATA
or SPSS dataset
• many convenience functions
(descriptive statistics, plotting
over time, grouping and
subsetting, . . . )
con:
• not always necessary (‘overkill’)
• if you deal with really large
datasets, you don’t want to load
them fully into memory (which
pandas does)
Big Data and Automated Content Analysis Damian Trilling
Pandas
Plotting and calculating with Pandas
Statistics in Python Pandas Exercise
Plotting and calculating with Pandas
More examples here: https://github.com/damian0604/bdaca/
blob/master/ipynb/basic_statistics.ipynb
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Plotting and calculating with Pandas
OLS regression in pandas
1 import pandas as pd
2 import statsmodels.formula.api as smf
3
4 df = pd.DataFrame({’income’: [10,20,30,40,50], ’age’: [20, 30, 10, 40,
50], ’facebooklikes’: [32, 234, 23, 23, 42523]})
5
6 # alternative: read from CSV file (or stata...):
7 # df = pd.read_csv(’mydata.csv’)
8
9 myfittedregression = smf.ols(formula=’income ~ age + facebooklikes’,
data=df).fit()
10 print(myfittedregression.summary())
Big Data and Automated Content Analysis Damian Trilling
1 OLS Regression Results
2 ==============================================================================
3 Dep. Variable: income R-squared: 0.579
4 Model: OLS Adj. R-squared: 0.158
5 Method: Least Squares F-statistic: 1.375
6 Date: Mon, 05 Mar 2018 Prob (F-statistic): 0.421
7 Time: 18:07:29 Log-Likelihood: -18.178
8 No. Observations: 5 AIC: 42.36
9 Df Residuals: 2 BIC: 41.19
10 Df Model: 2
11 Covariance Type: nonrobust
12 =================================================================================
13 coef std err t P>|t| [95.0% Conf. Int.]
14 ---------------------------------------------------------------------------------
15 Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
16 age 0.4012 0.650 0.617 0.600 -2.394 3.197
17 facebooklikes 0.0004 0.001 0.650 0.583 -0.002 0.003
18 ==============================================================================
19 Omnibus: nan Durbin-Watson: 1.061
20 Prob(Omnibus): nan Jarque-Bera (JB): 0.498
21 Skew: -0.123 Prob(JB): 0.780
22 Kurtosis: 1.474 Cond. No. 5.21e+04
23 ==============================================================================
Statistics in Python Pandas Exercise
Plotting and calculating with Pandas
Other cool df operations
df[’age’].plot() to plot a column
df[’age’].describe() to get descriptive statistics
df[’age’].value_counts() to get a frequency table
and MUCH more. . .
Big Data and Automated Content Analysis Damian Trilling
Joanna will introduce you to the exercise
... and of course you can also ask questions about the last weeks if
you still have some!

More Related Content

What's hot

Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Ground Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - ClassesGround Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - ClassesChariza Pladin
 
Programming for Everybody in Python
Programming for Everybody in PythonProgramming for Everybody in Python
Programming for Everybody in PythonCharles Severance
 
Introduction to Python for Data Science and Machine Learning
Introduction to Python for Data Science and Machine Learning Introduction to Python for Data Science and Machine Learning
Introduction to Python for Data Science and Machine Learning ParrotAI
 
Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
Python Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | EdurekaPython Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | EdurekaEdureka!
 
pycon-2015-liza-daly
pycon-2015-liza-dalypycon-2015-liza-daly
pycon-2015-liza-dalyLiza Daly
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview QuestionsShubham Shrimant
 

What's hot (20)

BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
BD-ACA week1a
BD-ACA week1aBD-ACA week1a
BD-ACA week1a
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Ground Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - ClassesGround Gurus - Python Code Camp - Day 3 - Classes
Ground Gurus - Python Code Camp - Day 3 - Classes
 
Searching algorithm
Searching algorithmSearching algorithm
Searching algorithm
 
Programming for Everybody in Python
Programming for Everybody in PythonProgramming for Everybody in Python
Programming for Everybody in Python
 
Introduction to Python for Data Science and Machine Learning
Introduction to Python for Data Science and Machine Learning Introduction to Python for Data Science and Machine Learning
Introduction to Python for Data Science and Machine Learning
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Python Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | EdurekaPython Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | Edureka
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
pycon-2015-liza-daly
pycon-2015-liza-dalypycon-2015-liza-daly
pycon-2015-liza-daly
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview Questions
 

Similar to BDACA - Tutorial5

Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018DataLab Community
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonTariq Rashid
 
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYAMaulik Borsaniya
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Statistics in Data Science with Python
Statistics in Data Science with PythonStatistics in Data Science with Python
Statistics in Data Science with PythonMahe Karim
 
Data Science with Python course Outline.pptx
Data Science with Python course Outline.pptxData Science with Python course Outline.pptx
Data Science with Python course Outline.pptxFerdsilinks
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and rKelli-Jean Chun
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)EMRE AKCAOGLU
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19GoDataDriven
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science JobRohit Dubey
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxParveenShaik21
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysisRajesh M
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 

Similar to BDACA - Tutorial5 (20)

Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
Statistics in Data Science with Python
Statistics in Data Science with PythonStatistics in Data Science with Python
Statistics in Data Science with Python
 
Data Science with Python course Outline.pptx
Data Science with Python course Outline.pptxData Science with Python course Outline.pptx
Data Science with Python course Outline.pptx
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and r
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Lecture3.pptx
Lecture3.pptxLecture3.pptx
Lecture3.pptx
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysis
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 

More from Department of Communication Science, University of Amsterdam

More from Department of Communication Science, University of Amsterdam (18)

BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3
 
BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Recently uploaded

Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 

Recently uploaded (20)

Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 

BDACA - Tutorial5

  • 1. Statistics in Python Pandas Exercise Big Data and Automated Content Analysis Week 5 – Wednesday »Statistics with Python« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 7 March 2018 Big Data and Automated Content Analysis Damian Trilling
  • 2. Statistics in Python Pandas Exercise Today 1 Statistics in Python General considerations Useful packages 2 Pandas Working with dataframes Plotting and calculating with Pandas 3 Exercise Big Data and Automated Content Analysis Damian Trilling
  • 4. Statistics in Python Pandas Exercise General considerations General considerations After having done all your nice text processing (and got numbers instead of text!), you probably want to analyse this further. You can always export to .csv and use R or Stata or SPSS or whatever. . . Big Data and Automated Content Analysis Damian Trilling
  • 5. Statistics in Python Pandas Exercise General considerations General considerations After having done all your nice text processing (and got numbers instead of text!), you probably want to analyse this further. You can always export to .csv and use R or Stata or SPSS or whatever. . . BUT: Big Data and Automated Content Analysis Damian Trilling
  • 6. Statistics in Python Pandas Exercise General considerations Reasons for not exporting and analyzing somewhere else • the dataset might be too big • it’s cumbersome and wastes your time • it may introduce errors and makes it harder to reproduce Big Data and Automated Content Analysis Damian Trilling
  • 7. Statistics in Python Pandas Exercise General considerations What statistics capabilities does Python have? • Basically all standard stuff (bivariate and multivariate statistics) you know from SPSS • Some advanced stuff (e.g., time series analysis) • However, for some fancy statistical modelling (e.g., structural equation modelling), you can better look somewhere else (R) Big Data and Automated Content Analysis Damian Trilling
  • 9. Statistics in Python Pandas Exercise Useful packages Useful packages numpy (numerical python) Provides a lot of frequently used functions, like mean, standard deviation, correlation, . . . scipy (scientic python) More of that ;-) statsmodels Statistical models (e.g., regression or time series) matplotlib Plotting seaborn Even nicer plotting Big Data and Automated Content Analysis Damian Trilling
  • 10. Statistics in Python Pandas Exercise Useful packages Example 1: basic numpy 1 import numpy as np 2 x = [1,2,3,4,3,2] 3 y = [2,2,4,3,4,2] 4 z = [9.7, 10.2, 1.2, 3.3, 2.2, 55.6] 5 np.mean(x) 1 2.5 1 np.std(x) 1 0.9574271077563381 1 np.corrcoef([x,y,z]) 1 array([[ 1. , 0.67883359, -0.37256219], 2 [ 0.67883359, 1. , -0.56886529], 3 [-0.37256219, -0.56886529, 1. ]]) Big Data and Automated Content Analysis Damian Trilling
  • 11. Statistics in Python Pandas Exercise Useful packages Characteristics • Operates (also) on simple lists • Returns output in standard datatypes (you can print it, store it, calculate with it, . . . ) • it’s fast! np.mean(x) is faster than sum(x)/len(x) • it is more accurate (less rounding errors) Big Data and Automated Content Analysis Damian Trilling
  • 12. Statistics in Python Pandas Exercise Useful packages Example 2: basic plotting 1 import matplotlib.pyplot as plt 2 x = [1,2,3,4,3,2] 3 y = [2,2,4,3,4,2] 4 plt.hist(x) 5 plt.plot(x,y) 6 plt.scatter(x,y) Figure: Examples of plots generated with matplotlib Big Data and Automated Content Analysis Damian Trilling
  • 14. Statistics in Python Pandas Exercise Working with dataframes When to use dataframes Native Python data structures (lists, dicts, generators) pro: • flexible (especially dicts!) • fast • straightforward and easy to understand con: • if your data is a table, modeling this as, e.g., lists of lists feels unintuitive • very low-level: you need to do much stuff ‘by hand’ Big Data and Automated Content Analysis Damian Trilling
  • 15. Statistics in Python Pandas Exercise Working with dataframes When to use dataframes Native Python data structures (lists, dicts, generators) pro: • flexible (especially dicts!) • fast • straightforward and easy to understand con: • if your data is a table, modeling this as, e.g., lists of lists feels unintuitive • very low-level: you need to do much stuff ‘by hand’ Pandas dataframes pro: • like an R dataframe or a STATA or SPSS dataset • many convenience functions (descriptive statistics, plotting over time, grouping and subsetting, . . . ) con: • not always necessary (‘overkill’) • if you deal with really large datasets, you don’t want to load them fully into memory (which pandas does) Big Data and Automated Content Analysis Damian Trilling
  • 17. Statistics in Python Pandas Exercise Plotting and calculating with Pandas More examples here: https://github.com/damian0604/bdaca/ blob/master/ipynb/basic_statistics.ipynb Big Data and Automated Content Analysis Damian Trilling
  • 18. Statistics in Python Pandas Exercise Plotting and calculating with Pandas OLS regression in pandas 1 import pandas as pd 2 import statsmodels.formula.api as smf 3 4 df = pd.DataFrame({’income’: [10,20,30,40,50], ’age’: [20, 30, 10, 40, 50], ’facebooklikes’: [32, 234, 23, 23, 42523]}) 5 6 # alternative: read from CSV file (or stata...): 7 # df = pd.read_csv(’mydata.csv’) 8 9 myfittedregression = smf.ols(formula=’income ~ age + facebooklikes’, data=df).fit() 10 print(myfittedregression.summary()) Big Data and Automated Content Analysis Damian Trilling
  • 19. 1 OLS Regression Results 2 ============================================================================== 3 Dep. Variable: income R-squared: 0.579 4 Model: OLS Adj. R-squared: 0.158 5 Method: Least Squares F-statistic: 1.375 6 Date: Mon, 05 Mar 2018 Prob (F-statistic): 0.421 7 Time: 18:07:29 Log-Likelihood: -18.178 8 No. Observations: 5 AIC: 42.36 9 Df Residuals: 2 BIC: 41.19 10 Df Model: 2 11 Covariance Type: nonrobust 12 ================================================================================= 13 coef std err t P>|t| [95.0% Conf. Int.] 14 --------------------------------------------------------------------------------- 15 Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386 16 age 0.4012 0.650 0.617 0.600 -2.394 3.197 17 facebooklikes 0.0004 0.001 0.650 0.583 -0.002 0.003 18 ============================================================================== 19 Omnibus: nan Durbin-Watson: 1.061 20 Prob(Omnibus): nan Jarque-Bera (JB): 0.498 21 Skew: -0.123 Prob(JB): 0.780 22 Kurtosis: 1.474 Cond. No. 5.21e+04 23 ==============================================================================
  • 20. Statistics in Python Pandas Exercise Plotting and calculating with Pandas Other cool df operations df[’age’].plot() to plot a column df[’age’].describe() to get descriptive statistics df[’age’].value_counts() to get a frequency table and MUCH more. . . Big Data and Automated Content Analysis Damian Trilling
  • 21. Joanna will introduce you to the exercise ... and of course you can also ask questions about the last weeks if you still have some!