Analyzing Census Data: Large databases and challenges to statistical softwares
1. Analyzing the census:
Large databases and statistical software challenges
Rogério Jerônimo Barbosa
PhD Candidate, Sociology – USP
Researcher at the Center for Metropolitan Studies (CEM)
1
2. Presentation Structure
1. Objetives of this presentation
2. The Census Project
3. Statistical Softwares and Computer Processing
4. (Little) More Advanced Stuff...
5. Conclusions and a “to do list”
2
3. 1. Objetives
• Share my personal experience with the Census
Databases.
• Give some hints on how to analyse big databases
• Show how R can be a good
environment/companion for “big data” analysis
3
5. 2. Census Project
• December 2011:
• Invited by Marta to become part of the project
• Jan/Apr 2012:
• Getting familiar with IBGE documentation and Census Databases
• We bought all PNADs and Census Data (Except for 1960 edition)
• May 2012:
• The team started working
• April 2013:
• End of (team) activities
5
6. 2. Census Project
The team:
Rogério J Barbosa
PhD Candidate – Sociology/USP
Diogo Ferrari
PhD Candidate – Political Science/Michigan University
Ian Prates
PhD Candidate – Sociology/USP
Leonardo Barone
PhD Candidate – Public Administration/FGV-SP
Murillo Marschner Alves de Brito
PhD Candidate – Sociology/USP
Patrick Silva
Graduate Student (Master)– Political Science/USP
6
7. 2. Census Project
• Challenges:
• Run (a lot!!) of descriptive analyses and statistical
models using the six huge Census databases (20 million
cases +) and sometimes other data too.
• Standardize variables and measures
• Do it all as fast as possible
7
14. 3. Statistical Softwares and Computer Processing
• There is no such thing as a “Super Computer”
• Clusters do not have a “user friendly” interface: you
have to use command line (Linux Terminal)
•
•
•
•
You write command lines for statistical analysis and upload it
Then you write a “job” and submit it to the cluster queue
Wait for your turn...
Download a file with the results
• Clusters require parallel processing – otherwise, you are
not using their real power.
• Common Statistical softwares don’t do that!
14
15. 3. Statistical Softwares and Computer Processing
• Parallel Computing
• “Who” to divide your processing tasks with?
• Between Computers (clusters)
• Between “cores” of the same computer (this is feasible using
personal computers!)
• How to do that?
• Implicitly: specialized statistical softwares (expensive)
• Explicitly: you write your parallel codes yourself! (hard)
15
16. 3. Statistical Softwares and Computer Processing
• Parallel Computing: not everything is (easily) parallelizable
Minimizing the squared residuals...
Specialized softwares use (very complicated ) approximations...
16
17. 3. Statistical Softwares and Computer Processing
• Parallel Computing: not everything is (easily) parallelizable
Iterative methods for getting maximum likelihood estimators...
(Fisher Scoring Algorithm:
the actual step depends on the results of the previous one)
17
Specialized softwares use (very complicated ) approximations...
18. 3. Statistical Softwares and Computer Processing
• Summary of the problems:
• Clusters are hard to use
(We didn’t become friends of Jaguar and Puma...)
• We didn’t have resources to buy parallel versions of
the standard softwares
• The fast softwares were not able to open the data
• We didn’t know advanced algebra for explicitly write
our parallel codes in R for modelling
18
19. 3. Statistical Softwares and Computer Processing
• So we discovered...
RAM
HDD
CPU
Very fast
access
XDF Files
19
20. 3. Statistical Softwares and Computer Processing
• Diogo’s bechmark:
CrossTab
Plot a graph
OLS
Percentiles TOTAL
R Revolution
(4 Census)
< 1 min
< 25 s
< 3min
< 30 s
1min40
SPSS
(1 census)
2min18s
4min20s
2min20s
2min20s
+15min
20
21. 3. Statistical Softwares and Computer Processing
My trial:
• OLS Regression
• 75 dummy variables for age
• Dummy for gender
• Interactions (age*gender)
Plotting the
results
4 seconds
21
22. 3. Statistical Softwares and Computer Processing
• Summary of the solutions:
• Some used (including me) SPSS for recoding and descriptive
statistics
• Revolution R for modelling
• Stata and (conventional) R for other stuff that used less amount
of data
22
24. (Little) More Advanced Stuff...
• My Purpose: use R* for every analysis
* Or similars, like Python, Julia etc...
• How to do that (once conventional R is
limited)?
24
25. 4. (Little) More Advanced Stuff...
1 – The “bigger” the better: better hardware
makes it faster
• Better processor (multicore)
• More RAM
• Solid State Disks
2 – Update R Algebra libraries
• Optimized Linear Algebra Subsystem (BLAS)
• Taylored to your processor!!
• Little bit difficult to do: compile BLAS + recompile R
25
26. 4. (Little) More Advanced Stuff...
3 – Use 64-bit system and softwares
4 – Use “professional” database management
• SQL for managing Data
• ODBC connections for exporting it to R
• Import just the pieces you need at the moment
5 – Minimize copies of data stored in RAM
• R objects make redundant copies
26
27. 4. (Little) More Advanced Stuff...
6 – Optimize your code
• Do not do a bunch of loops: vetorize!
• Use “lower level” funtions:
• lm.fit instead of lm
• If possible, use C++
My multilevel regression:
1 hour -> 9 seconds
• Use “lower level” objects:
• Matrices instead of data.frames
• Use “integer” instead of “double”:
27
28. 4. (Little) More Advanced Stuff...
6 – Optimize your code
Example: 7 million cases, 3 variables + survey weights
28
29. 4. (Little) More Advanced Stuff...
7 – Use bigdata packages
• ff/ffbase
• bigalgebra / bigmemory etc
• biglm / speedglm
8 – Use the “garbage can” to free memory
• gc()
9 – Do not sort data!
29
31. Conclusions:
1 – Large database are challenging...
(and if you are crazy enough you can even have fun with it!)
2 – The Census project was a great opportunity
for trying and learning new stuff!
Do do list:
1 – Learn more R, SQL and programing
2 – Learn more math (mainly Linear Algebra)
3 – Become friends with Puma and Jaguar
31