A client collects an annual commuter travel survey which collects extremely detailed information about their staff and regular visitors. The data was presented to our team as a 250-column Excel spreadsheet with confusingly coded questions and answers, and baffling geographic data. This session will explain how we turned it into a rich, usable database with the help of FME.
5. The survey
1. Please select the answer that applies to you.
• I work on the main campus
• I work at an off-campus site
• None of the above
If “none of the above is selected” then skip to end of survey.
2. How do you usually commute to campus?
• Drive alone in a passenger automobile
• Bicycle
• Carpool (two or more people)
• Campus Shuttle (not as a transfer from another mode)
• Motorcycle
• Public transit (e.g. bus)
• Walk
7. Required output
• A database (PostgreSQL / PostGIS)
• Access via desktop GIS (ArcGIS / QGIS)
• Possibly some kind of dashboard…
• Insight
8. Cleaning up the non-geographic data
• A lot of questions are not fully expressed in column headers
• Several column headers are repeated
• Not all questions are answered
• Some responses are “Other”, with the response in a second column
• Some questions allow multiple responses, so responses are collected
into multiple columns
• A lot of questions are not of interest for this study
9. Step 1: extract the schema
ResponseID SchemaID
ExternalDataReference q1
Affiliation q2
Staff class/Class level q3
CAC q5
Home lat q6
Home lon q7
FPC q8
Please select the answer that applies to you. q9
Although you are not eligible for this survey, we thank you for / your time and interest. q10
Where on campus is your primary work site? Please click on one location. - X q11
Where on campus is your primary work site? Please click on one location. - Y q12
Which of the following is your off-campus work / site?(Note: Consider your commute to/from / this loc... q13
Which of the following is your off-campus work / site?(Note: Consider your commute to/from / this loc...-TEXT q14
From what city and zip code do you typically begin your / commute? / If you choose to, please also p...-City q16
From what city and zip code do you typically begin your / commute? / If you choose to, please also p...-Zip q17
From what city and zip code do you typically begin your / commute? / If you choose to, please also p...-Cross street 1 q18
From what city and zip code do you typically begin your / commute? / If you choose to, please also p...-Cross street 2 q19
How do you usually commute to campus/work? (i.e., / What is the primary mode you use during your ty... q20
How do you usually commute to campus/work? (i.e., / What is the primary mode you use during your ty...-TEXT q21
Which transit system do you use for the longest distance of your / commute? q22
Which transit system do you use for the longest distance of your / commute?-TEXT q23
How do you usually get to the bus stop or train station from / your home? q24
How do you usually get to the bus stop or train station from / your home?-TEXT q25
13. Step 4: set up database with more columns
alter table survey_data
alter column raw_survey_data type jsonb using raw_survey_data::jsonb
, add column survey_date date
, add column response_id text
, add column affiliation text
, add column staff_class_level text
, add column work_location_description text
, add column geom_work geometry
, add column commute_start_city text
, add column commute_start_zip text
, add column commute_mode_primary text
, add column commute_mode_transit text
, add column commute_mode_transit_access_home text
, add column commute_mode_transit_access_work text
, add column typical_station_home text
, add column typical_station_work text
, add column commute_regularity text
, add column commute_mode_primary_regularity text
, add column typical_work_time_arrive text
, add column typical_work_time_depart text
15. Cleaning up the non-geographic data
A lot of questions are not fully expressed in
column headers Schema loaded separately into database
Several column headers are repeated
Not all questions are answered
PostgreSQL’s JSON capabilities used to
reduce data volumes while still retaining
raw data
Some responses are “Other”, with the
response in a second column Questions automatically and manually
assessed to ensure data integritySome questions allow multiple responses,
so responses are collected into multiple
columns
A lot of questions are not of interest for this
study
Some questions are ignored for this study
16. The story so far…
Transport Scotland: http://www.transport.gov.scot/report/j9425-10.htm
17. Two questions were very creatively
posed: where do you work on campus,
and where do you park?
18. Where do you work on campus?
Linford Nursing: http://www.linfield.edu/portland/about-portland/location/campus-map.html
580 194
502 270
41,616,668,701,171,800 3,360,666,809,082,030
504,5 3,298,699,951,171,870
563 250
4,951,040,344,238,280 3,744,800,109,863,280
486,5 3,238,699,951,171,870
507 3,296,800,231,933,590
487,5 3,748,699,951,171,870
463,5 2,688,699,951,171,870
414,5 2,808,699,951,171,870
584 196
46,261,907,958,984,300 2,562,952,575,683,590