Government agencies are collecting and producing data at an accelerating rate, and constituents want access to this data with decreasing latency. Meeting a digitally savvy polity's desire for data while ensuring that data is open, accessible, and interpretable by all comes with unique challenges. I'll share some of these while walking through how governments are building their own data products using open data as well as empowering civic hackers. I'll also walk through why data science at the government level is fundamentally different than data science in the private sector.
We're SaaS business providing cloud-based solutions for data-driven government.
Data-as-a-service platform and cloud applications for government agencies
We make government data discoverable, reusable, and actionable
I’m a social scientist turned data scientist turned product manager, so I think a lot about:
how humans generate data and how that data gets encoded, and then how that encoded data gets turned into models.
Here I am at our annual employee summit – I was in the middle of talking to Dave Doyle, the City of Seattle open data program manager, who had just given closing remarks, and didn’t realize a group photo was forming. Or, as a coworker said to me after seeing this photo “man you must really love that laptop"
I own several of our backend services –- all of the way that our customers ingress data onto our platform – as well as a chunk of our machine learning infrastructure.
What do we build to enable open data?Open data portals –
- web interfacet to a cloud-based data-as-a-service platform that lets agencies provide open data to their communities with built-in APIs, search, etc. making the data discoverable and resuable.
- this is probably what most people think of when they think of open data
- lots of cities, counties, and states, some NGOs and federal agencies as well
Open data powers applications as well that help agencies plan and communicate with stakeholders:
- here’s the City of Seattle’s open budget
- Lets anyone explore the budget in nitty gritty detail, interactively
Becoming more popular – performance programs
- lets agencies provide transparency and accountability for the goals they've set for programs and initiatives
- think of it as a public dashboard with KPIs that anyone can check in on
- Budgets and performance dashboards are changes in the consumption experience for the raw data that open data portals host
- represent the overall maturation of open data, as we move from data for data's sake to solving specific problems with open data and putting open data in the path of government work, rather than a destination where data lives
- Even Steve Ballmer is getting in on the action
Side project to help Americans understand the flow of money in government
No LA Clippers salary cap explorer – but he’ll have 21 million or so freed up without Chris Paul in the upcoming season.
However, USA facts is having the same problems with engagement that I'll talk about later in this talk – at the national governor's association last week, he said after a big flurry of publicity, they're only getting about 4000 visitors a day.
Just putting data up for people to consume often doesn't produce a ton of engagement
What’s the state of open data in 2017?
- Broke the rules of giving a talk by discussing open data so far without really talking about what it is and what we mean by it
Open data as both an idea and a practice really picked up steam during the Obama Administration – but the concept has been around for a while.
So what do we mean?
Data can be open in a couple of ways –
machine readable, available programmatically – this means data in widely accepted formats like CSV, JSON, XML, not locked up in a PDF or stored in physical copies that require a FOIA to get at
It also means APIs that allow programmatic retrieval, and to enable developers to build applications with the data – if you've used a non-city-created transit application to find out when your bus/train is coming, you've benefited from this kind of open data.
permissively licensed for reuse
This is obviously a stickier issue – as licensing always is – and differs from agency to agency, but there is a generally agreed upon idea that open data should be available for reuse by anyone that wants to use it (at the very least non-commercially) but often this means for commercial reuse as well. Some examples of data reuse in applications include restaurant inspections in Yelp reviews or data used by Zillow for estimating housing value.
Sunlight Foundation, a non-profit dedicated to making governments accountable and transparent, has published guidelines for open data, many of which have been adopted by government agencies in creating their open data polices.
Over the past 10 years or so, we've seen a number of municipalities create policies or legislation mandating that any data that can be made open be made open. The implementation and wording differs from agency to agency, but a growing number.
Open Data Policies dot org – branched off from sunlight -- lists 101 current open data policies at the city, county, and state levels
Federal level: DATA Act: Digital Accountability and Transparency Act (2014) – how does the government spend its money?
This widespread adoption has produced real results for many agencies.
The city of Chicago saw their freedom of information act requests drop by 50% when they launched their open data portal. Chicago is a real leader in open data – constantly pushing us – and open sourcing a lot of their work for other governments to use. They recently relaunched their open data portal (data.cityofchicago.org) to great success.
The Dallas PD saw requests for data on officer-involved shootings drop to zero (dallasopendata.com) when they started releasing that data on their open data portal.
Fulfilling requests for records and freedom of information act requests are resource-intensive. They’re usually time-bound and failure to comply carries a penalty.
Beyond time saving and request-fulfilling efficiency, Where are we with open data? What are some of the success stories?Hopefully the data scientists in the crowd will allow me a bit of selection on the dependent variable for a moment.
NYC has one of the most active open data programs in the country, thanks to a combination of a very talented staff and a legislated primary goal for all city agencies to share open data.
The Department of Information Technology and Telecommunications (best acronym -- DoITT) and MODA (the Mayor’s Office of Data Analytics) manage a tremendously complicated program with great success – acting as central hubs for many of the city's agencies and their data. NYC also has stringent retention policies, so a lot of moving pieces to manage.
A favorite dataset of mine is the NYC tree census of all trees in NYC – whenever we're testing out geospatial features, we usually use that dataset as one of our testers.
- Probably the most common thing many people think of when they think of Open Data – citizens serving as independent watchdogs, finding inefficiencies, injustices, and just plain mistakes. “Transparency”
Ben Wellington is a quantitative analyst (who runs the popular blog I Quant NY) at Two Sigma – which if you’ve been to a recent PyData or SciPy conference will be familiar did this analysis using NYC open data.
Something worth over $33,000 in this picture – and it’s not the Mercedes.
Using NYC Open Data, found 84 tickets over a 4.5-month period -- $33,000 a year in fines, not including towing fees.
One block over, another hydrant generating 24,000 a year – over 55,000 a year on two blocks
Open data for the same reason you crowdsource things
Can’t think of all possible questions to ask, instead, rely on motivated individuals to ask them
Of course, this cost the city 55,000 a year!
This has a bit of a bizarre twist, as there’s some confusion over whether these cars were parked legally or not.
Using Google Maps – there is a protected bike lane between the cars and the fire hydrant.
New Orleans is a real leader in performance management – a city with a number of well-known and lesser-known problems that really embraced tying its decision-making to open data.
- NOLA combined data from the American Housing Survey, American Community Survey (both from the US Census) and FD admin data
Office of Performance & Analytics identified homes most at-risk for fire
Distributed 8,000 smoke detectors, reducing fire deaths & injuries
A family of 11 later that year escaped a house fire after a smoke alarm went off at 3am – that had been installed as part of this program
Also increased operational efficiency of FD by modeling where fires are most likely to occur
Blight is a well-known problem in New Orleans, which has experienced natural disasters and social crises over the past century – homes left abandoned, boarded up in a city that is always at risk of being reclaimed by the earth.
Blight is problematic for a number of reasons – crime, pest infestation, public safety, and that lot is blocking new development / revenue collection / depressing home values elsewhere.
- Blight affected up to ¼ of all residential addresses in NOLA after Katrina
Formed BlightSTAT – cross-department task force to reduce blight using data
Decreased blight by 30% especially impressive when benchmarked against peer cities where abandoment rates are climbing
cut response time from initial inspection to a hearing in half – over 3 months reduction
- Jackson, MS – like many cities and like New Orleans, plagued by aging infrastructure
- the kind of infrastructure that sees schoolbuses fall into sinkholes TWICE in three years – these are different buses on different streets.
- more than half of the city's bridges in dire need of repairs
Last year, one of these bridges targeted for repairs totally collapsed due to flooding – after closing, thankfully, but an example of the critical state of affairs.
Estimates to repair Jackson's infrastructure range from 750 million to 1 billion dollars – that's 5-6 times the annual revenue of the city all told.
City has one lever it can turn to generate revenue --
1 percent sales tax voted in, with revenue reserved for capital projects
Jackson knew that the only way to keep the public's trust was to be open and transparent about how this revenue was being used and how it was tracking to meet its goals
This slide, which you saw earlier, is actually part of Jackson's performance program – called JackStats
Of particular interest given the bus-sinkhole problem is Operation Orange Cone
By using 311data to more efficiently dispatch repair crews and identify problem areas,
Filled over 69,000 potholes in 2 years, a 60% increase in pothole-filling compared to before Operation Orange Cone.
Some of these complaints dated to 2010!
AND also decreased 311 calls
Now residents can track the progress of Operation Orange Cone in a number of places and see how the project is doing on time and on budget, as well as getting up-to-date information about what streets are planned for resurfacing.
Operation Orange Cone was supposed to be a two-week pilot program, but has run continuously for 2 years now.
From the agency’s perspective, you need a strategy for success.
Open data programs are most successful when they see high engagement with constituents and residents.
Engagement is key – the mayor/governor wants to see that these open data programs are providing value. One way they do that is via constituent engagement. Checking the open data box doesn’t drive eyeballs.
Need to figure out what problem you're solving and who you're solving it for. (Selfishly, I might say that open data programs need a good product manager but that’s probably another talk).
Just releasing open data for civic hackers only helps a small portion of the population.
When Jackson, MS needed to fix its roads to keep buses from falling into sinkholes, they made the data part of the planning process, the decisions that were made, and the reporting on the actions that resulted from those decisions.
Putting a CSV on an open data portal doesn’t drive engagement, and is at worst a good way to have stale, out-of-date data that no one trusts or relies on.
Each of the previous success stories began with some problem to be solved or goal to be accomplished – that’s what makes open data compelling. Data is only part of the story – what happens with the data is the magic of open data.
Data-informed decision-making (credit to Greg Reda at PyData Seattle) needs data.
For that data to be effective, that data needs to be up to date and authoritative. Government information workers need to be able to trust that data and integrate it into their workflows.
“Open data portals” are a red herring – it shouldn’t be where old CSVs go to retire – it should be where government workers look for data they need to do their job.
For every data scientist with a phd working at the NSA or the Census, there’s an analyst working at the department of public works in a city that has to work on 20 projects at once. They’re working with limited resources, limited time, and a public that often doesn’t care a whole lot about how busy they are.
Public service isn’t just a catchphrase – there’s real service here.
Your work is out there for the public to see. When your audience is everyone, you have to show your work. This means opening your data and explaining your models.
Sometimes your work is going to be featured in the local news, and there will often be press releases about it.
On the other hand, your work is going to be out there for everyone to see! I bet a lot of you don't get to talk about your work.
Over 30,000 emails published for anyone to read. Released every Friday. This used to be a FOIA request that the local news outlets would make every week, now they just release them automatically to save them the trouble. Complex workflow that is half-automated, half-manual (can't release PII, constituent information, etc.).
Government moves slowly – and upgrade cycles are no exception. Getting data from one agency to another is a real challenge.
Talking to an analyst at a major west coast city – people drive around different lots in the city to verify permitting status, then literally fill out forms by hand, which are then delivered to a central office for data entry at a later date. By the time they’re digitized, who knows what has changed?
Budgets run by department, but problems span departments – there’s no “department of homelessness”, but the problem requires action across agencies. Each agency has its own budget, and may use a different database or ERP system to track their data. It’s not as simple as a JOIN.
You can’t just use that new library you read about on Hacker News into production – IT departments vary wildly in their permissiveness, and changes to infrastructure are extremely difficult. CIOs or CDOs may make software decisions for entire departments.
You’re probably not going to be using Spark to build a predictive machine learning model that runs on a cluster and scales to millions of predictions a minute.
Some agencies allow users to submit data via forms – and it’s messy. Different languages, profanity, and then some. I was speaking to an open data program manager from a city in the south and he was telling me that when they were cleaning up this data, retention laws meant that they had to go in and indicate where data had been changed from its original form.
You think data scientists in the private sector complain about how hard it is to clean their data…
Accessibility is a compliance matter – data you release and websites you host it on have to be accessible to people with disabilities, and there’s a shifting landscape of standards.
You have no idea who will be accessing your data – it’s open! – and you can’t assume that they’ll be subject matter experts. Metadata, documentation are key – but of course they are time consuming.
On the other hand – your audience is everyone! You get to work on things that affect all sorts of people in very real ways. Your predictive models have real impact – like saving lives through fire detector placement.
For every data scientist with a phd working at the NSA or the Census, there’s an analyst working at the department of public works in a city that has to work on 20 projects at once. They’re working with limited resources, limited time, and a public that often doesn’t care a whole lot about how busy they are.
Public service isn’t just a catchphrase – there’s real service here.