The U.S. Department of Commerce collects, processes and disseminates data on a range of issues that impact our nation. Whether it's data on the economy, the environment, or technology, data is critical in fulfilling the Department's mission of creating the conditions for economic growth and opportunity. It is this data that provides insight, drives innovation, and transforms our lives. The U.S. Department of Commerce has become known as "America's Data Agency" due to the tens of thousands of datasets including satellite imagery, material standards and demographic surveys.
But having a host of data and ensuring that this data is open and accessible to all are two separate issues. The latter, expanding open data access, is now a key pillar of the Commerce Department's mission. It was this focus on enhancing open data that led to the creation of the Commerce Data Service (CDS).
The mission at the Commerce Data Service is to enable more people to use big data from across the department in innovative ways and across multiple fields. In this talk, I will explore how we are using big data to create a data-driven government.
This talk is a keynote given at the Texas tech University's Big Data Symposium.
35. • A reason for existence
• Access to the field
• Access to actionable data
• Ethical intervention points
• Methodologically defensible yet intellectually
accessible
• Path to sustainability
Six conditions for data awesomeness
44. Case: Who is export-ready and to
what degree?
Unsupervised
Learning with a
hint of supervised
learning
Differentiated services
for
new markets
45. Case: A trade specialist in rural
America may need to drive 2
hours to meet a potential exporter.
Conversion
Scoring
Problem
Know your utility
before you go
46. Case: Which positions in a
company are like to use which
services?
Transition
probabilities
Sets expectations
53. Data Science I: Basics / Working with
Teams (Git and GitHub) / Intro to Object-
Oriented Programming (Python &
JavaScript) / Using APIs (Intro to REST) /
Intro to Photoshop / Intro to Python / Basic
SQL (Using Sqlite3) / Building APIs / Intro
to R / Intro to JavaScript / Intro to Data
Analysis with Python / Data Wrangling with
pandas / Agile Development / HTML + CSS
/ Storytelling with Data / Excel / Intro to
Machine Learning / Visual Analytics with
Python / Data Storytelling with R
54. 2016 Season (Scale Experiment)
14 Three-hour course taught by
Commerce Data Service staff
Two-week intensives on data science
and data visualization via General
Assembly
2
Option to be a data scientist or data
engineer-in-residence
62. Find the right users Understand security Find affordable housing Determine hail risk
Predict rainfall and flooding Determine human activity;
using satellite data
Help with Water
Management
63.
64. a novel analysis or question posed
to the data
—
visually arresting graphics and
engagement with the public
—
open, free code and data for the
public to use
Contribute
67. How might we create a better
‘conversation’ and/or experience with
data around income inequality?
purpose
68. Create a basis of knowledge for
Americans on income inequality
initially…
Eventually a one-stop hub for making
income-related decisions combining
Census and BLS data.
intention
69. ● Accessible via American Fact Finder (AFF).
● AFF doesn’t show distributions of individuals.
American Community Survey (ACS)
70. Current Population Survey (CPS)
● Limits:
● Medians falling in the upper, open-ended interval are
plugged with "$250,000”
● The data sets aggregate everyone above $100,000 together
● Limitations on job-to-job comparison
● Granularity of breakdowns
71. ACS Public Use Microdata Sample
(PUMS)
71
● Very Rich Data Set
● Difficult To Use
74. The lives of too many girls of
color is characterized by:
Early Sexual Abuse, Chronic Aversive Stress ➪
School Failure ➪ Sexual Exploitation ➪ Prison.
75. 12% African-American girls
7% of Native American girls
6% of white boys
2% of white girls.
Every year, girls of color are suspended from
school at higher rates than any other group
Annual Suspension Rates
Many of these girls are disproportionately funneled
through the juvenile justice system.
76. Girls are the fastest growing segment of the
juvenile justice system.
US Population Detained and
Committed
African American
Girls
14% 32%
Native American
Girls
1% 3.5%
80. Dr Tyrone W A Grandison
Deputy Chief Data Officer
tgrandison@doc.gov
commerce.gov/dataservice
github.com/CommerceDataService
Notas do Editor
On October 28, 2011, a Delta II rocket took off from Vandenberg Air Force Base in California.
Onboard was the Suomi NPP satellite, a nearly 2000 kg satellite with the mission of adding to the environmental and climate data records of the Earth; helping us to better understand society.
The satellite mission was made possible by a partnership between the National Oceanic and Atmospheric Administration (NOAA) and NASA.
Onboard, NPP carries various instruments that collect information about the earth system.
One particular instrument, the Visible Infrared Imaging Radiometer Suite or VIIRS -- a 277 kg imaging device -- holds the potential to understand earth in unprecedented ways.
While NPP flies over a sun synchronous orbit, the VIIRS instrument goes to work. It can see everything from:
- atmospheric conditions, clouds, the earth radiation budget, clear-air land/water surfaces, sea surface temperature, ocean color, and low light visible imagery.
It also captures nighttime lights, enabling far ranging applications.
Looking at the continental US, nighttime lights are distributed in non-random patterns.
On a macroscale, we can see all the interconnectedness of large cities to towns with the arteries in between.
We can also see activity on the high seas, with boats and oil rigs in the Gulf coast.
And, It’s more than a pretty picture. It’s data. It’s big data.
In fact, the US nighttime lights profile can be turned a histogram.
Think about taking a photo of the US from space using your nifty digital camera and then having a histogram of the lights.
We basically are binning the light so we know how many pixels fall into each level of light intensity.
And that light intensity holds the potential to understand population dynamics -- we could ballpark the number of people on the ground -- allowing researchers to tie it to labor force estimates and economic output.
This representation of data holds clues to how society collectively behaves. Let's put it into an example
Let’s zoom in a bit on the 35 largest metro areas in the US
See the spider web patterns and the clustering light. That indicates patterns in urban development, sprawl, economic activity, residential activity.
And using nighttime lights we can quantify it.
In fact when we breakdown satellite imagery into histograms, we can see clear differences in the amount and intensity of light.
Cities with less light will have smaller histograms.
Cities with more light and higher population density will have a tail to the right.
More clustered the central business district is in small cities, longer the right tail.
In New York, the light distribution has a mix of dim and bright lights. But in Las Vegas, it’s dimmer with one super bright urban core.
One intensely bright pixel in one city will not mean the same as the same bright pixel in another.
The clustering, residential, employment will also differ.
Our team is experimenting with ways to convert the signal into more timely measures of society and the economy.
And find where we can develop derivative data series.
The key to new data-driven societal insights is somewhere in that data.
But we're certainly not the first to take a crack at it and it doesn’t take much effort to find brilliant scientists at Commerce who are finding ways to use the data.
For example, Dr Chris Elvidge -- a remote sensing scientist based out of NOAA’s Boulder Research Facility -- has spent most of his career drumming up ways of using nighttime imagery.
Using VIIRS, he has found ways to detect:
illegal fishing,
the location and spread of wildfires and
gas flares that add greenhouse emissions.
Also, VIIRS can help estimate GDP and other social indicators, especially in the rural parts of developing world as well as measure the ROI of electrification projects.
The data is there. It’s collected everyday. And there is more there than many of us could imagine.
Just from the VIIRS instrument, we collect about 2.5 terabytes of raw data per day that expands out to much more when we consider all the processed data.
This is what Commerce is about.
We collect some of the highest value data around, find ways to use it to advance and better society and the economy
This is what my team is about.
I’m part of the leadership team of the Commerce Data Service, a new data startup within the Office of the Secretary, where I lead data science initiatives advancing the missions of the 12 bureaus of Commerce.
The Data Service was established in November last year and we've been quickly growing and moving to take on some of the hard problems across the bureaus...
Bureaus like the Census Bureau, NOAA, the Patent + Trademark Office, Bureau of Economic Analysis among other agencies that produce about 36% of the federal open data available through data.gov.
Essentially, we're one of the data big dogs.
As the Deputy Chief Data Officer of the US Department of Commerce, I have this extraordinary privilege of working with among the brightest scientists and policy makers in the country.
We have satellites and radar stations that help us understand the environment.
We conduct well over 200 of the highest quality demographic and economic surveys in the world, which supports research on trade, urban planning and schooling.
And it's not for nothing. I'd like to take you through what it means to work on data projects in government.
Government takes on the hardest problems and we need data to take on those problems.
If any one person needs help and asks for help, it’s the government that needs to step up to the challenge, whether it’s for defense, homelessness, housing, healthcare, education or the economy.
According to the Census Bureau, we have nearly 320 million Americans. That’s 320 million customers.
At the Commerce Data Service, we are doing our part by helping to make government more data-driven.
But given the nature of our portfolio, we have to work differently.
I often hear people start a data conversation with “what’s your stack?”, “how fast is your GPU cluster?”, “are you a spark guy?”. This indicates to me that someone is starting a project with technology first.
Well, the thing is, our modes of interaction with our customers are not usually through micro-touches such as purchases, likes, views.
The actions of a government are mostly in long touches -- hard conversations, in person services, laws and policies to create the right conditions.
This is a hard realization for me.
The first conversation a data scientist needs to have when starting a gov project is with the people out in the field.
It's humbling, it's tough, but ultimately, there is more to algorithmic accuracy than the data. There’s the operational awareness.
Both are equally important. We need to take a hard look at what data can actually do.
In government, data science projects need to start with conversations around signal + purpose.
Signal pertains to the substance of data. It’s about if that data even makes sense for what you want to do, if it matches the right time frames, the geographic resolution, the fidelity and reliability of the way it’s collected. There are data systems that can detect wildfires, but as amazing as it is, if it’s slightly off the decision time scale, it can’t be used.
Data is an amazing national resource, but it needs to be shaped and understood.
For data to affect change, we need adoption of products. Adoption is achieved through understanding purpose. We’re here to do good. We need to have a purpose to do good.
A great mission might not have good data.
Great data might not have an actionable purpose.
Jointly, signal and purpose are a way to proxy for viability.
Ultimately, in government we do not have simple 1 or 2 dimensional problems,
because data is only one of n-dimensions of project when considering all else in the world.
Thus, to ensure we're doing right by the public, we've worked out a set of six conditions for data and delivery awesomeness
A reason for existence: Why is there a policy, program or process? How does it work? What is the system blueprint -- tech and social. This is the key for developing a theory of change.
Access to the field. We need to speak with people who actually act on information and understand how they view new products and data. It's ultimately about them.
Access to actionable data. We need to be able to dive quickly and deeply into the data to find signal , as a data product without signal in the data is just a pretty picture.
Ethical intervention points. Using the social blueprint, we need to find an intervention point where a data science product would make sense.
Methodologically defensible yet intellectually accessible. Many data scientists like to go down the path of algorithmic splendor, but we can't do that in our world as it alienates too many stakeholders. So, our work needs to be methodologically bulletproof by research standards but explainable by a generalist. Once we have buy-in, we can re-introduce that splendor
Path to sustainability. Lastly, projects need an endpoint or a reason to be sustainable. And this is born out of testing.
These conditions allow us to create change, influence strategy, and seed for innovation.
And we apply this to all projects in our current portfolio of 40 projects.
The vast majority are in the R&D phase, but I'd like to talk about a few projects that are now in the open.
One of efforts uses data science to help strengthen export services
And to broaden and deepen impact, ITA and the US Commercial Service, which has trade specialists in 100 cities and 75 countries worldwide, is collaborating with the Commerce Data Service to incorporate data into their US national field strategy.
Example client
We call this the New Exporters Project and it’s an effort to experiment using data science to combine ITA’s client data with commercial data sources to find untapped markets.
In a given year, ITA reaches thousands businesses, providing everything from business match making services to market reports to company due diligence.
ITA is looking to reach far more business through their business disruption initiative. By fine tuning services by customer segment, they can reach a far broader audience of businesses.
Here are a few examples of what data science can do:
Think about all the companies that are export-ready and don’t know it. Using a combination of unsupervised learning and supervised learning, we’re developing fine tuned ways of searching for untouched companies, figuring out which company types are more likely to use which types of services, and migrate to a market-wide view.
How about the trade specialist in rural America may need to drive 2 hours to meet a potential exporter. That’s a huge time spend. We’re developing scoring models to figure out the potential utility of our services ahead of time before that long drive.
For example, smaller manufacturing facilities may be associated with lighter touch services like market reports – so an emailed report may actually be a better first step. Likewise, small to medium sized businesses with a larger market cap in certain industry may be able to afford to invest in developing international relationships
Which positions in a company will use which services?
It may be that different positions in a given company may ask for one service one service over another -- but to create a rule of thumb is a statistical research problem. Having biz dev in a title may be associated with more light touches. A CEO title may actually be a wildcard. So, having a good lead off offering could be the difference between use and non-use.
Exporting is clearly a Commerce priority. We’re just getting started.
One of the priorities at Commerce is data education and upskilling – both internally and externally.
More data skills will improve efficiency. The smallest behavioral change may scale. So, at Commerce, we’ve launched an internal initiative called the Commerce Data Academy.
Back in December, the Data Service launched the Commerce Data Academy to show what’s possible through data.
We started with a pilot of 4 three-hour classes taught by General Assembly.
And as it was a pilot, we didn’t think that we would end up with 422 registration with a 90% attendance rate. Who would’ve thought?
We then started to think… what if we went big. Hail Mary it. And expand the offering to cover JavaScript, Machine Learning, basic programing.
And we scaled it to 14 three hour class taught by our Data Service staff with 2 two-week long intensives taught by General Assembly.
We’ve seen a huge bump.
Now we have 3,500 registrations.
In addition, the 10 most committed public servants from the Academy are now on detail with our shop to exercise those new skills to build products and capacity for their home agencies.
This model has worked out so well that at least one other agency has forked the CDA model.
4-times more courses, led to 6.9 growth in interest, really tells us that there is unlimited potential to disrupt the skills space.
The upshot is that by showing we have the skills in the open now has established data skills as a “thing” within the Department of Commerce and there is a new internal market for data products.
Another area we are focusing on is Data Usability
Commerce has some of the most highly-valued data set. Unfortunately, they are often under-utilized and unused; primarily because they are difficult to find, hard to understand and even harder to process (because many do not understand the collection constraints involved in the production of the data).
Usability of data is dependent on the context, examples, and compelling purpose. And to help open data move to open knowledge, we’re stepping up our game.
We launched the Commerce Data Usability Project to publish long form tutorials that illustrate data use cases, code, and narrative around high-value, high potential data. And it's targeted at undergraduate and graduate students -- the next generation of data scientists who are hungry to learn.
We’ve partnered with private sector companies, academia, and nonprofits to show how data is being used around the country.
We have a nice bench of contributors and more always coming.
- Mapbox has contributed two tutorials on how to get started with interactive web maps using NOAA Global Weather Forecast data;
- Zillow has produced a tutorial on analyzing housing affordability combining their data and Census data;
- Earth Genome illustrated how to manipulate digital elevation model data that plays a key role in wetlands models.
We are highlighting the power of contextualizing and illuminating #OpenData.
How many people here believe that #OpenData can currently help them find their customers and users? The Commerce Data Service provides very specific detail on doing just that using data from the Census American Community Survey (ACS). See http://commercedataservice.github.io/tutorial_acs_rank/.
#OpenData from the Department can help businesses understand their computer security (http://commercedataservice.github.io/tutorial_nist_nvd/), find affordable housing options for their employees (http://commercedataservice.github.io/tutorial_zillow_acs/), help them determine weather risk (http://commercedataservice.github.io/tutorial_noaa_hail/), help predict rainfall and flooding issues (http://commercedataservice.github.io/tutorial_mapbox_part1/), help them determine hotbeds of human activity – using satellite data (http://commercedataservice.github.io/tutorial_viirs_part1/ ), and to help them with water management concerns (http://commercedataservice.github.io/tutorial_earthgenome/)
In the coming weeks, Microsoft and Columbia University have signed up to release a series of tutorial on how to begin to use analytical tools. Many more to come and we welcome collaborations.
There is agreement out there that product gets used if people are furnished with a basic understanding of what that product is.
In data and tech, free and balanced education really is a powerful tool. More and more organizations want to show how open data works for them.
Our tutorials are designed to engage data audiences, encourage adoption of datasets and associated workflows, and facilitate innovation. To do this, we’ve ensured that all tutorials are built according to the following guidelines:
A novel analysis or question posed to the data
Visually arresting graphics
Open and free code and data for the public to use. It is important to note that we are language, method, and approach agnostic.
This is what you have to do if you want to contribute to the initiative.
Income Inequality is one of the formidable challenges of our time.
However, it is a hard topic … and not many people talk about or interact with it because of this.
Our mission was to use data to drive this mission.
We want to create a data-driven platform to focus on this issue.
The first thing we have to do is examine the data sources.
The ACS does not have the detail that we require.
The Census Current Population Survey (CPS) has limitations that preclude us from having a conversation on the detailed data.
These limitations include:
Medians falling in the upper, open-ended interval are plugged with "$250,000”
The data sets aggregate everyone above $100,000 together
Limitations on job-to-job comparison
Granularity of breakdowns
The PUMS is the data that we choose to use.
Very Rich Data Set:
Individual and Household Data sets
Income breakdowns by types
Job breakdown by industry
Geographic breakdown below State
Difficult to Use:
USA individual file alone is 2 Excel files!!!
Data Dictionary 138 pages!!!
Very specific ways to match variables that are difficult to understand
MIDAAS is an API and website that unpacks the ACS PUMS data and creates a forum for us to have that discussion.
Another issue is the School-to-Prison pipeline.
We’re just warming up. That’s just a few of the 40 projects. Big ones on the way. Stay tuned.