1. The U.S. Census Bureau faces challenges from the rise of big data sources produced outside of traditional government surveys. These new sources are generated faster and more cheaply than surveys.
2. To remain reliable sources of demographic and economic information, the Census Bureau must integrate these new big data sources with traditional surveys. This requires linking massive datasets and developing new statistical modeling techniques.
3. The Census Bureau is exploring ways to use new big data sources like web search data, social media, and e-commerce transactions to improve surveys and provide more timely, detailed information. However, maintaining privacy and developing new technology is difficult.
08448380779 Call Girls In Civil Lines Women Seeking Men
U.S. Census Bureau's Big Data Activities
1. Big Data activities
at the U.S. Census Bureau
Cavan Capps
Big Data Lead
U.S. Census Bureau
February 13, 2014
Prepared for
MIT Libraries Program on Information Science Brown Bag Talk
Feb 2014
2. Big Data Challenge at the Census Bureau
“Designed Data” vs. “Organic Data”
“The world is now producing large amounts of data.. data from Internet
searches, credit card transactions, retail scanners, and social media”.
“ There also are more and more digital administrative data (e.g., tax
records, social security records, Medicare/Medicaid records, food stamp
records, HUD records). Some of these data are not directly linked to the
populations we study; some have item missing data problems; none
offer a real replacement for our surveys, but many will be useful as
auxiliary data sources.”
1
3. Big Data Challenge at the Census Bureau
Big Data is about creating information to make Big
Decisions from novel, and often massive data sources.
2
4. Big Data creates new Statistical Agency Challenges
A recent meeting of International Statistical Agencies observed:
1. The volume of data generated outside the government statistical
systems is increasing much faster than the volume of data collected
by the statistical systems; almost all of these data are digitized in
electronic files.
2. As this occurs, the leaders expect that relative cost, timeliness, and
effectiveness of traditional survey and census approaches of the
agencies may become less attractive.
3
5. Big Data creates new Statistical Agency Challenges
A recent meeting of International Statistical Agencies observed:
3.
Blending together multiple available data sources
(administrative, commercial electronic transactions and internet webpage data, search frequency data, twitter, facebook etc. ) with
traditional surveys and censuses (using paper, telephone, face-toface interviewing) to create high quality, timely statistics that tell a
coherent story of economic, social and environmental progress must
become a major focus of central government statistical agencies.
4. This requires efficient record linkage capabilities, the building of
master universe frames that act as core infrastructure to the blending
of data sources, and the use of modern statistical modeling to
combine data sources with highest accuracy.
4
6. Big Data creates new Statistical Agency Challenges
A recent meeting of International Statistical Agencies observed:
5. The Agencies will need to develop the analytical capabilities to
distill insights from more integrated views of the world and impart
a stronger systems view across different government and private
sector information systems to provide more geographical and
industry detail.
6. There are growing demands from researchers and policy-related
organizations to analyze the micro-data collected by the agencies, to
extract more timely and detailed information from the data.
5
7. Big Data Development Challenges for Statistical Agencies
The Meeting Recommended that Statistical Agencies develop:
1.
2.
3.
4.
High-speed, “big data” software/hardware systems for record
linkage and extraction of key information from massive files.
Efficient and sophisticated imputations procedures needed to make
the combined data sources jointly useful.
More use of statistical modeling for statistical estimation, to provide
more:
1. Timely estimates
2. Small area estimates
3. New measures
New ways to give secure access to micro-data for legitimate policy
and research purposes, to increase their impact of their work.
6
8. In Summary, massive challenges for the Statistical Agencies:
1.
The Internet and Private E-Transactions are generating data faster
and more cheaply than Statistical agencies can afford to do.
2.
To be reliable sources of information on the Demographics, Economy
and Social change in the U.S., this information needs to be mashed
together with traditional surveys and adjusted for bias.
3.
The sizes of the files and the number of computations to mash up the
data will be larger.
4.
Spoiled by the Internet, users expect more timely, and detailed data
provided at lower costs.
5.
Privacy/Confidentiality must be maintained.
7
9. Big Data Projects at the Census Bureau
The Census Bureau “Big Data”
Information Life Cycle
Data Collection
- Multi-Mode Data Survey Collection model
- New Data sources (Web, E-Transactions, Admin Recs)
Data Integration & Analysis
- Record Linkage
- Small Area Estimation modeling & “Now Casting”
Data Release
- Data Review for Release
- Confidentialize data for public release
8
10. Big Data Current Process
Future Process (exploring)
• Designed Data
• Designed & Organic Data
• Proprietary Software
• Next Generation Open-Source
& Proprietary Software
• Batch Processing
• More Parallel Processing
• Long processing times
• Faster processing times
9
11. Big Data Collection: Improving Survey Logistics & Cost
Improving Survey Collection and Imputation Operations(Adaptive
Design)
1.
Multi-modal data collection to reduce operational costs of
data collection
– More effective use of existing data such as
administrative records
– Incorporating new data into decennial operations
• Paradata from Internet Data Capture
• Information from Social Media Feeds
2.
3.
Edits and Imputations
Data Review
10
12. Big Data Collection: Evaluating Web Data as Inputs
Potential Internet Data Collection
1.
2.
Examine Google & Bing search frequency trend data
3.
Examine Twitter, and other social media trend data
Examine “Web Scraping” of housing data, price data, local
tax data, crime data, corporate profits etc.
11
13. Big Data Collection : Evaluating Commercial E-Transaction
Input Data
1.
Housing:
–
–
2.
Foreclosures: Use vendor data on new residential properties in
foreclosure to aid analysis of data on new construction and sales.
Building Permits: Web scrape opportunity to access local jurisdictions and
state agencies posting public records online.
Construction:
–
–
3.
Difficulty obtaining electronic data from numerous state and local agencies
Data are needed immediately to tabulate the monthly economic indicators.
Retail Sales: Evaluating electronic payment processing to fill data gaps such
as geographical detail and revenue measures by firm size
– New data products
– Improvements to current data quality
12
14. Big Data Integration & Analysis: (Current processes)
Data Integration Expertise:
• Record linkage
– Gov’t Admin Records to other Gov’t Admin Records
– Gov’t Admin Records to Gov’t Surveys
– Commercial records to Gov’t Admin Records
• Model based integration
– Small Area Poverty & Income Estimates
– Small Area Health & Income Estimates
– Longitudinal Economic & Housing Dynamics
13
15. Big Data Integration & Analysis: Exploring “Now Casting”
Exploring “Now Casting” to improve Statistical Timeliness :
1.
Some “real time” Internet data correlates with Official Statistics:
– Google search data modeled to match BLS unemployment &
CDC Flu spread
– Univ. of Michigan Twitter unemployment
– MIT Billion Price Project match to BLS CPI
2.
Census experiments with Gov’t Pension data
14
16. Big Data Lab
1.
Setting up an experimental Cluster
2.
Testing performance of Hardware
3.
Testing value of Software
– Open Source Big Data Software:
Hadoop, Mahout, Distributed R, Hbase, Pig, Hive,
Casandra, Mongo, Flume, Neo4J, I-Graph,
Allegrograph
– Internally Developed software:
TEA, DataWeb, Matching software
17. On the Horizon, Development of Big Data Center
Research, capacity building and economic Big Data
Processing:
1.
Proposal to create a new center that will include members from academy and
Census staff to:
1. Help lead work Census Bureau on practices to make sense of Big Data.
Developing principles to apply Big Data to federal statistics.
2. Facilitate CB as unbiased provider for information collected as Big Data
3. Validate new techniques and data sources it at a low cost (field staff
allow us to do ground checks, survey questions)
4. Lead on methods to integrate Big Data and develop standards
5. The Center should provide a way to bring both faculty and graduate
students to Census to facilitate Big Data capacity building at the Census
Bureau
1.
We will explore partnerships with others doing research in this area.
Universities, and Silicon Valley
Notas do Editor
This work by Cavan Capps <www.linkedin.com/pub/cavan-capps/12/201/523> is licensed under the Creative Commons Attribution-Share Alike 4.0 International License.To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.The use of Big Data isn’t new for the Census Bureau. We’ve been using administrative records, such as tax data, for decades to improve our collections. However, there is a new generation of Big Data – as the electronic environment flourishes – that we must keep up with. We must research ways to utilize these new data sources in our collections in order to increase efficiencies and to reduce costs and the time it takes to disseminate statistics. At the same time, we must also continue to maintain the quality of the official statistics. I’ll be addressing these aspects throughout my talk today. I was asked today to talk about two specific questions. I’ll address these questions broadly and then share some case studies of how Big Data is being used in our programs at the Census Bureau. I’ll also briefly touch on a Big Data source coming from the Census Bureau and ways the private sector could use our data in concert with Big Data.
Bob Groves called traditional survey data specifically created to measure something “Designed data”The private sector maintains vast troves of transactional data, much of which is “data exhaust‟, or data created as a by-product of other transactions. With the use of mobile phones, much of this data can be associated with individuals and their locations. The public sector in most countries also maintains enormous datasets in the form of census data, health indicators, and tax and expenditure information. … The global internet is currently offering near real-time data on durable and nondurable goods prices, housing sales, and other relevant events. This data exhaust can also be termed “Organic data” which has its own strengths and weaknesses.The Census Bureau is the largest statistical agency in the U.S. Many of the Nation’s economic indicators and other critical socio-economic measures come from the Census Bureau. Similarly, much of the data we collect and process are critical inputs to major economic indicators and measures produced by other statistical organizations. We can not afford to ignore the opportunities offered by these new data sources and techniques.My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
Data Collection. For data collection, we are moving to a Multi-Mode Data Collection model (for survey and census data collection) that utilizes different collection modes based on survey response rate, quality, cost and several other factors to effectively collect data. We are architecting a Big Data environment that makes it easier to collect large volumes of data from various sources, integrate with internal and external sources of data, and make real-time decisions about effective collection modes.In terms of Data Analysis, we are researching Big Data methodological techniques, such as modeling or mashing (or integrating) together a variety of data sources, that allow us to work effectively with the Big Data. We’re also exploring technology solutions, such as High Performance and Distributed Computing Environments, to improve the effectiveness and speed of data analytics aided by better visualization techniques that incorporate geographic information. And for Data Release, we are exploring a using correlated “Big Data” sources to improve and speed data review and to test that the released data maintain privacy and confidentiality.
Currently most of the Census statistical processing is based on designed surveys or designed measures from administrative data. Most of the processing is batch processing in SAS. Depending on the size of the data, processing times can be lengthy. Most speed improvements have been achieved by increasing the size of the machine.In the future as more data may be combined with various sources of organic as well as designed data, data sizes may grow rapidly. User expectations are also growing, expecting data to be released more timely, with more geographic, historical and industrial detail. The stress to deliver this information while maintaining strict confidentiality will explode. As a result new estimation and data processing paradigms are being explored.
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The use of alternative data sources such as administrative records or Big Data poses a number of opportunities for improving the current construction statistics produced by the Census Bureau and reducing data collection costs for these programs. For example:ForeclosuresData on residential properties in various stages of foreclosure could aid in our analysis of data on new residential construction and sales. These data are currently collected by a couple of private data vendors (for Bill’s info: CoreLogic and Realty Trac). The Census Bureau has purchased address-level files from a data vendor for analysis related to household surveys, but it did not easily allow for calculation of totals needed for analysis of national data. We also purchased annual totals by state from another vendor for use in data analysis; however, the vendor does not allow purchasers to disseminate data to the public.Manufactured HomesCensus conducts the Manufactured Housing Survey (MHS) for the U.S. Department of Housing and Urban Development, or HUD, to provide data that they are required to collect on manufactured home placements. By law, manufactured homes must be inspected at the factory. These inspections are conducted by the Institute for Building Technology and Safety (IBTS), which provides information on the inspections that becomes the universe and sampling frame for the Manufactured Housing Survey. If we could partner withHUD and IBTS as well as manufactured home manufacturers and dealers to follow through on the inspection forms to collect information on the placement of the home, we could use this information to tabulate data on placements. The data would have no sampling error and data collection costs would be drastically reduced.Public ConstructionOur estimates of construction spending include spending on construction funded by federal, state, and local governments, collected using voluntary surveys. Much of the information on government spending can be gleaned from publicly available budget documents. We do this to supplement and benchmark the data that we collect, but we could partner with government agencies that conduct construction (especially at the federal level) to obtain data files that would reduce our data collection costs and improve data quality. We have contacts at most agencies, but we have not yet undertaken a concerted effort to obtain the detailed electronic files that we need. Property OwnersCensus Bureau surveys collect information from homeowners on owner-occupied properties. Data on non-owner-occupied properties are more difficult to obtain because the owner of the property must be located. Various administrative sources such as the Business Register, tax data, and local deed records could provide information on property owners and their individual properties. Data on improvements to non-owner-occupied properties are no longer included in the construction spending estimates because the cost of finding the owners was prohibitive. The Residential Housing Finance Survey (RHFS, a HUD-sponsored survey) had the same problem. Reducing the cost could make it feasible to improve the construction spending estimates and would allow Census to conduct other surveys more cost effectively. However, startup costs to create an up-to-date list of property owners could be significant.Building PermitsThe largest opportunity for using administrative records for the construction area is data on building permits issued by local governments. Issuance of building permits in the U.S. is mostly at the local level, where approximately 20,000 unique jurisdictions issue permits. Some states are capturing data on all permits issued in their states, but this is not as prevalent.Building Permits SurveyCensus conducts a monthly and annual Building Permits Survey (BPS) to obtain data on the numbers of new housing units authorized from local jurisdictions. Because of cost and respondent burden concerns, data on nonresidential permits and permits for alterations and repairs are not collected.As more and more jurisdictions computerize their operations and more states begin compiling permit data from their jurisdictions, we have the opportunity to capture individual permits (which are public records) for use in our estimates. Information on individual new residential permits could replace the current Building Permits Survey data collection, and it also has tremendous potential for updating the Master Address File used for many household surveys and for the decennial Census. Staff working on this survey are partnering with colleagues in the Census Bureau’s Geography Division to encourage local governments to work toward providing files of permits. Lists of individual permits would also greatly improve the annual population estimates, which currently rely on the use of statistical algorithms to allocate the Building Permits Survey jurisdiction totals to more local areas. Survey of ConstructionThe Survey of Construction (SOC), which collects data on housing starts and new home sales, requires field representatives to list individual permits in a sample of jurisdictions to create the sampling frame. Use of individual permits received from jurisdictions could eliminate this expensive operation. Use of Certificate of Occupancy permits would also eliminate the need to follow up cases in sample until the building is completed. This would reduce the cost of interviewing by about one-third and save up to $1 million per year.To collect data on spending on nonresidential construction, we currently purchase a list of new projects from a third party vendor each month. This list is incomplete and expensive. If we could acquire data on nonresidential permits from local jurisdictions, it would be much less expensive and more complete.There are many opportunities when looking at Big Data for use in official construction statistics. There are also many challenges. We have had discussions with our government counterparts about how we could assist governments with automation and with standardizing the format of their data files, but jurisdictions have local regulations and custom computer systems that make standardization challenging. Likewise, these surveys are voluntary, so obtaining all permits in the U.S. would not be feasible without legislative changes. An iterative approach would be needed, starting with obtaining information from large jurisdictions with automated systems that are willing to participate.
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…
The first question I was asked to consider is if the Census Bureau is working on Big Data projects and how these differ from other projects. My initial response is a resounding, yes, the Census Bureau is incorporating Big Data solutions to improve the efficiency of its operations throughout the information lifecycle. We are exploring new sources of data and processing techniques to improve our products and increase the efficiency of our operations. The intent is for these project to result in enterprise-wide solutions that support all surveys and census operations across the Census Bureau. This is different from our current processes and technology that are developed to support individual surveys and census operations.Examples of these efforts include…