Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
FME Data Transformation for the Geographic Support System Initiative
1. FME Data Transformation for
the Geographic Support
System Initiative
Jay E. Spurlin
Software Architect and Development Manager for the
GSS-I Feature Source Evaluation software system
April 8, 2013
2. U.S. Census Bureau
• The Census Bureau serves as the leading
source of quality data about the nation's
people and economy. We honor privacy,
protect confidentiality, share our expertise
globally, and conduct our work openly. We
are guided on this mission by our strong and
capable workforce, our readiness to innovate,
and our abiding commitment to our
customers.
2
3. Geography Division
• The Geography Division plans, coordinates, and
administers all geographic and cartographic activities
needed to facilitate the Census Bureau's statistical
programs throughout the US and its territories. We
manage the Census Bureau's programs to
continuously update features, boundaries and
geographic entities in TIGER and the Master Address
File (MAF). We also conduct research into geographic
concepts, methods, and standards needed to facilitate
the Census Bureau's data collection and dissemination
programs.
3
4. GSS-I
• In support of the 2020 Decennial Census, the Census Bureau
is evaluating what areas should be targeted for a traditional,
on-the-ground address canvassing operation and in which
areas a traditional canvassing operation is not necessary.
• The task the Census Bureau is undertaking is determining
how to decide which areas should be considered for targeting
– GEO has evaluated the MAF/TIGER database and assigned
quality indicators to each of the census tracts
– A Targeted Address Canvassing strategy has been developed
that contains an inventory of criteria for evaluation
4
5. GSS-I
• The Geographic Partnership program is now underway.
– GEO is receiving both address and spatial data from invited partners
• This data is at the state, county, and local level.
• The data is being evaluated and integrated with the MAF/TIGER database.
• The next step is to determine what level of feedback we can give to the partners
about their data.
• GEO is also working with statisticians on predictive modeling to help
determine where to target.
• The combination of the evaluation of the current MAF/TIGER
database, the partner data, and the predictive modeling will
contribute to the recommendation on which areas of the country
should be considered for targeting.
5
6. The Geographic Partnership Program
• A partner provides a set of source files
• The source files are moved inside the Census firewall via a secure web-exchange module
• The content inventory of the files undergoes initial verification
• The files are preserved, as supplied, for later reference
• A more detailed content assessment is done, including verification the files meet the
minimum guidelines for content and metadata
• The files are prepared for automated processing, including re-projection and mapping to a
standardized schema
• A series of (mostly) automated checks is run, which provides metrics about the data in the
files
• An interactive review is conducted, in which the files and their associated metrics are
reviewed and a decision is made how to capture any new data
• Any data that are not useful for updating the MAF/TIGER database get removed from the
files
• Features or addresses are added or modified, using an automated conflate and review
process – or – an interactive update process
6
7. Feature Source Evaluation Software
• A number of MAF/TIGER spatial layers will be extracted for the extent of the partner
entity
• An analyst will use the supplied data and metadata to map the provided source
schema to a standardized schema, and the supplied road centerline file will be
converted to an ArcSDE layer, re-projected, and the name and MTFCC mappings
applied
• The feature names in the source file will be standardized to the parsed, MAF/TIGER
naming conventions
• The standardized feature names will be checked to see if any contain illegal
charactersor prohibited or generic names
• A topological check will be run, to gauge the topological stability of the source file
• A completeness / change detection check will be run to attempt to identify areas in
the source file that contain features not found in MAF/TIGER
• A comparison will be run between the universe of feature names in the source file
and the universe of feature names found in MAF/TIGER within the extent of the entity
• All intersections that meet the requirements for CE95 assessment will be identified
7
8. Previous FME Technology Architecture
• FME Workspaces were developed using FME Workbench 2012 on
desktop workstations, running 32-bit Windows XP Service Pack 3
• FME Server 2012 (FME Engine only), on batch servers running
Linux Redhat Enterprise 5 connected to a SAN (Storage Area
Network)
Linux Batch Server
Cronacle job-queueing system
Perl and shell scripts
MAF/TIGER FME Server (command line
(Oracle Shapefiles on
invocation of FME Engines)
Database) SAN
Oracle Run-Time Client
8
9. New FME Technology Architecture
• FME Workspaces are developed using FME Workbench 2012 SP3 on
desktop workstations, running 32-bit Windows XP Service Pack 3
• FME Server 2012 SP3 (FME Server Console), on batch servers running
Linux Redhat Enterprise 5
• FME Server 2012 SP3, on Windows server, with SAN (Storage Area
Network) disk(s) mounted via Samba
Linux Batch Server
Windows Web Server
Cronacle job-queueing system MAF/TIGER
Shapefiles on ArcGIS for Server (Oracle
SAN Database)
Perl and shell scripts
FME Server (full installation)
FME Server Console (remote job
submission to FME Server) ArcSDE
Oracle Run-Time Client
Geodatabase
9
11. Topology Check
• The Topology Check workspace compiles a number of topology and
tolerance based metrics:
– Gaps – endpoints within 5 meters of any line segment
– Overshoots – line segments extending less than 5 meters beyond an
intersection
– Tiny Features – features with a total length less than 5 meters
– Floating Features – features or connected sets of features that are not
connected to the rest of the road network
– Exact Duplicates – features whose geometry and name are identical to
another feature
– Coincident – features whose geometry overlaps with another feature
– Crossing – features that cross but do not intersect at a node
– Multi-part – features that consist of multiple geometry parts
– Cutbacks – features containing angles less than 25 degrees
11
12. Completeness / Change Detection Check
• The MAF/TIGER road centerline features and the
feature source file road centerline features will be
compared using and FME workspace.
• The MAF/TIGER features will be Buffered to a
distance of 15 meters, then “overlayed” with the
source file features.
• Any source file feature parts that fall outside of the
Buffer areas will be chained together, and the total
length of difference (and of each part) will be
reported as an evaluation metric.
12
13. CE95 Qualifying Intersection Identification
• Qualifying intersections must meet the
following criteria:
– Must consist of three roads (a “T” intersection)
or four roads (an “X” intersection)
– Must consist of only secondary roads or local
roads
– Must meet at 90 or 180 degree angles, with a
15 degree plus/minus tolerance
13
14. Thank You!
Questions?
For more information:
Jay E. Spurlin
jay.e.spurlin@census.gov
U.S. Census Bureau
http://www.census.gov/geo/www/gss/
Notas do Editor
I work in the Geography Division – or GEO, as we refer to it. We manage MAF/TIGER (Topologically Integrated Geographic Encoding and Reference), which isa geospatial database system. The data is stored in Oracle Spatial Topology Manager format, and is used in support of various censuses and surveys of the Census Bureau.
This is the basic set of steps through which a set of partner-supplied source files proceeds. Currently, this is a highly manual process and most of the processing is done on shapefiles using ArcGIS for Desktop.A partner provides a set of source files – this could be through a Regional Office contact, Community TIGER, or via a direct upload.The source files are moved inside the Census firewall via a secure web-exchange module.The content inventory of the files undergoes initial verification, to make sure someone has not accidentally supplied their laundry list.The files are preserved, as supplied, for later reference. This provides a re-start point, if it is ever necessary – as well as a reference against which future submissions could be compared to determine change over time.A more detailed content assessment is done, including verification the files meet the minimum guidelines for content and metadata.The files are prepared for automated processing, including re-projection and mapping to a standardized schema. The feature names are standardized to fit the parsed, MAF/TIGER naming convention, and metadata is used to derive the MAF/TIGER Feature Classification Code (or MTFCC) for each record.A series of (mostly) automated checks are run, which provide metrics about the data in the files. For addresses, this includes a range of geocoding checks and comparisons for the addresses and for the address point locations, if they were provided. For the spatial features, I’ll talk more about these checks in a moment.An interactive review is conducted, in which the files and their associated metrics are reviewed and an assessment is made as to how many new features or addresses have been supplied as well as how many attribute or shape updates. Based on this review, a decision is made about how to capture any new data – whether the data can continue through an automated update process or should be handled through an interactive update process.If the automated process is appropriate, then any data that are not useful for updating the MAF/TIGER database get removed from the files.Features or addresses are added and/or modified, using the method chosen during the interactive review - either an automated conflate and review process – or – an interactive update process.
For the purposes of this discussion, we will focus on the Feature Source Evaluation software – in contrast to the Address Source Evaluation software. There are two separate, dedicated software systems for the evaluation of spatial features and addresses, though the architecture of the GSS-I is integrated to include both. The business model, hardware and software architecture, technology architecture, and security models have been integrated; it is really only the application architectures that have been separated out – and that only because there are established, separate areas of development expertise for spatial features, geographic entities, and addresses.The list of functionality on this slide indicates the first set of functions targeted for production release at the end of March 2013. Other checks have been proposed, and will likely be added to the software at a future date.Basically, each of the pieces of functionality listed corresponds to a module in the Feature Source Evaluation software system.A number of MAF/TIGER spatial layers will be extracted for the extent of the partner entity. These will include the road centerline layer, a number of geographic entity boundaries for reference, and the topological edge layer with the primary feature name for each edge. These layers will be extracted using automated FME workspaces, but they are fairly simple and obvious – they basically just read from Oracle Spatial using an SDO_FILTER SQL query, narrow the selection with an AreaOnAreaOverlayer or Clipper, and write to an ArcSDE geodatabase, so I don’t plan to show any examples of those.An analyst will use the supplied data and metadata to map the provided source schema to a standardized schema, and the supplied road centerline file will be converted to an ArcSDE layer, re-projected if necessary, and the name and MTFCC mappings applied. We will look at some example transformers in a few minutes.The feature names in the source file will be standardized to the parsed, MAF/TIGER naming conventions. In production, this will be a Java application, but for the current, manual procedures, an FME workspace is making an HTTPFetcher call to a published web service to do the feature name standardization, with a Decelerator to keep from overloading the web service.The standardized feature names will be checked to see if any contain illegal characters or prohibited or generic names; another Java application.A topological check will be run, to gauge the topological stability of the source file. This will be accomplished using a fairly complicated FME Workspace, which we will look at in detail shortly.A completeness / change detection check will be run to attempt to identify areas in the source file that contain features not found in MAF/TIGER. This will also be accomplished using an FME Workspace, which we will also look at in a moment.A comparison will be run between the universe of feature names in the source file and the universe of feature names found in the MAF/TIGER within the extent of the entity; this will be another Java application.All intersections that meet the requirements for conducting the CE95 accuracyassessment will be identified. The CE95 accuracy value is stated as a distance in meters, and denotes the circular standard error confidence – this is stating a 95% chance each coordinate falls within that distance from “ground truth”. This is the final FME workspace that we will be looking at today.
Previously, our general technology architecture as it related to FME was very simple. FME Server was installed on our production Linux batch servers, and FME Engines were invoked via the command line from Perl scripts driven by Cronacle-based control systems.To keep things simple and better highlight the differences in architecture, the illustrations on this slide and the next depict only the production, batch configuration as it relates to FME.
The technology architecture for FME was restructured for GSS-I to support products and processes that depend on ArcGIS on Windows. The Geography Division deployment of ArcGIS for Server is limited to Windows servers, because a Linux deployment was not seen as a viable option, for various reasons. This prompted us to research and develop a new technology architecture pattern for utilizing FME. The old pattern is still in use, as well, but this new pattern will be applied for the GSS-I and several other new software systems.
One of the business functions for which we are utilizing FME is crosswalking (or transmogrification as some of our subject matter experts have taken to calling it). This mapping of each source file schema to a standard schema is configuredusing FME Workbench, and the data transformation is done using FME Server. Source schemas can – and do – vary widely. As you might imagine, the string manipulation and filter transformers come in extremely handy while doing these mappings.The example on the left shows the use of the AttributeValueMapper transformer to transform a set of road type identifiers into MAF/TIGER Feature Classification Codes.The example on the right shows the use of the StringSearcher transformer to find all instances of a street classification code that end with the digit ‘5’ – then set the MTFCC value to the code that designates the feature as a “Ramp”.
The topology check workspace uses various transformers to collect metrics about certain types of features or feature interactions in the feature source file. Please note – not all of these are technically “wrong” topologically – they are only meant to be markers for identifying general topology or network stability and to predict MAF/TIGER update behaviour. The list of metrics might shrink or grow with time, as more partner files get processed and we learn more about what situations indicate data issues or cause problem during the update of the MAF/TIGER database.{show topology workspace and explain}The road centerlines are projected to the North American Lambert Conformal Conic projection, which preserves shape (and thereby distance).
{show the change detection workspace and explain}
Please check with LFBFor the CE95 accuracy assessment, qualifying intersections must be perpendicular ‘T’ or ‘X’ intersections (plus or minus 15 degrees) on secondary and/or local roads.{show CE95 QI workspace and explain}The road type selection is accomplished using a TestFilter.The names of the attributes that contain the MTFCC code (road type) and road name are passed in via published parameters.The road centerlines are projected to the North American Lambert Conformal Conic projection, which preserves shape (and thereby angle).The TopologyBuilder is used to find all of the intersection nodes.“T” and “X” intersections are identified by counting the number of rays emanating from each node star (the number of elements in the _node_angle list).The _fme_arc_angle values are exposed with an AttributeExposer, and a composite test in a Tester transformer checks the angle ranges.The nodes are projected back to NAD83.The requirements was to create at least 200 randomly selected nodes, with the goal of assessing the accuracy of 100 of them.A RandomNumberGenerator and Sorter are used to randomly sort and output all the nodes, allowing the user to weed through as many as necessary.The CoordinateExtractor is used to expose the coordinate x and y values as attributes.The StringConcatenator is used to string together all of the road names, which were preserved from the line segments during the topology build.