SlideShare uma empresa Scribd logo
1 de 111
Big (Geo) Data Science




Robert Cheetham
cheetham@azavea.com
   @rcheetham
Web/Mobile

Geospatial

UI/UX Design

High Performance
Computing

R&D
B Corporation
   • Projects w/ Social Value
   • Summer of Maps
   • Pro Bono Program
   • Donate share of profits

Research-Driven
  • 10% Research Program
  • Academic Collaborations
  • Open Source
Spatial Temporal Forecasting
with Philadelphia Crime Data
How Phila PD uses Maps

 Customized Map Products




            Weekly CompStat Meetings




   Web Crime Analysis
INCT & PARS – main database sources
over 5,000 incidents daily, over 2 million annually



                                                                                        PARS

   Complainant                                                   INCT


      Verizon                                                           Daily download
        911                      District                               & Geocoding Routines
                                48 Desk
                                       Incident Report
                                       Completed by Officer                      District X


   911 Operator
                                Police Officer      Maps distributed
                                                   Through Intranet,            District Y
                                                  Printing, CompStat
      Radio
    Dispatcher
                                  CAD                                            District Z
The Context

1,500,000 people
7,000 police
1,000 civilian employees
2,000,000 new incidents / year
3 crime analysts
What we did

•   Weekly Compstat
•   Lots of maps
•   Automation of map creation
•   Web-based systems
… but what if we could…

 Accelerate the cycle
 Proactively notify
 Automate the process
Prototype
          VB & MapObjects                                ArcView
                                                  .ini
                                                  file




Process Documentation




                                          Shapefiles
                                          and
                                          GRIDs




                        MS SQL Server
                        Crime Incidents
                        Database
… but there was a problem …
…it was crap …
… sort of.
We needed ….

1. Better Statistics

2. Notification

3. Simplicity
Crime Analysis – What has happened?
   – Mapping (spatial / temporal densities)
   – Trending
   – Intelligence Dashboard
Early Warning – What is out of the ordinary?
   – Statistical & Threshold-based Hunches (data mining)
   – Alerting
Risk Forecasting – What is likely to happen next?
   – Near Repeat Pattern
   – Load Forecasting
Crime Analysis
   – Mapping (spatial / temporal densities)
   – Trending
   – Intelligence Dashboard
Early Warning
   – Statistical & Threshold-based Hunches (data mining)
   – Alerting
Risk Forecasting
   – Near Repeat Pattern
   – Load Forecasting
Crime Analysis
Intelligence Dashboard
Crime Analysis
Early Warning
Early Warning

• Geographic Early Warning System
   – A system to alert staff of an unusual situation in a particular
     location
   – Ingests data sets to automatically “cook on” and only
     involves staff when a statistically unusual situation is found


                               Geostatistical Engine



  Operational
   Operational
   Database
                                                       Alerting
     Operational
    Database                HunchLab
                            Database                   System
     Databases
Early Warning
What is a Hunch?

• A proposed hypothesis, saved into the system, and
  continually tested for validity
• Incident Attribute Requirements
   – Location (x, y)
   – Time (timestamp)
   – Classification
• Hunch Attributes
   – Location (area)
   – Time (recent / historic periods)
   – Classification
• Analyses
   – Statistical Hunch
   – Threshold Hunch
Hunch Parameters: Location

•   Address & Radius
•   Precinct/County/Country
•   Custom Drawn Area
•   Mass Hunch
Hunch Parameters: Time

• Statistical Hunch
   – Recent Past
   – Historic Past
Hunch Parameters: Classification

• Category
• Time of Day
• Narrative
Hunch Helper
Email Alert
Hunch Details
Risk Forecasting
Predictive Analytics?

• Prediction vs. Forecasting
Near Repeat Pattern Analysis
Contagious Crime?

• Near repeat pattern analysis
      • “If one burglary occurs, how does the risk change nearby?”
What Do We Mean By Near Repeat?

• Repeat victimization
   – Incident at the same location at a later time (likely related)
• Near repeat victimization
   – Incident at a nearby location at a later time (likely related)

• Incident A (place, time) --> Incident B (place, time)
Near Repeat Pattern Analysis

• The goal:
   – Quantify short term risk due to near-repeat victimization
      • “If one burglary occurs, how does the risk of burglary for the
        neighbors change?”


• What we know:
   – Incident A (place, time) --> Incident B (place, time)
      • Distance between A and B
      • Timeframe between A and B


• What we need to know:
   – What distances/timeframes are not simply random?
Near Repeat Pattern Analysis

• The process
   –   Observe the pattern in historic data
   –   Simulate the pattern in randomized historic data
   –   Compare the observed pattern to the simulated patterns
   –   Apply the non-random pattern to new incidents

• An example
   – 180 days of burglaries in Division 6 of Philadelphia
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis
Near Repeat Pattern Analysis

• How can you test your own data?
   – Near Repeat Calculator
      • http://www.temple.edu/cj/misc/nr/
• Papers
   – Near-Repeat Patterns in Philadelphia Shootings (2008)
      • One city block & two weeks after one shooting
           – 33% increase in likelihood of a second event




                                             Jerry Ratcliffe
                                           Temple University
Contagious Crime?
Workload Forecasting
Improving CompStat

• Workload forecasting
      • “Given the time of year, day of week, time of day and
        general trend, what counts of crimes should I expect?”
What Do We Mean By Load Forecasting?

 • Workload forecasting
         • Generating aggregate crime counts for a future timeframe
           using cyclical time series analysis



                                    Measure cyclical patterns


                                                +
                                    Identify non-cyclical trend

                                    Forecast expected count

bit.ly/gorrcrimeforecastingpaper
Load Forecasting

• Measure cyclical patterns
      • Take historic incidents (for example: last five years)
      • Generate multiplicative seasonal indices
          – For each time cycle:
              » time of year
              » day of week
              » time of day
          – Count incidents within each time unit (for example: Monday)
          – Calculate average per time unit if incidents were evenly
            distributed
          – Divide counts within each time unit by the calculated average to
            generate multiplicative indices
              » Index ~ 1 means at the average
              » Index > 1 means above average
              » Index < 1 means below average
Load Forecasting
Load Forecasting
Load Forecasting
Load Forecasting
Load Forecasting

• Identify non-cyclical trend
      • Take recent daily counts (for example: last year daily counts)
      • Remove cyclical trends by dividing by indices




      • Run a trending function on the new counts
          – Simple average
              » Last X Days
          – Smoothing function
              » Exponential smoothing
              » Holt’s linear exponential smoothing
Load Forecasting

• Forecast expected count
      • Project trend into future timeframe
          – Always flat
              » Simple average
              » Exponential smoothing
          – Linear trend
              » Holt’s linear exponential smoothing
      • Multiple by seasonal indices to reseasonalize the data
Load Forecasting




                                   Measure cyclical patterns


                                             +
                                   Identify non-cyclical trend

                                   Forecast expected count



bit.ly/gorrcrimeforecastingpaper
Improving CompStat
How Do We Know It’s Accurate?

• Testing
      • Generated forecasting techniques(examples)
            – Commonly Used
                » Average of last 30 days
                » Average of last 365 days
                » Last year’s count for the same time period
            – Advanced Combinations
                » Different cyclical indices (example: day of year vs. month of year)
                » Different levels of geographic aggregation for indices
                » Different trending functions
      • Scoring methodologies (examples)
            – Mean absolute percent error (with some enhancements)
            – Mean percent error
            – Mean squared error
      • Run thousands of forecasts through testing framework
      • Choose the right technique in the right situation
Ongoing Research
Research Topics

• Risk Forecasting
   – Load forecasting enhancements
      • Weather and special events




   – Combining short and long term risk forecasts (Temple)
      • Socioeconomic changes in neighborhoods
   – Risk Terrain Modeling (Rutgers)
      • Context of crime at the microplace
Research Topics
Research Topics

• Risk Forecasting
   – Offender Management
      • Prioritize offenders based upon statistical models using past
        behaviors
• Evaluation
   – Automate Randomized Controlled Trials
Data Processing for Big (Geo) Data
A Story
Robert’s Rules of Housing
                     Close to Center City      somewhat important
                   Walk to Grocery Store       vital
                     Nearby Restaurants        very important
                                  Library      nice to have
                             Near a Park       somewhat important
Biking / walking distance from our work        very important
               Biking distance to fencing      somewhat important
Your factors might include…
                      Child Care
                      Local School Rankings
                      Farmer's Market
                      Car Share
                      Public Transit
We stand on the
shoulders of giants
Not a new idea … Design with Nature
Not a new Idea … Dana Tomlin
Desktop GIS
Weighted Overlay


             +        +        +

    x5           x1       x3       x2




         =
Summary

      Geography-driven Decisions

      Iterative

      Individual

      Web [and Mobile]

      Growing data sets
Web Challenges
Web is different from the Desktop

  Lots of simultaneous users

  Stateless environment

  HTML+JS+CSS

  Users are less skilled

  Users are less patient
But wait … there’s a problem
 10 – 60 second calculation time

 Multiple simultaneous users …

 … that are impatient
Data Challenges
Big Data – Social Media
Big Data – Science
Big Data – Citizen Science
Big Data – Cities
Early Prototype
Specific Optimization Goals
 New Raster File Structure

 Distributed processing

 Binary messaging protocol
Optimization: File Format
 Limit data type and range

 1D arrays are fast to read/write

 Tiled

 Pyramids

 Azavea Raster Grid (ARG)
Optimization: Distributed Processing
 Parallelizable - Local Ops and Focal Ops

 Support multiple
  –   Threads
  –   Cores
  –   CPU’s
  –   Machines


 Considered
  – Hadoop
  – Amazon Map Reduce
  – Beowolf
Success!!
  Reduced from 10-60 seconds to

  <500 milliseconds
Optimizing one process sub-optimizes others
   Complex to configure and maintain
   Limited to one operation
   No interpolation
   No mixing
    – cell sizes
    – extents
    – projections
 etc.
 Broader set of functionality

 Both raster and vector

 Scala + Akka

 Open source
Faster is Different
Regional/State:     84 ms

National:           84 ms

Large Country     115 ms

Continental       271 ms

Planet          1.2 – 2.0 s
Ongoing R&D
GPUs
GPU Results
  Re-wrote a few Map
   Algebra operations:
    Local
    Neighborhood
    Zonal
    Viewshed
    etc.
  15 – 120x
  Large grids
  Large kernels
New Spatial Operations
 Vector

 Neighborhood/Focal

 Spatial Statistics

 Integration
Urban Forest Ecosystem Modeling
Crime Analysis, Early Warning and Forecasting
Open Source Geoprocessing

       GDAL

       GeoServer

       PostGIS

      R

       GeoDa
Many Thanks!
© Photo used with permission from Alphafish, via Flickr.com
Big (Geo) Data Science

                 [We are hiring]


Robert Cheetham
cheetham@azavea.com
   @rcheetham

Mais conteúdo relacionado

Mais procurados

Crime analysis mapping, intrusion detection using data mining
Crime analysis mapping, intrusion detection using data miningCrime analysis mapping, intrusion detection using data mining
Crime analysis mapping, intrusion detection using data mining
Venkat Projects
 
Crime Identification Denver Colorado
Crime Identification Denver ColoradoCrime Identification Denver Colorado
Crime Identification Denver Colorado
Chad Yowler
 

Mais procurados (20)

Machine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern DetectionMachine Learning Approaches for Crime Pattern Detection
Machine Learning Approaches for Crime Pattern Detection
 
Crime prediction-using-data-mining
Crime prediction-using-data-miningCrime prediction-using-data-mining
Crime prediction-using-data-mining
 
Chicago Crime Dataset Project Proposal
Chicago Crime Dataset Project ProposalChicago Crime Dataset Project Proposal
Chicago Crime Dataset Project Proposal
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
Crime analysis mapping, intrusion detection using data mining
Crime analysis mapping, intrusion detection using data miningCrime analysis mapping, intrusion detection using data mining
Crime analysis mapping, intrusion detection using data mining
 
Crime Mapping & Analysis – Georgia Tech
Crime Mapping & Analysis – Georgia TechCrime Mapping & Analysis – Georgia Tech
Crime Mapping & Analysis – Georgia Tech
 
PredPol: How Predictive Policing Works
PredPol: How Predictive Policing WorksPredPol: How Predictive Policing Works
PredPol: How Predictive Policing Works
 
Crime analysis
Crime analysisCrime analysis
Crime analysis
 
Applications of R (DataWeek 2014)
Applications of R (DataWeek 2014)Applications of R (DataWeek 2014)
Applications of R (DataWeek 2014)
 
EvIM: a real time complex event discovery platform for CPSS
EvIM: a real time complex event discovery platform for CPSSEvIM: a real time complex event discovery platform for CPSS
EvIM: a real time complex event discovery platform for CPSS
 
Social Life Networks (Eventshop and Personal Event Shop)
Social Life Networks (Eventshop and Personal Event Shop)Social Life Networks (Eventshop and Personal Event Shop)
Social Life Networks (Eventshop and Personal Event Shop)
 
How the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeedHow the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeed
 
Order Fulfillment Forecasting at John Deere: How R Facilitates Creativity and...
Order Fulfillment Forecasting at John Deere: How R Facilitates Creativity and...Order Fulfillment Forecasting at John Deere: How R Facilitates Creativity and...
Order Fulfillment Forecasting at John Deere: How R Facilitates Creativity and...
 
Fundamentalsof Crime Mapping 6
Fundamentalsof Crime Mapping 6Fundamentalsof Crime Mapping 6
Fundamentalsof Crime Mapping 6
 
Predictive policing computational thinking show and tell
Predictive policing computational thinking show and tellPredictive policing computational thinking show and tell
Predictive policing computational thinking show and tell
 
EventShop Demo
EventShop DemoEventShop Demo
EventShop Demo
 
Observing real world phenomena through event web
Observing real world phenomena through event webObserving real world phenomena through event web
Observing real world phenomena through event web
 
Crime Identification Denver Colorado
Crime Identification Denver ColoradoCrime Identification Denver Colorado
Crime Identification Denver Colorado
 
EventShop ISG talk 140213
EventShop ISG talk 140213EventShop ISG talk 140213
EventShop ISG talk 140213
 
Using Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime PatternUsing Data Mining Techniques to Analyze Crime Pattern
Using Data Mining Techniques to Analyze Crime Pattern
 

Destaque

Spatial enhancement
Spatial enhancement Spatial enhancement
Spatial enhancement
abinarkt
 
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Rich Heimann
 
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
AIST
 
Introduction To Gis With Employment Info
Introduction To Gis With Employment InfoIntroduction To Gis With Employment Info
Introduction To Gis With Employment Info
Jo Dyson
 

Destaque (20)

Rinkal.cpd.ppt
Rinkal.cpd.pptRinkal.cpd.ppt
Rinkal.cpd.ppt
 
Ijcatr04061005
Ijcatr04061005Ijcatr04061005
Ijcatr04061005
 
Exploratory Spatial Analysis Norma
Exploratory Spatial Analysis NormaExploratory Spatial Analysis Norma
Exploratory Spatial Analysis Norma
 
Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012Web mapping with vector data. Is it the future ? 2012
Web mapping with vector data. Is it the future ? 2012
 
Spatial queries entity recognition and disambiguation
Spatial queries entity recognition and disambiguationSpatial queries entity recognition and disambiguation
Spatial queries entity recognition and disambiguation
 
Introduction to Oracle Spatial
Introduction to Oracle SpatialIntroduction to Oracle Spatial
Introduction to Oracle Spatial
 
3D Visibility with Vector GIS Data
3D Visibility with Vector GIS Data3D Visibility with Vector GIS Data
3D Visibility with Vector GIS Data
 
Spatial enhancement
Spatial enhancement Spatial enhancement
Spatial enhancement
 
Exploratory Spatial Analysis using GeoDa
Exploratory Spatial Analysis using GeoDaExploratory Spatial Analysis using GeoDa
Exploratory Spatial Analysis using GeoDa
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010
 
Components of Spatial Data Quality in GIS
Components of Spatial Data Quality in GISComponents of Spatial Data Quality in GIS
Components of Spatial Data Quality in GIS
 
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1
 
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
Dmitriy Kolesov - GIS as an environment for integration and analysis of spati...
 
Introduction To Gis With Employment Info
Introduction To Gis With Employment InfoIntroduction To Gis With Employment Info
Introduction To Gis With Employment Info
 
QGIS Module 1
QGIS Module 1QGIS Module 1
QGIS Module 1
 
Vectors and Rasters
Vectors and RastersVectors and Rasters
Vectors and Rasters
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
Vector analysis
Vector analysisVector analysis
Vector analysis
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 

Semelhante a Data Philly Meetup - Big (Geo) Data

Mining Large-Scale Temporal Dynamics with Hadoop
Mining Large-Scale Temporal Dynamics with HadoopMining Large-Scale Temporal Dynamics with Hadoop
Mining Large-Scale Temporal Dynamics with Hadoop
DataWorks Summit
 
Extracting City Traffic Events from Social Streams
 Extracting City Traffic Events from Social Streams Extracting City Traffic Events from Social Streams
Extracting City Traffic Events from Social Streams
Pramod Anantharam
 
HunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the HoodHunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the Hood
Azavea
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
Aerospike, Inc.
 

Semelhante a Data Philly Meetup - Big (Geo) Data (20)

Forecasting Space-Time Events - Strata + Hadoop World 2015 San Jose
Forecasting Space-Time Events - Strata + Hadoop World 2015 San JoseForecasting Space-Time Events - Strata + Hadoop World 2015 San Jose
Forecasting Space-Time Events - Strata + Hadoop World 2015 San Jose
 
Mining Large-Scale Temporal Dynamics with Hadoop
Mining Large-Scale Temporal Dynamics with HadoopMining Large-Scale Temporal Dynamics with Hadoop
Mining Large-Scale Temporal Dynamics with Hadoop
 
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
 
Extracting City Traffic Events from Social Streams
 Extracting City Traffic Events from Social Streams Extracting City Traffic Events from Social Streams
Extracting City Traffic Events from Social Streams
 
EPOP: Quantifying Violent Risk for Every Point on the Planet
EPOP: Quantifying Violent Risk for Every Point on the PlanetEPOP: Quantifying Violent Risk for Every Point on the Planet
EPOP: Quantifying Violent Risk for Every Point on the Planet
 
HunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the HoodHunchLab 2.0 Predictive Missions: Under the Hood
HunchLab 2.0 Predictive Missions: Under the Hood
 
RAPID-N: A tool for mapping Natech risk due to earthquakes
RAPID-N: A tool for mapping Natech risk due to earthquakesRAPID-N: A tool for mapping Natech risk due to earthquakes
RAPID-N: A tool for mapping Natech risk due to earthquakes
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
 
Cyber Threat Ranking using READ
Cyber Threat Ranking using READCyber Threat Ranking using READ
Cyber Threat Ranking using READ
 
Machine Learning from Statistical Point of View
Machine Learning from Statistical Point of ViewMachine Learning from Statistical Point of View
Machine Learning from Statistical Point of View
 
Integrating Sensor and Social Data for Understanding City Events
Integrating Sensor and Social Data for Understanding City EventsIntegrating Sensor and Social Data for Understanding City Events
Integrating Sensor and Social Data for Understanding City Events
 
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
 
Cyber Attacks Spatial Analysis
Cyber Attacks Spatial AnalysisCyber Attacks Spatial Analysis
Cyber Attacks Spatial Analysis
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUNye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
 
A data driven approach for monitoring network events
A data driven approach for monitoring network eventsA data driven approach for monitoring network events
A data driven approach for monitoring network events
 
Cyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive ComputingCyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive Computing
 
Time series and forecasting from wikipedia
Time series and forecasting from wikipediaTime series and forecasting from wikipedia
Time series and forecasting from wikipedia
 

Mais de Azavea

OpenTreeMap NCGIS
OpenTreeMap NCGISOpenTreeMap NCGIS
OpenTreeMap NCGIS
Azavea
 

Mais de Azavea (20)

Using New Tools to Analyze and Plan Your Urban Forest
Using New Tools to Analyze and Plan Your Urban Forest Using New Tools to Analyze and Plan Your Urban Forest
Using New Tools to Analyze and Plan Your Urban Forest
 
7 misconceptions about predictive policing webinar
7 misconceptions about predictive policing webinar7 misconceptions about predictive policing webinar
7 misconceptions about predictive policing webinar
 
Tracking Your Green Infrastructure
Tracking Your Green InfrastructureTracking Your Green Infrastructure
Tracking Your Green Infrastructure
 
Growing Your Urban Forest: Using the OpenTreeMap Bulk Uploader
Growing Your Urban Forest: Using the OpenTreeMap Bulk UploaderGrowing Your Urban Forest: Using the OpenTreeMap Bulk Uploader
Growing Your Urban Forest: Using the OpenTreeMap Bulk Uploader
 
November 12, 2014 Webinar: Hackers, Beer Geeks, and Arborly Love - Reaching o...
November 12, 2014 Webinar: Hackers, Beer Geeks, and Arborly Love - Reaching o...November 12, 2014 Webinar: Hackers, Beer Geeks, and Arborly Love - Reaching o...
November 12, 2014 Webinar: Hackers, Beer Geeks, and Arborly Love - Reaching o...
 
Mobile Citizen Science
Mobile Citizen Science Mobile Citizen Science
Mobile Citizen Science
 
Getting Started with OpenTreeMap Cloud
Getting Started with OpenTreeMap CloudGetting Started with OpenTreeMap Cloud
Getting Started with OpenTreeMap Cloud
 
HunchLab 2.0 Getting Started
HunchLab 2.0 Getting StartedHunchLab 2.0 Getting Started
HunchLab 2.0 Getting Started
 
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
 
Your New Partners: Understanding Civic Hackathons, Why You Should be Involved...
Your New Partners: Understanding Civic Hackathons, Why You Should be Involved...Your New Partners: Understanding Civic Hackathons, Why You Should be Involved...
Your New Partners: Understanding Civic Hackathons, Why You Should be Involved...
 
Using Open Data and Citizen Science to Promote Citizen Engagement with Green ...
Using Open Data and Citizen Science to Promote Citizen Engagement with Green ...Using Open Data and Citizen Science to Promote Citizen Engagement with Green ...
Using Open Data and Citizen Science to Promote Citizen Engagement with Green ...
 
HunchLab 2.0 Preview Webinar - Place
HunchLab 2.0 Preview Webinar - PlaceHunchLab 2.0 Preview Webinar - Place
HunchLab 2.0 Preview Webinar - Place
 
Five Technology Trends Every Nonprofit Needs to Know
Five Technology Trends Every Nonprofit Needs to KnowFive Technology Trends Every Nonprofit Needs to Know
Five Technology Trends Every Nonprofit Needs to Know
 
PhillyHistory.org - Tracking Metrics for a Digital Project
PhillyHistory.org - Tracking Metrics for a Digital ProjectPhillyHistory.org - Tracking Metrics for a Digital Project
PhillyHistory.org - Tracking Metrics for a Digital Project
 
Fed Geo Day - Applying GeoTrellis at the US Army Corps
Fed Geo Day - Applying GeoTrellis at the US Army CorpsFed Geo Day - Applying GeoTrellis at the US Army Corps
Fed Geo Day - Applying GeoTrellis at the US Army Corps
 
Fed Geo Day - GeoTrellis Intro
Fed Geo Day - GeoTrellis IntroFed Geo Day - GeoTrellis Intro
Fed Geo Day - GeoTrellis Intro
 
Fed Geo Day 2013 - Azavea Intro
Fed Geo Day 2013 - Azavea Intro Fed Geo Day 2013 - Azavea Intro
Fed Geo Day 2013 - Azavea Intro
 
Modeling Count-based Raster Data with ArcGIS and R
Modeling Count-based Raster Data with ArcGIS and RModeling Count-based Raster Data with ArcGIS and R
Modeling Count-based Raster Data with ArcGIS and R
 
OpenTreeMap NCGIS
OpenTreeMap NCGISOpenTreeMap NCGIS
OpenTreeMap NCGIS
 
OpenTreeMap Overview
OpenTreeMap OverviewOpenTreeMap Overview
OpenTreeMap Overview
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Data Philly Meetup - Big (Geo) Data

  • 1. Big (Geo) Data Science Robert Cheetham cheetham@azavea.com @rcheetham
  • 3. B Corporation • Projects w/ Social Value • Summer of Maps • Pro Bono Program • Donate share of profits Research-Driven • 10% Research Program • Academic Collaborations • Open Source
  • 4. Spatial Temporal Forecasting with Philadelphia Crime Data
  • 5. How Phila PD uses Maps Customized Map Products Weekly CompStat Meetings Web Crime Analysis
  • 6. INCT & PARS – main database sources over 5,000 incidents daily, over 2 million annually PARS Complainant INCT Verizon Daily download 911 District & Geocoding Routines 48 Desk Incident Report Completed by Officer District X 911 Operator Police Officer Maps distributed Through Intranet, District Y Printing, CompStat Radio Dispatcher CAD District Z
  • 7. The Context 1,500,000 people 7,000 police 1,000 civilian employees 2,000,000 new incidents / year 3 crime analysts
  • 8. What we did • Weekly Compstat • Lots of maps • Automation of map creation • Web-based systems
  • 9. … but what if we could…  Accelerate the cycle  Proactively notify  Automate the process
  • 10. Prototype VB & MapObjects ArcView .ini file Process Documentation Shapefiles and GRIDs MS SQL Server Crime Incidents Database
  • 11.
  • 12. … but there was a problem …
  • 15. We needed …. 1. Better Statistics 2. Notification 3. Simplicity
  • 16.
  • 17. Crime Analysis – What has happened? – Mapping (spatial / temporal densities) – Trending – Intelligence Dashboard Early Warning – What is out of the ordinary? – Statistical & Threshold-based Hunches (data mining) – Alerting Risk Forecasting – What is likely to happen next? – Near Repeat Pattern – Load Forecasting
  • 18. Crime Analysis – Mapping (spatial / temporal densities) – Trending – Intelligence Dashboard Early Warning – Statistical & Threshold-based Hunches (data mining) – Alerting Risk Forecasting – Near Repeat Pattern – Load Forecasting
  • 23. Early Warning • Geographic Early Warning System – A system to alert staff of an unusual situation in a particular location – Ingests data sets to automatically “cook on” and only involves staff when a statistically unusual situation is found Geostatistical Engine Operational Operational Database Alerting Operational Database HunchLab Database System Databases
  • 25. What is a Hunch? • A proposed hypothesis, saved into the system, and continually tested for validity • Incident Attribute Requirements – Location (x, y) – Time (timestamp) – Classification • Hunch Attributes – Location (area) – Time (recent / historic periods) – Classification • Analyses – Statistical Hunch – Threshold Hunch
  • 26. Hunch Parameters: Location • Address & Radius • Precinct/County/Country • Custom Drawn Area • Mass Hunch
  • 27. Hunch Parameters: Time • Statistical Hunch – Recent Past – Historic Past
  • 28. Hunch Parameters: Classification • Category • Time of Day • Narrative
  • 35. Contagious Crime? • Near repeat pattern analysis • “If one burglary occurs, how does the risk change nearby?”
  • 36. What Do We Mean By Near Repeat? • Repeat victimization – Incident at the same location at a later time (likely related) • Near repeat victimization – Incident at a nearby location at a later time (likely related) • Incident A (place, time) --> Incident B (place, time)
  • 37. Near Repeat Pattern Analysis • The goal: – Quantify short term risk due to near-repeat victimization • “If one burglary occurs, how does the risk of burglary for the neighbors change?” • What we know: – Incident A (place, time) --> Incident B (place, time) • Distance between A and B • Timeframe between A and B • What we need to know: – What distances/timeframes are not simply random?
  • 38. Near Repeat Pattern Analysis • The process – Observe the pattern in historic data – Simulate the pattern in randomized historic data – Compare the observed pattern to the simulated patterns – Apply the non-random pattern to new incidents • An example – 180 days of burglaries in Division 6 of Philadelphia
  • 43. Near Repeat Pattern Analysis • How can you test your own data? – Near Repeat Calculator • http://www.temple.edu/cj/misc/nr/ • Papers – Near-Repeat Patterns in Philadelphia Shootings (2008) • One city block & two weeks after one shooting – 33% increase in likelihood of a second event Jerry Ratcliffe Temple University
  • 46. Improving CompStat • Workload forecasting • “Given the time of year, day of week, time of day and general trend, what counts of crimes should I expect?”
  • 47. What Do We Mean By Load Forecasting? • Workload forecasting • Generating aggregate crime counts for a future timeframe using cyclical time series analysis Measure cyclical patterns + Identify non-cyclical trend Forecast expected count bit.ly/gorrcrimeforecastingpaper
  • 48. Load Forecasting • Measure cyclical patterns • Take historic incidents (for example: last five years) • Generate multiplicative seasonal indices – For each time cycle: » time of year » day of week » time of day – Count incidents within each time unit (for example: Monday) – Calculate average per time unit if incidents were evenly distributed – Divide counts within each time unit by the calculated average to generate multiplicative indices » Index ~ 1 means at the average » Index > 1 means above average » Index < 1 means below average
  • 53. Load Forecasting • Identify non-cyclical trend • Take recent daily counts (for example: last year daily counts) • Remove cyclical trends by dividing by indices • Run a trending function on the new counts – Simple average » Last X Days – Smoothing function » Exponential smoothing » Holt’s linear exponential smoothing
  • 54. Load Forecasting • Forecast expected count • Project trend into future timeframe – Always flat » Simple average » Exponential smoothing – Linear trend » Holt’s linear exponential smoothing • Multiple by seasonal indices to reseasonalize the data
  • 55. Load Forecasting Measure cyclical patterns + Identify non-cyclical trend Forecast expected count bit.ly/gorrcrimeforecastingpaper
  • 57. How Do We Know It’s Accurate? • Testing • Generated forecasting techniques(examples) – Commonly Used » Average of last 30 days » Average of last 365 days » Last year’s count for the same time period – Advanced Combinations » Different cyclical indices (example: day of year vs. month of year) » Different levels of geographic aggregation for indices » Different trending functions • Scoring methodologies (examples) – Mean absolute percent error (with some enhancements) – Mean percent error – Mean squared error • Run thousands of forecasts through testing framework • Choose the right technique in the right situation
  • 59. Research Topics • Risk Forecasting – Load forecasting enhancements • Weather and special events – Combining short and long term risk forecasts (Temple) • Socioeconomic changes in neighborhoods – Risk Terrain Modeling (Rutgers) • Context of crime at the microplace
  • 61. Research Topics • Risk Forecasting – Offender Management • Prioritize offenders based upon statistical models using past behaviors • Evaluation – Automate Randomized Controlled Trials
  • 62. Data Processing for Big (Geo) Data
  • 64. Robert’s Rules of Housing Close to Center City  somewhat important Walk to Grocery Store  vital Nearby Restaurants  very important Library  nice to have Near a Park  somewhat important Biking / walking distance from our work  very important Biking distance to fencing  somewhat important
  • 65. Your factors might include…  Child Care  Local School Rankings  Farmer's Market  Car Share  Public Transit
  • 66. We stand on the shoulders of giants
  • 67. Not a new idea … Design with Nature
  • 68. Not a new Idea … Dana Tomlin
  • 70. Weighted Overlay + + + x5 x1 x3 x2 =
  • 71. Summary Geography-driven Decisions Iterative Individual Web [and Mobile] Growing data sets
  • 73. Web is different from the Desktop  Lots of simultaneous users  Stateless environment  HTML+JS+CSS  Users are less skilled  Users are less patient
  • 74. But wait … there’s a problem  10 – 60 second calculation time  Multiple simultaneous users …  … that are impatient
  • 76. Big Data – Social Media
  • 77. Big Data – Science
  • 78. Big Data – Citizen Science
  • 79. Big Data – Cities
  • 81.
  • 82. Specific Optimization Goals  New Raster File Structure  Distributed processing  Binary messaging protocol
  • 83. Optimization: File Format  Limit data type and range  1D arrays are fast to read/write  Tiled  Pyramids  Azavea Raster Grid (ARG)
  • 84. Optimization: Distributed Processing  Parallelizable - Local Ops and Focal Ops  Support multiple – Threads – Cores – CPU’s – Machines  Considered – Hadoop – Amazon Map Reduce – Beowolf
  • 85. Success!! Reduced from 10-60 seconds to <500 milliseconds
  • 86. Optimizing one process sub-optimizes others  Complex to configure and maintain  Limited to one operation  No interpolation  No mixing – cell sizes – extents – projections  etc.
  • 87.
  • 88.  Broader set of functionality  Both raster and vector  Scala + Akka  Open source
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101. Regional/State: 84 ms National: 84 ms Large Country 115 ms Continental 271 ms Planet 1.2 – 2.0 s
  • 103. GPUs
  • 104.
  • 105. GPU Results  Re-wrote a few Map Algebra operations:  Local  Neighborhood  Zonal  Viewshed  etc.  15 – 120x  Large grids  Large kernels
  • 106. New Spatial Operations Vector Neighborhood/Focal Spatial Statistics Integration
  • 108. Crime Analysis, Early Warning and Forecasting
  • 109. Open Source Geoprocessing  GDAL  GeoServer  PostGIS R  GeoDa
  • 110. Many Thanks! © Photo used with permission from Alphafish, via Flickr.com
  • 111. Big (Geo) Data Science [We are hiring] Robert Cheetham cheetham@azavea.com @rcheetham