SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
BIG DATA ANALYTICS ON
THE INTERNET
Dr. Shaozhong SHI
drshishaozhong@gmail.com
Drawing data from geographically
dispersed data stores over the Internet
 A showcase of internationally remote access to and
use of open data and application over the Internet is
presented.
 It shows how automation in big data analytics can be
achieved on the Internet.
 It shows the importance of standardisation and
accessibility of data.
 It illustrates, with a live example, how Open Source
tools can be utilised for advancing big data analytics.
Drawing data from geographically
dispersed data stores over the Internet
 It shows the design of a new application with use of
Open Source tools such as Pandas, Numpy and
Metplotlib.
 It explains how full automation in sourcing and
processing data and generating analytical output can
be achieved.
 It shows the importance of the standardisation of data
and the role of geographical identifiers in automated
data processing.
Some key solutions for working
across multiple Pandas dataframes
(tables)
 This PowerPoint show covers some keys which
are important to data linkage, data integration,
working across multiple Pandas dataframes
(tables), and automation in processing.
 These are key solutions for automated exact
processing of records.
 The showcase implementation is provided in a
IPython notebook. See at the link below:
 http://dev.mapofagriculture.com:9999/ipython/notebooks/sshaozhong/
2016-05-16_Automatic_Aggregation_Disaggregation_Showcase.ipynb
Original Online Data from USGS
 The original data used is a large well structured
Excel sheet at the following USGS website:
 http://water.usgs.gov/pubs/sir/2006/5012/excel/Nutri
ent_Inputs_1982-2001jan06.xls
 It is used as the input to the program. It is geo-
indexed with Federal Information Processing
Standards (FIPS) codes.
 The data is read in the newly developed program
and stored as a Pandas dataframe table.
 A subset of data was extracted for creation of a
Pandas dataframe table to serve as the input table.
Original data:
Nitrogen Input from Fertilizer Use (kilograms)
in each year between 1987 and 2001
A subset of a large spread sheet
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
 The primary questions that this work set out to answer is
whether automated means can be designed and developed
for use in data integration and integrated processing of
agricultural census dataset,
 and whether automated aggregation by states and dis-
aggregation of values at state level into values at county
level.
 To this end, an exploratory design, development and testing
were carried out. An integrated set of algorithms were
researched, designed, implemented and tested on the Map
of Agriculture platform.
 The integrated algorithms are collectively called Data
Linkage for Data Integration and Automated Aggregation
and Dis-aggregation.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
 The Automated Aggregation and Dis-aggregation is a
prototype program that was developed in order to
enable rapid development of data integration and
integrated processing with Open Source Python tools
and libraries.
 The automated Aggregation and Dis-aggregation use
Python and Pandas, Numpy libraries.
 It has efficient, exact data integration, data inflow and
outflow in Pandas dataframe tables, integrated
processing characteristics.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
 The two sets of algorithmic solutions implemented are
automatic online sourcing of structured data
 and Automated Aggregation and Dis-aggregation itself.
The first is to access, read in and take a set of data.
 The second is to carry out an integrated processing for
aggregating county level statistics into state level
statistics and dis-aggregating state level statistics into
county level statistics by rule.
 Automated aggregation: addition and summing used.
 A loop for summing up farm and non-farm statistics at
county level for each year from 1987 to 2001.
 Aggregated state level statistics are produced by using
the State FIPS codes as the key.
Working of the processing
Aggregation:
Input
Working of the processing
Aggregation:
Output of
Adding farm
And nofarm
Statistics
Recursively
Carried out
For all years
Working of processing
Aggregation:
Output of
Application of
Groupby with
The use of
StateFIPS
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
 The output of automatic aggregation is a Pandas dataframe
table which is indexed with the State FIPS codes.
 Dis-aggregation:
 The showcase uses a rule assuming that county level
statistics contributing to state level statistics proportionally
as determined by the area within the state.
 The totals of land areas of the states are collected from the
output of the aggregated output table through vLookup. It
is stored as a Python dictionary as a geo-referenced
dataset.
 These are mapped exactly into right positions in a new
column in the intermediary table for producing dis-
aggregated statistics.
 Output of dis-aggregating statistics on Nitrogen Input
Characterisation of the new
algorithm
 Then, calculation of ratio between each county and its state
takes place.
 A loop is used to calculate dis-aggregated statistics for all
counties for each of years from 1987 to 2001.
 This results in a Pandas dataframe table as a dis-
aggregated table.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
 New approach of dis-aggregating tabular statistics into
smaller geographical units (no intersection of geometric
objects is required):
 Calculation of ratio between each county and its state
takes place. A loop is used to calculate dis-aggregated
Nitrogen input statistics for all counties for each of years
between 1987 and 2001. The total of a state times the
ratio yields a dis-aggregated sum for the county. This
logic of dis-aggregation has been used in areal
interpolations as a technique for spatial disaggregation
(Flowerdew and Green, 1992&1994; Goodchild, Anselin
and Deichmann, 1993). This results in a Pandas
dataframe table as a dis-aggregated table.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
 Hitherto, areal interpolation and Dasymetric mapping
(Flowerdew and Green, 1992&1994; Goodchild,
Anselin
 and Deichmann, 1993) are the only known approach
and methods for spatially dis-aggregating statistics in
relevance to the current work, particularly regarding
the processing of tabular statistics in vector GIS
datasets. The current work uses the logic of areal
interpolation, as far as the datasets involved can
currently allow. The difference between the current
implementation of calculations and areal interpolation
is that the current implementation does not involve
intersection of area features/polygons.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation
 There is a degree of uncertainty related to the
estimates. Improvement in estimation requires further
research in the future. Nevertheless, it is a step
forward in enabling estimation given the situation
where no data are collected at county level. It offers a
means to provide a quantitative indication. It is
particularly useful to the processing of tabular
statistics or when patterns need to be visualised at
large scales.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation
 The algorithmic solutions are characterised by their
capabilities to track the geo-referenced data entries
throughout cycles of processing, and exact geo-
referenced data retrieval and mapping, namely data
inflow and outflow from Pandas dataframe tables.
 The dis-aggregation algorithm/procedure can be used
for directly processing of tabular statistics without
involving intersection of polygons, particularly in
situations when neatly nested geospatial boundaries
files of US states and counties are used.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation
 The new algorithm can carry out automatic online
sourcing of datasets and integrated processing with
Open Source Python libraries. The new algorithm
can be further extended for linking geodata from
various sources, and for creation of indexed tabular
datasets with geographical identifiers.
 It can carry out automatic aggregation and dis-
aggregation of agricultural census datasets for all
states and counties in the USA.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation
 The Federation Information Processing Standards (FIPS)
codes were used as geographical identifiers for geo-
referenced data entries. It plays a critical role in retrieving
data from databases and mapping data into right
positions. It plays an efficient role in enabling vLookup
solutions for retrieving data and mapping to exact
positions in tables as desired.
 Geographical identifiers serve as the key and are critically
important in linking data between tables and creating geo-
indexed tabular datasets. Geographical identifiers track
attribute data entries in reference to geospatial objects.
 This vLookup solution can be modified and used for other
geodata projects.
Output
 The output of the program includes an
aggregated statistical table by states and a
dis-aggregated table by counties.
Dis-aggregating wheat statistics into
all counties
 Data columns of StateFIPS, State
Abbreviation, County name, country FIPS
and ratio are taken from the table of dis-
aggregated nitrogen input to form a new
Pandas DataFrame table.
 Data on wheat is extracted from the
QuickStats are used. These data are state
level statistics. The data are dis-aggregated
into all counties.
Dis-aggregating wheat statistics into
all counties
 Output of dis-aggregating wheat statistics
Issues encountered
 Data type issues were encountered and resolved.
 Clear understanding of data types and methods for
changing and handling is required.
 After application of groupby command in Pandas
dataframe, the original indexing is found meaningless.
The use of FIPS codes ensures that data indexing and
linkage in records are maintained throughout
processing cycles. Mapping geo-referenced data into
exact positions in columns is very important.
Update Geo-databases and Create
digital models in Geographical Information
Systems to visualise spatial variation
 A standard Geographical Information System has digital
map associated with a tabular database of records.
 Areal interpolation and Dasymetric mapping techniques
have gained its popularity in using tabular records and
combine these with area boundary files for creating
map models.
 The approach presented in this talk is based on the use
of a neatly nested area boundary files in the
administrative hierarchy of areas of the USA.
 No intersection of digital boundaries is needed.
Analytical example: Change over time
Analytical example: Rate of Change
References
 https://www.nass.usda.gov/Quick_Stats/
 https://www.python.org/downloads/
 https://www.scipy.org/scipylib/download.html
 http://matplotlib.org/downloads.html
 https://pypi.python.org/pypi/pylab
 Contact
 4 Haythrop Close, Downhead Park, Milton Keynes,
Buckinghamshire, United Kingdom, MK15 9DD
 Mobile: +44-7909844462
 EMail: drshishaozhong@gmail.com

Mais conteúdo relacionado

Mais procurados

Timmons Group ESRI Replication Solutions
Timmons Group ESRI Replication SolutionsTimmons Group ESRI Replication Solutions
Timmons Group ESRI Replication Solutions
Timmons Group
 
congress_project_w205_conference-FINAL
congress_project_w205_conference-FINALcongress_project_w205_conference-FINAL
congress_project_w205_conference-FINAL
Amir Ziai
 
Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression
IJECEIAES
 
2004-09-12 Data and Tools for Web-Based Monitoring and Analysis
2004-09-12 Data and Tools for Web-Based Monitoring and Analysis2004-09-12 Data and Tools for Web-Based Monitoring and Analysis
2004-09-12 Data and Tools for Web-Based Monitoring and Analysis
Rudolf Husar
 

Mais procurados (19)

2017 GIS in Emergency Management Track: Situational Awareness: Building an O...
2017 GIS in Emergency Management Track:  Situational Awareness: Building an O...2017 GIS in Emergency Management Track:  Situational Awareness: Building an O...
2017 GIS in Emergency Management Track: Situational Awareness: Building an O...
 
Hive Correlation Optimizer
Hive Correlation OptimizerHive Correlation Optimizer
Hive Correlation Optimizer
 
Project on nypd accident analysis using hadoop environment
Project on nypd accident analysis using hadoop environmentProject on nypd accident analysis using hadoop environment
Project on nypd accident analysis using hadoop environment
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Geolocation analysis using HiveQL
Geolocation analysis using HiveQLGeolocation analysis using HiveQL
Geolocation analysis using HiveQL
 
2017 GIS in Development Track: USGS POD Implementation in USGS Cloud to Suppo...
2017 GIS in Development Track: USGS POD Implementation in USGS Cloud to Suppo...2017 GIS in Development Track: USGS POD Implementation in USGS Cloud to Suppo...
2017 GIS in Development Track: USGS POD Implementation in USGS Cloud to Suppo...
 
Timmons Group ESRI Replication Solutions
Timmons Group ESRI Replication SolutionsTimmons Group ESRI Replication Solutions
Timmons Group ESRI Replication Solutions
 
Dr Richard Fry - Using R as a GIS
Dr Richard Fry - Using R as a GISDr Richard Fry - Using R as a GIS
Dr Richard Fry - Using R as a GIS
 
TYBSC IT SEM 6 GIS
TYBSC IT SEM 6 GISTYBSC IT SEM 6 GIS
TYBSC IT SEM 6 GIS
 
Reactive Databases for Big Data applications
Reactive Databases for Big Data applicationsReactive Databases for Big Data applications
Reactive Databases for Big Data applications
 
CKANへの空間情報機能拡張実装の試み
CKANへの空間情報機能拡張実装の試みCKANへの空間情報機能拡張実装の試み
CKANへの空間情報機能拡張実装の試み
 
congress_project_w205_conference-FINAL
congress_project_w205_conference-FINALcongress_project_w205_conference-FINAL
congress_project_w205_conference-FINAL
 
Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...Merging statistics and geospatial information - demography / commuting / spat...
Merging statistics and geospatial information - demography / commuting / spat...
 
Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression
 
2004-09-12 Data and Tools for Web-Based Monitoring and Analysis
2004-09-12 Data and Tools for Web-Based Monitoring and Analysis2004-09-12 Data and Tools for Web-Based Monitoring and Analysis
2004-09-12 Data and Tools for Web-Based Monitoring and Analysis
 
An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)
An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)
An Introduction to Mapping, GIS and Spatial Modelling in R (presentation)
 
Maps with leafletR
Maps with leafletRMaps with leafletR
Maps with leafletR
 
Graphalytics: A big data benchmark for graph processing platforms
Graphalytics: A big data benchmark for graph processing platformsGraphalytics: A big data benchmark for graph processing platforms
Graphalytics: A big data benchmark for graph processing platforms
 
Field Data Collecting, Processing and Sharing: Using web Service Technologies
Field Data Collecting, Processing and Sharing: Using web Service TechnologiesField Data Collecting, Processing and Sharing: Using web Service Technologies
Field Data Collecting, Processing and Sharing: Using web Service Technologies
 

Destaque

Big Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPTBig Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPT
Nikhil Atkuri
 
GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)
GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)
GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)
IGN Vorstand
 
GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)
GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)
GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)
IGN Vorstand
 
GI2012 pekarek+hoffmann-poster inmap
GI2012 pekarek+hoffmann-poster inmapGI2012 pekarek+hoffmann-poster inmap
GI2012 pekarek+hoffmann-poster inmap
IGN Vorstand
 
Effective planning and delivery of virtual classes meetings
Effective planning and delivery of virtual classes meetingsEffective planning and delivery of virtual classes meetings
Effective planning and delivery of virtual classes meetings
Heather Zink
 

Destaque (20)

Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0Ets train ppt_big_data_basics_v2.0
Ets train ppt_big_data_basics_v2.0
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big data Ppt
Big data PptBig data Ppt
Big data Ppt
 
Big Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPTBig Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPT
 
A Brand New Bag
A Brand New BagA Brand New Bag
A Brand New Bag
 
GI2015 programme+proceedings
GI2015 programme+proceedingsGI2015 programme+proceedings
GI2015 programme+proceedings
 
GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)
GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)
GI2010 symposium-kubicek+stachon+stampach+geryk (visual healthdata)
 
GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)
GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)
GI2010 symposium-klosa (explorers pal-amateurvermessungstechnik_osm)
 
GI2013 ppt iliev_tto_general_eng_final_reduced
GI2013 ppt iliev_tto_general_eng_final_reducedGI2013 ppt iliev_tto_general_eng_final_reduced
GI2013 ppt iliev_tto_general_eng_final_reduced
 
GI2012 pekarek+hoffmann-poster inmap
GI2012 pekarek+hoffmann-poster inmapGI2012 pekarek+hoffmann-poster inmap
GI2012 pekarek+hoffmann-poster inmap
 
僕が銀座のキャバ嬢と付き合えた方法
僕が銀座のキャバ嬢と付き合えた方法僕が銀座のキャバ嬢と付き合えた方法
僕が銀座のキャバ嬢と付き合えた方法
 
QM2011_MobileStrategies
QM2011_MobileStrategiesQM2011_MobileStrategies
QM2011_MobileStrategies
 
Effective planning and delivery of virtual classes meetings
Effective planning and delivery of virtual classes meetingsEffective planning and delivery of virtual classes meetings
Effective planning and delivery of virtual classes meetings
 
Final bio of aids presentation
Final bio of aids presentationFinal bio of aids presentation
Final bio of aids presentation
 

Semelhante a GI2016 ppt shi (big data analytics on the internet)

Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous EnvironmentData Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
Association of Scientists, Developers and Faculties
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
mathieuraj
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
R programming language in spatial analysis
R programming language in spatial analysisR programming language in spatial analysis
R programming language in spatial analysis
Abhiram Kanigolla
 

Semelhante a GI2016 ppt shi (big data analytics on the internet) (20)

Data Imputation by Soft Computing
Data Imputation by Soft ComputingData Imputation by Soft Computing
Data Imputation by Soft Computing
 
survey paper 2
survey paper 2survey paper 2
survey paper 2
 
Analysis of parking citations mapreduce techniques
Analysis of parking citations   mapreduce techniquesAnalysis of parking citations   mapreduce techniques
Analysis of parking citations mapreduce techniques
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
A Study on Data Visualization Techniques of Spatio Temporal Data
A Study on Data Visualization Techniques of Spatio Temporal DataA Study on Data Visualization Techniques of Spatio Temporal Data
A Study on Data Visualization Techniques of Spatio Temporal Data
 
Components of gis
Components of gisComponents of gis
Components of gis
 
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous EnvironmentData Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
 
The Role of Data Science in Real Estate
The Role of Data Science in Real EstateThe Role of Data Science in Real Estate
The Role of Data Science in Real Estate
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
 
Association Rule Mining using RHadoop
Association Rule Mining using RHadoopAssociation Rule Mining using RHadoop
Association Rule Mining using RHadoop
 
Analysis of S2 (Spherical) Geometry Library Algorithm for GIS Geocoding Engin...
Analysis of S2 (Spherical) Geometry Library Algorithm for GIS Geocoding Engin...Analysis of S2 (Spherical) Geometry Library Algorithm for GIS Geocoding Engin...
Analysis of S2 (Spherical) Geometry Library Algorithm for GIS Geocoding Engin...
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
What is GIS (PDF).pdf
What is GIS (PDF).pdfWhat is GIS (PDF).pdf
What is GIS (PDF).pdf
 
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information SystemsTYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docx
 
R programming language in spatial analysis
R programming language in spatial analysisR programming language in spatial analysis
R programming language in spatial analysis
 
A REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICSA REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICS
 
Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App Performance Analysis of Hashing Mathods on the Employment of App
Performance Analysis of Hashing Mathods on the Employment of App
 

Mais de IGN Vorstand

Mais de IGN Vorstand (20)

GI2016 final programm & proceedings of abstracts & summaries
GI2016 final programm & proceedings of abstracts & summariesGI2016 final programm & proceedings of abstracts & summaries
GI2016 final programm & proceedings of abstracts & summaries
 
GI2016 ppt hoffmann address+history from_gi2000_to_gi2016
GI2016 ppt hoffmann address+history from_gi2000_to_gi2016GI2016 ppt hoffmann address+history from_gi2000_to_gi2016
GI2016 ppt hoffmann address+history from_gi2000_to_gi2016
 
GI2016 ppt böhm saxonian_gdi_1_grenze_hi_hedo
GI2016 ppt böhm saxonian_gdi_1_grenze_hi_hedoGI2016 ppt böhm saxonian_gdi_1_grenze_hi_hedo
GI2016 ppt böhm saxonian_gdi_1_grenze_hi_hedo
 
GI2016 ppt böhm saxonian_gdi_2_eine_bwk_entsteht
GI2016 ppt böhm saxonian_gdi_2_eine_bwk_entstehtGI2016 ppt böhm saxonian_gdi_2_eine_bwk_entsteht
GI2016 ppt böhm saxonian_gdi_2_eine_bwk_entsteht
 
GI2016 ppt böhm saxonian_gdi_3_vimage
GI2016 ppt böhm saxonian_gdi_3_vimageGI2016 ppt böhm saxonian_gdi_3_vimage
GI2016 ppt böhm saxonian_gdi_3_vimage
 
GI2016 ppt charvat senslog api as tools for collection of big vgi data
GI2016 ppt charvat senslog api as tools for collection of big vgi dataGI2016 ppt charvat senslog api as tools for collection of big vgi data
GI2016 ppt charvat senslog api as tools for collection of big vgi data
 
GI2016 ppt charvat workshop geoss & conference inspire2016
GI2016 ppt charvat workshop geoss & conference inspire2016GI2016 ppt charvat workshop geoss & conference inspire2016
GI2016 ppt charvat workshop geoss & conference inspire2016
 
GI2016 ppt mayer copernicus_dresden
GI2016 ppt mayer copernicus_dresdenGI2016 ppt mayer copernicus_dresden
GI2016 ppt mayer copernicus_dresden
 
GI2016 ppt schiller dbd-bauprofessor & zuse-dualsemantik
GI2016 ppt schiller dbd-bauprofessor & zuse-dualsemantikGI2016 ppt schiller dbd-bauprofessor & zuse-dualsemantik
GI2016 ppt schiller dbd-bauprofessor & zuse-dualsemantik
 
GI2016 ppt schiller kostenkalkül
GI2016 ppt schiller kostenkalkülGI2016 ppt schiller kostenkalkül
GI2016 ppt schiller kostenkalkül
 
GI2016 ppt shi (automatic interaction and seamless integration of big data hu...
GI2016 ppt shi (automatic interaction and seamless integration of big data hu...GI2016 ppt shi (automatic interaction and seamless integration of big data hu...
GI2016 ppt shi (automatic interaction and seamless integration of big data hu...
 
GI2016 ppt shi (cartography and communication)
GI2016 ppt shi (cartography and communication)GI2016 ppt shi (cartography and communication)
GI2016 ppt shi (cartography and communication)
 
GI2016 Open Call for Presentations
GI2016 Open Call for PresentationsGI2016 Open Call for Presentations
GI2016 Open Call for Presentations
 
GI2015 ppt hoffmann_address_intro
GI2015 ppt hoffmann_address_introGI2015 ppt hoffmann_address_intro
GI2015 ppt hoffmann_address_intro
 
CoO + GI2015 ppt_charvat ict for a sustainable agriculture – public support n...
CoO + GI2015 ppt_charvat ict for a sustainable agriculture – public support n...CoO + GI2015 ppt_charvat ict for a sustainable agriculture – public support n...
CoO + GI2015 ppt_charvat ict for a sustainable agriculture – public support n...
 
CoO + GI2015 ppt_mayer ict for a sustainable agriculture - status and missing
CoO + GI2015 ppt_mayer ict for a sustainable agriculture - status and missingCoO + GI2015 ppt_mayer ict for a sustainable agriculture - status and missing
CoO + GI2015 ppt_mayer ict for a sustainable agriculture - status and missing
 
GI2015 ppt karas dresden j.karas
GI2015 ppt karas dresden j.karasGI2015 ppt karas dresden j.karas
GI2015 ppt karas dresden j.karas
 
GI2015 ppt hladikova copernicus_agriculture_forestry_lh
GI2015 ppt hladikova copernicus_agriculture_forestry_lhGI2015 ppt hladikova copernicus_agriculture_forestry_lh
GI2015 ppt hladikova copernicus_agriculture_forestry_lh
 
GI2015 ppt fiore eurisy_presentation
GI2015 ppt fiore eurisy_presentationGI2015 ppt fiore eurisy_presentation
GI2015 ppt fiore eurisy_presentation
 
GI2014 programme+proceedings final
GI2014 programme+proceedings finalGI2014 programme+proceedings final
GI2014 programme+proceedings final
 

Último

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

GI2016 ppt shi (big data analytics on the internet)

  • 1. BIG DATA ANALYTICS ON THE INTERNET Dr. Shaozhong SHI drshishaozhong@gmail.com
  • 2. Drawing data from geographically dispersed data stores over the Internet  A showcase of internationally remote access to and use of open data and application over the Internet is presented.  It shows how automation in big data analytics can be achieved on the Internet.  It shows the importance of standardisation and accessibility of data.  It illustrates, with a live example, how Open Source tools can be utilised for advancing big data analytics.
  • 3. Drawing data from geographically dispersed data stores over the Internet  It shows the design of a new application with use of Open Source tools such as Pandas, Numpy and Metplotlib.  It explains how full automation in sourcing and processing data and generating analytical output can be achieved.  It shows the importance of the standardisation of data and the role of geographical identifiers in automated data processing.
  • 4. Some key solutions for working across multiple Pandas dataframes (tables)  This PowerPoint show covers some keys which are important to data linkage, data integration, working across multiple Pandas dataframes (tables), and automation in processing.  These are key solutions for automated exact processing of records.  The showcase implementation is provided in a IPython notebook. See at the link below:  http://dev.mapofagriculture.com:9999/ipython/notebooks/sshaozhong/ 2016-05-16_Automatic_Aggregation_Disaggregation_Showcase.ipynb
  • 5. Original Online Data from USGS  The original data used is a large well structured Excel sheet at the following USGS website:  http://water.usgs.gov/pubs/sir/2006/5012/excel/Nutri ent_Inputs_1982-2001jan06.xls  It is used as the input to the program. It is geo- indexed with Federal Information Processing Standards (FIPS) codes.  The data is read in the newly developed program and stored as a Pandas dataframe table.  A subset of data was extracted for creation of a Pandas dataframe table to serve as the input table.
  • 6. Original data: Nitrogen Input from Fertilizer Use (kilograms) in each year between 1987 and 2001 A subset of a large spread sheet
  • 7. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The primary questions that this work set out to answer is whether automated means can be designed and developed for use in data integration and integrated processing of agricultural census dataset,  and whether automated aggregation by states and dis- aggregation of values at state level into values at county level.  To this end, an exploratory design, development and testing were carried out. An integrated set of algorithms were researched, designed, implemented and tested on the Map of Agriculture platform.  The integrated algorithms are collectively called Data Linkage for Data Integration and Automated Aggregation and Dis-aggregation.
  • 8. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The Automated Aggregation and Dis-aggregation is a prototype program that was developed in order to enable rapid development of data integration and integrated processing with Open Source Python tools and libraries.  The automated Aggregation and Dis-aggregation use Python and Pandas, Numpy libraries.  It has efficient, exact data integration, data inflow and outflow in Pandas dataframe tables, integrated processing characteristics.
  • 9. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The two sets of algorithmic solutions implemented are automatic online sourcing of structured data  and Automated Aggregation and Dis-aggregation itself. The first is to access, read in and take a set of data.  The second is to carry out an integrated processing for aggregating county level statistics into state level statistics and dis-aggregating state level statistics into county level statistics by rule.  Automated aggregation: addition and summing used.  A loop for summing up farm and non-farm statistics at county level for each year from 1987 to 2001.  Aggregated state level statistics are produced by using the State FIPS codes as the key.
  • 10. Working of the processing Aggregation: Input
  • 11. Working of the processing Aggregation: Output of Adding farm And nofarm Statistics Recursively Carried out For all years
  • 12. Working of processing Aggregation: Output of Application of Groupby with The use of StateFIPS
  • 13. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The output of automatic aggregation is a Pandas dataframe table which is indexed with the State FIPS codes.  Dis-aggregation:  The showcase uses a rule assuming that county level statistics contributing to state level statistics proportionally as determined by the area within the state.  The totals of land areas of the states are collected from the output of the aggregated output table through vLookup. It is stored as a Python dictionary as a geo-referenced dataset.  These are mapped exactly into right positions in a new column in the intermediary table for producing dis- aggregated statistics.
  • 14.  Output of dis-aggregating statistics on Nitrogen Input
  • 15. Characterisation of the new algorithm  Then, calculation of ratio between each county and its state takes place.  A loop is used to calculate dis-aggregated statistics for all counties for each of years from 1987 to 2001.  This results in a Pandas dataframe table as a dis- aggregated table.
  • 16. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  New approach of dis-aggregating tabular statistics into smaller geographical units (no intersection of geometric objects is required):  Calculation of ratio between each county and its state takes place. A loop is used to calculate dis-aggregated Nitrogen input statistics for all counties for each of years between 1987 and 2001. The total of a state times the ratio yields a dis-aggregated sum for the county. This logic of dis-aggregation has been used in areal interpolations as a technique for spatial disaggregation (Flowerdew and Green, 1992&1994; Goodchild, Anselin and Deichmann, 1993). This results in a Pandas dataframe table as a dis-aggregated table.
  • 17. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  Hitherto, areal interpolation and Dasymetric mapping (Flowerdew and Green, 1992&1994; Goodchild, Anselin  and Deichmann, 1993) are the only known approach and methods for spatially dis-aggregating statistics in relevance to the current work, particularly regarding the processing of tabular statistics in vector GIS datasets. The current work uses the logic of areal interpolation, as far as the datasets involved can currently allow. The difference between the current implementation of calculations and areal interpolation is that the current implementation does not involve intersection of area features/polygons.
  • 18. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  There is a degree of uncertainty related to the estimates. Improvement in estimation requires further research in the future. Nevertheless, it is a step forward in enabling estimation given the situation where no data are collected at county level. It offers a means to provide a quantitative indication. It is particularly useful to the processing of tabular statistics or when patterns need to be visualised at large scales.
  • 19. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The algorithmic solutions are characterised by their capabilities to track the geo-referenced data entries throughout cycles of processing, and exact geo- referenced data retrieval and mapping, namely data inflow and outflow from Pandas dataframe tables.  The dis-aggregation algorithm/procedure can be used for directly processing of tabular statistics without involving intersection of polygons, particularly in situations when neatly nested geospatial boundaries files of US states and counties are used.
  • 20. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The new algorithm can carry out automatic online sourcing of datasets and integrated processing with Open Source Python libraries. The new algorithm can be further extended for linking geodata from various sources, and for creation of indexed tabular datasets with geographical identifiers.  It can carry out automatic aggregation and dis- aggregation of agricultural census datasets for all states and counties in the USA.
  • 21. Characterisation of the new algorithm for spatial statistical aggregation and disaggregation  The Federation Information Processing Standards (FIPS) codes were used as geographical identifiers for geo- referenced data entries. It plays a critical role in retrieving data from databases and mapping data into right positions. It plays an efficient role in enabling vLookup solutions for retrieving data and mapping to exact positions in tables as desired.  Geographical identifiers serve as the key and are critically important in linking data between tables and creating geo- indexed tabular datasets. Geographical identifiers track attribute data entries in reference to geospatial objects.  This vLookup solution can be modified and used for other geodata projects.
  • 22. Output  The output of the program includes an aggregated statistical table by states and a dis-aggregated table by counties.
  • 23. Dis-aggregating wheat statistics into all counties  Data columns of StateFIPS, State Abbreviation, County name, country FIPS and ratio are taken from the table of dis- aggregated nitrogen input to form a new Pandas DataFrame table.  Data on wheat is extracted from the QuickStats are used. These data are state level statistics. The data are dis-aggregated into all counties.
  • 24. Dis-aggregating wheat statistics into all counties  Output of dis-aggregating wheat statistics
  • 25. Issues encountered  Data type issues were encountered and resolved.  Clear understanding of data types and methods for changing and handling is required.  After application of groupby command in Pandas dataframe, the original indexing is found meaningless. The use of FIPS codes ensures that data indexing and linkage in records are maintained throughout processing cycles. Mapping geo-referenced data into exact positions in columns is very important.
  • 26. Update Geo-databases and Create digital models in Geographical Information Systems to visualise spatial variation  A standard Geographical Information System has digital map associated with a tabular database of records.  Areal interpolation and Dasymetric mapping techniques have gained its popularity in using tabular records and combine these with area boundary files for creating map models.  The approach presented in this talk is based on the use of a neatly nested area boundary files in the administrative hierarchy of areas of the USA.  No intersection of digital boundaries is needed.
  • 29. References  https://www.nass.usda.gov/Quick_Stats/  https://www.python.org/downloads/  https://www.scipy.org/scipylib/download.html  http://matplotlib.org/downloads.html  https://pypi.python.org/pypi/pylab  Contact  4 Haythrop Close, Downhead Park, Milton Keynes, Buckinghamshire, United Kingdom, MK15 9DD  Mobile: +44-7909844462  EMail: drshishaozhong@gmail.com