The Actionable Intelligence Retrieval System (AIRS) is an integrated prototype that aligns different data models, applies advanced analytics algorithms, and allows analysts to search for information across many data sources. AIRS includes three essential components: an ontology to align data, advanced analytics algorithms, and an integrated prototype. The document outlines AIRS' research areas and tasks to develop its capabilities for retrieving and analyzing information from multiple data sources.
Agu chen a31_g-2917_retrieving temperature and relative humidity profiles fro...Maosi Chen
Atmospheric temperature and relative humidity profiles are fundamental for atmospheric research such as numerical weather prediction and climate change assessment. Hyperspectral satellite data contain a wealth of relevant information and have been used in many algorithms (e.g. regression-based methods) to retrieve these profiles. Deep Learning or Deep Neural Network (DNN) is capable of finding complex relationships (functions) between pairs of input and output variables by assembling many simple non-linear modules together and learning the parameters therein from large amounts of observations. DNN has been successfully applied in many fields (such as image classification, object detection, language translation). In this study, we explored the potential of retrieving atmospheric profiles from hyperspectral satellite radiation data using DNN. The requirement for applying the DNN technique is satisfied with large amount of hyperspectral radiance data provided by United States Suomi National Polar (NPP) Cross-track Infrared Sounder (CrIS) and the reanalyzed atmospheric profiles data provided by the European Centre for Medium-Range Weather Forecasts (ECMWF). The proposed DNN consists of two consecutive parts. In the first part, the first 1245 bands of the NPP CrIS hyperspectral radiance data (648.75 to 2555 cm-1) are compressed into a 300-element vector representing their key features by stacked AutoEncoders. Then, in the second part, the multi-layer Self-Normalizing Neural Network (SNN) is used to map the compressed vector (of 300 elements) into 55-layer temperature and relative humidity profiles. The DNN trainable variables are optimized by minimizing the difference of its predictions and the matched ECMWF temperature and humidity profiles (53230 samples). Finally, the DNN retrieved atmospheric temperature and relative humidity profiles and those provided by the NOAA Unique Combined Atmospheric Processing System (NUCAPS, the official retrieval products for CrIS) are compared with the matched radiosonde observations at one location.
NDGISUC2017 - Development of an Open Source Alternative Climate Database UtilityNorth Dakota GIS Hub
This document discusses the development of an open-source climate database utility (CDUSt) for the Soil and Water Assessment Tool (SWAT) model. CDUSt allows SWAT users to access and utilize higher resolution Parameter-elevation Regressions on Independent Slopes Model (PRISM) climate data more easily than other options. The document provides an example usage of SWAT and CDUSt to simulate water levels of Devils Lake, ND using PRISM and Coupled Forecast System version 2 (CFSv2) data. Analysis shows PRISM data improves SWAT calibration compared to lower resolution CFSR data. Ongoing work includes defining multiple forecast scenarios and addressing challenges in utilizing CFSv2
There are a set of new real-time scheduling algorithms being developed for the Linux kernel, which provide temporal isolation among tasks.
These include an implementation of the POSIX sporadic server (SS) and a deadline-based scheduler. These are based on the specification of
the scheduling guarantees needed by the kernel in terms of a budget and a period.
This presentation aims to tackle the issues related to how to design a proper kernel-space / user-space interface for accessing such new
functionality. For the SS, a POSIX compliant implementation would break binary compatibility. However, the currently implemented API seems to be lacking some important features, like a sufficient level of extensibility. This would be required for example for adding further parameters in the future, e.g., deadlines different from periods, or soft (i.e., work-conserving) reservations, or how to mix power management in the looop (if ever).
Geohash is a geocoding system that encodes latitude and longitude coordinates into a short string of letters and digits. It allows for efficient spatial queries to select points within a given rectangular boundary. While it defines boxes rather than points, geohash allows dividing the search area into contiguous slices to retrieve nearby points through single-parameter queries on the geohash string. However, it has limitations for points near "fault lines" and in ensuring close points share prefixes.
Geohash encodes latitude and longitude coordinates into alphanumeric strings to simplify representation and allow for proximity searches. It subdivides geographic areas into nested grid "buckets" represented by strings, with longer strings indicating smaller areas. This hierarchical structure allows nearby locations to be identified by searching for records with similar geohash prefixes. While it approximates location rather than representing a single point, geohash enables easy grouping, zooming, and proximity searches of geographic data in databases.
Christian jensen advanced routing in spatial networks using big datajins0618
Advanced Routing in Spatial Networks Using Big Data discusses using big data and advanced routing techniques for transportation networks. It covers modeling transportation networks using big data from sensors to assign time-varying weights representing factors like travel time and emissions. It then discusses routing algorithms that find optimal routes considering these weights, including algorithms for stochastic and uncertain weights. The document provides an overview of using big data to improve transportation network modeling and routing.
[FOSS4G Korea 2016] GeoHash를 이용한 지형도 변화탐지와 시계열 관리BJ Jang
연차별로 구축된 지형도를 PostGIS에 넣어 ST_GeoHash()함수를 이용해 지리적인 식별키를 생성하고 이를 이용해 각 객처별 변화를 탐지해 낸다. 이렇게 탐지한 변화정보를 이용해 지형도의 변화를 시계열적으로 구축하여, 원하는 시점의 자료를 조회하고, 변화내용을 분석하는 과정을 국토지리정보원의 실사례와 함께 설명한다.
Agu chen a31_g-2917_retrieving temperature and relative humidity profiles fro...Maosi Chen
Atmospheric temperature and relative humidity profiles are fundamental for atmospheric research such as numerical weather prediction and climate change assessment. Hyperspectral satellite data contain a wealth of relevant information and have been used in many algorithms (e.g. regression-based methods) to retrieve these profiles. Deep Learning or Deep Neural Network (DNN) is capable of finding complex relationships (functions) between pairs of input and output variables by assembling many simple non-linear modules together and learning the parameters therein from large amounts of observations. DNN has been successfully applied in many fields (such as image classification, object detection, language translation). In this study, we explored the potential of retrieving atmospheric profiles from hyperspectral satellite radiation data using DNN. The requirement for applying the DNN technique is satisfied with large amount of hyperspectral radiance data provided by United States Suomi National Polar (NPP) Cross-track Infrared Sounder (CrIS) and the reanalyzed atmospheric profiles data provided by the European Centre for Medium-Range Weather Forecasts (ECMWF). The proposed DNN consists of two consecutive parts. In the first part, the first 1245 bands of the NPP CrIS hyperspectral radiance data (648.75 to 2555 cm-1) are compressed into a 300-element vector representing their key features by stacked AutoEncoders. Then, in the second part, the multi-layer Self-Normalizing Neural Network (SNN) is used to map the compressed vector (of 300 elements) into 55-layer temperature and relative humidity profiles. The DNN trainable variables are optimized by minimizing the difference of its predictions and the matched ECMWF temperature and humidity profiles (53230 samples). Finally, the DNN retrieved atmospheric temperature and relative humidity profiles and those provided by the NOAA Unique Combined Atmospheric Processing System (NUCAPS, the official retrieval products for CrIS) are compared with the matched radiosonde observations at one location.
NDGISUC2017 - Development of an Open Source Alternative Climate Database UtilityNorth Dakota GIS Hub
This document discusses the development of an open-source climate database utility (CDUSt) for the Soil and Water Assessment Tool (SWAT) model. CDUSt allows SWAT users to access and utilize higher resolution Parameter-elevation Regressions on Independent Slopes Model (PRISM) climate data more easily than other options. The document provides an example usage of SWAT and CDUSt to simulate water levels of Devils Lake, ND using PRISM and Coupled Forecast System version 2 (CFSv2) data. Analysis shows PRISM data improves SWAT calibration compared to lower resolution CFSR data. Ongoing work includes defining multiple forecast scenarios and addressing challenges in utilizing CFSv2
There are a set of new real-time scheduling algorithms being developed for the Linux kernel, which provide temporal isolation among tasks.
These include an implementation of the POSIX sporadic server (SS) and a deadline-based scheduler. These are based on the specification of
the scheduling guarantees needed by the kernel in terms of a budget and a period.
This presentation aims to tackle the issues related to how to design a proper kernel-space / user-space interface for accessing such new
functionality. For the SS, a POSIX compliant implementation would break binary compatibility. However, the currently implemented API seems to be lacking some important features, like a sufficient level of extensibility. This would be required for example for adding further parameters in the future, e.g., deadlines different from periods, or soft (i.e., work-conserving) reservations, or how to mix power management in the looop (if ever).
Geohash is a geocoding system that encodes latitude and longitude coordinates into a short string of letters and digits. It allows for efficient spatial queries to select points within a given rectangular boundary. While it defines boxes rather than points, geohash allows dividing the search area into contiguous slices to retrieve nearby points through single-parameter queries on the geohash string. However, it has limitations for points near "fault lines" and in ensuring close points share prefixes.
Geohash encodes latitude and longitude coordinates into alphanumeric strings to simplify representation and allow for proximity searches. It subdivides geographic areas into nested grid "buckets" represented by strings, with longer strings indicating smaller areas. This hierarchical structure allows nearby locations to be identified by searching for records with similar geohash prefixes. While it approximates location rather than representing a single point, geohash enables easy grouping, zooming, and proximity searches of geographic data in databases.
Christian jensen advanced routing in spatial networks using big datajins0618
Advanced Routing in Spatial Networks Using Big Data discusses using big data and advanced routing techniques for transportation networks. It covers modeling transportation networks using big data from sensors to assign time-varying weights representing factors like travel time and emissions. It then discusses routing algorithms that find optimal routes considering these weights, including algorithms for stochastic and uncertain weights. The document provides an overview of using big data to improve transportation network modeling and routing.
[FOSS4G Korea 2016] GeoHash를 이용한 지형도 변화탐지와 시계열 관리BJ Jang
연차별로 구축된 지형도를 PostGIS에 넣어 ST_GeoHash()함수를 이용해 지리적인 식별키를 생성하고 이를 이용해 각 객처별 변화를 탐지해 낸다. 이렇게 탐지한 변화정보를 이용해 지형도의 변화를 시계열적으로 구축하여, 원하는 시점의 자료를 조회하고, 변화내용을 분석하는 과정을 국토지리정보원의 실사례와 함께 설명한다.
1) Scientists at the Advanced Photon Source use the Argonne Leadership Computing Facility for data reconstruction and analysis from experimental facilities in real-time or near real-time. This provides feedback during experiments.
2) Using the Swift parallel scripting language and ALCF supercomputers like Mira, scientists can process terabytes of data from experiments in minutes rather than hours or days. This enables errors to be detected and addressed during experiments.
3) Key applications discussed include near-field high-energy X-ray diffraction microscopy, X-ray nano/microtomography, and determining crystal structures from diffuse scattering images through simulation and optimization. The workflows developed provide significant time savings and improved experimental outcomes.
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
In this deck from the Stanford HPC Conference, Ryan Quick from Providentia Worldwide presents: High Availability HPC ~ Microservice Architectures for Supercomputing.
"Microservices power cloud-native applications to scale thousands of times larger than single deployments. We introduce the notion of microservices for traditional HPC workloads. We will describe microservices generally, highlighting some of the more popular and large-scale applications. Then we examine similarities between large-scale cloud configurations and HPC environments. Finally we propose a microservice application for solving a traditional HPC problem, illustrating improved time-to-market and workload resiliency."
Watch the video: https://insidehpc.com/2018/02/high-availability-hpc-microservice-architectures-supercomputing/
Learn more: http://www.providentiaworldwide.com/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides an introduction to Geographic Information Systems (GIS). It outlines 12 topics: (1) what GIS is and its components; (2) spatial and attribute data; (3) major GIS tasks and functions; (4) where GIS data comes from; (5) benefits of using GIS; (6) why GIS is studied; (7) geographic models in ArcGIS; (8) the steps in a GIS project; (9) basic ArcMap components; (10) the ArcGIS software window and platforms; (11) the ArcCatalog interface; and (12) a practical exercise on implementing ArcGIS and performing tasks like importing data, digitizing features, and map layout
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
This document discusses methods for understanding the evolution of open source cloud systems like OpenStack. It presents the authors' solution of using tracing techniques to analyze OpenStack's data and message flows for logical operations such as creating and deleting VMs. Key findings from tracing OpenStack releases include significant behavioral changes between releases, hundreds of database queries and AMQP messages required for operations, and the involvement of components like Keystone, Glance, Nova, and Neutron. The authors propose using their techniques to inject faults and build a knowledge base to aid future problem diagnosis.
Asset performance management using Druid by Eric Lim, BistelMetatron
This document discusses asset performance management (APM) and how to use Druid for APM. It contains the following key points:
1. APM involves monitoring industrial assets like production lines, conveyor belts, and semiconductor manufacturing equipment to predict failures and perform preventative maintenance.
2. Druid is an open source analytics database that can be used with tools like Kafka, Flink, and Metatron for real-time health monitoring, analytics, and predictive maintenance of industrial assets.
3. The Metatron APM system powered by Druid allows users to configure asset templates, build data pipelines from IoT sensors to Druid, perform fault detection and prediction, and generate dashboards and
Solar System Processing with LSST: A Status UpdateMario Juric
An update for the LSST Solar System Science Collaboration on the work in progress on data products and software needed to support the Solar System science. Delivered at DPS 2017 meeting.
Atomate: a tool for rapid high-throughput computing and materials discoveryAnubhav Jain
Atomate is a tool for automating materials simulations and high-throughput computations. It provides predefined workflows for common calculations like band structures, elastic tensors, and Raman spectra. Users can customize workflows and simulation parameters. FireWorks executes workflows on supercomputers and detects/recovers from failures. Data is stored in databases for analysis with tools like pymatgen. The goal is to make simulations easy and scalable by automating tedious steps and leveraging past work.
Marius Eriksen discusses Reflow, a new cloud-native workflow framework for bioinformatics. Reflow programs workflows directly using a functional programming language for simplicity and composability. It leverages lazy evaluation and caching to efficiently parallelize and distribute work across private clusters. Reflow aims to untie the hands of implementors compared to traditional workflow systems through its unified approach to programming, execution, and infrastructure.
Performing Large Scale Repeatable Software Engineering StudiesGeorgios Gousios
The document discusses performing large-scale software engineering studies. It outlines how empirical research is currently done, including issues with small sample sizes, lack of experiment replication, and unavailable tools and data. The document then proposes a platform for software engineering research to address these issues. The platform would provide pre-processed data in standard formats, shared tools and results, and large-scale processing capabilities to enable more rigorous empirical studies.
ArrayUDF: User-Defined Scientific Data Analysis on ArraysGoon83
User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management and other non-trivial tasks to the system. This general approach is at the heart of the modern Big Data systems, such MapReduce/Spark and SciDB. However, a wide variety of common scientific data operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute with these Big Data systems. In this talk, we will introduce a brand new Big Data system namely ArrayUDF (https://bitbucket.org/arrayudf/arrayudf) for scientific data sets, especially for multi-dimensional arrays. The ArrayUDF allows flexible expressiveness of UDF for scientific data analysis on the strength of their common character--structural locality. ArrayUDF executes the UDF directly on arrays stored in files, such as HDF5, without any data load overload. ArrayUDF's desi
gn and implementation considerations for parallel data processing on large-scale HPC will also be introduced. The performance tests on Edison at NERSC show that ArrayUDF is around 2000X faster than Spark on processing large scientific datasets.
Event Processing Using Semantic Web TechnologiesMikko Rinne
The presentation held at the public defence of my doctoral thesis at the department of computer science of Aalto University, Espoo, Finland on 1st of September 2017.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
This document discusses diagnosing issues in cloud applications hosted on Microsoft Azure. It covers the types of diagnostic data that can be monitored on Azure, including performance counters, logs, and event logs. It provides guidance on using the Azure Diagnostics agent to configure which diagnostic data is collected and transferred to storage. Both imperative and declarative configuration methods are demonstrated. Real-world troubleshooting steps and examples are also presented.
The slides of the invited talk Maurizio Marchese from the LiquidPub team gave at the Workhop on Automated Experimentation at e-Science Institute, Edinburgh, February 24th, 2010
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
Modern DevOps with Spinnaker/Concourse and MicrometerJesse Tate Pulfer
Learn how you can leverage the recent addition of Micrometer to the Spring ecosystem and Cloud Foundry to the Spinnaker ecosystem to help you deliver code quickly and safely. Some highlights include:
Micrometer’s real-time application monitoring capabilities
Spinnaker’s visibility into what is going on in the system
Spinnaker’s safety of deployments and rollbacks
Spinnaker’s deployment scalability
Observability & Continuous Deployment, The Big Picture with Adib Saikali
10:15-11:00am Micrometer: Four Key Performance Indicators for Every Java Service with Jon Schneider
11:00-11:15am Micrometer & PCF with Victor Szoltysek
11:15-11:25am Break
11:25-12:15pm Spinnaker 101 with Olga Kundzich
12:15-12:45pm Lunch
12:45-1:30pm Concourse 101: Container Based CI with Concourse with Jamil Shamy
1:30-1:40pm Break
1:40-2:40pm Putting all the tools together Continuous Deployment with Concourse / Spinnaker / Micrometer & PCF with Jon Schneider
2:40-2:50pm Break
2:50-3:30pm Panel with Concourse / Spinnaker & Micrometer Team
3:30pm Wrap Up
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011Alberto Lluch Lafuente
This document presents a conceptual framework for adaptation. It discusses reflective rule-based programming and context-oriented programming as inspirations. Context-oriented programming allows changing an object's behavior depending on its current context through layered architectures and dynamic dispatching. Control data like rules and contexts may allow an adaptable program to modify its own behavior. Open questions remain around realizing adaptation towers, compositional approaches, and what constitutes control data in different models.
Information Extraction and Integration of Hard and Soft Information for D2D v...DataCards
"Information Extraction and Integration of Hard and Soft Information for D2D via
Controlled National Language,” Dr. Tien Pham, US Army Research Laboratory
This document discusses human geography data fusion techniques. It provides an overview of human geography data themes that can be used, such as demographics, religion, language, ethnicity, and transportation. It also discusses approaches to collecting and integrating data from various sources on these themes to create comprehensive data layers. Examples are provided showing how fused human geography data has been used to analyze situations in the Democratic Republic of Congo, Mexico, and Syria. The document notes applications of this approach include contributing to social network analysis and filling intelligence gaps.
Mais conteúdo relacionado
Semelhante a Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program
1) Scientists at the Advanced Photon Source use the Argonne Leadership Computing Facility for data reconstruction and analysis from experimental facilities in real-time or near real-time. This provides feedback during experiments.
2) Using the Swift parallel scripting language and ALCF supercomputers like Mira, scientists can process terabytes of data from experiments in minutes rather than hours or days. This enables errors to be detected and addressed during experiments.
3) Key applications discussed include near-field high-energy X-ray diffraction microscopy, X-ray nano/microtomography, and determining crystal structures from diffuse scattering images through simulation and optimization. The workflows developed provide significant time savings and improved experimental outcomes.
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
In this deck from the Stanford HPC Conference, Ryan Quick from Providentia Worldwide presents: High Availability HPC ~ Microservice Architectures for Supercomputing.
"Microservices power cloud-native applications to scale thousands of times larger than single deployments. We introduce the notion of microservices for traditional HPC workloads. We will describe microservices generally, highlighting some of the more popular and large-scale applications. Then we examine similarities between large-scale cloud configurations and HPC environments. Finally we propose a microservice application for solving a traditional HPC problem, illustrating improved time-to-market and workload resiliency."
Watch the video: https://insidehpc.com/2018/02/high-availability-hpc-microservice-architectures-supercomputing/
Learn more: http://www.providentiaworldwide.com/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides an introduction to Geographic Information Systems (GIS). It outlines 12 topics: (1) what GIS is and its components; (2) spatial and attribute data; (3) major GIS tasks and functions; (4) where GIS data comes from; (5) benefits of using GIS; (6) why GIS is studied; (7) geographic models in ArcGIS; (8) the steps in a GIS project; (9) basic ArcMap components; (10) the ArcGIS software window and platforms; (11) the ArcCatalog interface; and (12) a practical exercise on implementing ArcGIS and performing tasks like importing data, digitizing features, and map layout
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
This document discusses methods for understanding the evolution of open source cloud systems like OpenStack. It presents the authors' solution of using tracing techniques to analyze OpenStack's data and message flows for logical operations such as creating and deleting VMs. Key findings from tracing OpenStack releases include significant behavioral changes between releases, hundreds of database queries and AMQP messages required for operations, and the involvement of components like Keystone, Glance, Nova, and Neutron. The authors propose using their techniques to inject faults and build a knowledge base to aid future problem diagnosis.
Asset performance management using Druid by Eric Lim, BistelMetatron
This document discusses asset performance management (APM) and how to use Druid for APM. It contains the following key points:
1. APM involves monitoring industrial assets like production lines, conveyor belts, and semiconductor manufacturing equipment to predict failures and perform preventative maintenance.
2. Druid is an open source analytics database that can be used with tools like Kafka, Flink, and Metatron for real-time health monitoring, analytics, and predictive maintenance of industrial assets.
3. The Metatron APM system powered by Druid allows users to configure asset templates, build data pipelines from IoT sensors to Druid, perform fault detection and prediction, and generate dashboards and
Solar System Processing with LSST: A Status UpdateMario Juric
An update for the LSST Solar System Science Collaboration on the work in progress on data products and software needed to support the Solar System science. Delivered at DPS 2017 meeting.
Atomate: a tool for rapid high-throughput computing and materials discoveryAnubhav Jain
Atomate is a tool for automating materials simulations and high-throughput computations. It provides predefined workflows for common calculations like band structures, elastic tensors, and Raman spectra. Users can customize workflows and simulation parameters. FireWorks executes workflows on supercomputers and detects/recovers from failures. Data is stored in databases for analysis with tools like pymatgen. The goal is to make simulations easy and scalable by automating tedious steps and leveraging past work.
Marius Eriksen discusses Reflow, a new cloud-native workflow framework for bioinformatics. Reflow programs workflows directly using a functional programming language for simplicity and composability. It leverages lazy evaluation and caching to efficiently parallelize and distribute work across private clusters. Reflow aims to untie the hands of implementors compared to traditional workflow systems through its unified approach to programming, execution, and infrastructure.
Performing Large Scale Repeatable Software Engineering StudiesGeorgios Gousios
The document discusses performing large-scale software engineering studies. It outlines how empirical research is currently done, including issues with small sample sizes, lack of experiment replication, and unavailable tools and data. The document then proposes a platform for software engineering research to address these issues. The platform would provide pre-processed data in standard formats, shared tools and results, and large-scale processing capabilities to enable more rigorous empirical studies.
ArrayUDF: User-Defined Scientific Data Analysis on ArraysGoon83
User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management and other non-trivial tasks to the system. This general approach is at the heart of the modern Big Data systems, such MapReduce/Spark and SciDB. However, a wide variety of common scientific data operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute with these Big Data systems. In this talk, we will introduce a brand new Big Data system namely ArrayUDF (https://bitbucket.org/arrayudf/arrayudf) for scientific data sets, especially for multi-dimensional arrays. The ArrayUDF allows flexible expressiveness of UDF for scientific data analysis on the strength of their common character--structural locality. ArrayUDF executes the UDF directly on arrays stored in files, such as HDF5, without any data load overload. ArrayUDF's desi
gn and implementation considerations for parallel data processing on large-scale HPC will also be introduced. The performance tests on Edison at NERSC show that ArrayUDF is around 2000X faster than Spark on processing large scientific datasets.
Event Processing Using Semantic Web TechnologiesMikko Rinne
The presentation held at the public defence of my doctoral thesis at the department of computer science of Aalto University, Espoo, Finland on 1st of September 2017.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
This document discusses diagnosing issues in cloud applications hosted on Microsoft Azure. It covers the types of diagnostic data that can be monitored on Azure, including performance counters, logs, and event logs. It provides guidance on using the Azure Diagnostics agent to configure which diagnostic data is collected and transferred to storage. Both imperative and declarative configuration methods are demonstrated. Real-world troubleshooting steps and examples are also presented.
The slides of the invited talk Maurizio Marchese from the LiquidPub team gave at the Workhop on Automated Experimentation at e-Science Institute, Edinburgh, February 24th, 2010
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
Modern DevOps with Spinnaker/Concourse and MicrometerJesse Tate Pulfer
Learn how you can leverage the recent addition of Micrometer to the Spring ecosystem and Cloud Foundry to the Spinnaker ecosystem to help you deliver code quickly and safely. Some highlights include:
Micrometer’s real-time application monitoring capabilities
Spinnaker’s visibility into what is going on in the system
Spinnaker’s safety of deployments and rollbacks
Spinnaker’s deployment scalability
Observability & Continuous Deployment, The Big Picture with Adib Saikali
10:15-11:00am Micrometer: Four Key Performance Indicators for Every Java Service with Jon Schneider
11:00-11:15am Micrometer & PCF with Victor Szoltysek
11:15-11:25am Break
11:25-12:15pm Spinnaker 101 with Olga Kundzich
12:15-12:45pm Lunch
12:45-1:30pm Concourse 101: Container Based CI with Concourse with Jamil Shamy
1:30-1:40pm Break
1:40-2:40pm Putting all the tools together Continuous Deployment with Concourse / Spinnaker / Micrometer & PCF with Jon Schneider
2:40-2:50pm Break
2:50-3:30pm Panel with Concourse / Spinnaker & Micrometer Team
3:30pm Wrap Up
A conceptual framework for behavioural adaptation @ Meeting ASCENS 2011Alberto Lluch Lafuente
This document presents a conceptual framework for adaptation. It discusses reflective rule-based programming and context-oriented programming as inspirations. Context-oriented programming allows changing an object's behavior depending on its current context through layered architectures and dynamic dispatching. Control data like rules and contexts may allow an adaptable program to modify its own behavior. Open questions remain around realizing adaptation towers, compositional approaches, and what constitutes control data in different models.
Semelhante a Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program (20)
Information Extraction and Integration of Hard and Soft Information for D2D v...DataCards
"Information Extraction and Integration of Hard and Soft Information for D2D via
Controlled National Language,” Dr. Tien Pham, US Army Research Laboratory
This document discusses human geography data fusion techniques. It provides an overview of human geography data themes that can be used, such as demographics, religion, language, ethnicity, and transportation. It also discusses approaches to collecting and integrating data from various sources on these themes to create comprehensive data layers. Examples are provided showing how fused human geography data has been used to analyze situations in the Democratic Republic of Congo, Mexico, and Syria. The document notes applications of this approach include contributing to social network analysis and filling intelligence gaps.
Data Normalization and Alignment in Heterogeneous Data SetsDataCards
The document discusses challenges related to data normalization, alignment, and integration from multiple sources. It describes issues such as different data formats, missing metadata, and structural inconsistencies between databases. It then provides examples of techniques used to address these issues, including developing rules to standardize date formats, using algorithms to infer missing primary and foreign keys in databases, and creating modular ontologies to semantically link diverse data sources. The goal is to enable combined search, analysis and exploitation of heterogeneous data as if it were a single resource.
The Challenges and Pitfalls of Aggregating Social Media DataDataCards
This document discusses the challenges and pitfalls of aggregating social media data. It notes that social media is often seen as a "panacea" but questions remain about what questions the data can answer. A case study of analyzing social media data from Mexico revealed issues like inadequate data processing capabilities and questions about the collection profile and presence of U.S. person data. Next steps include building a baseline of traditional Mexican media's social media presence. Analyzing social media data faces hurdles like lack of agreement among experts and lack of databases. Initial findings note gaps between what can be remotely collected and what media people actually use, with Facebook dominating but Twitter disliked by most. The wrap up emphasizes being skeptical and clarifying
This document summarizes a presentation on best practices for polling and survey data. It cautions against simply aggregating polls, noting that doing so risks losing nuance and precision. It emphasizes the importance of representative sampling, transparency, and minimizing errors. Key points include carefully evaluating coverage and potential biases in samples, especially for international data, and considering how factors like question wording, response options, and population studied can affect results. The overall message is that high-quality methodology, transparency, and understanding sources of error are needed to ensure survey accuracy.
2. Alignment of Data Models Apr – Jul OccursOn Crop
2008 Type Failure
- Single representation for all data
sources Event
- Easily plug-in new data sources Report
OccursAt
Transcript Western
RecordedBy Afghanistan
Newsletter
Remove Perspective
Report Transcript Newsletter
Observer ID:556AS4 Date: 15 May 08 Date: 10 Apr 08
Date: 26 Apr 08 Event: Situation Description:
Data Model Event: Crop in certain areas Crop outlook
Perspective Failure Extent dire as lack of for early
Detection rain … summer …
Confounds Data
Integration
Event
Observer
CUBRC KDD AIRS System 2
3. Event
Advanced Analytics Algorithms
Quantitative
“Easy” Analyst Questions
- Identify All Event Information
Timeline
“Harder” Analyst Questions
-Identify Similar Events
“Hardest” Analyst Question
- Identify Predictor Events
Qualitative
CUBRC KDD AIRS System 3
4. Probe Tasks
• Fully automated tasks
• Test system plumbing
• Ex: Find all associates of Jim Johnson and list the person’s
affiliation to Jim. Use only data sets A, E, M.
• 20 questions like these
Analyst Tasks
• Manual task executed by actual analysts
• Test usability and applicability of developed algorithms
to realistic tasks
• Ex: Find all information that may have predicted an
attack was imminent in Khost, Afghanistan on 3 June,
2008.
• 10 questions like these
CUBRC KDD AIRS System 4
5. Many Sources Many Records Many Types
1K 100K 1M
DS 1 Reports
Articles
DS 2
DS 3 Blogs
Transcripts
DS 4
Structured
DS 5
DS 6 DOMEX
DS 7 Semi-Structured
DS 8 Social Media
CUBRC KDD AIRS System 5
7. 9 High Level Research Areas 30 Research Tasks in Phase 2
•Task 1.1.3 (CUBRC) April - PreProto
•Task 1.1.4 (CUBRC) Aug - Lab
•Task 1.1.5 (CUBRC) Aug - Lab
ALIGNMENT •Task 1.2.2 (CUBRC) April - Lab
•Task 2.1.2 (ISS) April - Lab
1. Ontology Development •Task 2.1.3.a (ISS) April - Lab
•Task 2.1.3.c (ISS) August - Lab
2. Structured Data Alignment •Task 3.1.2.a (GDIT) April Lab | Aug PreProto
•Task 3.1.3 (GDIT)
3. Unstructured Data Alignment •Task 3.1.4 (GDIT)
April PreProto
April Lab | Aug Preproto
4. Alignment Reasoner •Task 3.2.1.a (GDIT)
•Task 3.2.1.b (GDIT)
April Lab | Aug PreProto
April Lab | Aug PreProto
5. Alignment Optimization •Task 3.2.1.c (GDIT) April Lab | Aug PreProto
•Task 3.2.1.d (GDIT) April Lab | Aug Preproto
•Task 3.2.3 (GDIT) April Proproto
•Task 4.2.1 (Securboration) Aug Lab
ADVANCED ANALYTICS •Task 5.1.1 (CUBRC) Aug Lab
•Task 6.1.2 (CUBRC/UB) April Lab | Aug Preproto
6. Workflow Optimization •Task 6.1.4 (CUBRC) April Lab | Aug Preproto
•Task 6.1.5 (CUBRC)
7. Application of Analyst Context •Task 6.3.1 (CUBRC) April Lab
8. Data Association for Entity Resolution •Task 7.3.1 (Securboration)
•Task 7.4.1 (UB)
April Lab
Aug PreProto
9. Distributed Graph Matching •Task 8.1.1 (UB) Aug PreProto
•Task 8.3.1 (UB) April PreProto
•Task 8.3.2 (UB) April Lab | Aug PreProto
•Task 9.1.1 (UB) Aug Preprotp
•Task 9.1.2 (CUBRC) Aug Lab
•Task 9.2.1 (UB) Aug Lab
•Task 9.3.1 (UB) Aug Lab
CUBRC KDD AIRS System 7
8. Visualization Answers Analytics Data Flow
Invoked
Invoke Algorithm
Algorithm
Query: Query
Expansion Single Threaded
KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Query Use Query: Evaluate
Aligned
Execution
Models
Results
Data Services
Search Graph Graph Association: Query:
Creation: Creation: Entities & Events Sparql
Structured Unstructured
Parallelized Parallelized Parallelized
Read & Write
Raw Data Write
Data
Global Sources
Model CUBRC KDD AIRS System 8
9. Backbone of Project
Basic Formal
Ontology –
Relation
Ontology
Artifact Time
Ontology Ontology
Extended Information
Agent Event Geospatial Quality
Relation Technology
Ontology Ontology Ontology Ontology
Ontology Ontology
AIRS Mid-
Level
Ontology
Defines Input &
Output Format Most
Counterterrorism
Processes
Ontology
CUBRC KDD AIRS System 9
10. Information Entity Ontology Sample Document
• 76 local classes
• 21 equivalence class axioms
• 1 superclass axioms
• 28 local object properties
• 7 datatype properties
Agent Ontology
• 787 local classes
• 231 equivalence class axioms (mostly
persons with roles, e.g. Physician, Lawyer)
• 70 local object properties (mostly
familial relationships)
• SPARQL Inferencing Notation (SPIN)
rules that infer familial relationships from
the primitive relationships of the child_of #Note #Paragraph #SectionOfText
and parent_of and the qualities of male
and female gender.
#Person #Place
CUBRC KDD AIRS System 10
11. Analytics Query ‘Soup-to-Nuts’ Graph
“Documents where
Smyth is a Person &&
has Associates &&
Ontology
footnote contains ‘XY’ &&
from data set 4 or 5”
4 5
SPARQL Query Raw
CUBRC KDD AIRS System 11
12. Visualization Answers Analytics Data Flow
Invoked
Invoke Algorithm
Algorithm
Query: Query
Expansion Single Threaded
KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Query Use Query: Evaluate
Aligned
Execution
Models
Results
Data Services
Search Graph Graph Association: Query:
Creation: Creation: Entities & Events Sparql
Structured Unstructured
Parallelized Parallelized Parallelized
Read & Write
Raw Data Write
Data
Global Sources
Model CUBRC KDD AIRS System 12
13. Architecture Implementation
Column Alignment Request
Data Value
Learner Characterization
Learner
Learner
Learner Context Column Categorical Based
Alignment Mega- Data Value
Alignment
Data Cube Learner Characterization
Mega-
Learner Lucene Base
Alignment
* Spring Framework
Column Alignment Prediction
Data Value Characterization
• Used metadata, data values, regular expressions, and neural networks to classify columns
• Combined with a collection of heuristics
• Date Time
• Person’s Name, Alias, and Birth Date
• Recognizing unstructured data within structured
13
CUBRC KDD AIRS System
15. Method
1. Document Type Identification:
• Determine document type with pattern-based configurations
2. Passage & Metadata Retrieval:
• With Document Type, Identify & extract data using:
a. Template / Grammar Process
b. Generic Heuristic Process
3. Document Genre Association:
• Link associated document genres
Document Type Passage & Metadata Document Genre
Identification Retrieval Association
Identification Template Passage &
Configuration Grammars Metadata
Document
Type Annotations Passages,
Document
Metadata,
Document Type (a) Template / Document Genre
Identification
Genre links
Grammar Process Association
Process Process
(b) Generic
Heuristic Process
CUBRC KDD AIRS System 15
16. Methods
• Extraction of Entity types (People, Place, Location, Facility, etc.)
• Extraction of Events and Relationships - Uses an external file of
patterns to extract attributes, relationships, and events.
• Speed is 100 - 250K per second for information extraction
Purchaser Pattern Language
Seller
Quickly Define
16
CUBRC KDD AIRS System
17. Developed Tools
Create Corpora Tool
1. Pulls down documents from data sources (uses samples)
2. Performs document analysis
3. Generates Core Types ~20 minutes for full markup of 1200 documents
CUBRC KDD AIRS System 17
18. Developed Tools
Corner Case Coverage
Text to RDF tool
CUBRC KDD AIRS System 18
19. Visualization Answers Analytics Data Flow
Invoked
Invoke Algorithm
Algorithm
Query: Query
Expansion Single Threaded
KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Query Use Query: Evaluate
Aligned
Execution
Models
Results
Data Services
Search Graph Graph Association: Query:
Creation: Creation: Entities & Events Sparql
Structured Unstructured
Parallelized Parallelized Parallelized
Read & Write
Raw Data Write
Data
Global Sources
Model CUBRC KDD AIRS System 19
20. Many Data Keyword- Fast Core Dynamic Graph
Sources based Analytics Generation
Query
Structured Data
Processing
Keyword
Natural Language
Index
Processing
Custom Analytics
Data
Service
Consistent 5 Minute Realist Scalable
Running Time Goal Ontology (Hadoop)
CUBRC KDD AIRS System 20
21. Purpose
To create a component that selects
the workflow definition that
satisfies a set of QoS requirements,
maximizing the expected outcome
of the workflow.
Method
Solve Composite Service Problem
• The problem is decomposed into a sequence of functionalities.
• Functionalities (service classes) can be executed by many candidate
services.
• Candidates have associated benefits/costs (QoS Parameters).
• Candidates are substitute and complementary within a service class.
• Given QoS requirements, e.g., algorithm runtime ≤ 5 minutes
CUBRC KDD AIRS System 21
22. • Implemented in prototype system as runtime QoS
Structured
Processing
Write SPARQL Write to
Search Model Query VIZ
Unstructured
Processing
5 Minutes
• Developers must adhere to QoS parameters
• Phenomenal feedback loop developed with analysts; analysts
understood and diagnosed system
• Choose two additional QoS metrics for Phase 3 (memory)
CUBRC KDD AIRS System 22
23. Method
Representation Similarity
Euclidean
Dynamic Weighting (.80)
Location String
Static Weighting
Spatial/Hierarchical
Logistic Regression (.75)
Event Time
Neural Network (.77)
TFIDF (0.80) SVM (0.75)
Description
Semantic (0.64)
(Max F)
Major Research Tasks:
• Identified succinct easily extractable event representation
• Tested Location and Description similarity measures
• Tested Event Similarity Algorithms
• Tested performance on natural language and structured data
sources
CUBRC KDD AIRS System 23
24. GTD: 200804060007 WITS: 200804509
04/06/2008: On Sunday, unknown gunmen set On 6 April 2008, in the morning, in Jurn, Ninawa,
up a fake checkpoint and intercepted two Iraq, armed assailants stopped two school buses
college buses, one carrying male students and carrying students to Mosul University at a fake
one carrying female students, in Mosul, checkpoint. The assailants then fired upon one of the
Nineveh province, Iraq. The bus carrying the busses as it managed to escape, wounding three
female students managed to escape but the students and damaging the bus. Assailants kidnapped
gunmen held the 42 male college students… all 42 students on board the second bus…
Jurn ≈ Mosul Gaza ≠ Sderot
Mosul
25 km
Jurn
Close Distance ≠ Similarity
24 CUBRC KDD AIRS System 24
25. Processing Pipelines for Speed vs. Quality Decision
<RDF INPUT DIRECTORY> FastestEntityResolutionSolverLocal.java
Text Files
LREntityResolutionSolverLocal.java <NEW-RDF OUTPUT DIRECTORY>
Ont Model 1
Text File
Ont Model 2 EntityResolutionSubproblemConstruction.java
New Ont Model
Ont Model 3
Ont Model 4
Subproblems FastestEntityResolutionSolverMR.java
Subproblem (1,2)
LREntityResolutionSolverMR.java
Associate: …
Person Subproblem (3,4)
Location <SUBPROBLEM DIRECTORY>
Implements JavaJobRunner
Organization Implements JavaJobRunner, but runs MR Jobs
Date Implements MapReduceJobRunner
Artifact
CUBRC KDD AIRS System 25
26. Method
P1
Lagrangian relaxation of an integer programming
formulation of the clustering problem. This 55 65
algorithm iteratively adjusts scores to resolve
inconsistencies, and also provides a performance P2 P3
guarantee (optimality gap) on the solutions. -85
310 45
290 40
Run Time per Iteration (minutes)
35
270
30
Objective Value
250
25
230
20
210
15
190 10
170 5
150 0
1 6 11 16 21 26 31 36 41 46 0 4 8 12 16
Iteration Number # Processors
CUBRC KDD AIRS System 26
27. Results Cluster AIRS Search
Arrest
Similar
Content
Trial
Cluster Similar Group 300 Distinct
Content Information Results
CUBRC KDD AIRS System 27
28. • Analyst Context and Current State
– Analyst may come to the system with some information
• “There was a Terrorist Act at time X”
• “I am interested in this suspected Insurgent”
• “I want to know about a relationship between groups A and B”
– Initial queries may produce statements aligned with CTO
• Abductive Requery is applied
– Select weighted fragments whose bound variables match CTO elements used in
Context/State
– Select rules those fragments correspond to, weighting by selected fragments
– Combine rule statements with known Context/State
– Produce subsequent query with known values ‘filled in’
SELECT ?w1 {
Context: CONSTRUCT { }
“Jane Doe” wife “John Doe” ?p1 wife ?p2 . WHERE {
?p2 husband ?p1 . “Jane Doe” bride ?w1 .
} “John Doe” groom ?w1 .
WHERE { ?w1 rdf:type Wedding .
?p1 bride ?w1 . }
Fragment 1 { ?p2 groom ?w1 .
?p1 wife ?p2 . } ?w1 rdf:type Wedding .
} CUBRC KDD AIRS System 28
29. Visualization Answers Analytics Data Flow
Invoked
Invoke Algorithm
Algorithm
Query: Query
Expansion Single Threaded
KDD RDF Ranked
Requery:
Queries
Abduction
Analyze
Query Use Query: Evaluate
Aligned
Execution
Models
Results
Data Services
Search Graph Graph Association: Query:
Creation: Creation: Entities & Events Sparql
Structured Unstructured
Parallelized Parallelized Parallelized
Read & Write
Raw Data Write
Data
Global Sources
Model CUBRC KDD AIRS System 29
30. • Developed on the Hadoop/ MapReduce framework
• Distributed services used in AIRS
– Algorithms are written within the MapReduce and HDFS (file-system)
environment – single threaded algorithms are a single “slot” algorithm
– Oozie is the workflow coordination service; all jobs are monitored,
dispatched, and logged
– HBase and HDFS are used as distributed data stores for document
metadata, and RDF graphs
AIRS Software
HBase Database Oozie Workflow Coordination Service
MySQL Database Map Reduce Processing Framework
Hadoop Distributed File System (HDFS)
Server / Cluster Hardware
CUBRC KDD AIRS System 30
31. SELECT DISTINCT ?personNameText
WHERE
{
?act rdf:type event:Act .
?act ro:has_participant ?person .
?person rdf:type agent:Person .
?person ero:designated_by ?personName .
?personName ero:bearer_of ?personNameBearer .
?personNameBearer info:has_text_value ?PersonNameText .
}
Initial Query Merging Query Merging Query Merging Query
• ?act rdf:type event:Act • ?act ro:has_participant • ?person rdf:type • ?person
?person agent:Person ero:designated_by
?personName
Merging Query Merging Query Distinct Query
Save a result
• ?personName • ?personNameBearer Step
iterator and return
ero:bearer_of info:has_text_value • Filter on distinct
?personNameBearer ?PersonNameText results to the user
?PersonNameText’s
CUBRC KDD AIRS System 31
32. “Raw” Algorithms “Secondary” Algorithms
Accept Model Query Airs Query
Data Association Query Ingestion Cluster Results
Data Association Only Query Inprocess Extract All Organizations
Ingestion Query Structured Extract All Persons
Ingestion Only Translation Data Filter By Date
Association Find Events
Sparql Translation Ingestion Topic Filters (32 variants)
Structured • Leadership
• Corruption
• Dirty bombs
• Drugs, etc.
CUBRC KDD AIRS System 32
33. Probe Task - Wrapper Algorithms
1400
1200
Total Wrapper Lines of Code
1000
800
600
400
200
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Probe Task
• Total Lines: 13,958*
Wrapper Code
29% – Wrapper Code: 6,778
Implementation
49% Code – Implementation Code: 3,186*
Validation Code – Validation Code: 3,994
23%
* Less code developed before Test & Evaluation
CUBRC KDD AIRS System 33
34. Task: Find Life Events of
an Individual
Day
0 1 2 3 4 5
Tune Life Develop Algorithm
Event Extraction (glue code) to New Analytic
(NLP & SDA) Align Events
Capabilities in Days
CUBRC KDD AIRS System 34
35. Over 1200 workflows
were issued by analysts
over a 3 day period
CUBRC KDD AIRS System 35
36. Cluster Monitoring (Ganglia)
• System Load
• CPU Usage
• Memory Usage
• Network Bandwidth
CUBRC KDD AIRS System 36
37. • Fast translation technologies for structured and unstructured
• Many analytics successes - more to come in Phase 3
• All open source software, written entirely in Java
• Full Government Purpose Rights
• Installation manual and user manual ready to go
CUBRC KDD AIRS System 37
- Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
1. ontology (backbone of this project) -- Why is an ontology important; It speaks the language. -- Here are our ontologies -- Here is data that we have developed. -- Maybe some statistics on the explosion of data -- How overlaying a model to truly network information together is the best approach -- Show the exotic queries from Phase 1; very very powerful -- Query can go from the raw data to the extracted types
- Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
This architecture will allow us to integrate more machine learning algorithms and create a hybrid system for producing predictions for alignmentSupport weighting of alignment learnersLearner can be Mega-Learner, therefore it supports multiple levels of predictionAll learners can utilize the data contained within the Learner ContextEach learner will post its alignment result and score to the Alignment Data Cube for other Learners to access if needed.Alignment Data Cube is similar architecture to a Data Cube used within Data MiningAll scores are normalized between 0 and 1Data Value CharacterizationRegex to determine overall categorization of the data in the columnLucene Based AlignmentTF/IDF based learnerUtilizes Wordnet to expand the search terms
This architecture will allow us to integrate more machine learning algorithms and create a hybrid system for producing predictions for alignmentSupport weighting of alignment learnersLearner can be Mega-Learner, therefore it supports multiple levels of predictionAll learners can utilize the data contained within the Learner ContextEach learner will post its alignment result and score to the Alignment Data Cube for other Learners to access if needed.Alignment Data Cube is similar architecture to a Data Cube used within Data MiningAll scores are normalized between 0 and 1Data Value CharacterizationRegex to determine overall categorization of the data in the columnLucene Based AlignmentTF/IDF based learnerUtilizes Wordnet to expand the search terms
- Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
- Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.