O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
1
Mining and Utilizing Dataset Relevancy
from Oceanographic Dataset
(MUDROD) Metadata, Usage Metrics,
and User Feedback to...
2
Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD)
PI: Chaowei (Phil) Yang, George Mason University...
3
Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD)
PI: Chaowei (Phil) Yang, George Mason University...
4
Mr. Edward M Armstrong
(Senior Data Engineer , NASA Jet
Propulsion Laboratory)
Mr. David Moroni
(Data Engineer, NASA Jet...
5
Team Members: Students and Post-Docs
Ms. Yun Li
(Ph.D. student, George Mason
University; Research Assistant ).
Mr. Yongy...
6
Presentation Contents
• Background / Objectives / Technical Advancements
• TRL assessment
• Summary of Accomplishments a...
7
Background / Objectives / Tech Advance (1)
• Keyword-based matching (traditional search engines)
– User query: ocean win...
8
Background / Objectives / Tech Advance (2)
• Analyze web logs to discover user knowledge (query and data
relationships)
...
9
Background / Objectives / Tech Advance (2)
• Web log preprocessing
• Semantic analysis of user queries & Navigation
• Ma...
10
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• ...
11
Accomplishments Since last Review
TRL Assessment (1)
• Performance improvement
• Improved log mining performance using ...
12
Current TRL Rational
TRL Assessment (2)
Component Current TRL Project end TRL Description
Semantic search engine
Search...
13
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• ...
14
Overview
• Web log preprocessing
• Semantic analysis of user queries & Navigation
• Machine learning based search ranki...
15
Web log processing
16
• Requests sent from client e.g. browser, cmd line tool, etc. recorded by
server
• Log files provided by PO.DAAC (HTTP(...
17
Goal: reconstruct user browsing pattern
(search history & clickstream) from a set
of raw logs
Web logs
User identificat...
18
Reconstructed session structure
19
Data preprocess results
1. User search history 2. Clickstream
Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D....
20
Semantic similarity
21
Existing ontology (SWEET)
• SWEET (Raskin and Pan
2003)
• Focus on only two relations
• The closer, the more similar
22
User search history
• Create query – user matrix
• Calculate binary cosine similarity
Conceptual example
23
Clickstream
• Hypothesis: similar queries can result in similar clicking behavior
• If two queries are similar, the dat...
24
Metadata
• Hypothesis: semantically related terms tend to appear in the same
metadata more frequently
• Essentially the...
25
Integration
• All four results could be converted to
• Problem:
– None of them are perfect (uncertainty in data, hypoth...
26
Integration
• The maximum similarity of all of the components (large similarity
appears to be more reliable)
• The adju...
27
Semantic Similarity Calculation Workflow
28
Results and evaluation
Query
Search history Clickstream Metadata
SWEE
T
Integrated list
ocean
temperature
sea surface
t...
29
Search ranking
30
Background
• Ranking is a long-standing problem in geospatial data discovery
• Typically, hundreds, even thousands of m...
31
Objective and Methods
• Put the most desired data to the top of the result list
• What features can represent users’ se...
32
Ranking features – Metadata attributes
Features Description
Release date The date when the data was published
Processin...
33
Spatial query-metadata overlap
• Spatial similarity between query area and the coverage of a particular
data
• Overlap ...
34
Ranking features – User behavior
• All-time, monthly, user popularity, and semantic popularity (retrieved
from web logs...
35
RankSVM
• One of the well-recognized ML ranking algorithm
• Convert a ranking problem into a classification problem tha...
36
Architecture
• All of these (except for
training) can be
finished within 2
seconds
• None of the open
source mainstream...
37
NDCG (K) for five different ranking
methods at varying K (1-40)
Jiang, Y., Y. Li, C.
Yang, K. Liu, E.
M. Armstrong, T.
...
38
Data recommendation
39
How to recommend geospatial data?
• Use geospatial metadata for content-based recommendation
-Metadata spatiotemporal s...
40
Attribute type Attribute name Attribute description
Spatiotemporal attributes DatasetCoverage-EastLon The East longitud...
41
• Spatial variables: NorthLat, SouthLat, WestLon, EastLon
• Temporal variables: DatasetCoverage-StartTimeLong, StopTime...
42
• Fixed number of values
• No intrinsic ordering
• sensor-name: "AMSR-E", "MODIS", "AVHRR-3" and "WindSat"
𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙_...
43
Ordinal attribute is similar to categorical attribute but its values has a
clear order, e.g. spatial resolution
• Conve...
44
Original text
Aquarius Level 3 sea surface salinity (SSS) standard mapped image
data contains gridded 1 degree spatial ...
45
Step 2: Represent metadata in the phrase vector space (The
dimension lower than word feature space)
Term 1 Term 2 Term ...
46
Session
1
Session
2
…
Session
N
Data 1 1 0 1
Data 2 0 0 0
…
Data k 1 0 1
Calculate metadata similarity based on session...
47
Recommendation
method
Pros Cons Strategy
Descriptive
similarity
1. Natural language
processing methods
can be adopted t...
48
Quantitative Evaluation
Hybrid similarity
outperform other
similarities since it
integrates metadata
attributes and use...
49
Plan Forward
• Add more features (e.g., temporal similarity)
• Create training data from web logs for RankSVM
• Develop...
50
What has been done
• Offline/sequential log analysis
– Time and computation intensive
• Ranking
– Use manually labeled ...
51
Software status
• Java 8
• Git
• Apache Maven 3.X
• Elasticsearch v5.X
• Kibana v4
• Apache Spark v2.0.0
• Apache Tomca...
52
What remains to be done
• Support for Solr
• Add faceted search to the UI and backend
• Support for real-time ranking, ...
53
Real-time ranking
54
Real-time recommendation
User clicked
metadata
Knowledge base
Metadata based
similar datasets
Top-k datasets
Hybrid sim...
55
Proposed timeline
55
Sept. –
Oct.
Nov. –
Dec.
Jan. –
Feb.
March -
April
May -
June
July –
Aug.
Support for Solr
Add fac...
56
Research points
• Semantic similarity
– Online matrix analysis
– Query parsing
• Ranking
– Generate training data from ...
57
Summary
• Log mining enables a data portal integrating implicit user
preferences
• Word similarity retrieved by data mi...
58
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans Forward
• Sch...
59
Project Schedule
60
Jun'1
5
Jul'15
Aug
'15
Sep'1
5
Oct
'16
Nov
'16
Dec
'16
Jan
'16
Feb
'16
Mar
'16
Apr
'16
May
'16
Jun
'16
Jul
'16
Aug
'16
...
61
Costs Status and Plan
Funding Plan 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,...
62
Online Products
1. PO.DAAC Labs website https://podaac.jpl.nasa.gov/podaac_labs
2. PO.DAAC forum page
https://podaac.jp...
63
Demonstration
MUDROD deployed in PO.DAAC Labs
• https://podaac.jpl.nasa.gov/podaac_labs
64
Publications
Journal / Conference Papers (over 10 peer review publications)
1. Jiang, Y., Y. Li, C. Yang, E. M. Armstro...
65
Students Cultivated
• Ph.D. Dissertations focused on MUDROD: Yongyao Jiang and
Yun Li (GMU), will complete in phase 2 (...
66
Acronyms
List of Acronyms
• MUDROD Mining and Utilizing Dataset Relevancy from
Oceanographic Dataset
• HTTP Hypertext T...
67
1. NASA AIST Program (NNX15AM85G)
2. PO.DAAC SWEET Ontology Team (Initially funded by
ESTO)
3. Hydrology DAAC Rahul Ram...
Próximos SlideShares
Carregando em…5
×

MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

64 visualizações

Publicada em

MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

  1. 1. 1 Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access Dr. Chaowei (Phil) Yang (PI, George Mason University) Mr. Edward M Armstrong (Co-I, Jet Propulsion Laboratory, NASA) AIST-14-82, NNX15AM85G Final Review June 22, 2017
  2. 2. 2 Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD) PI: Chaowei (Phil) Yang, George Mason University TRLin = 5 4/09 Objective • Improve data discovery, selection and access to NASA Observational Data. • Intuitive interface to federated data holdings. • Enable new user communities to discover and access data for their projects. • Reduce time for scientists to discover, download and reformat data. • Implement extensible ontology framework. • Improve discovery accuracy of oceanographic data • Foundation for Managing Big Data. • Demonstrate MUDROD at PO.DAAC. Approach • Setup collaboration, testing environment. • Design MUDROD knowledge base system. • Reconstruct sessions from web logs • Calculate vocabulary similarity from logs and metadata. • Update semantic search and conduct alpha testing. • Integrate MUDROD alpha into P.O.DAAC. • Enhance knowledge base, to include GCMD. • Demonstrate Prototype. 03/16 Key Milestones AIST-14-82 NNX15AM85G CoIs: E. Armstrong, T. Huang, D. Moroni, JPL; • Start 06/15 • Identify Use cases 01/16 • Design search, query, reasoning engine 03/16 • Ontological System Implementation 07/16 • Complete Beta test at P.O. DAAC 12/16 • Integrated test 02/17 • PO.DAAC metadata discovery Demo (TRL 7) 05/17 MUDROD Engine Architecture
  3. 3. 3 Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD) PI: Chaowei (Phil) Yang, George Mason University TRLin = 5 TRLout = 7 4/09 Objective • Improve data discovery, selection and access to NASA Observational Data. • Intuitive interface to federated data holdings. • Enable new user communities to discover and access data for their projects. • Reduce time for scientists to discover, download and reformat data. • Implement extensible ontology framework. • Improve discovery accuracy of oceanographic data • Foundation for Managing Big Data. • Demonstrate MUDROD at PO.DAAC. Accomplishments • Developed framework • Analyze web logs to discover user knowledge (query and data relationships) • Construct knowledge base by combining semantics and profile analyzer • Improve data discover • Resulted in • better ranking; • recommendation; • ontology navigation 03/16 AIST-14-0082 CoIs: T. Huang, D. Moroni, E. Armstrong, JPL;
  4. 4. 4 Mr. Edward M Armstrong (Senior Data Engineer , NASA Jet Propulsion Laboratory) Mr. David Moroni (Data Engineer, NASA Jet Propulsion Laboratory) Team Members: Investigators Dr. Chaowei Phil Yang (Professor and Director, NSF Spatiotemporal Innovation Center, George Mason University) Mr. Thomas Huang (Principal Scientist, NASA Jet Propulsion Laboratory)
  5. 5. 5 Team Members: Students and Post-Docs Ms. Yun Li (Ph.D. student, George Mason University; Research Assistant ). Mr. Yongyao Jiang (Ph.D. student, George Mason University; Research Assistant ). Dr. Lewis J. McGibbney (Data Scientist, NASA Jet Propulsion Laboratory). Mr. Chris Finch, (Software Engineer, NASA Jet Propulsion Laboratory). Mr. Frank Greguska, (Software Engineer, NASA Jet Propulsion Laboratory) . Mr. Gary Chen
  6. 6. 6 Presentation Contents • Background / Objectives / Technical Advancements • TRL assessment • Summary of Accomplishments and Plans • Schedule • Financials • Publications • List of Acronyms
  7. 7. 7 Background / Objectives / Tech Advance (1) • Keyword-based matching (traditional search engines) – User query: ocean wind – Final query: ocean AND wind • Reveal the real intent of user query – ocean wind = “ocean wind” OR “greco” OR “surface wind” OR “mackerel breeze” … • PO.DAAC UWG Recommendation 2014-07 • NASA ESDSWG Search Relevance Recommendations 2016 & 2017 Data Discovery Problems
  8. 8. 8 Background / Objectives / Tech Advance (2) • Analyze web logs to discover user knowledge (query and data relationships) • Construct knowledge base by combining semantics and profile analyzer • Improve data discovery by 1) better ranking; 2) recommendation; 3) ontology navigation Objectives
  9. 9. 9 Background / Objectives / Tech Advance (2) • Web log preprocessing • Semantic analysis of user queries & Navigation • Machine learning based search ranking • Data Recommendation Tech Advance (four technological modules)
  10. 10. 10 Presentation Contents • Background and Objectives • TRL assessment • Summary of Accomplishments and Plans • Schedule • Financials • Publications • List of Acronyms
  11. 11. 11 Accomplishments Since last Review TRL Assessment (1) • Performance improvement • Improved log mining performance using cloud computing • Hybrid Recommendation algorithm • Stem MUDROD ontologies from a number of sources • create an OWL manifestation based on NASA GCMD-DIF data model • create a PO.DAAC-specific extension of the SWEET concept • Ingest PO.DAAC extractions into ESIP Ontology Portal • Improve web UI • Test MUDROD and ingest latest PO.DAAC logs • Deploy MUDROD on JPL server
  12. 12. 12 Current TRL Rational TRL Assessment (2) Component Current TRL Project end TRL Description Semantic search engine Search Dispatcher 7 7 Translating a user search query into a set of new semantic queries Similarity calculator 7 7 Calculating the semantic similarity from weblogs, metadata, and ontology Recommendation module 7 7 Recommending similar datasets to the clicked dataset Ranking module 7 7 Re-ranking the search results based on RankSVM ML algorithm Knowledge base Ontology 7 7 Extensions from SWEET ontology for earth science data Triple Store 7 7 ESIP ontology repository Vocabulary linkage discovery engine Profile analyzer 7 7 Extracting user browsing pattern from raw web logs Web services/GUI Ranking service/presenter 7 7 Providing and presenting the ranked results Recommendation service/presenter 7 7 Providing and presenting the related datasets Ontology navigation service/presenter 7 7 Providing and presenting related searches
  13. 13. 13 Presentation Contents • Background and Objectives • TRL assessment • Summary of Accomplishments and Plans • Schedule • Financials • Publications • List of Acronyms
  14. 14. 14 Overview • Web log preprocessing • Semantic analysis of user queries & Navigation • Machine learning based search ranking • Data Recommendation Functions/Modules
  15. 15. 15 Web log processing
  16. 16. 16 • Requests sent from client e.g. browser, cmd line tool, etc. recorded by server • Log files provided by PO.DAAC (HTTP(S), FTP) Client IP: 68.180.228.99 Request date/time: [31/Jan/2015:23:59:13 -0800] Request: " GET /datasetlist/... HTTP/1.1 " HTTP Code: 200 Bytes returned: 84779 Referrer/previous page: “/ghrsst/" User agent/browser: "Mozilla/5.0 ... 68.180.228.99 - - [31/Jan/2015:23:59:13 -0800] "GET /datasetlist/... HTTP/1.1" 200 84779 "/ghrsst/" "Mozilla/5.0 ..." Web logs
  17. 17. 17 Goal: reconstruct user browsing pattern (search history & clickstream) from a set of raw logs Web logs User identification Crawler detection Structure reconstruction Session identification Search history Clickstrea m Additional steps include: word normalization, stop words removal, and stemming Data preprocess
  18. 18. 18 Reconstructed session structure
  19. 19. 19 Data preprocess results 1. User search history 2. Clickstream Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) “Reconstructing Sessions from Data Discovery and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery” ISPRS International Journal of Geo-Information, 5, 54.
  20. 20. 20 Semantic similarity
  21. 21. 21 Existing ontology (SWEET) • SWEET (Raskin and Pan 2003) • Focus on only two relations • The closer, the more similar
  22. 22. 22 User search history • Create query – user matrix • Calculate binary cosine similarity Conceptual example
  23. 23. 23 Clickstream • Hypothesis: similar queries can result in similar clicking behavior • If two queries are similar, the data that get clicked after they are searched would be more likely to be similar Query b Data Query a Similar?
  24. 24. 24 Metadata • Hypothesis: semantically related terms tend to appear in the same metadata more frequently • Essentially the same as the clickstream analysis • Perform Latent Semantic Analyses (LSA) over the term – metadata matrix Query b Metadata Query a Similar?
  25. 25. 25 Integration • All four results could be converted to • Problem: – None of them are perfect (uncertainty in data, hypothesis and method) – Metadata and ontology might have unknown terms to search engine end users – Sometimes, similarity values from different methods are inconsistent Concept A Concept B Similarity
  26. 26. 26 Integration • The maximum similarity of all of the components (large similarity appears to be more reliable) • The adjustment increment becomes larger when the similarity exists in more sources
  27. 27. 27 Semantic Similarity Calculation Workflow
  28. 28. 28 Results and evaluation Query Search history Clickstream Metadata SWEE T Integrated list ocean temperature sea surface temperature(0.66), sea surface topography(0.56), ocean wind(0.56), aqua(0.49) sea surface temperature(0.94), sst(0.94), group high resolution sea surface temperature dataset(0.89), ghrsst(0.87) sst(0.96), ghrsst(0.77), sea surface temperature(0.72), surface temperature(0.63), reynolds(0.58) None sst(1.0), sea surface temperature(1.0), ghrsst(1.0), group high resolution sea surface temperature dataset(0.99), reynolds sea surface temperature(0.74) Sample group Overall accuracy Most popular 10 queries 88% Least popular 10 queries 61% Randomly selected 10 queries 83% By domain experts Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2017) A Comprehensive Approach to Determining the Linkage Weights among Geospatial Vocabularies - An Example with Oceanographic Data Discovery. International Journal of Geographical Information Science (minor revision)
  29. 29. 29 Search ranking
  30. 30. 30 Background • Ranking is a long-standing problem in geospatial data discovery • Typically, hundreds, even thousands of matches • Can get larger as more Earth observation data is being collected
  31. 31. 31 Objective and Methods • Put the most desired data to the top of the result list • What features can represent users’ search preferences for geospatial data? • How can the ranking function reach a balance of all these features? • Identified eleven features from – Geospatial metadata attributes – Query – metadata content overlap – User behavior from web logs
  32. 32. 32 Ranking features – Metadata attributes Features Description Release date The date when the data was published Processing level (PL) The processing level of image products, ranging from level 0 to level 4. Version number The publish version of the data Spatial resolution The spatial resolution of the data Temporal resolution The temporal resolution of the data • Five metadata features • Verified by domains experts • Query-independent: static, depends on the data itself, won’t change with the query
  33. 33. 33 Spatial query-metadata overlap • Spatial similarity between query area and the coverage of a particular data • Overlap area normalized by the original area of query and data
  34. 34. 34 Ranking features – User behavior • All-time, monthly, user popularity, and semantic popularity (retrieved from web logs) • Semantic popularity: the number of times that the data has been clicked after searching a particular query and its highly related ones (query-dependent)
  35. 35. 35 RankSVM • One of the well-recognized ML ranking algorithm • Convert a ranking problem into a classification problem that a regular SVM algorithm can solve • 3 main steps  1) Standardize: mean = 0, std = 1 o SVM is not scale invariant o Over-optimized o Longer to train  2) For any pair of training data, calculate the difference  3) A ranking problem becomes a binary classification problem, where SVM is applied to find the optimal decision boundary
  36. 36. 36 Architecture • All of these (except for training) can be finished within 2 seconds • None of the open source mainstream ML library provide any ranking algorithm • Implemented it by ourselves with the aid of Spark MLlib Index User query Semantic query Top K retrieval Ranking model Learning algorithm Training data Re-ranked results Feature extractor Weblogs User clicksKnowledge base
  37. 37. 37 NDCG (K) for five different ranking methods at varying K (1-40) Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) Towards intelligent geospatial discovery: a machine learning ranking framework. International Journal of Digital Earth (minor revision)
  38. 38. 38 Data recommendation
  39. 39. 39 How to recommend geospatial data? • Use geospatial metadata for content-based recommendation -Metadata spatiotemporal similarity -Metadata attribute similarity -Metadata content similarity • Leverage user behaviors data for CF recommendation -Session-based co-occurrence of data
  40. 40. 40 Attribute type Attribute name Attribute description Spatiotemporal attributes DatasetCoverage-EastLon The East longitude of the bounding rectangle DatasetCoverage-WestLon The West longitude of the bounding rectangle DatasetCoverage-NorthLat The North latitude of the bounding rectangle DatasetCoverage-SouthLat The South latitude of the bounding rectangle DatasetCoverage-StartTimeLong The start time of the data DatasetCoverage-StopTimeLong The end time of the data Categorical geographic attributes DatasetRegion-Region Region of dataset. Such as global, Atlantic Dataset-ProjectionType Project type like cylindrical lat-lon Dataset-ProcessingLevel Data processing level DatasetPolicy-DataFormat Data format e.g. HDF, NetCDF DatasetSource-Sensor-ShortName Short name of sensor Ordinal geographic attributes Dataset-TemporalResolution Temporal resolution of dataset Dataset-TemporalRepeat Temporal resolution of dataset Dataset-SpatialResolution Spatial resolution of dataset Descriptive attributes Dataset-description Describe the content of the dataset Geographic metadata
  41. 41. 41 • Spatial variables: NorthLat, SouthLat, WestLon, EastLon • Temporal variables: DatasetCoverage-StartTimeLong, StopTimeLong • Use volume overlap ratio to calculate similarity 𝑠𝑝𝑎𝑡𝑖𝑜𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑖𝑚 𝑟 𝑖,𝑟 𝑗 = 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 ∩ 𝑟𝑗 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 + 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 ∩ 𝑟𝑗 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑗 ∗ 0.5 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟 = 𝑒𝑎𝑠𝑡𝑙𝑜𝑛 − 𝑤𝑒𝑠𝑡𝑙𝑜𝑛 ∗ 𝑠𝑜𝑢𝑡ℎ𝑙𝑎𝑡 − 𝑛𝑜𝑟𝑡ℎ𝑙𝑎𝑡 ∗ 𝑒𝑛𝑑𝑡𝑖𝑚𝑒 − 𝑠𝑡𝑎𝑟𝑡𝑖𝑚𝑒 Spatiotemporal similarity
  42. 42. 42 • Fixed number of values • No intrinsic ordering • sensor-name: "AMSR-E", "MODIS", "AVHRR-3" and "WindSat" 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙_𝑣𝑎𝑟_𝑠𝑖𝑚 𝑣𝑖, 𝑣𝑗 = 𝑣 𝑖 ∩ 𝑣 𝑗 𝑣 𝑖 ∪ 𝑣 𝑗 Categorical similarity
  43. 43. 43 Ordinal attribute is similar to categorical attribute but its values has a clear order, e.g. spatial resolution • Converted into rank from 1 to R • Nominalize these ranks for similarity calculation 𝑜𝑟𝑑𝑖𝑛𝑎𝑙 𝑣𝑎𝑟 𝑠𝑖𝑚 𝑣 𝑖,𝑣 𝑗 = 1 − 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑖 − 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑗 𝑛𝑜𝑟𝑚_𝑟𝑎𝑛𝑘 𝑣𝑖 = 𝑅𝑎𝑛𝑘𝑣 𝑖+1 𝑅+1 Ordinal similarity
  44. 44. 44 Original text Aquarius Level 3 sea surface salinity (SSS) standard mapped image data contains gridded 1 degree spatial resolution SSS averaged over daily, 7 day, monthly, and seasonal time scales. This particular data set is the seasonal climatology, Ascending sea surface salinity product for version 4.0 of the Aquarius data set, which is the official end of prime mission public data release from the AQUARIUS/SAC-D mission. Only retrieved values for Ascending passes have been used to create this product. The Aquarius instrument is onboard the AQUARIUS/SAC-D satellite, a collaborative effort between NASA and the Argentinian Space Agency Comision Nacional de Actividades Espaciales (CONAE). The instrument consists of three radiometers in push broom alignment at incidence angles of 29, 38, and 46 degrees incidence angles relative to the shadow side of the orbit. Footprints for the beams are: 76 km (along-track) x 94 km (cross-track), 84 km x 120 km and 96km x 156 km, yielding a total cross-track swath of 370 km. The radiometers measure brightness temperature at 1.413 GHz in their respective horizontal and vertical polarizations (TH and TV). A scatterometer operating at 1.26 GHz measures ocean backscatter in each footprint that is used for surface roughness corrections in the estimation of salinity. The scatterometer has an approximate 390km swath. Extracted terms Radiometers Measure Brightness Temperature,AQUARIUS/SAC Mission,Image Data,Broom Alignment,Resolution SSS,AQUARIUS/SAC Satellite,Scatterometer,Scatterometer,Aquarius Data,Argentinian Space Agency Comision Nacional,Incidence Angles,Time Scales,Actividades Espaciales,Cross-track Swath,Official End,Aquarius Instrument,Shadow Side,Ascending Sea Surface Salinity Product,Level,Level,Surface Roughness Corrections,Data Release,Salinity,Density, Salinity,Density, AQUARIUS,L3,SSS,SMIA,SEASONAL-CLIMATOLOGY,V4 Step 1: Phrase extraction 1. Extract term candidates from metadata description with POS (part of speech) Tagging 2. Introduce “occurrence” and “strength” to filter out terms from candidates. “occurrence”: occurrences number of terms “strength”: the number of words in a term Descriptive similarity
  45. 45. 45 Step 2: Represent metadata in the phrase vector space (The dimension lower than word feature space) Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 … Term N dataser1 1 0 1 0 0 0 0 1 dataser2 0 0 0 1 1 0 0 0 … dataset k 1 0 1 0 0 0 0 1 Step 3: Calculate cosine similarity Metadata abstract semantic similarity
  46. 46. 46 Session 1 Session 2 … Session N Data 1 1 0 1 Data 2 0 0 0 … Data k 1 0 1 Calculate metadata similarity based on session level co-occurrence Similarity (i,j) = 𝑁(𝑖ᴖ𝑗) 𝑁(𝑖)∗ 𝑁(𝑗) N(i): The number of sessions in which dataset i was viewed or download N(j): The number of sessions in which dataset j was viewed or download N(𝑖ᴖ𝑗): The number of sessions in which both dataset i and j were viewed or download Session based recommendation
  47. 47. 47 Recommendation method Pros Cons Strategy Descriptive similarity 1. Natural language processing methods can be adopted to find latent semantic relationship 1. Many datasets has nearly same abstract with few words/values changed 2. It is hard to extract detailed attributed from description Used as the basic method of recommendation algorithm Attribute similarity (spatiotemporal, ordinal, categorical) 1. As structured data, geographic metadata have many variables. 1. Variable values may be null or wrong 2. The quality depends on the weight assigned to every variable As supplement to semantic similarity Session concurrence 1. Reflect users’ preference 1. Cold start problem: Newly published data don’t have usage data Fine-tune recommendation list Recommend (i) = 𝑊 𝑠𝑠 ∗ 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑒𝑐 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑖 + ∑𝑊 𝑐𝑣 ∗ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑎𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + ∑𝑊 𝑜𝑣 ∗ 𝑂𝑟𝑑𝑖𝑎𝑛𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + 𝑊𝑠𝑡𝑣 ∗ 𝑆𝑝𝑎𝑡𝑖𝑜𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 + 𝑊𝑠𝑜 ∗ 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 Hybrid recommendation
  48. 48. 48 Quantitative Evaluation Hybrid similarity outperform other similarities since it integrates metadata attributes and user preference. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data Recommender System based on Metadata and User Behaviour (revision)
  49. 49. 49 Plan Forward • Add more features (e.g., temporal similarity) • Create training data from web logs for RankSVM • Develop a query understanding module to better interpret user’s search intent (e.g. “ocean wind level 3” -> “ocean wind” AND “level 3”) • Support Solr • Support near real-time data ingestion to dynamically update knowledge base • Integration with DOMS and OceanXtremes to support an ocean science analytics center • Leverage advanced computing techniques to speed up the process
  50. 50. 50 What has been done • Offline/sequential log analysis – Time and computation intensive • Ranking – Use manually labeled data as train data – RankSVM • Offline recommendation – Cannot respond to what user clicked in the last few minutes • Offline Similarity calculation – Cannot be updated in real-time – Cannot parse multi-concept queries
  51. 51. 51 Software status • Java 8 • Git • Apache Maven 3.X • Elasticsearch v5.X • Kibana v4 • Apache Spark v2.0.0 • Apache Tomcat 7.X • https://github.com/mudrod/mudrod
  52. 52. 52 What remains to be done • Support for Solr • Add faceted search to the UI and backend • Support for real-time ranking, recommendation, similarity calculation • Support for DOMS and OceanXstreams
  53. 53. 53 Real-time ranking
  54. 54. 54 Real-time recommendation User clicked metadata Knowledge base Metadata based similar datasets Top-k datasets Hybrid similarity calculator Recommendation results sequential clicked datasets in a session Weblogs Train model
  55. 55. 55 Proposed timeline 55 Sept. – Oct. Nov. – Dec. Jan. – Feb. March - April May - June July – Aug. Support for Solr Add faceted search Support for DOMS & OceanXstreams Real-time ranking Real-time recommendation Real-time similarity
  56. 56. 56 Research points • Semantic similarity – Online matrix analysis – Query parsing • Ranking – Generate training data from user clicks – Online deep learning algorithm • Near real-time log ingesting – Spark Streaming APIs – Real-time sessionization • High performance log mining – Use parallel computing to speed up • Recommendation – Session-based real-time recommendation – Personalized recommendation
  57. 57. 57 Summary • Log mining enables a data portal integrating implicit user preferences • Word similarity retrieved by data mining tasks expands any given query to improve search recall and precision. • The rich set of ranking features and the ML algorithm provide substantial advantages over using other ranking methods • The recommendation algorithm can discover latent data relevancy • The proposed architecture enables the loosely coupled software structure of a data portal and avoids the cost of replacing the existing system
  58. 58. 58 Presentation Contents • Background and Objectives • TRL assessment • Summary of Accomplishments and Plans Forward • Schedule • Financials • Publications • List of Acronyms
  59. 59. 59 Project Schedule
  60. 60. 60 Jun'1 5 Jul'15 Aug '15 Sep'1 5 Oct '16 Nov '16 Dec '16 Jan '16 Feb '16 Mar '16 Apr '16 May '16 Jun '16 Jul '16 Aug '16 Sep'1 6 Oct '17 Nov '17 Dec '17 Jan '17 Feb '17 Mar '17 Apr '17 May '17 Jun '17 Funding Plan 208 208 208 208 208 208 208 208 208 208 208 208 392 392 392 392 392 392 392 392 392 392 392 392 392 Cum Obligation Plan 2 3 6 8 10 14 23 30 33 41 48 67 91 120 144 167 200 230 253 273 292 343 370 390 392 Cum Actuals 2 3 6 8 10 14 23 30 33 41 48 67 84.3 106.7 132.1 141.2 159.7 179.7 200.1 226.1 255.1 285.5 339.1 366.7 Variance 0 0 0 0 0 0 0 0 0 0 0 0 -7 -13 -12 -26 -40 -50 -53 -47 -37 -58 -31 -23 0 50 100 150 200 250 300 350 400 450 JPL Part: $K Costs Status and Plan
  61. 61. 61 Costs Status and Plan Funding Plan 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 648,838 648,838 648,838 Cumulative Cost Actual 0 0 2498 9993 28880 56770 66100 74279 107932 126371 151600 181645 206108 218265 255809 323578 404836 434623 442374 461156 478063 482765 524561 578369 594214 628852 648838 Cumulative Cost Plan - - 2498 9993 28880 56770 66100 74279 107932 126371 151600 181645 208693 254642 301097 365107 411031 455801 471465 493314 508058 540992 561021 595815 616823 632186 648838 Variance 0 -2499 2499 -2499 2499 -2499 2499 -2499 2499 -2499 5083 31293 13995 27535 -21340 42518 -13426 45585 -15589 73817 -37356 54802 -32193 35527 -35528 • Actual costs are below or above plan – Due to the GMU payment cycle is different from end of month • Impact: None
  62. 62. 62 Online Products 1. PO.DAAC Labs website https://podaac.jpl.nasa.gov/podaac_labs 2. PO.DAAC forum page https://podaac.jpl.nasa.gov/forum/viewforum.php?f=88 3. Open Source with Github and Apache License 2.0: https://github.com/mudrod/mudrod 4. Demo Site: http://mudrod.jpl.nasa.gov/ 5. PD Leverage: http://pd.cloud.gmu.edu/
  63. 63. 63 Demonstration MUDROD deployed in PO.DAAC Labs • https://podaac.jpl.nasa.gov/podaac_labs
  64. 64. 64 Publications Journal / Conference Papers (over 10 peer review publications) 1. Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) Reconstructing Sessions from Data Discovery and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery. ISPRS International Journal of Geo-Information, 5, 54. 2. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2016) Leverage cloud computing to improve data access log mining. IEEE Oceans 2016. 3. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2017) A Comprehensive Approach to Determining the Linkage Weights among Geospatial Vocabularies - An Example with Oceanographic Data Discovery. International Journal of Geographical Information Science (minor revision) 4. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) Towards intelligent geospatial discovery: a machine learning ranking framework. International Journal of Digital Earth (minor revision) 5. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data Recommender System based on Metadata and User Behaviour (revision) 6. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A smart web-based data discovery system for ocean sciences. (ongoing) 7. Yang, C., et al., 2017. Big Data and cloud computing: innovation opportunities and challenges. International Journal of Digital Earth, 10(1), pp.13-53. (the 2nd most read paper of IJDE in it’s decadal history) 8. Yang, C.P., Yu, M., Xu, M., Jiang, Y., Qin, H., Li, Y., Bambacus, M., Leung, R.Y., Barbee, B.W., Nuth, J.A. and Seery, B., 2017, March. An architecture for mitigating near earth object's impact to the earth. In Aerospace Conference, 2017 IEEE (pp. 1-13). IEEE 9. Yu, M. and Yang, C., 2016. Improving the Non-Hydrostatic Numerical Dust Model by Integrating Soil Moisture and Greenness Vegetation Fraction Data with Different Spatiotemporal Resolutions. PloS one, 11(12), p.e0165616. 10. Vance, T.C., Merati, N., Yang, C. and Yuan, M., 2016, September. Cloud computing for ocean and atmospheric science. In OCEANS 2016 MTS/IEEE Monterey (pp. 1-4). IEEE. (also as a book with the same name) 11. Liu, K., Yang, C., Li, W., Gui, Z., Xu, C. and Xia, J., 2016. Using semantic search and knowledge reasoning to improve the discovery of earth science records: an example with the ESIP semantic testbed. In Mobile Computing and Wireless Networks: Concepts, Methodologies, Tools, and Applications (pp. 1375-1389). IGI Global. 12. Xia, J., Yang, C., Liu, K., Gui, Z., Li, Z., Huang, Q. and Li, R., 2015. Adopting cloud computing to optimize spatial web portals for better performance to support Digital Earth and other global geospatial initiatives. International Journal of Digital Earth, 8(6), pp.451- 475.
  65. 65. 65 Students Cultivated • Ph.D. Dissertations focused on MUDROD: Yongyao Jiang and Yun Li (GMU), will complete in phase 2 (OceanWorks) • Graduate Students worked on tasks; Fei Hu, Kai Liu (graduate in 2017 summer), Manzhu Yu (graduate in 2017 summer), Han Qin (graduated), Mengchao Xu, Ishan Shams (GMU) • Undergraduate students: Joseph Michael/Univ. of Richmond (HBCU), Alena/Virginia Tech (Female) • Other (Approximately 20 presentations at AGU, ESIP, AAG, GeoInformatics, and other national and international conferences)
  66. 66. 66 Acronyms List of Acronyms • MUDROD Mining and Utilizing Dataset Relevancy from Oceanographic Dataset • HTTP Hypertext Transfer Protocol • FTP File Transfer Protocol • SWEET Semantic web for earth and environmental terminology • PO.DAAC Physical Oceanography Distributed Active Archive Center • LSA Latent Semantic Analysis • SVM Support vector machine • NDCG Normalized Discounted Cumulative Gain
  67. 67. 67 1. NASA AIST Program (NNX15AM85G) 2. PO.DAAC SWEET Ontology Team (Initially funded by ESTO) 3. Hydrology DAAC Rahul Ramachandran (providing the earlier version of NOESIS) 4. ESDIS for providing testing logs of CMR 5. All team members at JPL and GMU Acknowledgements

×