Presentation given at 2018 IEEE International Conference on Intelligent Systems (IS) 2018 for the article:
L. Mazzola, P. Siegfried, A. Waldis, M. Kaufmann, A. Denzler. (2018). "A Domain Specific ESA Inspired Approach for Document Semantic Description". In Proceedings of the IEEE IS2018, At: Madeira Island, Portugal, pg. xx-xx. ISBN: 978-1-5386-7097-2/18
Sequential and reinforcement learning for demand side management by Margaux B...
Document semantic characterization
1. A Domain Specific
ESA-inspired Approach for
Document Semantic
Description
Luca Mazzola, Patrick Siegfried, Andreas Waldis, Michael Kaufmann, and
Alexander Denzler
HSLU - Lucerne University of Applied Sciences,
School of Information Technology,
6343 - Rotkreuz,
Switzerland
9th IEEE International Conference on Intelligent Systems – IS2018
IEEE_IS2018 25/09/2018
2. Slide 2, 25-Sep-18
- DSS : Decision Support System for
- job placement
- further education suggestion
- profile (CV) similarity identification
- Data driven
- Automatically evolving (no rules definition need)
- Limiting the cold-start problem.
Motivation
• DSS
• Data-driven
• Limited cold-start
IEEE_IS2018 25/09/2018
3. Slide 3, 25-Sep-18
- Unstructured/semi-structured documents
- CV/resumé
- job offer
- education description(high school, professional
instruction, Bachelor, Master, executive ed.,…)
- Other general purposes docs (e.g: websites)
- Mixing with on-the-job training:
- No formal learning objective, no uniform
description
- Consideration of competences due to job
experiences
Issues
• Unstructured data
• Different origin/standard
• Informal and semiformal
IEEE_IS2018 25/09/2018
4. Slide 4, 25-Sep-18
- External crowd-based available corpus: Wikipedia
- Good quality
- Concepts = existing page titles
- Vocabulary = page content (stems)
- Metric = normalized TF-IDF
- As suggested by ESA, but transposed
- Domain specific filtering
- Noise reduction by removal of “irrelevant”
concepts / vocabulary
Our Approach
• Wikipedia ad data-source (ESA)
• nTF-IDF
• Domain specific (noise limiting)
IEEE_IS2018 25/09/2018
5. Slide 5, 25-Sep-18
Semantic matrix building process
• Enriching ( NO Disambiguation,
Virtual pages for Redirect)
• filtering
Data characterization:
IEEE_IS2018 25/09/2018
DEWiki: ~2.5M
CVs: ~27K
JOB offers: ~30K
Education descr: ~1,1K
Valid “concepts”: ~40K
Valid ”stems”: ~66K
6. Slide 6, 25-Sep-18
Reference Model building
• Additional distribution data
• Dynamic filtering
IEEE_IS2018 25/09/2018
7. Slide 7, 25-Sep-18
- develop a metric to compare documents based on
common set of attributes
- compare two given documents:
- identify similarities
- extract common “concepts”
- compare a given document against a set:
- assign relevant CVs to a job post
- Match educational experiences to CV on
common skill-set
- find similar CVs to a given one
Requirements
• Set of requirements
IEEE_IS2018 25/09/2018
8. Slide 8, 25-Sep-18
- Ranked matching between 17CVs and 44
educational experiences
- Golden standard: manual annotation by business
partner (ordered top-3 educations for each CV)
- Weighted as from the table Expected value
for pure random assignment: E[Q] ~ 0.32
- Obtained result Q = 6.62 and sd[Q]= 1.68
- Additional analysis, for 5 representative cases:
Non-randomness verification
• Wikipedia ad data-source (ESA)
• nTF-IDF
• Domain specific (noise limiting)
Rank #1 #2 #3
Top-1 2 - -
Top-2 1/2 3/2 -
Top-3 1/3 3/3 5/3
Top-5 1/5 3/5 5/5
Top-10 1/10 3/10 5/10
IEEE_IS2018 25/09/2018
9. Slide 9, 25-Sep-18
- We identified a set of 10 heterogenous
documents in German:
- Doc1 Automobile Meckatroniker EZF (educ exp)
- Doc2 Software Entwichkler (JOB offer)
- Doc3 B.Sc. Medizin-Informatiker/in BFH (educ exp)
- Doc4 AutoMeckatroniker (JOB offer)
- Doc5 Webpage of «Data Intelligence» team at HSLU (website)
- Doc6 Dipl. Pflegefachperson HF/FH(Privatabteilung) (JOB offer)
- Doc7 Luzerner Kantonspital website - general page (website)
- Doc8 Zuger Kantonspital website – «about us» (website)
- Doc9 Visa hat technische Probleme in ganz Europa (news, 01Jun)
- Doc10 Bayer übernimmt Monsanto für 63 Milliarden (news, 07Jun)
- Analysis to discover relationships (similarities)
amongst them
Experiment
• Experiment setup
IEEE_IS2018 25/09/2018
noise, from http://www.20min.ch
13. Slide 13, 25-Sep-18
- An ESA-inspired approach for document
comparison
- Able to work on heterogeneous documents
- Language
- structure
- Domain filtering for better specificity (less noise)
- Better results wrt randomness
- Human manual evaluation positive
- Clustering capabilities
- Meaninful
- Able to spot and “separate” outliers in a
dataset(noise)
Achievments
• New approach
• Good performances
• Outliers “detection”
IEEE_IS2018 25/09/2018
14. Slide 14, 25-Sep-18
- Language dependent
- Currently in German
- No interpretation of absolute distance of
documents
- Only comparisons are meaningful
- No completely meaningful explicit signature of
document (such as the one offered by ESA)
- Computation complexity for model creation
- But, dynamic adjustment partially compensate
Limits
• Language dependency
• Adopted metrics
• Explicit semantic interpretation
IEEE_IS2018 25/09/2018
15. Slide 15, 25-Sep-18
- Granular approach usage
- Using, if available, the CV semi-structure
- Customizable metrics for stem weighting
- Different metrics for vectors comparison
- Multilanguage version
- Using the Wikipedia metadata for “translated”
pages
- Granular map of the CH educational panorama
Next Steps
• Improve model (metrics)
• Multilanguage support
• Towards a Map of CH education
IEEE_IS2018 25/09/2018
16. T direct
Research
Dr. Luca Mazzola
Research Associate
+41 41 757 68 90
luca.mazzola@hslu.ch
Rotkreuz
Questions
IEEE_IS2018 25/09/2018