SlideShare uma empresa Scribd logo
1 de 20
Author- Paper Identification Problem
Guided By
Prof Duc Tran
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
Problem Statement
To determine the correct author from the author’s dataset
for a particular paper.
Ambiguity in author names might cause a paper to be
assigned to the wrong author, which leads to noisy
author profiles.
Challenge is to determine which papers in an author
profile were truly written by a given author
Data Points
The data points include all papers written by an author, his
affiliation (University, Technical Society, Groups).
 Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and
conferences attended by an author.
 Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
 Author -( Id, Name, Affiliation)
 Conference-(Id, ShortName,FullName,HomePage)
 Journal -(Id, ShortName, FullName, HomePage)
Machine Learning Task
• Building the model using train dataset and testing
• Feature Engineering
• Algorithms
• Model Tuning
• Results
• Evaluation
Steps Taken to Solve Problem
 Data preprocessing and cleaning
 Feature engineering
 Extracting feature values
 Creating final input file
 Choose a ML algorithm - Random Forest/Gradient
Boost Model
 Building the model
 Tuning the model
 Test the model on test data
 Evaluating the results
Data preprocessing and cleaning
Issues with data
Few had attributes spilled over 3 rows
Some rows had more attributes than the required
number of attributes
Wrote a Perl script to Clean data and format it
Removed stop words using NLTK package in python
Converted all text to lower case
Removed special characters
Removed noise from years field
Assigned ID to each keyword and normalized it
Issues with data-I
Feature Engineering Steps
• Aggregation: combining multiple features into one.
How did we use: Elaborated train file with AuthorID, PaperID and
Confirmation combined with name and affiliation from Author file and
PaperAuthor file.
• Construction: Creating new features out of original ones
How did we use: minyear and maxyear into active years
• Discretization: Converting continuous features or variables to discretized
or nominal features
How did we use: The year the paper is published. The max and min years
the author was actively publishing papers.
Author features
• distance between the author names in paper-author and author files
• matched substring ratio between the author names in paper-author and author
files
• keywords used by a particular author(less weight)
• count keywords for author
• count the no of co-authors for a given co-author
• weighted TF-IDF measure of all author keywords inside author's papers
• count different papers of author
• years during which the author wrote many papers
• number of times an author is repeated (sum for distinct ids)
• list of distinct ids assigned to the same author
Paper features
• year of paper
• count authors of paper
• count duplicated papers for paper
• count duplicated authors for paper
• count keywords in paper
• how many time the exact same set of authors is repeated in
different papers (without
duplicates)
Paper-Author features
• correct affiliation from the table PaperAuthor: binary feature
• which year the author publishes (for the first year papers of author this
feature equals 1 for the second year papers - 2 and so on)
• count sources: number of times pair author-paper is appeared in the table
PaperAuthor table and Author table the same
Machine Learning Models Used
Models used earlier
• K-means clustering
• ZeroR
• Tree J-48
• Naïve Bayes
Machine Learning Models Used
• RandomForest
 Using Weka, Mahout and H20
• Gradient Boost
 Using H20
Build Random forest using weka
With the elaborated train file
Build Random forest using H20
Feature values extraction & importance
Count of paper for each author
Maximum active year for given author
Maximum active year for given author
Jaccard distance between author name in author file and
paper author file
Jaccard distance between affiliation in author file and paper
author file
The year paper was published
Normalized Keyword ids
1. Put the data in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir testdata $HADOOP_HOME/bin/hadoop fs -put testdata
2. Build the job files.
$MAHOUT_HOME/ run: mvn clean install -DskipTests
3. Generate a file descriptor for the dataset.
$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p
testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
4. Run the model
$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples--job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d
testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
5. Using the Decision Forest to Classify new data.
$MAHOUT_HOME/examples/target/mahout-examples--job.jar
org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m
nsl-forest -a -mr -o predictions
Random Forest using Mahout Steps
Evaluation Metrics
The mean average precision for N users at position n is
the average of the average precision of each user, i.e.,
MAP@n=∑i=1Nap@ni/N
• Mean Average Precision
Lessons Learned!
Since we were given train and test data supervised learning is the best fit. Hence
classification algorithms work better for author paper identification problem rather
than clustering.
When you want to find patterns or structures in the provided data use
unsupervised learning models such as clustering.
Choosing the features is the most important thing and we can extract the feature
values from the given data and use it to build the model.
Construct an initial set of features and try to build the model and test for its
accuracy. This might not give you better results always. Construct features from
features.
Choosing features from different data points will give better results than just
choosing them from only one.
Choose initial set of weights for each feature based on its importance. This will
help in model tuning.
Thank you!!

Mais conteúdo relacionado

Semelhante a ML Paper Identification Using Author Name and Affiliation

Author paper identification problem
Author  paper identification problemAuthor  paper identification problem
Author paper identification problemPooja Mishra
 
Author paper midterm
Author paper midtermAuthor paper midterm
Author paper midtermPooja Mishra
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ireSoham Saha
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
Unit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesUnit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesVasavi College of Engg
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorialYiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysisstat
 
Data_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfData_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfjill734733
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 

Semelhante a ML Paper Identification Using Author Name and Affiliation (20)

Author paper identification problem
Author  paper identification problemAuthor  paper identification problem
Author paper identification problem
 
Author paper midterm
Author paper midtermAuthor paper midterm
Author paper midterm
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ire
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
Unit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesUnit 2 Principles of Programming Languages
Unit 2 Principles of Programming Languages
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 
Data_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfData_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdf
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
R tutorial
R tutorialR tutorial
R tutorial
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 

Último

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 

Último (20)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 

ML Paper Identification Using Author Name and Affiliation

  • 1. Author- Paper Identification Problem Guided By Prof Duc Tran Team : Karthik Reddy Vakati Nachammai C Pooja Mishra
  • 2. Problem Statement To determine the correct author from the author’s dataset for a particular paper. Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles. Challenge is to determine which papers in an author profile were truly written by a given author
  • 3. Data Points The data points include all papers written by an author, his affiliation (University, Technical Society, Groups).  Paper-Author -( PaperId , AuthorId, Name, Affiliation) The meta data includes journals written by him and conferences attended by an author.  Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)  Author -( Id, Name, Affiliation)  Conference-(Id, ShortName,FullName,HomePage)  Journal -(Id, ShortName, FullName, HomePage)
  • 4. Machine Learning Task • Building the model using train dataset and testing • Feature Engineering • Algorithms • Model Tuning • Results • Evaluation
  • 5. Steps Taken to Solve Problem  Data preprocessing and cleaning  Feature engineering  Extracting feature values  Creating final input file  Choose a ML algorithm - Random Forest/Gradient Boost Model  Building the model  Tuning the model  Test the model on test data  Evaluating the results
  • 6. Data preprocessing and cleaning Issues with data Few had attributes spilled over 3 rows Some rows had more attributes than the required number of attributes Wrote a Perl script to Clean data and format it Removed stop words using NLTK package in python Converted all text to lower case Removed special characters Removed noise from years field Assigned ID to each keyword and normalized it
  • 8. Feature Engineering Steps • Aggregation: combining multiple features into one. How did we use: Elaborated train file with AuthorID, PaperID and Confirmation combined with name and affiliation from Author file and PaperAuthor file. • Construction: Creating new features out of original ones How did we use: minyear and maxyear into active years • Discretization: Converting continuous features or variables to discretized or nominal features How did we use: The year the paper is published. The max and min years the author was actively publishing papers.
  • 9. Author features • distance between the author names in paper-author and author files • matched substring ratio between the author names in paper-author and author files • keywords used by a particular author(less weight) • count keywords for author • count the no of co-authors for a given co-author • weighted TF-IDF measure of all author keywords inside author's papers • count different papers of author • years during which the author wrote many papers • number of times an author is repeated (sum for distinct ids) • list of distinct ids assigned to the same author
  • 10. Paper features • year of paper • count authors of paper • count duplicated papers for paper • count duplicated authors for paper • count keywords in paper • how many time the exact same set of authors is repeated in different papers (without duplicates)
  • 11. Paper-Author features • correct affiliation from the table PaperAuthor: binary feature • which year the author publishes (for the first year papers of author this feature equals 1 for the second year papers - 2 and so on) • count sources: number of times pair author-paper is appeared in the table PaperAuthor table and Author table the same
  • 12. Machine Learning Models Used Models used earlier • K-means clustering • ZeroR • Tree J-48 • Naïve Bayes
  • 13. Machine Learning Models Used • RandomForest  Using Weka, Mahout and H20 • Gradient Boost  Using H20
  • 14. Build Random forest using weka With the elaborated train file
  • 15. Build Random forest using H20
  • 16. Feature values extraction & importance Count of paper for each author Maximum active year for given author Maximum active year for given author Jaccard distance between author name in author file and paper author file Jaccard distance between affiliation in author file and paper author file The year paper was published Normalized Keyword ids
  • 17. 1. Put the data in HDFS. $HADOOP_HOME/bin/hadoop fs -mkdir testdata $HADOOP_HOME/bin/hadoop fs -put testdata 2. Build the job files. $MAHOUT_HOME/ run: mvn clean install -DskipTests 3. Generate a file descriptor for the dataset. $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L 4. Run the model $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples--job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest 5. Using the Decision Forest to Classify new data. $MAHOUT_HOME/examples/target/mahout-examples--job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions Random Forest using Mahout Steps
  • 18. Evaluation Metrics The mean average precision for N users at position n is the average of the average precision of each user, i.e., MAP@n=∑i=1Nap@ni/N • Mean Average Precision
  • 19. Lessons Learned! Since we were given train and test data supervised learning is the best fit. Hence classification algorithms work better for author paper identification problem rather than clustering. When you want to find patterns or structures in the provided data use unsupervised learning models such as clustering. Choosing the features is the most important thing and we can extract the feature values from the given data and use it to build the model. Construct an initial set of features and try to build the model and test for its accuracy. This might not give you better results always. Construct features from features. Choosing features from different data points will give better results than just choosing them from only one. Choose initial set of weights for each feature based on its importance. This will help in model tuning.