The task of extracting keyphrases from free text documents is becoming increasingly important as the uses for such technology expands,
as the amount of electronic textual content grows fast,
Keyphrases play an important role in digital libraries, web contents, and content management systems, especially in cataloging and
information retrieval purposes.
We propose an approach for basic text minning tool for Arabic KeyPhrase Extraction. The approach is relying on AI techniques represented in applying heuristic knowledge [linguistic rules] combined with statistical machine learning.
Paper: http://goo.gl/Bgu4Y
Documentation: http://goo.gl/KrJUM
350M speaking arabic120K wiki arabic page2000% growth in Arabic Tweets from 2010 to 2011Arabic on the web has increased 2500% since 200066M users uses arabiclang in the Internet---------------------------------------------------------Arabic internet users grow by 2500% and reached 65M usersEgypy , KSA and UAE will spend about 2.1 billion $ in Electronic Retail sector
Generating metadata that gives a high-level description of a document's contents. This provides tools for text-mining related tasks such as document and Web page retrieval purposes. Summarizing documents for prospective readers. Keyphrases can represent a highly condensed summary of the document in question (Avanzo & Magnini, 2005).Highlighting important topics within the body of the text, to facilitate speed reading (skimming), which allows deciding whether it is relevant or not.Measuring the similarity between documents, making it possible to cluster and categorize documents (Karanikolas & Skourlas, 2006).Searching: more precise upon using them as the basis for search indexes or as a way of browsing a collection of documents.
Just points
Speak about correction and why we do it !!Removing Diactries and why we do it ?!SegmentionSpeak about Segmenter Module from Stanford Segmenting Sentences and it’s importance in Features Calculation (NPL)Segmentation into Words
Remove all unused special characters Remove non-arabic characters Replace QM and exclamation with arabic oneLeave only significant special charcter، ; :
Almost all clitics are separated off as separate words. This includes clitic pronouns, prepositions, and conjunctions. However, the clitic determiner (definite article) "Al" (ال) is not separated off. Inflectional and derivational morphology is not separated off.[GALE ROSETTA: These separated off clitics are not overtly marked as proclitics/enclitics, although we do have a facility to strip off the '+' and '#' characters that the IBM segmenter uses to mark enclitics and proclitics, respectively. See the example below using the option -escaper edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper]Parentheses are rendered -LRB- and -RRB-Quotes are rendered as (ASCII) straight single and double quotes (' and "), not as curly quotes or LaTeX-style quotes (unlike the Penn English Treebank).Dashes are represented with the ASCII hyphen character (U+002D).Non-break space is not used.
(not words) mean the pos tagger run on the whole sentence to detect the right pos for every word
Another example Word: التوقعات المرئيةStem مرء:Lemma : مرئي
Mention about another test set that include different domain like (Sport, psychology, science and religious )Don’t forget to mention about the documents have different authors
Define both precession and recall to the audience
Mention that sakhr is not eligible to compare due to limitation of categorizing keyphrase into sectionMention that kp-miner is only available website to compare withWe used same test set with kp-miner to make fair comparison
Don’t forget human judgment
Don’t forget human judgment
Speak in more detailsLinguistic feature include adding special characters like semicolon and double colon to detect importance of text Linguistic include to check if the candidate is a sub of another candidateStatistical feature include some defining of each writer style where he mention the topic and when he tell details (تفصيل واجمال)