Using Stanza NLP and TensorFlow to create a summary of a book
1. Book Summarizer
Using Stanza and Tensorflow to create a summary of a book
Rafael Moreira
Olu Amusan
Kishen Patel
Jared Kelly
Blake Myers
2. Project Abstract
Natural Language Processing (NLP) remains one of the most popular applications
of Machine learning today. Our project seeks to improve knowledge assimilation
and learning through text summarization.
In this project, we intend to use a python package called Stanza which is a
collection of accurate and efficient tools natural languages. Stanza helps with
processing raw text while carrying out syntactic analysis as well as entity
recognition.
3. Proposed Project Design
We propose to use Term Frequency and Inverse Document Frequency to first give
weights to all relevant terms in a document.
Next, we will calculate the weight of each sentence in a document as a function of
its component terms.
Finally, we will rank and return the n heaviest weighted sentences as a summary of
the document in question.
This method is described in Mihalcea & Ceylan (2007), and a tutorial, using a
similar method can be found on Medium.com. We wish to extend this tutorial to
use Stanza, instead of Spacy, and then to improve upon the
method with ideas from Mihalcea & Ceylan.
4. Stanza
Stanza is a suite of NLP tools, much like the
more popular Spacy or NLTK toolkits. It
contains useful tools for conversion of natural
language into lists of sentences, or words,
lemmatization, POS tagging, morpho-syntactic
analysis, parsing, and Named-Entity
Recognition. Stanza uses the standard
Universal Dependencies formalism.
We’ve chosen the newer Stanza, from the Stanford NLP group,
because of its mass-multilinguality. The tool current has pre-trained
neural model support for 66 human languages.
5. Milestones
Design Proposal
Designing our proposal
while aligning our
thoughts on the
summarizer and the
Python NLP package of
choice. Agree on
meeting dates and
project approach.
Review Stanza Doc.
Then we review Stanza
documentation in the
line of the tutorial we are
extending, create the
framework of our
solution as we move on
to implementation
Build Summarizer
Begin Implementation
(coding) of the
summarizer and improve
on accuracy with
additional iterations and
training.
6. Workflow
Using github for version control.
Using discord for communication and zoom for collaboration.
https://discord.gg/XZqKYNBF
Meetings: Thursday 9:00am
Participants
Rafael Moreira - RafaelAlvesMoreira@my.unt.edu
Olu Amusan -
Kishen Patel - KishenPatel@my.unt.edu
Blake Myers -
Jared Kelly - jared.kelly@unt.edu
7. Resources and Related Projects
https://medium.com/better-programming/extractive-text-summarization-using-spacy-in-python-
88ab96d1fd97
● This tutorial implements extractive text summarization using spaCy in Python. Our goal is to
implement a similar text summarization algorithm using Stanza instead.
Mihalcea, R., & Ceylan, H. (2007). Explorations in Automatic Book Summarization. Proceedings of the
2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning, 380–389.
● An article on summarization methods.
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python Natural Language
Processing Toolkit for Many Human Languages. Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics: System Demonstrations, 101–108.
https://doi.org/10.18653/v1/2020.acl-demos.14
● The paper introducing Stanza
8. Resources and Related Projects
Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches
Out, 74–81. https://www.aclweb.org/anthology/W04-1013/
● This is the journal article which introduces the ROUGE (Recall-Oriented Understudy for Gisting
Evaluation) method for evaluation of summaries.
9. What have we worked on so far
Discovered text material to be used for summarization that has a human made
summary available.
Applied multiple text summarization methods:
● Extractive text summarization using:
○ Stanza library
● Abstractive text summarization:
○ Keras Library
● Evaluation method:
○ ROUGE (Compare F-scores, precision and recall)
10. What we plan to working on
● Text summarization Evaluation
We intend to use ROUGE as well as BLEU for evaluating the summarized text. A
human based summarization will also be fed in to the evaluation for comparisons.
A combination of scores from ROUGE, BLEU and human grading will be used to
evaluate model performance.
● User Interface
A simple platform will be developed where a user can upload or submit text where
the summarized text will be provided to the user in return.
Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 60 languages, using the Universal Dependencies formalism.
Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, 74–81. https://www.aclweb.org/anthology/W04-1013/
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human