In the quest of improving the quality of education, Flexudy leverages the
power of AI to help people learn more efficiently.
During the talk, I will show how we trained an automatic extractive text
summarizer based on concepts from Reinforcement Learning, Deep Learning and Natural Language Processing. Also, I will talk about how we use pre-trained NLP models to generate simple questions for self-assessment.
3. Hi, I am
Pascal
Zoleko My Projects
Flexudy
PR & AI
For Education
Study
Work
PR & AI
For Privacy
People Analytics
Artificial Intelligence &
Pattern Recognition
4. Problems we want to solve
1. Too much to read
2. Too long to read
3. Abstracts are sometime too bold.
4. Abstracts are sometimes too vague.
5. Abstracts are not available for all
kinds of text documents. (Web page)
Some students (learners) … :
6. Read and forget
7. Can’t continuously evaluate their
knowledge on a subject.
8. Can’t revise while on
the train, bus etc.
5. Flexudy
Education
Today.
Automatic Text
Summarisation
Simple Question
Generation
Demo Video
NLP
Ranking
Reinforcement
Learning
Rules
A simple overview
Good enough to give
an idea about the text
Deep
Learning
Fill in the blanks
Simple but useful to
Remember the keywords
found in a text.
We won’t have enough
time for this. But we
can discuss about it.
9. The Summariser Pipeline.
“Let AI do all the work
and then reap the fruits
of its labour.”
Step by Step
I will avoid technical terms as much as possible.
I made no assumption about the audience.
So, no Maths!
10. Summary generation algorithm. Easy, but not trivial.
1. Get user text to be summarised
2. For each sentence in the text
3. Decide if sentence should be added to the summary.
4. If yes, then append the sentence to the summary
5. Format the summary and return
How do we train our a summariser ?
Reinforcement Learning - Cross-Entropy
Can be improved by using other state of the
SOTA algorithms, e.g Deep Q-Networks.
11. First, a quick recap.
1. Agent
Money & Environment icon made by Freepik from www.flaticon.com
2. Reward
The central idea behind Reinforcement Learning
3. Environment
Observation
Actions
12. Trainable
Non-linear
function
Reward
How does it translate to our use case ?
Original Text
Sentence Features
Prediction
[0, 1]
Score The next sentence
First, a quick recap.
Note: although the environment is fully observable, we decide
to observe, sentences one at a time.
Can easily be improved: By observing many at a time.
13. We need data to train the agent.
Gutenberg
arXiv.org
wikipedia
Broad corpus for
higher coverage.
~50% of our development time.
Data is handpicked
From different domains:
Biology, History,
Physics, Psychology etc.
Our current (English)
implementation used
~300 documents.
There is a lot of overfitting
obviously.
Next release will be trained
on a lot more data.
14. Then prepare the data.
Generate random
Chunks of text:
[25K - 50K] characters
Chunks are kept small
to keep training
episodes short.
Better for RL with
Cross-entropy
Cheap data
augmentation.
If there are few documents
like in our case (e.g ~300)
then we obtain overfitting.
28K chars
45.5K chars
25k <= x <= 50k
chars
… chars
We generated 12K chunks
in our case
30k <= y <= ~400k
chars
15. Training step by step.
Start a new training
episode.
Tokenise and
extract sentences
Extract sentence
features
For each chunk in batch For sentence Agent observes
A sentence represents
a step in RL jargon.
Makes a
Decision
Add sentence to summary ?
YES / NO
If YES
get a reward
« Reward is accumulated «
If no more
Sentences
Save all episode steps
and final reward.
Money & Environment icon made by Freepik from www.flaticon.com
16. Feature Extraction
Part of speech ratios
Dependency ratios
Word Embeddings: We use Glove. Could BERT be better ? We will try that soon.
Sentence position in document
Ratio of skipped sentences
Etc. Be creative.
Possible Improvement:
SOTA sentence embeddings, more complex features (minimising the similarity of sentences)
Named Entity Recognition
17. Decision Making
Extract sentence
features
Agent observes
Random
Choice
Add sentence to summary ?
YES / NO
1. Decisions are always random:
Yes or No (1 or 0) with probabilities
P(Decision = 1) and
P(Decision = 0) respectively
2. Probabilities are based on
Softmax predictionsoftmax
3. In early, episodes, softmax prediction
are arbitrary.
4. We use a fully connected (FC)
Neural Network.
Five FC Layers each with high dropout probability.
to minimise overfitting.
Possible Improvement:
Sequence models, 1D Convolutional Neural Networks
18. Reward
Rewards are positive and negative:
If YES
get a reward
« Reward is accumulated «
Positive if constraints are met.
Otherwise negative.
How are rewards computed ?
With the Textrank algorithm.
We forked SummaNLP’s
Implementation and
modified it to our needs.
What are the constraints ?
Number of sentences selected S should
not exceed an integer M.
With M <= total number of sentences.
M is the theoretical maximum number
of sentences in any generated summary.
In our case, M = 20.
For example: If a sentence with score x is
selected for the summary (i.e yes is predicted),
but S >= M then x = -x .
In other words, we punish the agent for
exceeding the upper bound.
Possible Improvement:
Try different algorithms, e.g LexRank. Combine algorithms. Manually rank sentences.
Money & Environment icon made by Freepik from www.flaticon.com
19. The steps are repeated for every sentence
and every chunk in the batch.
Step 1 Step 2 Step 3 Step k
…
Episode 1
Step 1
Episode 2
…
Step 1
Episode j
…
Sentences
Chunks
Score
E1
∑ ∑ ∑
score1: s1 s3
score i
sK s1 s1
i = 1
K
Score
E2
Score
E3
s2
20. The learning step.
1. Select the episodes with the best scores.
i.e episodes with scores at least as high as some p-th percentile.
We chose 90 based on our empirical analysis.
2. Train the agent, on the elite episodes.
…
Our new “ground truth”
{
Note: The score is not fed into the Neural Network (Agent)
The score is no longer need at inference time.
21. Loss Reward bound
Reward mean
The agent is careless.
The agent is shy.
The agent has learned
from experience.
22. But wait, aren’t we just implicitly learning
the TextRank scoring algorithm ?
23. Yes, but:
1. The model does not depend on vocabulary.
2. Transfer learning can be used to improve the agent:
- For particular a use case or in general.
- By simply changing the scoring function when training on new data.
3. The pipeline is flexible.
- Easily integrate new algorithms and architectures.
4. In practice, summaries a usually generated faster.
24. An honest example: Summarise this page
https://en.wikipedia.org/wiki/Cross_entropy
25. An honest example: TextRank results - 17 sentences
- In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the same underlying set …
- The cross entropy of the distribution q {displaystyle q} q relative to a distribution p {displaystyle p} p over a given set is defined as follows:
- The definition may be formulated using the Kullback–Leibler divergence D K L ( p ‖ q ) {displaystyle D_{mathrm {KL} }(p|q)} D_{{{mathrm {KL}}}}(p|q) of …
- For discrete probability distributions p {displaystyle p} p and q {displaystyle q} q with the same support X {displaystyle {mathcal {X}}} {mathcal {X}} …
- H ( p , q ) = − ∑ x ∈ X p ( x ) log q ( x ) {displaystyle H(p,q)=-sum _{xin {mathcal {X}}}p(x),log q(x)} {displaystyle H(p,q)=-sum _{xin {mathcal {X}}} …
- H ( p , q ) = − ∫ X P ( x ) log Q ( x ) d r ( x ) {displaystyle H(p,q)=-int _{mathcal {X}}P(x),log Q(x),dr(x)} {displaystyle H(p,q)=-int _{mathcal {X}}P(x), …
- Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle q} q is assumed while …
- That is why the expectation is taken over the true probability distribution p {displaystyle p} p and not q {displaystyle q} q.
- There are many situations where cross-entropy needs to be measured but the distribution of p {displaystyle p} p is unknown.
- This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) {displaystyle p(x)} p(x)[citation needed].
- If the estimated probability of outcome I {displaystyle I} I is q I {displaystyle q_{I}} q_{I}, while the frequency (empirical probability) of outcome I …
- 1 N log ∏ I q I N p I = ∑ I p I log q I = − H ( p , q ) {displaystyle {frac {1}{N}}log prod _{I}q_{I}^{Np_{I}}=sum _{I}p_{I}log q_{I}=-H(p,q)} {displaystyle …
- When comparing a distribution q {displaystyle q} q against a fixed reference distribution p {displaystyle p} p, cross entropy and KL divergence are …
- This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be D K L ( p …
- The true probability p I {displaystyle p_{I}} p_{I} is the true label, and the given distribution q I {displaystyle q_{I}} q_{I} is the predicted value of the …
- Having set up our notation, p ∈ { y , 1 − y } {displaystyle pin {y,1-y}} pin {y,1-y} and q ∈ { y ^ , 1 − y ^ } {displaystyle qin {{hat {y}},1-{hat {y}}}} …
- J ( w ) = 1 N ∑ n = 1 N H ( p n , q n ) = − 1 N ∑ n = 1 N [ y n log y ^ n + ( 1 − y n ) log ( 1 − y ^ n ) ] , {displaystyle {begin{aligned}J(mathbf {w} ) …
{aligned}J(mathbf {w} ) &= {frac {1}{N}}sum _{n=1}^{N}H(p_{n},q_{n}) = -{frac {1}{N}}sum _{n=1}^{N} {bigg [}y_{n}log {hat {y}}_{n}+(1-y_{n})log( …
26. An honest example: Flexudy results - 12 sentences
- In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the …
- In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to …
- Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle …
- An example is language modeling, where a model is created based on a training set T {displaystyle T} T, and then its cross-entropy …
- In this example, p {displaystyle p} p is the true distribution of words in any corpus, and q {displaystyle q} q is the distribution of …
- In these cases, an estimate of cross-entropy is calculated using the following formula:
H ( T , q ) =
- displaystyle N} N. This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) …
- Cross-entropy minimization
Cross-entropy minimization is frequently used in optimization and rare-event probability estimation; see the cross-entropy method.
- This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross- …
- Cross entropy can be used to define a loss function in machine learning and optimization.
- The output of the model for a given observation, given a vector of input features x {displaystyle x} x, can be interpreted as a …
- The typical cost function that one uses in logistic regression is computed by taking the average of all cross-entropies in the sample.
28. We cannot tell.
We do not yet have evidence to support
such a claim.
// TODO - Evaluate Flexudy with BLUE and ROUGE Scores
29. An honest example II: Summarise this page
https://en.wikipedia.org/wiki/Renaissance
30. An honest example II: Flexudy results - 11 sentences
- The School of Athens (1509–1511), Raphael
Topics
Humanism Age of Discovery Architecture Dance Fine arts
- Depicting the Hebrew prophet-prodigy-king David as a muscular Greek athlete, the Christian humanist ideal can be seen in the ..
- REN-ə-sahnss)[2][a] was a period in European history marking the transition from the Middle Ages to Modernity and covering …
- In addition to the standard periodization, proponents of a long Renaissance put its beginning in the 14th century and its end in the 17th …
- The traditional view focuses more on the early modern aspects of the Renaissance and argues that it was a break from the past, …
The intellectual basis of the Renaissance was its version of humanism, derived from the concept of Roman Humanitas and the rediscovery …
- Early examples were the development of perspective in oil painting and the recycled knowledge of how to make concrete.
- Although the invention of metal movable type sped the dissemination of ideas from the later 15th century, the changes of the Renaissance …
- As a cultural movement, the Renaissance encompassed innovative flowering of Latin and vernacular literatures, beginning with the …
- In politics, the Renaissance contributed to the development of the customs and conventions of diplomacy, and in science to an …
- Various theories have been proposed to account for its origins and characteristics, focusing on a variety of factors including the …
- Other major centres were northern Italian city-states such as Venice, Genoa, Milan, Bologna, and finally Rome during the …
The first 2 sentences make absolutely no sense
32. Future work
1. Try new architectures and algorithms e.g 1D Convolutions.
2. Support formulas e.g Mathematics:
Combine Reinforcement Learning and Logic (Symbolic AI).
3. Manual annotation to improve sentence selection.
4. Collect more data.
5. Use SOTA sentence embeddings.
6. Improve sentence boundary detection algorithms.
7. Implement co-reference resolution to deal with pronouns.
33. References
1. Deep Reinforcement Learning Hands-On by Maxim Lapan
2. A survey automatic text summarization by Oguzhan Tas & Farzad Kiyani
3. Deep Transfer Reinforcement Learning for Text Summarization
by Yaser Naren & Chandan
4. Variations of the Similarity Function of TextRank for Automated Summarization
by Federico Barrios, Luis Argerich & Rosa W.
5. Natural language understanding with {B}loom embeddings, convolutional
neural networks and incremental parsing by Honnibal, Matthew and Montani, Ines
34. To learn more about the meetup, click the Link
https://www.meetup.com/Erlangen-Artificial-Intelligence-Machine-Learning-Meetup
Erlangen
Artificial Intelligence &
Machine Learning Meetup
presents