AI applications in education, Pascal Zoleko, Flexudy

Erlangen
Artificial Intelligence &
Machine Learning Meetup
presents

Hi, I am

Pascal

Zoleko My Projects

Flexudy
PR & AI
For Education
Study
Work
PR & AI
For Privacy
People Analytics
Pattern Recognition

Problems we want to solve
1. Too much to read

2. Too long to read

3. Abstracts are sometime too bold.
4. Abstracts are sometimes too vague.

5. Abstracts are not available for all

kinds of text documents. (Web page)
Some students (learners) … :

6. Read and forget

7. Can’t continuously evaluate their

knowledge on a subject.

8. Can’t revise while on

the train, bus etc.

Flexudy

Education
Today.
Automatic Text

Summarisation
Simple Question

Generation
Demo Video
NLP
Ranking
Reinforcement

Learning
Rules
A simple overview
Good enough to give

an idea about the text
Deep

Learning
Fill in the blanks
Simple but useful to

Remember the keywords

found in a text.
We won’t have enough

time for this. But we

can discuss about it.

Text Summarisation
Extractive Abstractive
Simpler
Select the relevant phrases
Harder
Can generate phrases

not found in the text

Automatic Extractive

Summarisation
Existing Solutions
Text rank
Lexrank
LSA
GitHub

Reinforce-

ment

Learning
Ranking

Algorithms
Natural

Language

Processing
+
Automatic Extractive

Summarisation
Actually

Stochastic optimisation

With Cross Entropy

- model free
- policy based

The Summariser Pipeline.
“Let AI do all the work

and then reap the fruits

of its labour.”
Step by Step
I will avoid technical terms as much as possible.

I made no assumption about the audience.

So, no Maths!

Summary generation algorithm. Easy, but not trivial.
1. Get user text to be summarised
2. For each sentence in the text
3. Decide if sentence should be added to the summary.
4. If yes, then append the sentence to the summary
5. Format the summary and return
How do we train our a summariser ?
Reinforcement Learning - Cross-Entropy
Can be improved by using other state of the

SOTA algorithms, e.g Deep Q-Networks.

First, a quick recap.
1. Agent
Money & Environment icon made by Freepik from www.ﬂaticon.com
2. Reward
The central idea behind Reinforcement Learning
3. Environment
Observation
Actions

Trainable

Non-linear

function
Reward
How does it translate to our use case ?
Original Text
Sentence Features
Prediction
[0, 1]
Score The next sentence
First, a quick recap.
Note: although the environment is fully observable, we decide

to observe, sentences one at a time.
Can easily be improved: By observing many at a time.

We need data to train the agent.
Gutenberg
arXiv.org
wikipedia
Broad corpus for

higher coverage.
~50% of our development time.
Data is handpicked

From diﬀerent domains:

Biology, History,

Physics, Psychology etc.
Our current (English)

implementation used

~300 documents.

There is a lot of overﬁtting

obviously.

Next release will be trained

on a lot more data.

Then prepare the data.
Generate random

Chunks of text:

[25K - 50K] characters
Chunks are kept small

to keep training

episodes short.

Better for RL with

Cross-entropy
Cheap data

augmentation.

If there are few documents

like in our case (e.g ~300)

then we obtain overﬁtting.
28K chars
45.5K chars
25k <= x <= 50k

chars
… chars
We generated 12K chunks

in our case
30k <= y <= ~400k

chars

Training step by step.
Start a new training

episode.
Tokenise and

extract sentences
Extract sentence

features
For each chunk in batch For sentence Agent observes
A sentence represents

a step in RL jargon.
Makes a

Decision
Add sentence to summary ?
YES / NO
If YES
get a reward
« Reward is accumulated «
If no more

Sentences
Save all episode steps

and ﬁnal reward.

Feature Extraction
Part of speech ratios
Dependency ratios
Word Embeddings: We use Glove. Could BERT be better ? We will try that soon.
Sentence position in document
Ratio of skipped sentences

Etc. Be creative.
Possible Improvement:

SOTA sentence embeddings, more complex features (minimising the similarity of sentences)
Named Entity Recognition

Decision Making
Extract sentence

features
Agent observes
Random
Choice
Add sentence to summary ?
YES / NO
1. Decisions are always random:
Yes or No (1 or 0) with probabilities

P(Decision = 1) and

P(Decision = 0) respectively
2. Probabilities are based on

Softmax predictionsoftmax
3. In early, episodes, softmax prediction

are arbitrary.
4. We use a fully connected (FC)

Neural Network.
Five FC Layers each with high dropout probability.

to minimise overﬁtting.

Sequence models, 1D Convolutional Neural Networks

Reward
Rewards are positive and negative:
If YES
get a reward
« Reward is accumulated «
Positive if constraints are met.
Otherwise negative.
How are rewards computed ?
With the Textrank algorithm.
We forked SummaNLP’s

Implementation and

modiﬁed it to our needs.
What are the constraints ?
Number of sentences selected S should

not exceed an integer M.
With M <= total number of sentences.

M is the theoretical maximum number

of sentences in any generated summary.
In our case, M = 20.
For example: If a sentence with score x is

selected for the summary (i.e yes is predicted),

but S >= M then x = -x .
In other words, we punish the agent for

exceeding the upper bound.

Try diﬀerent algorithms, e.g LexRank. Combine algorithms. Manually rank sentences.

The steps are repeated for every sentence

and every chunk in the batch.
Step 1 Step 2 Step 3 Step k
…
Episode 1
Step 1
Episode 2
…
Step 1
Episode j
…
Sentences
Chunks
Score

E1
∑ ∑ ∑
score1: s1 s3
score i
sK s1 s1
i = 1
K
Score

E2
Score

E3
s2

The learning step.
1. Select the episodes with the best scores.
i.e episodes with scores at least as high as some p-th percentile.
We chose 90 based on our empirical analysis.
2. Train the agent, on the elite episodes.
…
Our new “ground truth”
{
Note: The score is not fed into the Neural Network (Agent)
The score is no longer need at inference time.

Loss Reward bound
Reward mean
The agent is careless.
The agent is shy.
The agent has learned

from experience.

But wait, aren’t we just implicitly learning

the TextRank scoring algorithm ?

Yes, but:
1. The model does not depend on vocabulary.
2. Transfer learning can be used to improve the agent:
- For particular a use case or in general.
- By simply changing the scoring function when training on new data.
3. The pipeline is ﬂexible.
- Easily integrate new algorithms and architectures.
4. In practice, summaries a usually generated faster.

An honest example: Summarise this page
https://en.wikipedia.org/wiki/Cross_entropy

An honest example: TextRank results - 17 sentences
- In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the same underlying set …

- The cross entropy of the distribution q {displaystyle q} q relative to a distribution p {displaystyle p} p over a given set is defined as follows:

- The definition may be formulated using the Kullback–Leibler divergence D K L ( p ‖ q ) {displaystyle D_{mathrm {KL} }(p|q)} D_{{{mathrm {KL}}}}(p|q) of …

- For discrete probability distributions p {displaystyle p} p and q {displaystyle q} q with the same support X {displaystyle {mathcal {X}}} {mathcal {X}} …

- H ( p , q ) = − ∑ x ∈ X p ( x ) log q ( x ) {displaystyle H(p,q)=-sum _{xin {mathcal {X}}}p(x),log q(x)} {displaystyle H(p,q)=-sum _{xin {mathcal {X}}} …

- H ( p , q ) = − ∫ X P ( x ) log Q ( x ) d r ( x ) {displaystyle H(p,q)=-int _{mathcal {X}}P(x),log Q(x),dr(x)} {displaystyle H(p,q)=-int _{mathcal {X}}P(x), …

- Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle q} q is assumed while …

- That is why the expectation is taken over the true probability distribution p {displaystyle p} p and not q {displaystyle q} q.

- There are many situations where cross-entropy needs to be measured but the distribution of p {displaystyle p} p is unknown.

- This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) {displaystyle p(x)} p(x)[citation needed].

- If the estimated probability of outcome I {displaystyle I} I is q I {displaystyle q_{I}} q_{I}, while the frequency (empirical probability) of outcome I …

- 1 N log ∏ I q I N p I = ∑ I p I log q I = − H ( p , q ) {displaystyle {frac {1}{N}}log prod _{I}q_{I}^{Np_{I}}=sum _{I}p_{I}log q_{I}=-H(p,q)} {displaystyle …

- When comparing a distribution q {displaystyle q} q against a fixed reference distribution p {displaystyle p} p, cross entropy and KL divergence are …

- This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be D K L ( p …

- The true probability p I {displaystyle p_{I}} p_{I} is the true label, and the given distribution q I {displaystyle q_{I}} q_{I} is the predicted value of the …

- Having set up our notation, p ∈ { y , 1 − y } {displaystyle pin {y,1-y}} pin {y,1-y} and q ∈ { y ^ , 1 − y ^ } {displaystyle qin {{hat {y}},1-{hat {y}}}} …

- J ( w ) = 1 N ∑ n = 1 N H ( p n , q n ) = − 1 N ∑ n = 1 N [ y n log y ^ n + ( 1 − y n ) log ( 1 − y ^ n ) ] , {displaystyle {begin{aligned}J(mathbf {w} ) …

{aligned}J(mathbf {w} ) &= {frac {1}{N}}sum _{n=1}^{N}H(p_{n},q_{n}) = -{frac {1}{N}}sum _{n=1}^{N} {bigg [}y_{n}log {hat {y}}_{n}+(1-y_{n})log( …

An honest example: Flexudy results - 12 sentences
- In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the …

- In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to …

- Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle …

- An example is language modeling, where a model is created based on a training set T {displaystyle T} T, and then its cross-entropy …

- In this example, p {displaystyle p} p is the true distribution of words in any corpus, and q {displaystyle q} q is the distribution of …

- In these cases, an estimate of cross-entropy is calculated using the following formula:

H ( T , q ) =

- displaystyle N} N. This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) …

- Cross-entropy minimization

Cross-entropy minimization is frequently used in optimization and rare-event probability estimation; see the cross-entropy method.

- This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redeﬁning cross- …

- Cross entropy can be used to deﬁne a loss function in machine learning and optimization.

- The output of the model for a given observation, given a vector of input features x {displaystyle x} x, can be interpreted as a …

- The typical cost function that one uses in logistic regression is computed by taking the average of all cross-entropies in the sample.

Is Flexudy’s current implementation

better than TextRank ?

We cannot tell.

We do not yet have evidence to support

such a claim.
// TODO - Evaluate Flexudy with BLUE and ROUGE Scores

An honest example II: Summarise this page
https://en.wikipedia.org/wiki/Renaissance

An honest example II: Flexudy results - 11 sentences
- The School of Athens (1509–1511), Raphael

Topics

Humanism Age of Discovery Architecture Dance Fine arts

- Depicting the Hebrew prophet-prodigy-king David as a muscular Greek athlete, the Christian humanist ideal can be seen in the ..

- REN-ə-sahnss)[2][a] was a period in European history marking the transition from the Middle Ages to Modernity and covering …

- In addition to the standard periodization, proponents of a long Renaissance put its beginning in the 14th century and its end in the 17th …

- The traditional view focuses more on the early modern aspects of the Renaissance and argues that it was a break from the past, …

The intellectual basis of the Renaissance was its version of humanism, derived from the concept of Roman Humanitas and the rediscovery …

- Early examples were the development of perspective in oil painting and the recycled knowledge of how to make concrete.

- Although the invention of metal movable type sped the dissemination of ideas from the later 15th century, the changes of the Renaissance …

- As a cultural movement, the Renaissance encompassed innovative flowering of Latin and vernacular literatures, beginning with the …

- In politics, the Renaissance contributed to the development of the customs and conventions of diplomacy, and in science to an …

- Various theories have been proposed to account for its origins and characteristics, focusing on a variety of factors including the …

- Other major centres were northern Italian city-states such as Venice, Genoa, Milan, Bologna, and finally Rome during the …

The first 2 sentences make absolutely no sense

Hence, there is still a lot of

work to do.

Future work
1. Try new architectures and algorithms e.g 1D Convolutions.
2. Support formulas e.g Mathematics:
Combine Reinforcement Learning and Logic (Symbolic AI).
3. Manual annotation to improve sentence selection.
4. Collect more data.
5. Use SOTA sentence embeddings.
6. Improve sentence boundary detection algorithms.
7. Implement co-reference resolution to deal with pronouns.

References
1. Deep Reinforcement Learning Hands-On by Maxim Lapan
2. A survey automatic text summarization by Oguzhan Tas & Farzad Kiyani
3. Deep Transfer Reinforcement Learning for Text Summarization

by Yaser Naren & Chandan
4. Variations of the Similarity Function of TextRank for Automated Summarization

by Federico Barrios, Luis Argerich & Rosa W.
5. Natural language understanding with {B}loom embeddings, convolutional

neural networks and incremental parsing by Honnibal, Matthew and Montani, Ines

To learn more about the meetup, click the Link
https://www.meetup.com/Erlangen-Artificial-Intelligence-Machine-Learning-Meetup
Erlangen
Machine Learning Meetup
presents

AI applications in education, Pascal Zoleko, Flexudy

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a AI applications in education, Pascal Zoleko, Flexudy

Semelhante a AI applications in education, Pascal Zoleko, Flexudy (20)

Mais de Erlangen Artificial Intelligence & Machine Learning Meetup

Mais de Erlangen Artificial Intelligence & Machine Learning Meetup (7)

Último

Último (20)

AI applications in education, Pascal Zoleko, Flexudy