A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments

A toy model of
human cognition:
!

Utilizing ﬂuctuation in uncertain
and non-‐‑stationary environments
1

1,2

Tatsuji Takahashi , Yu Kohno
Seminar on science of complex systems (organized by
Yukio-‐‑Pegio Gunji), Yukawa Institute for Theoretical
Physics, Kyoto University, Jan. 20, 2014
1

Tokyo Denki University, 2JSPS (from Apr., 2014)

Contents
The loosely symmetric (LS) model

!2

Contents
Cognitive properties or cognitive biases

!2

Contents
Analysis of reconstruction of LS

!2

Contents
Result: Eﬃcacy in reinforcement learning

!2

Contents
Result: Eﬃcacy in reinforcement learning
Utilization of ﬂuctuation in non-‐‑
stationary environments
!2

A toy model of human cognition

!3

Modeling focussing on deviations from rational standards:
cognitive biases

!3

cognitive biases
the diﬀerences from “machines”

!3

cognitive biases

Principal properties implemented in a form as simple as
possible

!3

cognitive biases

possible
so that it can be analyzed and applied easily

!3

cognitive biases

possible

Intuition of human beings

!3

cognitive biases

possible

Intuition of human beings
as simple, again: not the policy (or strategy) that is learnt
through education and culture
!3

LS as a toy model of cognition

!4

We treat the loosely symmetric (LS) model
proposed by Shinohara (2007). LS:

!4

models cognitive biases

!4

merely a function over co-‐‑occurrence information
between two events

!4

between two events
faithfully describes the causal intuition of humans

!4

between two events
faithfully describes the causal intuition of humans
which form the basis of decision-‐‑making and action for
adaptation in the world
!4


!5

A quasi-‐‑probability function LS(-‐‑|-‐‑) like conditional
probability P(-‐‑|-‐‑).

!5

Deﬁned over the co-‐‑occurrence information of events p and q

!5


posterior event

q
prior
event
!5

p
¬p

¬q

a
c

b
d

The relationship from p to q: LS(q|p)

posterior event

q
prior
event
!5

p
¬p

¬q

a
c

b
d

LS describes the causal intuition of human beings the most
faithfully (among more than 40 existing models).

posterior event

q
prior
event
!5

p
¬p

¬q

a
c

b
d


posterior event

a
P (q|p) =
a+b

q
prior
event
!5

p
¬p

¬q

a
c

b
d


posterior event

a
P (q|p) =
a+b
LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d

+b+

a
a+c c
!5

prior
event

p
¬p

¬q

a
c

b
d


posterior event

LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d
a
a+c c

+b+

!6

prior
event

p
¬p

¬q

a
c

b
d

Inductive inference of causal relationship

posterior event

LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d
a
a+c c

+b+

!6

prior
event

p
¬p

¬q

a
c

b
d

How humans form the intensity of causal relationship from p to
q,

posterior event

LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d
a
a+c c

+b+

!6

prior
event

p
¬p

¬q

a
c

b
d

q,
when p is the candidate cause of the eﬀect q in focus?
posterior event

LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d
a
a+c c

+b+

!6

prior
event

p
¬p

¬q

a
c

b
d

q,
The function form of f(a, b, c, d) for the human causal intuition
posterior event

LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d
a
a+c c

+b+

!6

prior
event

p
¬p

¬q

a
c

b
d

q,
posterior event

LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d
a
a+c c

+b+

prior
event

p
¬p

¬q

a
c

b
d

Meta analysis as in Hattori & Oaksford (2007)

!6

q,
posterior event

LS(q|p) =

a+
a+

b
b+d d

q

b
b+d d

prior
event

a
a+c c

+b+

p
¬p

¬q

a
c

b
d

Meta analysis as in Hattori & Oaksford (2007)
Experiment
r for LS
r for ΔP

AS95
0.95
0.88

BCC03.1 BCC03.3

0.98
0.92

0.98
0.84

!6

H03
0.98
0.00

H06
0.97
0.71

LS00 W03.2 W03.6
0.85
0.95
0.85
0.88
0.28
0.46

In 2-‐‑armed bandit problems

!7

later on bandit problems

!7

LS used as the value
function in
reinforcement learning:


!7

function in


The agent evaluates the
actions according to
the causal intuition of
humans.

!7

function in
1.0

LS
CP
ToWH0.5L
SMH0.3L
SMH0.7L

0.9
Accuracy rate

humans.


0.8
0.7
0.6
0.5
1

5

10

50
step

!7

100

500 1000

function in

Very good adaptation
to the environment,
both in short term and
long term.

1.0

LS
CP
ToWH0.5L
SMH0.3L
SMH0.7L

0.9
Accuracy rate

humans.


0.8
0.7
0.6
0.5
1

5

10

50
step

!7

100

500 1000

From the analysis of LS, we found the following cognitive
properties:

properties:
Ground-‐‑invariance (like visual acention, Takahashi et al., 2010)

properties:
Comparative valuation

properties:
psychology: Tversky & Kahneman, Science, 1974.

properties:
brain science: Daw et al., Nature, 2006.

properties:
Idiosyncratic, asymmetric risk a8itude as in the prospect theory

properties:
Kahneman & Tversky, Am., Psy., 1984, Boorman et al., Neuron, 2009

properties:
Satisﬁcing

properties:
Satisﬁcing
Simon, Psy. Rev., 1954, Kolling et al., Science, 2012.

Principal human cognitive biases

!9

Humans:

!9

Humans:
Satisﬁcing: do not optimize but satisﬁce.

!9

Humans:
become satisﬁed when it is becer than the reference
level

!9

Humans:
level
Comparative valuation: evaluate states and actions in
a relative manner

!9

Humans:
level
Comparative valuation: evaluate states and actions in
a relative manner
Asymmetric risk a:itude: asymmetrically recognize
gain and loss
!9

Satisﬁcing
A1

A2

A1

A2

reference
all arms are over reference
No pursuit of arms over the reference level given
reference
all arms are under reference
Search hard for an arm over the reference level

Satisﬁcing
A1

A2

A1

A2

reference
reference

Risk a:itude (Reliability consideration)
Risk-avoiding over the reference
Expected value

0.75

win (o) and lose (x) in the past

=

○×○○○
×○○○○
○○○×○
○○×○×

comparison considering reliability

Choose 15/20 than 3/4

75%
○×○○

>

Risk-seeking under the reference
25%

reﬂection effect

=

×○×××
○××××
×××○×
××○×○

25%
×○××

<

Gamble on 1/4 rather than 5/20

Satisﬁcing
A1

A2

A1

reference
reference

A2

Risk a:itude (Reliability consideration)
Risk-avoiding over the reference
Expected value

0.75

win (o) and lose (x) in the past

=

○×○○○
×○○○○
○○○×○
○○×○×

○×○○

Choose 15/20 than 3/4
Comparative evaluation
value of A1

A2

75%

=

25%

reﬂection effect

×○×××
○××××
×××○×
××○×○

>

comparison considering reliability

A1

Risk-seeking under the reference
25%
×○××

<

Gamble on 1/4 rather than 5/20
Try arms other than A1 by
comparative valuation
value of A2
(see-saw)

Choose A1 and lose
comparative
absolute

A1

A2

The generalized LS with variable
reference (LSVR)

Variable Reference

Abstract image

LSVR is a generalization of LS with an autonomously
adjusted parameter of reference.

n-‐‑armed bandit problem (nABP)

!12

The simplest framework in reinforcement learning,
exhibiting the exploration-‐‑exploitation dilemma and
the speed-‐‑accuracy tradeoﬀ.

!12

It is to maximize the total reward acquired from n
actions (sources) with unknown reward distribution.

!12


One-‐‑armed bandit is a slot machine that gives a
reward (win) or not (lose).

!12


One-‐‑armed bandit is a slot machine that gives a
reward (win) or not (lose).
n-‐‑armed bandit is a slot machine with n arms that have
diﬀerent probability of winning.
!12

Performance indices for nABP

!13

Accuracy:

!13

Accuracy:
the average percentage of choosing the optimal
action

!13

Accuracy:
action
Regret (expected loss):

!13

Accuracy:
action
Regret (expected loss):
the diﬀerence of the actually acquired
accumulated rewards from the best possible
sequence of actions (where accuracy=1.0 all
through the trial)
!13

Result
n=100, the reward probability for each action is taken uniformly from [0,1].
LS γ= 0.999
LS-VR γ= 0.999
UCB1-tuned γ= 0.999

LS
LS-VR
UCB1-tuned
LS γ= 0.999
LS-VR γ= 0.999

10

Expected loss

0.6

0

0.0

0.2

5

0.4

Accuracy rate

0.8

15

1.0

LS
LS-VR
UCB1-tuned

0e+00

2e+05

4e+05

6e+05

8e+05

1e+06

0e+00

Steps

Accuracy: highest
The more there are actions, the better
the performance of LSVR becomes.

2e+05

4e+05

6e+05

8e+05

1e+06

Steps

Regret: smallest
Kohno & Takahashi, 2012; in prep.

Non-‐‑stationary bandits

The reward probabilities change while
playing.

!15

Result in non-stationary environment 1
n=16, the reward probability is from [0,1].

The probabilities are totally reset every 10,000 steps.

200
100

150

Expected loss

0.6
0.4

50

0.2

0

0.0

Accuracy rate

0.8

250

1.0

LS
LS-VR
UCB1-tuned
LS γ= 0.999
LS-VR γ= 0.999

300

LS γ= 0.999
LS-VR γ= 0.999

LS
LS-VR
UCB1-tuned

0

10000

20000

30000

40000

50000

0

10000

Steps

Accuracy: highest
Kohno & Takahashi, in prep.

20000

30000

40000

Steps

Regret: smallest

50000

n=20, the initial probability from [0,1]. The probability of each
action is reset at the probability of 0.0001.
LS γ= 0.999
LS-VR γ= 0.999

0.6
0.4
0.2
0.0

Accuracy rate

0.8

1.0

LS
LS-VR
UCB1-tuned

0

10000

20000

30000
Steps

40000

50000

Accuracy (the rate of the optimal
action at the time chosen)


0.6
0.4
0.2
0.0

Accuracy rate

0.8

1.0

Even when a not well-tried
action becomes the new
optimal, it can switch to the
optimal action.

LS γ= 0.999
LS-VR γ= 0.999

LS
LS-VR
UCB1-tuned

0

10000

20000

30000
Steps

40000

50000



0.6
0.2

0.4

If the reward is given
deterministically, this is
impossible.

0.0

Accuracy rate

0.8

1.0

optimal action.

LS γ= 0.999
LS-VR γ= 0.999

LS
LS-VR
UCB1-tuned

0

10000

20000

30000
Steps

40000

50000



0.6
0.2

0.4

If the reward is given
deterministically, this is
impossible.

Efﬁcient search utilizing
uncertainty and ﬂuctuation
in non-stationary
environments
0.0

Accuracy rate

0.8

1.0

optimal action.

LS γ= 0.999
LS-VR γ= 0.999

LS
LS-VR
UCB1-tuned

0

10000

20000

30000

Steps

40000

50000

Results
LS γ= 0.999
LS-VR γ= 0.999

0.6
0.2

0.4

stationary

0.0

Accuracy rate

0.8

1.0

LS
LS-VR
UCB1-tuned

0e+00

2e+05

4e+05

6e+05

8e+05

1e+06

Steps

!18

Results
LS γ= 0.999
LS-VR γ= 0.999

The more there are options,
the becer the performance
of LSVR becomes.

0.6
0.2

0.4

stationary

0.0

Accuracy rate

0.8

1.0

LS
LS-VR
UCB1-tuned

0e+00

2e+05

4e+05

6e+05

8e+05

1e+06

Steps

!18

Results
LS γ= 0.999
LS-VR γ= 0.999

of LSVR becomes.

0.6

stationary

0.0

0.2

0.4

Accuracy rate

0.8

1.0

LS
LS-VR
UCB1-tuned

0e+00

2e+05

4e+05

6e+05

8e+05

1e+06

Steps
LS γ= 0.999
LS-VR γ= 0.999

0.6

non-stationary 2

0.0

0.2

0.4

Accuracy rate

0.8

1.0

LS
LS-VR
UCB1-tuned

0

10000

20000

30000

40000

50000

Steps

LSVR can trace the unobserved
change, amplifying ﬂuctuation.

!18

Results
LS γ= 0.999
LS-VR γ= 0.999

of LSVR becomes.

0.6

stationary

2e+05

4e+05

6e+05

8e+05

1e+06

non-stationary!
synchronous

0.2

0.6

0.0

non-stationary 2

0.4

Accuracy rate

0.8

0.4

Accuracy rate

LS γ= 0.999
LS-VR γ= 0.999

LS
LS-VR
UCB1-tuned

1.0

0.8

Steps

0.6

0e+00

LS γ= 0.999
LS-VR γ= 0.999

LS
LS-VR
UCB1-tuned

1.0

0.0

0.2

0.4

Accuracy rate

0.8

1.0

LS
LS-VR
UCB1-tuned

0

10000

20000

30000

0.2

Steps

10000

20000

30000

40000

50000

Steps

LSVR can trace the unobserved
change, amplifying ﬂuctuation.

50000

LSVR can trace the change in
non-‐‑stationary environments.

0.0

0

40000

!18

Discussion
The cognitive biases of humans, when combined:

!19

Discussion
Eﬀectively works for adaptation under uncertainty

!19

Discussion
Conﬂates an action and the set of the actions
through comparative valuation.

!19

Discussion
Symbolizes the whole situation into a virtual action.

!19

Discussion
Symbolizes the whole situation into a virtual action.
Utilizes ﬂuctuation from uncertainty and enables
adaptation to non-‐‑stationary environments.
!19

Conﬂating part and whole

!20

Comparative valuation conﬂates the
information of an action and of the whole
set of actions.

!20

Comparative valuation conﬂates the
information of an action and of the whole
set of actions.
Universal in living systems from slime
molds (Lacy & Beekman, 2011) to neurons
(Royer & Paré, 2003) to animals and human
beings.
!20

Relative evaluation is especially
important

if absolute

value

of A1

value

of A2 if relative

Try arms other than A1
by relative evaluation
(see-saw)

Choose A1 and lose

value

of A1

value

of A2

value

of A1

value

of A2

important

★ Relative evaluation:

if absolute

value

of A1

value

of A2 if relative

(see-saw)

Choose A1 and lose

value

of A1

value

of A2

value

of A1

value

of A2

important


★ is what even slime molds and real neural networks (conservation
of synaptic weights) do. Behavioral economics found that humans
comparatively evaluate actions and states.

if absolute

value

of A1

value

of A2 if relative

(see-saw)

Choose A1 and lose

value

of A1

value

of A2

value

of A1

value

of A2

important


★ weakens the dilemma between exploitation and exploration with
the see-‐‑saw game like competition among arms:

if absolute

value

of A1

value

of A2 if relative

(see-saw)

Choose A1 and lose

value

of A1

value

of A2

value

of A1

value

of A2

important


★ Through failure (low reward), choice of greedy action may quickly
trigger to the next choice of the previously second best, non-‐‑greedy arm.

if absolute

value

of A1

value

of A2 if relative

(see-saw)

Choose A1 and lose

value

of A1

value

of A2

value

of A1

value

of A2

important


★ Through failure (low reward), choice of greedy action may quickly
trigger to the next choice of the previously second best, non-‐‑greedy arm.
★ Through success (high reward), choice of greedy action may quickly
trigger to focussing on the currently greedy action, lessening the
possibility of choosing non-‐‑greedy arms by decreasing the value of other
arms.
if absolute

value

of A1

value

of A2 if relative

(see-saw)

Choose A1 and lose

value

of A1

value

of A2

value

of A1

value

of A2

Symbolization of the whole and
comparative valuation with multi actions
777

777

777

A1

A2

An

...

Symbolization of the whole and
comparative valuation with multi actions
Virtual machine
representing the whole

777

Ag

777

777

777

A1

A2

An

...

Comparative valuation with a virtual
action representing the whole
Virtual machine representing the whole

777

777

777

777

Ag

A1

A2
A2

An

...
“>” or “<”?
“>” or “<”?
“>” or “<”?

Conclusion
The cognitive biases that look irrational are, when appropriately
combined together as in humans, actually rational for adapting to
uncertain environments and survival through evolution

24

Conclusion
Applicable in engineering, in machine learning and robot control

24

Conclusion
Implications to brain science (brain as a machine learning equipment)

24

Conclusion
Modeling PFC and vmPFC

24

Conclusion
Brain science and the three cognitive biases:

24

Conclusion
Satisﬁcing

24

Conclusion
Satisﬁcing
Kolling et al., Science, 2012.

24

Conclusion
Satisﬁcing

Comparative valuation of state-‐‑action
value

24

Conclusion
Satisﬁcing

value
Daw et al., Nature, 2006.

24

Conclusion
Satisﬁcing

value

Idiosyncratic risk evaluation
24

Conclusion
Satisﬁcing

value

Idiosyncratic risk evaluation
Boorman et al., Neuron, 2009.

24

Applications of bandit problems

Game-tree


Game-tree

★ Monte-‐‑Carlo tree search (Go-‐‑AI)


Game-tree

★ Online advertisement


Game-tree

★ e.g., A/B test


Game-tree

★ e.g., A/B test

★ Design of medical treatment


Game-tree

★ e.g., A/B test

★ Design of medical treatment
★ Reinforcement learning

Robotic motion learning
Learning giant-swing motion with no prior knowledge
and under coarse-grained states through trial-and-error.
Real$Robot$

Simulator$

free$joint

1st$joint$(free)

Posi%on'State'

2nd$joint$
(ac,ve)

ac,ve$joint

1st$link

Acquired#reward#per#1000#steps

2nd$link

Typical(case

Average(of(100(trials

600#

500#

400#

400#

300#

P23 P0 P1
P22
P2
P21
P3
P20
P4
P19
P5
P18
P6
P17
P7
P16
P8
P15
P9
P14
P10
P13P12P11

Posture'State'
5/6π'[rad]
R4
R3

[rad/s]

R2
R1

W6 W5 W4 W3 W2 W1 W0

3π

0

.3π

300#

0#

20#

40#

60#

80#

100#

0#

20#

40#

60#

80#

A1
A2

100#

A0

Learning#steps#[#/1000#steps]

Reward'
r'='1

4.0'[rad/s]

200#

200#

Ac%on'

LSQ#
Q#

0.0'[rad/s]

.4.0'[rad/s]

r'='|θ%p'/'π|

Uragami, D., Takahashi, T., Matsuo,Y., Cognitively inspired reinforcement learning
architecture and its application to giant-swing motion control, BioSystems, 116, 1–9. (2014)

R0

0'[rad]

600#

500#

Velocity'State'

r'='0

A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments

Semelhante a A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments (20)

Último

Último (20)

A toy model of human cognition: Utilizing fluctuation in uncertain and non-stationary environments