O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Quantifying reflection

2.908 visualizações

Publicada em

Creating a gold-standard for evaluating automated reflection detection

Publicada em: Educação, Tecnologia, Negócios

Quantifying reflection

  1. 1. Quantifying reflection: Creating a gold-standard for evaluating automated reflection detection Thomas Ullmann, Fridolin Wild, Peter Scott, Knowledge Media Institute, The Open University
  2. 2. Outline • A Model for reflection • Related work on quantification of reflection • Methodology • Data collected • Results and discussion • Outlook 2
  3. 3. Reflection is creative sense-making of the past 3
  4. 4. State of the art in quantifying reflection Reference Scales Unit of analysis Findings Dyment & O’Connell (2011) Depth of reflection Studies (writings) Meta review: five studies low; four medium; two studies high levels of reflection Wong et al. (1995) Depth of reflection: habitual to critical. 45 students Content analysis and interviews: 76% reflectors, 11% critical reflectors. Wald et al. (2012) Reflective to non- reflective 93 writings 2nd year students, self selected best of reflective field notes: 30% critically reflective, 11% transformative reflective. Plack et al. (2005) Frequencies of elements and depth of reflection 43 journals 43% reflection, 42% critical reflection; frequencies see next slide. Hatton & Smith (1995) Units of reflection; dialogic versus descriptive ‘units’ (in writings of 60 students) After instruction: 30% dialogic reflection; 19 reflective units in average per 8-12 pages Ross (1989) Depth of reflection 134 papers of 25 students 22% highly reflective, 34 % moderately reflective Williams et al. (2002) Action classification. 56 student journals 23% verify learning, 36% new understanding, 39% future behaviour 4
  5. 5. Plack et al. %
  6. 6. Williams et al.
  7. 7. Summary: Related work • More research on level than on elements • Wide range for ‘level of depth’ • Measurements on students or writings/journals level • Mostly in the context of instructed reflective writing • Typically: Mapping from evidence to depth/breadth => No re-usable instrument to measure reflection
  8. 8. The dimensions of reflection Ullmann, Wild, Scott (2012): Comparing automatically detected reflective texts with human judgements. http://ceur-ws.org/Vol-931/paper8.pdf Documentation of insights, plans, and intentions. Switch point of view. Argumentation and reasoning. Identification of a conflict. Awareness building over affective factors. Explication of self-awareness, e.g., inner monologues, description of feelings.
  9. 9. Example accounts (anonymised) Dim: Type Example SA: Identification of a conflict. “[Victor] and [Morgan], you are right that I should have applied better my own learning instead of using the Uni ones.” CA: Reasoning. “I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.” TP: Switch point of view. “When I am doing FRT work, I often think about how the parents view me when they know I haven‟t got children!” Dim: Type Example OD: Documentation of an insight. “After I saw how this lifted her mood and eased her anxiety, I will remember that what we can view sometimes to be small can actually make a significant difference.” OD: Intention. “I would like to be involved in helping with the site, too - although I‟m a novice! I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.” Dim: Type Example OD: New understanding. “This has helped me reflect on my own life and experiences whilst allowing me to empathise with others in their own circumstances; I feel proud of what I have achieved so far as the work/life/study balance is always difficult to navigate, but I‟m lucky that I have a supportive family to help.“ None “Bye the way, Audacity is also run under the CC Attribution.”
  10. 10. Methodology: creating a gold standard 10 Corpus selection Sanitize Chunking (for cues) Sample Batching Crowd- sourcing „Spam‟ filtering Objecti- fication mid range length postings OU LMS forum posts 4 subjects, 2 years de-identification sentence level 1000 random 500 pers. 500 non- pers.. Expand grid, 10 batches control questions 5 raters each Justification valid „gold questions‟ passed„majority vote‟ interrater reliability
  11. 11. Crowdsourcing • Crowdflower: the ‘virtual pedestrian area’ • Pre-tests showed: – Really simple questions needed for HITs – But: Quick answer options increase spam – Short texts easier than long texts (less spam, smaller costs) – Shuffling of answers to avoid artefacts • Check: larger than usual number of raters (5+) to see how reliable judgements are
  12. 12. Example questionnaire
  13. 13. OU Forum Corpus
  14. 14. Countries (origin of request) • In total 411 raters • Most of them from the USA (N=202) • GB (N=94) • India (N=45) • 14 other nations (N=70)
  15. 15. Across batches (3M)
  16. 16. Frequency distribution (3M)
  17. 17. Frequencies by courses (3M)
  18. 18. Interrater Reliability – Raw data • Baseline: control questions: Krippendorff’s α = 0.43 • Control questions + survey data: α = 0.32 • Survey data: α = 0.22 – ‘objectified’ data • Majority vote of 3 to all raters agree – Survey data: α = 0.36, (623 out of 1,000 sentences) • Majority vote of 4 to all agree – Survey data: α = 0.581, (301 sentences) • Majority vote of 5 (to all) agree: – α = 0.98 (with outliers), (107 out of 1,000 sentences)
  19. 19. Discussion • Agreement of 5 of course increases IRR – (to 0.98 unfiltered) – when omitting ‘over answering’: to 1.0 – But: reduces to single category sentences • Agreement of 3 deemed good enough – since questions were single choice, whereas multiple anwers are correct • Sentences are reduction, but allow to zoom in on markers • Context: Forum texts • Personal vs. non personal sentences
  20. 20. Questions? Answers? bit.ly/tel-advances