IAC 2024 - IA Fast Track to Search Focused AI Solutions
Decoder Ring
1. Decoder Ring
http://decoder-ring.net
Jeff Beeman jeff.beeman@asu.edu @doogiemac
GLS Conference 2010
2. Background
• Fall 2009 semester
• Seminars w/ Jim & Betty
• Wanted to do some sort of emulation of
work I had been reading (Gee, Hayes,
Steinkuehler, Duncan, etc.)
• Seemed to me the process for doing it
was painful
3. Traditional process
Copy into Take notes /
Find content
Word docs hi-light phrases
Come up w/ Manually transfer
equations & charts data to Excel
(At least how I see it)
4. Traditional process
Copy into Take notes /
Find content
Word docs hi-light phrases
Come up w/ Manually transfer
equations & charts data to Excel
Wasting time... and it’s BORING
5. I’m lazy
• I want to
• use technology to solve repetitive, boring
problems for me
• write something once, use it many times
• take advantage of work others have
already done
• work with a lot of data
6.
7.
8. Better process
Create
Find content
importer
Import content
Analyze
content
Get someone else to do this
9. Initial requirements
• Abstracted, flexible, powerful data model
• Sustainable, low cost, framework
• Web based to facilitate collaboration
• Facilitate importing and browsing large data
sets
• Automated reporting
11. Data model
Collection
Name Taxonomy
Description Name
Post User Term
Title Username Name
Body Avatar Description
Author Creation date
Post date Attributes (rank, sex, etc.)
Parent post (optional)
External identifier
All data normalized into Collections, Posts, Users, Taxonomies
16. Getting the content
Collections
Posts
Users
Seems to be the overwhelmingly most difficult part of doing this
work.
17. Again, I’m lazy
• I have a tool that has a normalized,
predictable data model.
• I can “scrape” websites or other data sets
and put them into the data model.
19. Reduced to as little
work as possible
• Given a common file format, data is quick
and easy to import into Decoder Ring
• Bad news: Scrapers need to be written for
every site
• Good news: They’re very quick to write
(average 4 - 8 hours each)
24. This is great, but...
• It’s making things faster, but what does it do
that’s new?
• Collaboration, networking of researchers
• Immediate reporting provides insight where
it may not otherwise be seen
• Still some difficulties:
• How do you effectively communicate how to
use / apply a taxonomy?
26. Todo
• Per-collection taxonomy visibility
• Per-collection access control
• Cross-collection reports
• Search-based reports (i.e. taxonomy term activity for all
posts with the word "tutorial")
• More accurate and faster search (Solr): i.e. All posts with
"violence" near the words "games OR video games OR
entertainment"
• More robust hosting infrastructure (more users,
collections)
27. Long-term todo
• DR could "learn" over time about taxonomies
and language: i.e. What words commonly
appear in phrases tagged "scientific learning"?
• Comparisons with external data: i.e. Thread
activity corresponding to product release
announcements (Starcraft II thread)
• Web-based content import: Once a parser is
written, the ability to queue up import via the
DR website
Notas do Editor
**** Why scraping data is difficult but possible
- Many sites use different terminology and structure for what are essentially similar data types (post vs. discussion vs. thread; user vs. account)
- Unpredictable markup on websites -- often BAD markup
- Picture of malformed HTML
- Creating a generic scraper tool would be sloppy, inaccurate, and error-prone
- Fortunately, writing site-specific scrapers is a pretty straight-forward process
- Roughly 4 hours per scraper, getting to be less as I gain more experience