SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
The Leyline: A Comparative
Approach To Designing a Graphical
  Provenance-Based Search UI

         Soroush Ghorashi, Carlos Jensen
             Oregon State University
                  HICSS 2013
What is the problem?
                    Computers are increasingly “black holes” for information

—    Storage abundant and cheap, no incentives to delete or archive

—    Collaboration and sharing are growing

—    Information increasingly flowing across devices
What is the problem?
                    Computers are increasingly “black holes” for information

—    Storage abundant and cheap, no incentives to delete or archive

—    Collaboration and sharing are growing

—    Information increasingly flowing across devices

                     More information available, harder to (re)find anything
What is the problem?
                     Computers are increasingly “black holes” for information

—    Storage abundant and cheap, no incentives to delete or archive

—    Collaboration and sharing are growing

—    Information increasingly flowing across devices

                      More information available, harder to (re)find anything

Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]

—    Collaborators use conflicting name schemes

—    Overlapping projects introduce uncertainty
What is the problem?
                     Computers are increasingly “black holes” for information

—    Storage abundant and cheap, no incentives to delete or archive

—    Collaboration and sharing are growing

—    Information increasingly flowing across devices

                       More information available, harder to (re)find anything

Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]

—    Using conflicting name scheme by collaborators

—    Overlapping projects introduce uncertainty

Keyword Search

—    Having larger repositories and information reuse lead to long list of hits for common keywords

—    Multiple Copies and drafts of files
Solution?
    What about: “Leveraging provenance to enrich file search”
—  Provenance: The history of a document’s ownership, transformations, as
   well as sources and derivatives
                                         att
                                               ac
                                                 hm                                                e
                                                   en                                        ast
                      RE: presentation draft         ts
                                                       av                           y   /p             data.html
                                                         e                      cop



                                                                                sav
                                                             presentation.ppt      e    as




                                                                                             presentation-v2.ppt



—  Track provenance events: Make available in search queries, use in results
   presentation

—  Allow for fundamentally different types of queries
—  People remember related documents [Gonçalves , 2004; Blanc-Brude,
   2007]
Research Goals
—  Phase 1: Analyze information reuse, information
  flow, and provenance events in a real-world settings



—  Phase 2: Investigate the effectiveness of
  provenance cues in desktop search



—  Phase 3: Develop and evaluate provenance-based
  search tools (if appropriate)
Phase 1: Study Real-World Work
     Practices (2008/2010)
                                                                                                              File use per person-day
3 month user study at Intel Corporation
                                                                                                           Web*                              89.9
   —  Logging subjects’ activities on their computers                                                     Email                             73.7

       —  Data cleaned for personal and sensitive information                                             Word                                4.4
                                                                                                           Excel                               2.5
   —  Recorded provenance and information access events
                                                                                                           PowerPoint                          2.1
                                                                                                           Text                                0.4

      —  Participants                                                                                     PDF                                 0.2

                                                                                                           Total                           173.2
          —  17 information workers, 43 workdays average
          —  9 observation sessions                                                                                      DownloadFile
                                                                                                                              3%
                                                                                                                                 FileRename

          —  Exit interview with test                                                                                               5%
                                                                                                                                          MoveFile
                                                                                                                                            6%




      —  Findings                                                                                                                            SaveAs
                                                                                                                                               15%

          —  126,620 unique resources                                                                        CopyPaste
                                                                                                                63%
                 —  7,448 resources per subject                                                                                                      UploadFile
                                                                                                                                                         2%
                                                                                                                                                  AttachmentAdd
                 —  Min: 3,211; Max: 17,570; σ: 3,326                                                                                                  3%
                                                                                                                                              AttachmentSave
                                                                                                                                                    3%




C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
Phase 1 contd.
      Provenance networks are more common than we expected!
      —  521 significant graphs (3+ nodes)
      —  Average 5.8 resources per graph
      —  53.7% of files related to at least one other file in their own network




C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
Phase 1 contd.
   “It looks like it comes from the IAP tool, and all the green boxes are                              “I recall uploading
   my Excel spreadsheets that I exported to. The word documents are                                            those to the
      probably what I copied the Excel data to, probably for email.”                                     SharePoint site!”



  “Oh, I see what’s going       Half of subjects remembered
   on. I tend to open a                                                                                          “2.4 might have been
     spreadsheet and            more about their documents                                                    embedded in a doc, so I
 sometimes I’ll have more                                                                                      had to copy it out from
than one open at the same        after seeing a provenance                                                                      there.”
          time…”
                                           graph.
“Yeah, that’s what I did, I turned it into Excel… I saved it,             “Looks like I copied and pasted from the website into
  and then I changed the name because I wanted to make                       a doc… It’s kind of complicated what I did here. I
 sure it was distinguished from other files I have with the                        took 2.2, copied and pasted info into an Excel
             same name for a different group.”                                 spreadsheet. And then yeah, there’s number 7, a
                                                                                                          spreadsheet as well.”




 C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
 Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
Can We Use Provenance More
          Directly?
Textual query in most traditional
keyword search tools
Can We Use Provenance More
          Directly?
Textual query in most traditional
keyword search tools




                                    What about drawing queries?
Phase 2: Provenance in Search?
      Is it Appropriate?
      Can provenance be used effectively in search?

—  How complex a query do we need to find a file?
—  List of all unique walks in provenance graphs
  —  Find longest repeating strings for each subject
  —  Worst case unique query: Longest repeating string + 1
  —  With/without provenance event type to examine impact



       Outlook--AS--Word--CP--PowerPoint--SA--PowerPoint--CP--Powerpoint
Phase 2 contd.
—  Maximum query length for a repository of ~7500
  items:
  —  Considering the type of provenance events
      —  3 to 9, median 4
  —  Without considering the type of provenance events
      —  3 to 10, median 4.5
        Provenance events like copy/paste and versioning are too
                           common to add value!



—  Provenance search grows linearly
  —  1 node per 200 links
   Provenance can be used to narrow search space quickly. 
Tool Analysis
   Categorizing tools that are using provenance-like data to enhance search




—  Provenance Types

—  Provenance Monitoring

—  Provenance Use

—  UI Approach

—  Evaluation
Tool Analysis contd.
Name       Provenance Types         Provenance Monitoring           Provenance Use       UI Approach           Evaluation
           File meta-data,          Extracting relations from       Query formulation,   Flow-chart like,      Canned data,
           keyword, static          Google Desktop’s database       Search process       List view model       limited within
Feldspar
           relations between        using its API                                        (real-time results    subjects user
           resources                                                                     updating)             study
           Meta-data such as        Built-in System Monitor to      Query formulation,   Narrative-based,      Multiple user
           author, storage place,   record meta-data about the      Search process       List of resources’    studies
Quill      date, physical place     user’s documents, email                              thumbnails (real-
           tag (home, work,         attachments, WebPages,                               time results
           etc.)                    applications and calendar                            updating)
           File meta-data (such     Microsoft Desktop Search        Query formulation,   Text input with       Longitudinal
           as kind, date, author,   database, fuzzy matching (car   Search process,      selectable filters,   study using real
           email attributes)        and cars are same), fielded     Results              List view of          data on subjects’
SIS
                                    search (author is “john doe”)   presentation         results with a        PCs (234 people),
                                                                                         preview and           6 weeks
                                                                                         meta-data
           File meta-data (such     Microsoft Desktop Search        Query formation,     Text input with       Longitudinal
           as kind, date, author,   database, Extra meta-data as    Search process,      selectable filters,   study using real
           email attributes).       tags (Labeling system)          Results              List view of          data on subjects’
Phlat
           Contextual cues such                                     presentation         results with a        PCs (225 people),
           as user defined tags                                                          preview and           8 months
                                                                                         meta-data
           Environmental            Integrated system monitor to    Query formulation,   Textual input and     Canned data,
           factors as contextual    record contextual cues and      Search Process       selectable filters,   limited within
YouPivot
           cues, user defined       their occurrences                                    List view of          subjects user
           marks                                                                         results               study
Tool Analysis
                         Feldspar
—  Feldspar – Chau et. al 2008
  —  Desktop search
     —  Uses associations between files and resources
        —  extracted from Google Desktop database
     —  Keyword and meta-data search
  —  Flowchart-like user interface
  —  Real-time results, fast
  —  Evaluated with canned data
     —  Within subject study
Tool Analysis
               Stuff I’ve seen, Phlat
—  Stuff I’ve Seen (SIS) – Dumais et. al 2003, Phlat – Cutrell et. al 2006
   —    Similar to Windows Desktop Search
   —    Keyword and meta-data search
   —    Ranks the results using contextual cues
   —    Textual input
   —    List view of results with snippet and meta-data
   —    Unified labeling (Phlat)
   —    Longitudinal study
Tool Analysis
                           YouPivot
—  YouPivot – Hailpern et. al 2011
   —    Search web browsing history
   —    Internal system monitor
   —    Uses keyword for search and contextual cues to filter the results
   —    Timeline view for user activities
   —    Textual input, list view of results
   —    TimeMarks to filter the results
   —    Evaluated with canned data
         —  Within subject study
Phase 3: Design Goals
—  Use dynamic relations
   between files

—  Integration with keyword
   search

—  Graphical UI

—  Allowing all kinds of
   graphical queries

—  Internal system monitor

—  Result exploration
Phase 3: System Requirements
—  Provenance + Keyword search

—  Streamline query composition
   using a drag-drop graphical
   sketchpad

—  Allow for flexible exploration
   and discovery

—  Integration with Windows
   Explorer to allow exploration of
   workflow and information
   provenance
Phase 3 contd.
     Exact pattern matching problem is np-complete!
                    (sub-graph isomorphism problem)


—  Introducing * links
Phase 3 contd.
     Exact pattern matching problem is np-complete!
                    (sub-graph isomorphism problem)


—  Introducing * links
Phase 3 contd.
      Exact pattern matching problem is np-complete!
                      (sub-graph isomorphism problem)


—  Introducing * links
   —  Partial matching
   —  Easier to solve
   —  Better matches user recall


—  Use G-Ray algorithm [Tong et al. 2007]
   —  Best-effort matching
   —  Fast, scalable, flexible and forgiving
Phase 3: The Leyline
Phase 3: Preliminary Evaluation
                                            Is UI approach reasonable?

       —  User Study
            —  Used file repository modeled after those found at Intel
            —  Participant selection
                —  Questionnaire to examine knowledge of search tools
                —  Graduate students
            —  Interactive tutorial
            —  9 Experiment tasks
             “Find the word document you created using information copy/pasted from an email, a web page, and
                     an excel document. Find the emails that have this word document as an attachment.”
            —    Tasks ordered randomly
            —    Think aloud protocol
            —    4 minutes for each tasks
            —    Exit interview about their experience




S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
Phase 3: Preliminary Evaluation
                    contd.
       —  Average completion time: 106 seconds
            —    Simple tasks (72 seconds – 93 seconds)
            —    Hard tasks (126 seconds – 155 seconds)

       —  Query complexity (#nodes & #edges)
            —    Average of 2.8 nodes and 2 edges
            —    System scales well (Completion time vs. Complexity)

       —  Observations
            —    Importance of target document
            —    Working on one resource or relation at a time
            —    Saw marked learning effect

       —  Interviews
            —    Overall likability rating: 4.2 out of 5 (σ = 0.4)
            —    Wanted Leyline in real life
            —    No one complained about effort/time requirement
            —    Areas for improvement
                      —    Query composition history panel
                      —    Customization options
                      —    Support more resource types

S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
Conclusion
—  Provenance events are very common in real-world
  settings, and potentially helpful in search

—  Provenance alone can quickly and effectively identify
  unique files/resources (assuming perfect recall)

—  A graphical sketchpad is a viable UI for query
  composition
  —  Isn’t going to replace keyword search, but valuable addition

—  Users quickly learned how to use our system, and
  wanted the tool
What about the future?
—  Incorporate the feedback and lessons learned into a new
  prototype
—  Expand feature set to include:
  —  Auto-completion and suggestion features to speed up the
      search process
  —  Support a broader set of files and resources
  —  Possibly support other computer platforms
—  Prepare for longitudinal study
  —    How do people adapt and use the Leyline?
  —    How does the Leyline scale in a large database?
  —    Does the Leyline change exploration?
  —    Does the Leyline work in collaborative environment?
Thank you
—  Thanks to Intel for early funding and subjects!


—  For more information:
  —  Soroush Ghorashi
      —  (ghorashi@eecs.oregonstate.edu)
  —  Carlos Jensen
      —  (cjensen@eecs.oregonstate.edu)

Mais conteúdo relacionado

Destaque

Doorbraak presentatie
Doorbraak presentatieDoorbraak presentatie
Doorbraak presentatienobnob
 
CURRICULUM VITAE
CURRICULUM VITAECURRICULUM VITAE
CURRICULUM VITAEalealbiazul
 
Viñetas políticas mayojunio2014
Viñetas políticas mayojunio2014Viñetas políticas mayojunio2014
Viñetas políticas mayojunio2014Vinetaspoliticas
 
Vacolba dossier corporativo
Vacolba   dossier corporativoVacolba   dossier corporativo
Vacolba dossier corporativoPlanimedia
 
Top 8 features of bizmail that you cant miss
Top 8 features of bizmail that you cant missTop 8 features of bizmail that you cant miss
Top 8 features of bizmail that you cant missNet4 India Ltd.
 
Invitacion%20 sipp%201%262 barcelona%202015
Invitacion%20 sipp%201%262 barcelona%202015Invitacion%20 sipp%201%262 barcelona%202015
Invitacion%20 sipp%201%262 barcelona%202015ali navarro
 
Mi Materia Favorita es Quimica
Mi Materia Favorita es QuimicaMi Materia Favorita es Quimica
Mi Materia Favorita es QuimicaNoelia Encalada
 
14 Really Useful Websites
14 Really Useful Websites14 Really Useful Websites
14 Really Useful WebsitesBrightCarbon
 
Goals on every level - Delivery Leads Melbourne
Goals on every level - Delivery Leads MelbourneGoals on every level - Delivery Leads Melbourne
Goals on every level - Delivery Leads MelbourneTom Sommer
 
Cómo seducir a la nueva audiencia periodismo digital
Cómo seducir a la nueva audiencia periodismo digitalCómo seducir a la nueva audiencia periodismo digital
Cómo seducir a la nueva audiencia periodismo digitalCarlos Osorio Gamarra
 
Exposicion de empresa monopolista y oligopolista
Exposicion de empresa monopolista y oligopolistaExposicion de empresa monopolista y oligopolista
Exposicion de empresa monopolista y oligopolistaDianita León
 

Destaque (18)

Doorbraak presentatie
Doorbraak presentatieDoorbraak presentatie
Doorbraak presentatie
 
Ioana y Nerea. Aracne
Ioana y Nerea. AracneIoana y Nerea. Aracne
Ioana y Nerea. Aracne
 
CURRICULUM VITAE
CURRICULUM VITAECURRICULUM VITAE
CURRICULUM VITAE
 
Viñetas políticas mayojunio2014
Viñetas políticas mayojunio2014Viñetas políticas mayojunio2014
Viñetas políticas mayojunio2014
 
Vacolba dossier corporativo
Vacolba   dossier corporativoVacolba   dossier corporativo
Vacolba dossier corporativo
 
Iv bimestre
Iv bimestreIv bimestre
Iv bimestre
 
Top 8 features of bizmail that you cant miss
Top 8 features of bizmail that you cant missTop 8 features of bizmail that you cant miss
Top 8 features of bizmail that you cant miss
 
Hoja de vida - María de Lourdes López López
Hoja de vida - María de Lourdes López López Hoja de vida - María de Lourdes López López
Hoja de vida - María de Lourdes López López
 
Bondia Lleida 23062011
Bondia Lleida 23062011Bondia Lleida 23062011
Bondia Lleida 23062011
 
Invitacion%20 sipp%201%262 barcelona%202015
Invitacion%20 sipp%201%262 barcelona%202015Invitacion%20 sipp%201%262 barcelona%202015
Invitacion%20 sipp%201%262 barcelona%202015
 
Mi Materia Favorita es Quimica
Mi Materia Favorita es QuimicaMi Materia Favorita es Quimica
Mi Materia Favorita es Quimica
 
14 Really Useful Websites
14 Really Useful Websites14 Really Useful Websites
14 Really Useful Websites
 
Goals on every level - Delivery Leads Melbourne
Goals on every level - Delivery Leads MelbourneGoals on every level - Delivery Leads Melbourne
Goals on every level - Delivery Leads Melbourne
 
Cómo seducir a la nueva audiencia periodismo digital
Cómo seducir a la nueva audiencia periodismo digitalCómo seducir a la nueva audiencia periodismo digital
Cómo seducir a la nueva audiencia periodismo digital
 
Reglamento interno 2014
Reglamento interno 2014Reglamento interno 2014
Reglamento interno 2014
 
KKKKu Klux Klan
KKKKu Klux KlanKKKKu Klux Klan
KKKKu Klux Klan
 
Exposicion de empresa monopolista y oligopolista
Exposicion de empresa monopolista y oligopolistaExposicion de empresa monopolista y oligopolista
Exposicion de empresa monopolista y oligopolista
 
PEDAGOGÍA TRADICIONAL
PEDAGOGÍA TRADICIONALPEDAGOGÍA TRADICIONAL
PEDAGOGÍA TRADICIONAL
 

Semelhante a Leyline: A provenance-based desktop search

Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep projectUKOLN (dev), University of Bath
 
Jen Ferguson "A tale of two projects"
Jen Ferguson "A tale of two projects"Jen Ferguson "A tale of two projects"
Jen Ferguson "A tale of two projects"The TMC Library
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Anne Nicolas
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013olberger
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4szbra
 
The current architecture of TYPO3 5.0
The current architecture of TYPO3 5.0The current architecture of TYPO3 5.0
The current architecture of TYPO3 5.0Robert Lemke
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...GarethKnight
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Make your data great now
Make your data great nowMake your data great now
Make your data great nowDaniel JACOB
 
Everyone's A Mechanic
Everyone's A MechanicEveryone's A Mechanic
Everyone's A MechanicBrad Houston
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA'saaroncollie
 
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...Mark Matienzo
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic TechnologiesPeter Haase
 

Semelhante a Leyline: A provenance-based desktop search (20)

QQML presentation
QQML presentationQQML presentation
QQML presentation
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep project
 
Jen Ferguson "A tale of two projects"
Jen Ferguson "A tale of two projects"Jen Ferguson "A tale of two projects"
Jen Ferguson "A tale of two projects"
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4Flexible Resources In 3 6 And E4
Flexible Resources In 3 6 And E4
 
The current architecture of TYPO3 5.0
The current architecture of TYPO3 5.0The current architecture of TYPO3 5.0
The current architecture of TYPO3 5.0
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Introduction to Document Management
Introduction to Document ManagementIntroduction to Document Management
Introduction to Document Management
 
Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Everyone's A Mechanic
Everyone's A MechanicEveryone's A Mechanic
Everyone's A Mechanic
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic Technologies
 

Leyline: A provenance-based desktop search

  • 1. The Leyline: A Comparative Approach To Designing a Graphical Provenance-Based Search UI Soroush Ghorashi, Carlos Jensen Oregon State University HICSS 2013
  • 2. What is the problem? Computers are increasingly “black holes” for information —  Storage abundant and cheap, no incentives to delete or archive —  Collaboration and sharing are growing —  Information increasingly flowing across devices
  • 3. What is the problem? Computers are increasingly “black holes” for information —  Storage abundant and cheap, no incentives to delete or archive —  Collaboration and sharing are growing —  Information increasingly flowing across devices More information available, harder to (re)find anything
  • 4. What is the problem? Computers are increasingly “black holes” for information —  Storage abundant and cheap, no incentives to delete or archive —  Collaboration and sharing are growing —  Information increasingly flowing across devices More information available, harder to (re)find anything Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008] —  Collaborators use conflicting name schemes —  Overlapping projects introduce uncertainty
  • 5. What is the problem? Computers are increasingly “black holes” for information —  Storage abundant and cheap, no incentives to delete or archive —  Collaboration and sharing are growing —  Information increasingly flowing across devices More information available, harder to (re)find anything Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008] —  Using conflicting name scheme by collaborators —  Overlapping projects introduce uncertainty Keyword Search —  Having larger repositories and information reuse lead to long list of hits for common keywords —  Multiple Copies and drafts of files
  • 6. Solution? What about: “Leveraging provenance to enrich file search” —  Provenance: The history of a document’s ownership, transformations, as well as sources and derivatives att ac hm e en ast RE: presentation draft ts av y /p data.html e cop sav presentation.ppt e as presentation-v2.ppt —  Track provenance events: Make available in search queries, use in results presentation —  Allow for fundamentally different types of queries —  People remember related documents [Gonçalves , 2004; Blanc-Brude, 2007]
  • 7. Research Goals —  Phase 1: Analyze information reuse, information flow, and provenance events in a real-world settings —  Phase 2: Investigate the effectiveness of provenance cues in desktop search —  Phase 3: Develop and evaluate provenance-based search tools (if appropriate)
  • 8. Phase 1: Study Real-World Work Practices (2008/2010) File use per person-day 3 month user study at Intel Corporation Web* 89.9 —  Logging subjects’ activities on their computers Email 73.7 —  Data cleaned for personal and sensitive information Word 4.4 Excel 2.5 —  Recorded provenance and information access events PowerPoint 2.1 Text 0.4 —  Participants PDF 0.2 Total 173.2 —  17 information workers, 43 workdays average —  9 observation sessions DownloadFile 3% FileRename —  Exit interview with test 5% MoveFile 6% —  Findings SaveAs 15% —  126,620 unique resources CopyPaste 63% —  7,448 resources per subject UploadFile 2% AttachmentAdd —  Min: 3,211; Max: 17,570; σ: 3,326 3% AttachmentSave 3% C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
  • 9. Phase 1 contd. Provenance networks are more common than we expected! —  521 significant graphs (3+ nodes) —  Average 5.8 resources per graph —  53.7% of files related to at least one other file in their own network C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
  • 10. Phase 1 contd. “It looks like it comes from the IAP tool, and all the green boxes are “I recall uploading my Excel spreadsheets that I exported to. The word documents are those to the probably what I copied the Excel data to, probably for email.” SharePoint site!” “Oh, I see what’s going Half of subjects remembered on. I tend to open a “2.4 might have been spreadsheet and more about their documents embedded in a doc, so I sometimes I’ll have more had to copy it out from than one open at the same after seeing a provenance there.” time…” graph. “Yeah, that’s what I did, I turned it into Excel… I saved it, “Looks like I copied and pasted from the website into and then I changed the name because I wanted to make a doc… It’s kind of complicated what I did here. I sure it was distinguished from other files I have with the took 2.2, copied and pasted info into an Excel same name for a different group.” spreadsheet. And then yeah, there’s number 7, a spreadsheet as well.” C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
  • 11. Can We Use Provenance More Directly? Textual query in most traditional keyword search tools
  • 12. Can We Use Provenance More Directly? Textual query in most traditional keyword search tools What about drawing queries?
  • 13. Phase 2: Provenance in Search? Is it Appropriate? Can provenance be used effectively in search? —  How complex a query do we need to find a file? —  List of all unique walks in provenance graphs —  Find longest repeating strings for each subject —  Worst case unique query: Longest repeating string + 1 —  With/without provenance event type to examine impact Outlook--AS--Word--CP--PowerPoint--SA--PowerPoint--CP--Powerpoint
  • 14. Phase 2 contd. —  Maximum query length for a repository of ~7500 items: —  Considering the type of provenance events —  3 to 9, median 4 —  Without considering the type of provenance events —  3 to 10, median 4.5 Provenance events like copy/paste and versioning are too common to add value! —  Provenance search grows linearly —  1 node per 200 links Provenance can be used to narrow search space quickly. 
  • 15. Tool Analysis Categorizing tools that are using provenance-like data to enhance search —  Provenance Types —  Provenance Monitoring —  Provenance Use —  UI Approach —  Evaluation
  • 16. Tool Analysis contd. Name Provenance Types Provenance Monitoring Provenance Use UI Approach Evaluation File meta-data, Extracting relations from Query formulation, Flow-chart like, Canned data, keyword, static Google Desktop’s database Search process List view model limited within Feldspar relations between using its API (real-time results subjects user resources updating) study Meta-data such as Built-in System Monitor to Query formulation, Narrative-based, Multiple user author, storage place, record meta-data about the Search process List of resources’ studies Quill date, physical place user’s documents, email thumbnails (real- tag (home, work, attachments, WebPages, time results etc.) applications and calendar updating) File meta-data (such Microsoft Desktop Search Query formulation, Text input with Longitudinal as kind, date, author, database, fuzzy matching (car Search process, selectable filters, study using real email attributes) and cars are same), fielded Results List view of data on subjects’ SIS search (author is “john doe”) presentation results with a PCs (234 people), preview and 6 weeks meta-data File meta-data (such Microsoft Desktop Search Query formation, Text input with Longitudinal as kind, date, author, database, Extra meta-data as Search process, selectable filters, study using real email attributes). tags (Labeling system) Results List view of data on subjects’ Phlat Contextual cues such presentation results with a PCs (225 people), as user defined tags preview and 8 months meta-data Environmental Integrated system monitor to Query formulation, Textual input and Canned data, factors as contextual record contextual cues and Search Process selectable filters, limited within YouPivot cues, user defined their occurrences List view of subjects user marks results study
  • 17. Tool Analysis Feldspar —  Feldspar – Chau et. al 2008 —  Desktop search —  Uses associations between files and resources —  extracted from Google Desktop database —  Keyword and meta-data search —  Flowchart-like user interface —  Real-time results, fast —  Evaluated with canned data —  Within subject study
  • 18. Tool Analysis Stuff I’ve seen, Phlat —  Stuff I’ve Seen (SIS) – Dumais et. al 2003, Phlat – Cutrell et. al 2006 —  Similar to Windows Desktop Search —  Keyword and meta-data search —  Ranks the results using contextual cues —  Textual input —  List view of results with snippet and meta-data —  Unified labeling (Phlat) —  Longitudinal study
  • 19. Tool Analysis YouPivot —  YouPivot – Hailpern et. al 2011 —  Search web browsing history —  Internal system monitor —  Uses keyword for search and contextual cues to filter the results —  Timeline view for user activities —  Textual input, list view of results —  TimeMarks to filter the results —  Evaluated with canned data —  Within subject study
  • 20. Phase 3: Design Goals —  Use dynamic relations between files —  Integration with keyword search —  Graphical UI —  Allowing all kinds of graphical queries —  Internal system monitor —  Result exploration
  • 21. Phase 3: System Requirements —  Provenance + Keyword search —  Streamline query composition using a drag-drop graphical sketchpad —  Allow for flexible exploration and discovery —  Integration with Windows Explorer to allow exploration of workflow and information provenance
  • 22. Phase 3 contd. Exact pattern matching problem is np-complete! (sub-graph isomorphism problem) —  Introducing * links
  • 23. Phase 3 contd. Exact pattern matching problem is np-complete! (sub-graph isomorphism problem) —  Introducing * links
  • 24. Phase 3 contd. Exact pattern matching problem is np-complete! (sub-graph isomorphism problem) —  Introducing * links —  Partial matching —  Easier to solve —  Better matches user recall —  Use G-Ray algorithm [Tong et al. 2007] —  Best-effort matching —  Fast, scalable, flexible and forgiving
  • 25. Phase 3: The Leyline
  • 26. Phase 3: Preliminary Evaluation Is UI approach reasonable? —  User Study —  Used file repository modeled after those found at Intel —  Participant selection —  Questionnaire to examine knowledge of search tools —  Graduate students —  Interactive tutorial —  9 Experiment tasks “Find the word document you created using information copy/pasted from an email, a web page, and an excel document. Find the emails that have this word document as an attachment.” —  Tasks ordered randomly —  Think aloud protocol —  4 minutes for each tasks —  Exit interview about their experience S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human- Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
  • 27. Phase 3: Preliminary Evaluation contd. —  Average completion time: 106 seconds —  Simple tasks (72 seconds – 93 seconds) —  Hard tasks (126 seconds – 155 seconds) —  Query complexity (#nodes & #edges) —  Average of 2.8 nodes and 2 edges —  System scales well (Completion time vs. Complexity) —  Observations —  Importance of target document —  Working on one resource or relation at a time —  Saw marked learning effect —  Interviews —  Overall likability rating: 4.2 out of 5 (σ = 0.4) —  Wanted Leyline in real life —  No one complained about effort/time requirement —  Areas for improvement —  Query composition history panel —  Customization options —  Support more resource types S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human- Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
  • 28. Conclusion —  Provenance events are very common in real-world settings, and potentially helpful in search —  Provenance alone can quickly and effectively identify unique files/resources (assuming perfect recall) —  A graphical sketchpad is a viable UI for query composition —  Isn’t going to replace keyword search, but valuable addition —  Users quickly learned how to use our system, and wanted the tool
  • 29. What about the future? —  Incorporate the feedback and lessons learned into a new prototype —  Expand feature set to include: —  Auto-completion and suggestion features to speed up the search process —  Support a broader set of files and resources —  Possibly support other computer platforms —  Prepare for longitudinal study —  How do people adapt and use the Leyline? —  How does the Leyline scale in a large database? —  Does the Leyline change exploration? —  Does the Leyline work in collaborative environment?
  • 30. Thank you —  Thanks to Intel for early funding and subjects! —  For more information: —  Soroush Ghorashi —  (ghorashi@eecs.oregonstate.edu) —  Carlos Jensen —  (cjensen@eecs.oregonstate.edu)