SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Secondary data analysis
  with digital trace data

Examples from FLOSS research

         Andrea Wiggins
         13 Juillet, 2011
Secondary Data Analysis
•   Uses existing data produced or collected by
    someone else, usually for a different purpose
    •   Databases
    •   Repositories
    •   Surveys
    •   Emails
    •   Social networks
                           2
Digital Trace Data
•   Records of activity (trace data) undertaken through
    an online information system (thus digital)
•   Increasingly common in studies of online
    phenomena
    •   Large volumes of available data
    •   Can be complete: a census, not a sample
    •   May be more reliably recorded than other data

                             3
Characteristics


1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data




                          4
Requirements
•   Understand the original data source
    •   How it was collected, potential problems
    •   Limitations of the sample
    •   What the data describe
•   Match with appropriate analysis methods and measures
    •   New types of data may require new measures
    •   Theoretical coherence is very important
                              5
Advantages
•   Data may be “complete”
    •   Usually no response bias (exception: cookies)
    •   May cover long periods of time and large groups
    •   Multiple different data types, but mostly textual
•   Data are often easy to acquire
    •   APIs or scraping web pages (with caution)
    •   Databases, archives, or repositories of research data
•   But remember: you usually get what you pay for!
                                  6
Disadvantages
•   Often difficult to know limitations of data
    •   Data may be poorly documented
    •   Original creator may not be available for comment
•   Volume of data can be overwhelming
    •   Sampling strategies needed, e.g., temporal, random
    •   Substantial time required for data preparation: 90% of effort
    •   Exceptions are everywhere and will break analyses, but can
        only be discovered through trial and error

                                  7
Example: Email Networks
•   Data source: email listservs for FLOSS projects
•   Analysis approach: create social networks
    •   Within discussion threads, individuals are nodes, and links
        are reply-to messages
    •   Some conceptual issues for interpretation, choice of
        measures
•   Technical challenges
    •   Temporal aggregation
    •   Identity resolution
                                   8
Figures from Howison et al., 2006


Temporal Aggregation
                  9
Network Workflow
       10
Network Results
                                                     • Different levels of correlation
                                                       between venues, suggesting different
                                                       types of interactions
                                                     • User venues more decentralized than
                                                       developer venues, reflecting greater
                                                       number of participants
                                                     • Overall trend toward decentralization
                                                       could be result of different influences

• Observed anomalous patterns in trackers for
  both projects: periodic centralization spikes
                                                                Cleaning up before shutting down
• A single user makes batch bug closings
  (up to 279!)
   – Fire’s (feature request) tracker housekeeping
     appears to be preparation for project
     closure
   – Gaim’s tracker housekeeping was more
     regular and repeated
                                              11
Example: Classification
•   Replication of success-tragedy classification
    •   Classification criteria originally drawn from
        interviews with community members
    •   Data extracted from repositories
•   Technical challenges
    •   Merging data from two repositories
    •   Processing large volume of data in multiple steps
                             12
Variables
•   Inputs: project names and 5 threshold values for
    classification tests, e.g. number of downloads
•   Project statistics retrieved from repositories
    •   Founding date
    •   Data collection date
    •   Dates for all releases
    •   Number of downloads
    •   URL
                                 13
Classification workflow
          14
Classification Results
   Class        Original           Our results    Difference
unclassifiabl      3 186               3 296          +110
     e
     II        13 342 (12%)        16 252 (14%)   +2 910 (+2%)

    IG         10 711 (10%)        12 991 (11%)   +2 280 (+1%)

    TI         37 320 (35%)        36 507 (31%)    -813 (-4%)

    TG         30 592 (28%)        32 642 (28%)   +2 050 (0%)

    SG         15 782 (15%)        16 045 (14%)    +263 (-1%)

   other          8 422                 0

   Total         119 355             117 733

                              15
Thanks!



•   Questions?




                    16

Mais conteúdo relacionado

Mais procurados

Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big DataTom Mens
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersTao Xie
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 
Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015raymond91105
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?James Howison
 
Modern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging resultsModern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging resultsKrzysztof Gorgolewski
 
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App SecurityTao Xie
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software DatasetsTao Xie
 
Large Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackLarge Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackMarcus Botacin
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible ResearchC. Tobin Magle
 
Getting (and giving) credit for all that we do
Getting (and giving) credit for all that we doGetting (and giving) credit for all that we do
Getting (and giving) credit for all that we domhaendel
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth
 
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...Krzysztof Gorgolewski
 
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Kristin Briney
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
 

Mais procurados (20)

Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big Data
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that Matters
 
20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 
Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?
 
Modern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging resultsModern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging results
 
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App Security
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Large Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackLarge Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a Haystack
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible Research
 
Getting (and giving) credit for all that we do
Getting (and giving) credit for all that we doGetting (and giving) credit for all that we do
Getting (and giving) credit for all that we do
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
 
ROHub
ROHubROHub
ROHub
 
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
 

Destaque

With Great Data Comes Great Responsibility
With Great Data Comes Great ResponsibilityWith Great Data Comes Great Responsibility
With Great Data Comes Great ResponsibilityAndrea Wiggins
 
National Park System Property Designations
National Park System Property DesignationsNational Park System Property Designations
National Park System Property DesignationsAndrea Wiggins
 
secondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eightsecondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eightUniversity of Balochistan
 
Content Analysis vs secondary analysis
Content Analysis vs secondary analysisContent Analysis vs secondary analysis
Content Analysis vs secondary analysisDr. Cupid Lucid
 
Secondary data collection.mjm
Secondary data collection.mjmSecondary data collection.mjm
Secondary data collection.mjmmanjunath
 
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...David Rozas
 
Ch11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary DataCh11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary Datayxl007
 
Secondary Data Analysis
Secondary Data AnalysisSecondary Data Analysis
Secondary Data AnalysisKeith Lyons
 
Harvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case StudyHarvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case StudySkalla Marketing
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Ahsan Khan Eco (Superior College)
 
Primary & secondary data
Primary & secondary dataPrimary & secondary data
Primary & secondary datahezel3210
 

Destaque (13)

Birds
BirdsBirds
Birds
 
With Great Data Comes Great Responsibility
With Great Data Comes Great ResponsibilityWith Great Data Comes Great Responsibility
With Great Data Comes Great Responsibility
 
Moselle
MoselleMoselle
Moselle
 
National Park System Property Designations
National Park System Property DesignationsNational Park System Property Designations
National Park System Property Designations
 
secondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eightsecondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eight
 
Content Analysis vs secondary analysis
Content Analysis vs secondary analysisContent Analysis vs secondary analysis
Content Analysis vs secondary analysis
 
Secondary data collection.mjm
Secondary data collection.mjmSecondary data collection.mjm
Secondary data collection.mjm
 
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
 
Ch11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary DataCh11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary Data
 
Secondary Data Analysis
Secondary Data AnalysisSecondary Data Analysis
Secondary Data Analysis
 
Harvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case StudyHarvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case Study
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...
 
Primary & secondary data
Primary & secondary dataPrimary & secondary data
Primary & secondary data
 

Semelhante a Secondary data analysis with digital trace data

Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j
 
Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...amiraryani
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
Web Scale Discovery Reality Check
Web Scale Discovery Reality CheckWeb Scale Discovery Reality Check
Web Scale Discovery Reality CheckJeff Wisniewski
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processLouise Corti
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data ScienceNiko Vuokko
 
Electronic Lab Notebooks
Electronic Lab NotebooksElectronic Lab Notebooks
Electronic Lab NotebooksKristin Briney
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Hydra Project Management Survey
Hydra Project Management SurveyHydra Project Management Survey
Hydra Project Management SurveyMark Notess
 
Towards an Agile approach to building application profiles
Towards an Agile approach to building application profilesTowards an Agile approach to building application profiles
Towards an Agile approach to building application profilesPaul Walk
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorialJosh Young
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...benaam
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
 
Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital PreservationBill LeFurgy
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 

Semelhante a Secondary data analysis with digital trace data (20)

Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
 
Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
Web Scale Discovery Reality Check
Web Scale Discovery Reality CheckWeb Scale Discovery Reality Check
Web Scale Discovery Reality Check
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Data cycle health
Data cycle healthData cycle health
Data cycle health
 
Electronic Lab Notebooks
Electronic Lab NotebooksElectronic Lab Notebooks
Electronic Lab Notebooks
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Hydra Project Management Survey
Hydra Project Management SurveyHydra Project Management Survey
Hydra Project Management Survey
 
Towards an Agile approach to building application profiles
Towards an Agile approach to building application profilesTowards an Agile approach to building application profiles
Towards an Agile approach to building application profiles
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital Preservation
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 

Mais de Andrea Wiggins

Online Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCamsOnline Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCamsAndrea Wiggins
 
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceFree as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceAndrea Wiggins
 
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...Andrea Wiggins
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen ScienceAndrea Wiggins
 
Citizen Science Phenotypes
Citizen Science PhenotypesCitizen Science Phenotypes
Citizen Science PhenotypesAndrea Wiggins
 
The Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen ScienceThe Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen ScienceAndrea Wiggins
 
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Andrea Wiggins
 
Data Management for Citizen Science
Data Management for Citizen ScienceData Management for Citizen Science
Data Management for Citizen ScienceAndrea Wiggins
 
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Andrea Wiggins
 
Mechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen ScienceMechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen ScienceAndrea Wiggins
 
Open Source & Citizen Science
Open Source & Citizen ScienceOpen Source & Citizen Science
Open Source & Citizen ScienceAndrea Wiggins
 
From Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen ScienceFrom Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen ScienceAndrea Wiggins
 
Motivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and IncentivesMotivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and IncentivesAndrea Wiggins
 
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesData Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesAndrea Wiggins
 
Open Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen ScienceOpen Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen ScienceAndrea Wiggins
 
Reclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS ProjectsReclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS ProjectsAndrea Wiggins
 
Intellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and FutureIntellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and FutureAndrea Wiggins
 
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen ScienceDistributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen ScienceAndrea Wiggins
 
Designing Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen ScienceDesigning Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen ScienceAndrea Wiggins
 

Mais de Andrea Wiggins (20)

Online Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCamsOnline Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCams
 
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceFree as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
 
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen Science
 
Citizen Science Phenotypes
Citizen Science PhenotypesCitizen Science Phenotypes
Citizen Science Phenotypes
 
The Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen ScienceThe Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen Science
 
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
 
Data Management for Citizen Science
Data Management for Citizen ScienceData Management for Citizen Science
Data Management for Citizen Science
 
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
 
Mechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen ScienceMechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen Science
 
Open Source & Citizen Science
Open Source & Citizen ScienceOpen Source & Citizen Science
Open Source & Citizen Science
 
From Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen ScienceFrom Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen Science
 
Motivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and IncentivesMotivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and Incentives
 
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesData Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
 
Open Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen ScienceOpen Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen Science
 
Reclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS ProjectsReclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS Projects
 
Crowdsourcing Science
Crowdsourcing ScienceCrowdsourcing Science
Crowdsourcing Science
 
Intellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and FutureIntellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and Future
 
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen ScienceDistributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
 
Designing Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen ScienceDesigning Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen Science
 

Último

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Secondary data analysis with digital trace data

  • 1. Secondary data analysis with digital trace data Examples from FLOSS research Andrea Wiggins 13 Juillet, 2011
  • 2. Secondary Data Analysis • Uses existing data produced or collected by someone else, usually for a different purpose • Databases • Repositories • Surveys • Emails • Social networks 2
  • 3. Digital Trace Data • Records of activity (trace data) undertaken through an online information system (thus digital) • Increasingly common in studies of online phenomena • Large volumes of available data • Can be complete: a census, not a sample • May be more reliably recorded than other data 3
  • 4. Characteristics 1. Found data (not produced for research) 2. Event-based data (not summary data) 3. Events occur over time, so it is longitudinal data 4
  • 5. Requirements • Understand the original data source • How it was collected, potential problems • Limitations of the sample • What the data describe • Match with appropriate analysis methods and measures • New types of data may require new measures • Theoretical coherence is very important 5
  • 6. Advantages • Data may be “complete” • Usually no response bias (exception: cookies) • May cover long periods of time and large groups • Multiple different data types, but mostly textual • Data are often easy to acquire • APIs or scraping web pages (with caution) • Databases, archives, or repositories of research data • But remember: you usually get what you pay for! 6
  • 7. Disadvantages • Often difficult to know limitations of data • Data may be poorly documented • Original creator may not be available for comment • Volume of data can be overwhelming • Sampling strategies needed, e.g., temporal, random • Substantial time required for data preparation: 90% of effort • Exceptions are everywhere and will break analyses, but can only be discovered through trial and error 7
  • 8. Example: Email Networks • Data source: email listservs for FLOSS projects • Analysis approach: create social networks • Within discussion threads, individuals are nodes, and links are reply-to messages • Some conceptual issues for interpretation, choice of measures • Technical challenges • Temporal aggregation • Identity resolution 8
  • 9. Figures from Howison et al., 2006 Temporal Aggregation 9
  • 11. Network Results • Different levels of correlation between venues, suggesting different types of interactions • User venues more decentralized than developer venues, reflecting greater number of participants • Overall trend toward decentralization could be result of different influences • Observed anomalous patterns in trackers for both projects: periodic centralization spikes Cleaning up before shutting down • A single user makes batch bug closings (up to 279!) – Fire’s (feature request) tracker housekeeping appears to be preparation for project closure – Gaim’s tracker housekeeping was more regular and repeated 11
  • 12. Example: Classification • Replication of success-tragedy classification • Classification criteria originally drawn from interviews with community members • Data extracted from repositories • Technical challenges • Merging data from two repositories • Processing large volume of data in multiple steps 12
  • 13. Variables • Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads • Project statistics retrieved from repositories • Founding date • Data collection date • Dates for all releases • Number of downloads • URL 13
  • 15. Classification Results Class Original Our results Difference unclassifiabl 3 186 3 296 +110 e II 13 342 (12%) 16 252 (14%) +2 910 (+2%) IG 10 711 (10%) 12 991 (11%) +2 280 (+1%) TI 37 320 (35%) 36 507 (31%) -813 (-4%) TG 30 592 (28%) 32 642 (28%) +2 050 (0%) SG 15 782 (15%) 16 045 (14%) +263 (-1%) other 8 422 0 Total 119 355 117 733 15
  • 16. Thanks! • Questions? 16