SlideShare uma empresa Scribd logo
1 de 14
Detecting Bad Data CARMA Research Module Jeff Stanton
May 18-20, 2006 Internet Data Collection Methods (Day 2-2) Sources of Data Problems in Online Studies Technical errors: Programming errors: Not common, but damaging when they occur Server errors: Can halt the collection of data Transmission errors: Uncommon and usually isolated to one record or field Response fraud: Inadvertent multiple response and malicious multiple response Missing data Intentionally malicious patterns of response leading to outliers or self-contradictory data
Response Fraud Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality Minimal frauds: skipping questions, not thinking through the answers Maximal frauds: A robot that randomly answers  May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
Duplicate Detection Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns Create a new variable that contains this unique “checksum” value for each row/case Sort the dataset on the checksum Create a lag difference variable that subtracts the checksum for each neighboring row Sort on the lag variable and investigate all cases of zero or small differences May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
May 18-20, 2006 Internet Data Collection Methods (Day 2-5) Bogus Response Detection  Calculate common univariate statistics using the complete row of responses for each subject Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min) Sort the cases by the mean value Look for extreme outliers on the high and low ends Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum Look for anomalies and trace them back to the original data for that subject
May 18-20, 2006 Internet Data Collection Methods (Day 2-6) Multivariate Outlier Detection Use Mahalanobis distance to detect outliers Regress a set of related items on an arbitrary dependent variable Sort by Mahalanobis distance: Larger distances are suggestive of outliers Use autocorrelation to detect unusual data patterns Flip the data: Cases become variables and variables become cases Run an autocorrelation function Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags) I have provided example SPSS code in the utilities area of the LMS for each of these tests
May 18-20, 2006 Internet Data Collection Methods (Day 2-7) Mahalanobis
May 18-20, 2006 Internet Data Collection Methods (Day 2-8) Plot, Sort, and Examine
May 18-20, 2006 Internet Data Collection Methods (Day 2-9) An ACF Indicating No Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-10) An ACF with a Suspicious Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-11) Common Missing Data Mitigation Techniques Item imputation For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset Mean substitution Suppresses variability Time series imputation Mean of neighboring points; suppresses spikes Regression imputation, works well for highly intercorrelated variables Full information maximum likelihood imputation Available in some SEM programs
May 18-20, 2006 Internet Data Collection Methods (Day 2-12) Excel Tips Your friend the “fill” function The power of “Paste Special” Sorting: Click on Data/Sort
May 18-20, 2006 Internet Data Collection Methods (Day 2-13) Excel Statistical Formulas =find(<find text>, <within text>, <start>) Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start> Example: =find(“=“, “fish=head”, 1) =Len(<string>) Returns the number of characters in a string Example =Len(“Ouch”) =Right(<string>,<length>) Returns the rightmost <length> characters in string Example: =Right(“fishhead“,4) =Left(<string>,<length>) works similarly =average(value, value…) Gives the arithmetic mean of a collection of cells and/or numeric values =stdev(value, value…) // stdevp(value, value…) Gives the sample/population standard deviation of a collection of cells and/or numeric values =sum(value, value…) Gives the sum of a collection of cells and/or numeric values =correl(vector1, vector2) Gives the pearson correlation between two vectors =if(<test>,<value if true>,<value if false>) Makes a logical test and returns a different value depending on whether the test is true or false Example =if(1=1, “Yes!”, “No…”)
May 18-20, 2006 Internet Data Collection Methods (Day 2-14) Summary of Bad Data Problems Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back… Unmotivated responding: participant uses same option over and over again Malicious patterns: Participate enters some unusually regular pattern of responses There are at least five errors of these kinds in the exercise dataset (see below)

Mais conteúdo relacionado

Mais procurados

Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchAnalyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchMatthieu Schapranow
 
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFPharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFannzi
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Association Mining
Association Mining Association Mining
Association Mining Edureka!
 
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Matthieu Schapranow
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Matthieu Schapranow
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIPaul Agapow
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDatabricks
 
resume_LangZhou
resume_LangZhouresume_LangZhou
resume_LangZhouLang Zhou
 
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesAnalyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesMatthieu Schapranow
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...CTSI at UCSF
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developersNirmal Fernando
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineMatthieu Schapranow
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical DataPaul Agapow
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
 

Mais procurados (20)

Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchAnalyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
 
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFPharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Association Mining
Association Mining Association Mining
Association Mining
 
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
 
resume_LangZhou
resume_LangZhouresume_LangZhou
resume_LangZhou
 
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesAnalyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems Medicine
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

Destaque

Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale developmentSyracuse University
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question proSyracuse University
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultySyracuse University
 
Martin data collection methods
Martin  data collection methodsMartin  data collection methods
Martin data collection methodsMartin Otundo
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 

Destaque (9)

Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Martin data collection methods
Martin  data collection methodsMartin  data collection methods
Martin data collection methods
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 

Semelhante a Carma internet research module detecting bad data

Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Peter Gfader
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdfLellaLinton
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterBen De Meester
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASEMEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASEIAEME Publication
 
Datamining
DataminingDatamining
Dataminingsumit621
 
Information Extraction
Information ExtractionInformation Extraction
Information Extractionbutest
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...ahmedragab433449
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET Journal
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.docmurtaza5500
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structuresecomputernotes
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databasesIEEEMEMTECHSTUDENTSPROJECTS
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...ahmedragab433449
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internetSyracuse University
 

Semelhante a Carma internet research module detecting bad data (20)

Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASEMEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
 
Datamining
DataminingDatamining
Datamining
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.doc
 
Data mining
Data miningData mining
Data mining
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 

Mais de Syracuse University

Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issuesSyracuse University
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics CourseSyracuse University
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Syracuse University
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collectionSyracuse University
 
Carma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingCarma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingSyracuse University
 
Carma internet research module: Survey reduction
Carma internet research module: Survey reductionCarma internet research module: Survey reduction
Carma internet research module: Survey reductionSyracuse University
 
Carma internet research module: Research design catalog
Carma internet research module: Research design catalogCarma internet research module: Research design catalog
Carma internet research module: Research design catalogSyracuse University
 
Carma internet research module detecting bad data
Carma internet research module   detecting bad dataCarma internet research module   detecting bad data
Carma internet research module detecting bad dataSyracuse University
 
Carma internet research module preparing for manuscript submission
Carma internet research module   preparing for manuscript submissionCarma internet research module   preparing for manuscript submission
Carma internet research module preparing for manuscript submissionSyracuse University
 
Carma internet research module survey design issues
Carma internet research module   survey design issuesCarma internet research module   survey design issues
Carma internet research module survey design issuesSyracuse University
 

Mais de Syracuse University (20)

Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
 
Carma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingCarma internet research module: Encouraging responding
Carma internet research module: Encouraging responding
 
Carma internet research module: Survey reduction
Carma internet research module: Survey reductionCarma internet research module: Survey reduction
Carma internet research module: Survey reduction
 
Carma internet research module: Research design catalog
Carma internet research module: Research design catalogCarma internet research module: Research design catalog
Carma internet research module: Research design catalog
 
Stanton eScience Presentation
Stanton eScience PresentationStanton eScience Presentation
Stanton eScience Presentation
 
Carma internet research module detecting bad data
Carma internet research module   detecting bad dataCarma internet research module   detecting bad data
Carma internet research module detecting bad data
 
Carma internet research module preparing for manuscript submission
Carma internet research module   preparing for manuscript submissionCarma internet research module   preparing for manuscript submission
Carma internet research module preparing for manuscript submission
 
Carma internet research module survey design issues
Carma internet research module   survey design issuesCarma internet research module   survey design issues
Carma internet research module survey design issues
 

Último

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 

Último (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Carma internet research module detecting bad data

  • 1. Detecting Bad Data CARMA Research Module Jeff Stanton
  • 2. May 18-20, 2006 Internet Data Collection Methods (Day 2-2) Sources of Data Problems in Online Studies Technical errors: Programming errors: Not common, but damaging when they occur Server errors: Can halt the collection of data Transmission errors: Uncommon and usually isolated to one record or field Response fraud: Inadvertent multiple response and malicious multiple response Missing data Intentionally malicious patterns of response leading to outliers or self-contradictory data
  • 3. Response Fraud Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality Minimal frauds: skipping questions, not thinking through the answers Maximal frauds: A robot that randomly answers May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
  • 4. Duplicate Detection Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns Create a new variable that contains this unique “checksum” value for each row/case Sort the dataset on the checksum Create a lag difference variable that subtracts the checksum for each neighboring row Sort on the lag variable and investigate all cases of zero or small differences May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
  • 5. May 18-20, 2006 Internet Data Collection Methods (Day 2-5) Bogus Response Detection Calculate common univariate statistics using the complete row of responses for each subject Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min) Sort the cases by the mean value Look for extreme outliers on the high and low ends Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum Look for anomalies and trace them back to the original data for that subject
  • 6. May 18-20, 2006 Internet Data Collection Methods (Day 2-6) Multivariate Outlier Detection Use Mahalanobis distance to detect outliers Regress a set of related items on an arbitrary dependent variable Sort by Mahalanobis distance: Larger distances are suggestive of outliers Use autocorrelation to detect unusual data patterns Flip the data: Cases become variables and variables become cases Run an autocorrelation function Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags) I have provided example SPSS code in the utilities area of the LMS for each of these tests
  • 7. May 18-20, 2006 Internet Data Collection Methods (Day 2-7) Mahalanobis
  • 8. May 18-20, 2006 Internet Data Collection Methods (Day 2-8) Plot, Sort, and Examine
  • 9. May 18-20, 2006 Internet Data Collection Methods (Day 2-9) An ACF Indicating No Pattern
  • 10. May 18-20, 2006 Internet Data Collection Methods (Day 2-10) An ACF with a Suspicious Pattern
  • 11. May 18-20, 2006 Internet Data Collection Methods (Day 2-11) Common Missing Data Mitigation Techniques Item imputation For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset Mean substitution Suppresses variability Time series imputation Mean of neighboring points; suppresses spikes Regression imputation, works well for highly intercorrelated variables Full information maximum likelihood imputation Available in some SEM programs
  • 12. May 18-20, 2006 Internet Data Collection Methods (Day 2-12) Excel Tips Your friend the “fill” function The power of “Paste Special” Sorting: Click on Data/Sort
  • 13. May 18-20, 2006 Internet Data Collection Methods (Day 2-13) Excel Statistical Formulas =find(<find text>, <within text>, <start>) Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start> Example: =find(“=“, “fish=head”, 1) =Len(<string>) Returns the number of characters in a string Example =Len(“Ouch”) =Right(<string>,<length>) Returns the rightmost <length> characters in string Example: =Right(“fishhead“,4) =Left(<string>,<length>) works similarly =average(value, value…) Gives the arithmetic mean of a collection of cells and/or numeric values =stdev(value, value…) // stdevp(value, value…) Gives the sample/population standard deviation of a collection of cells and/or numeric values =sum(value, value…) Gives the sum of a collection of cells and/or numeric values =correl(vector1, vector2) Gives the pearson correlation between two vectors =if(<test>,<value if true>,<value if false>) Makes a logical test and returns a different value depending on whether the test is true or false Example =if(1=1, “Yes!”, “No…”)
  • 14. May 18-20, 2006 Internet Data Collection Methods (Day 2-14) Summary of Bad Data Problems Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back… Unmotivated responding: participant uses same option over and over again Malicious patterns: Participate enters some unusually regular pattern of responses There are at least five errors of these kinds in the exercise dataset (see below)