SlideShare uma empresa Scribd logo
1 de 39
Big Data Project Presentation
Team Members: Shrinivasaragav Balasubramanian, Shelley
Bhatnagar
STACK OVERFLOW DATASET ANALYSIS
 The Dataset is obtained from Stack Exchange Data Dump at the Internet
Archive.
 The link to the Dataset is as follows :
https://archive.org/details/stackexchange
 Each site under Stack Exchange is formatted as a separate archive
consisting of XML files zipped via 7-zip that includes various files.
 We chose the Stack Overflow Data Segment under the Stack Exchange
Dump which originally is around ~ 20 GB and we brought it to 3 GB for
performing analysis.
Dataset Overview:
 Stack Overflow Dataset consists of following files that are treated as
tables in our Database Design:
 Posts
 PostLinks
 Tags
 Users
 Votes
 Batches
 Comments
Dataset Overview:
 Since our dataset is in xml format, we designed parsers for each file i.e
table, to process the data easily and dump the data into HDFS.
 The parsers were designed into a Java Application, implementing Mapper
and Reducer while configuring a job in Hadoop to parse the data.
 The Jar is run in Hadoop Distributed Mode and the parsed data is dumped
into HDFS.
 Each file in dataset consists of 12 million + entries.
 Each table had 6-7 attributes in average while also consisting of missing
attributes, empty fields and hence inconsistent data entries which the
parser took care of.
Mission:
 The Posts table consisted of an attribute named PostTypeId which is 1 if
the Post is a Question Post and 2 is the Post is an answer to the Question.
 Since most of our analysis was centered on this table, we divided the
table into PostQuestions and PostAnswers to make the analysis simple.
 Eg. <row Id="1258222" PostTypeId="2" ParentId="1238775“
CreationDate="2009-08-11T02:29:20.380" Score="1"
Body="&lt;p&gt;Lisp. There are so many Lisp systems out there defined in
terms of rules not imperative commands. Google ahoy...&lt;/p&gt;&#xA;"
OwnerUserId="16709" LastActivityDate="2009-08-11T02:29:20.380"
CommentCount="0" />
Posts Table:
 The trending Questions that are viewed and scored highly by users.
 The Questions that doesn’t have any answers.
 The Questions that have been marked closed for each category.
 The Questions that are dead and have no activity past 2 years.
 The most viewed questions in each category.
 The most scored questions in each category
 The count of posted questions of each category over a timeframe (say 2
years).
 The list of tags other than standard tags.
 The top posted Questions in each category.
Analysis using Posts
 The RANK of the Post in the dataset.
 Approximate time for a User Post in a category to expect a correct answer
or a working solution.
Analysis on Posts (cont)
 The User profile with maximum views.
 The top users with maximum reputation points.
 Most valuable users in the dataset.
 The numbers of users that have been awarded batches.
 The count of users creating account in a given timeframe (say 6 months).
 Recommending users to contribute an answer for a similarly liked
category.
 The inactive accounts over a range of time.
 Total Number of dead accounts.
 The Number of users bearing various batches
Analysis on Users:
 The comments that have a count greater than average count.
 The users posting maximum number of comments.
 The Question Post that have highest number of comments.
Analysis on Comments
 The number of spam comments in the dataset.
 The Users that contribute to the spam posts.
 The Posts that are scheduled to be deleted from the data dump over a period
of say (6 months).
 The top users carrying votes titled as favorite.
Analysis on Votes
 A page rank is calculated to find out the weightage of the posted Query
contributed by a user into the dump.
 Each Post written as a question maybe linked to several other similar posts
that are posted by users having similar doubts.
 Similarly each answer to a post can be referred by another post.
 Hence, Page Rank is a ‘’VOTE” by all the other posts in the dataset.
 A link to a Post counts as a vote of support, absence of which indicates
lack of support.
Overview of Internal Page Rank Analysis:
 Thus if we have a Post with PostId = A, which have Posts T1…..Tn pointing
to it, we take a dumping factor between 0 – 1 and we have define C(A) to
be as the number of links associated with the Post, the Page Rank of a
Post is given as follows:
 PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Page Rank Formula:
 The Page Rank of each Post depends on the post linked to it.
 It is calculates without knowing the final value of Page Rank.
 Thus we run the calculation repeatedly which takes us closer to the
estimated final value.
How is Page Rank Calculated?
 The “damping factor” is quite subtle.
 If it’s too high then it takes ages for the numbers to settle,
 if it’s too low then you get repeated over-shoot
 We performed analysis for achieving the optimal damping factor.
 The Damping factor chosen for this Dataset is 0.25.
 No matter from where we start the guess, once settled, the average Page
Rank of all pages will be 1.0
Choosing the Dumping Factor:
Example
Web Application: Internal Page Rank Analysis
 The analysis predicts and provides an estimates time in which a user can
expect an activity on the Post.
 Analysis involved categorizing the dataset according to the tags.
 For each posted question the fastest reply was taken into consideration
and the time difference between posting a question and getting the first
reply was calculated.
 This difference was averaged for all the posts belonging to a category,
thereby predicting the activity on a post.
Predicting First Activity Time On A Post
 In the application, a user can provide the tags he/she would be using for
their posts.
 Based on the tags provided, the application will calculate the average
time taken for an activity on each tag and then average the two results.
How This Works In The Application
 Creating a graph structure based on Posts and Related Posts.
 Graph will comprise of Nodes and Edges.
 Each Node will have several Edges and each Edges will be a Node again
will several Edges.
 Created a Pig UDF where all the Posts and Related Posts are sent as a
Group.
 Based on the input a graph gets created.
 Rank is calculated based on how many incoming links each Node has.
 The more the number of incoming links, the higher the Page Rank.
How We Did It
 Integrated Hive with the existing Hbase table.
 We need to provide the hbase.columns.mapping whereas
hbase.table.name is optional to provide.
 We use HbaseStorage Handler to allow Hive to interact with Hbase.
Hive Hbase Integration
 HiveServer is an optional service that allows a remote client to submit
requests to Hive, using a variety of programming languages, and retrieve
results.
 We used the Hive Thrift Server to connect with the Hive Tables from the
Web Application.
 Starting the Hive Thrift Server: hive –service hiveserver
 Connection String:
Hive Thrift Server
 Providing Suggestions to users regarding the various questions they can
answer from other categories.
 We have taken the User ID, Category ID and the Interaction level as the
input to Mahout User Recommender.
Mahout User Based Recommender
 We used pig queries to join the various tables and get an output which
contained User ID, Category ID and Interaction level.
 We used this output as an input to the Mahout User Based
Recommender.
 We converted the Interaction Level values to be in the range of 0 to 5.
 We used the PearsonCorrelationSimilarity and the NearestNNeighbours
as the neighborhood.
 We then used the UserBased Recommender to provide 3 suggestions of
other Categories for which the user can provide his contribution by
answering the questions.
How Did We Implement It
Web Application: Mahout Recommender
 We were able to incorporate our analysis in a Web Appplication.
 The Web Application retrieves the required data using Hbase and Hive
technologies.
 Below attached are screenshots of the application and the analysis that
has been performed.
 We have used Google Charts for displaying our analysis in a graph.
Web Application
Questions Posted By User: Used HBase
Tag Count Analysis: Most Used Tags
Dead Accounts Analysis
Closed Questions Analysis
Comments To Answers Analysis
Top Questions Analysis
Trending Posts Analysis
Monthly Deleted Posts
Answered Vs Unanswered Questions
Finding Average Answer Time
Internal Page Rank Analysis
Mahout Recommender
 Performance depends upon input sizes and MR FS chunk size.
 While there were queries that required sorting of data, many temp files
were created and written onto the disc.
 The performance of MR is evaluated by reviewing the counters for map
task.
 In the Parser Implemented to read the xml file, there were significant
problems faced.
 The number of spilled records were significantly more than the map task
read that resulted in NullPointerException with the message:
INFO mapreduce.Job: Job job_local1747290386_0001 failed with
state FAILED due to: NA
Problem Faced:

Mais conteúdo relacionado

Mais procurados

Array operations
Array operationsArray operations
Array operationsZAFAR444
 
FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...
FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...
FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...Arti Parab Academics
 
SYBSC IT COMPUTER NETWORKS UNIT I Network Models
SYBSC IT COMPUTER NETWORKS UNIT I Network ModelsSYBSC IT COMPUTER NETWORKS UNIT I Network Models
SYBSC IT COMPUTER NETWORKS UNIT I Network ModelsArti Parab Academics
 
Internet Protocol (IP) And Different Networking Devices.
Internet Protocol (IP) And Different Networking Devices.Internet Protocol (IP) And Different Networking Devices.
Internet Protocol (IP) And Different Networking Devices.Clinton Dsouza
 
Evolution of network - computer networks
Evolution of network - computer networksEvolution of network - computer networks
Evolution of network - computer networksSabarishSanjeevi
 
Nonrecursive predictive parsing
Nonrecursive predictive parsingNonrecursive predictive parsing
Nonrecursive predictive parsingalldesign
 
Presentation on Data Structure
Presentation on Data StructurePresentation on Data Structure
Presentation on Data StructureA. N. M. Jubaer
 
Presentation on queue
Presentation on queuePresentation on queue
Presentation on queueRojan Pariyar
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things PayamBarnaghi
 
Programming in Java: Control Flow
Programming in Java: Control FlowProgramming in Java: Control Flow
Programming in Java: Control FlowMartin Chapman
 
Computer networks a tanenbaum - 5th editionee
Computer networks   a tanenbaum - 5th editioneeComputer networks   a tanenbaum - 5th editionee
Computer networks a tanenbaum - 5th editioneepawan1809
 
VCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxVCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxskilljiolms
 
Data Communication and Networking
Data Communication and NetworkingData Communication and Networking
Data Communication and NetworkingAnjan Mahanta
 

Mais procurados (20)

Array operations
Array operationsArray operations
Array operations
 
Data structures using c
Data structures using cData structures using c
Data structures using c
 
Queues in C++
Queues in C++Queues in C++
Queues in C++
 
FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...
FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...
FYBSC IT Digital Electronics Unit I Chapter II Number System and Binary Arith...
 
SYBSC IT COMPUTER NETWORKS UNIT I Network Models
SYBSC IT COMPUTER NETWORKS UNIT I Network ModelsSYBSC IT COMPUTER NETWORKS UNIT I Network Models
SYBSC IT COMPUTER NETWORKS UNIT I Network Models
 
Transport layer protocol
Transport layer protocolTransport layer protocol
Transport layer protocol
 
DBMS Canonical cover
DBMS Canonical coverDBMS Canonical cover
DBMS Canonical cover
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Space complexity
Space complexitySpace complexity
Space complexity
 
Infix to postfix conversion
Infix to postfix conversionInfix to postfix conversion
Infix to postfix conversion
 
Internet Protocol (IP) And Different Networking Devices.
Internet Protocol (IP) And Different Networking Devices.Internet Protocol (IP) And Different Networking Devices.
Internet Protocol (IP) And Different Networking Devices.
 
Evolution of network - computer networks
Evolution of network - computer networksEvolution of network - computer networks
Evolution of network - computer networks
 
Nonrecursive predictive parsing
Nonrecursive predictive parsingNonrecursive predictive parsing
Nonrecursive predictive parsing
 
Presentation on Data Structure
Presentation on Data StructurePresentation on Data Structure
Presentation on Data Structure
 
Presentation on queue
Presentation on queuePresentation on queue
Presentation on queue
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
 
Programming in Java: Control Flow
Programming in Java: Control FlowProgramming in Java: Control Flow
Programming in Java: Control Flow
 
Computer networks a tanenbaum - 5th editionee
Computer networks   a tanenbaum - 5th editioneeComputer networks   a tanenbaum - 5th editionee
Computer networks a tanenbaum - 5th editionee
 
VCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxVCE Unit 01 (2).pptx
VCE Unit 01 (2).pptx
 
Data Communication and Networking
Data Communication and NetworkingData Communication and Networking
Data Communication and Networking
 

Destaque

StackOverflow Architectural Overview
StackOverflow Architectural OverviewStackOverflow Architectural Overview
StackOverflow Architectural OverviewFolio3 Software
 
Stackoverflow Data Analysis-Homework3
Stackoverflow Data Analysis-Homework3Stackoverflow Data Analysis-Homework3
Stackoverflow Data Analysis-Homework3Ayush Tak
 
Stack Overflow slides Data Analytics
Stack Overflow slides Data Analytics Stack Overflow slides Data Analytics
Stack Overflow slides Data Analytics Rahul Thankachan
 
Analyzing Stack Overflow - Problem
Analyzing Stack Overflow - ProblemAnalyzing Stack Overflow - Problem
Analyzing Stack Overflow - ProblemAmrith Krishna
 
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)Ontico
 
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...How to Web
 
Software Engineering and Social media
Software Engineering and Social mediaSoftware Engineering and Social media
Software Engineering and Social mediaJorge Melegati
 
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Margaret-Anne Storey
 
Stack overflow growth model
Stack overflow growth modelStack overflow growth model
Stack overflow growth modelusama0581
 
Mining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMarco Aurelio Gerosa
 
Implementación Repositorio De Objetos De Aprendizajes Basado En
Implementación Repositorio De Objetos De Aprendizajes Basado EnImplementación Repositorio De Objetos De Aprendizajes Basado En
Implementación Repositorio De Objetos De Aprendizajes Basado Enf.cabrera1
 
What is Node.js used for: The 2015 Node.js Overview Report
What is Node.js used for: The 2015 Node.js Overview ReportWhat is Node.js used for: The 2015 Node.js Overview Report
What is Node.js used for: The 2015 Node.js Overview ReportGabor Nagy
 
Soluciones tecnológicas para REA
Soluciones tecnológicas para REASoluciones tecnológicas para REA
Soluciones tecnológicas para REARicardo Corai
 
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...Paola Amadeo
 
Responsive Design
Responsive DesignResponsive Design
Responsive DesignMRMtech
 

Destaque (20)

StackOverflow Architectural Overview
StackOverflow Architectural OverviewStackOverflow Architectural Overview
StackOverflow Architectural Overview
 
Stackoverflow Data Analysis-Homework3
Stackoverflow Data Analysis-Homework3Stackoverflow Data Analysis-Homework3
Stackoverflow Data Analysis-Homework3
 
Stack Overflow slides Data Analytics
Stack Overflow slides Data Analytics Stack Overflow slides Data Analytics
Stack Overflow slides Data Analytics
 
Understanding Stack Overflow
Understanding Stack OverflowUnderstanding Stack Overflow
Understanding Stack Overflow
 
Analyzing Stack Overflow - Problem
Analyzing Stack Overflow - ProblemAnalyzing Stack Overflow - Problem
Analyzing Stack Overflow - Problem
 
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)
 
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...
 
Lanubile@SSE2013
Lanubile@SSE2013Lanubile@SSE2013
Lanubile@SSE2013
 
Software Engineering and Social media
Software Engineering and Social mediaSoftware Engineering and Social media
Software Engineering and Social media
 
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
 
Stack overflow growth model
Stack overflow growth modelStack overflow growth model
Stack overflow growth model
 
Mining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software Repositories
 
Repositorio Institucional para el manejo de Investigaciones de la UNAN-Manag...
 Repositorio Institucional para el manejo de Investigaciones de la UNAN-Manag... Repositorio Institucional para el manejo de Investigaciones de la UNAN-Manag...
Repositorio Institucional para el manejo de Investigaciones de la UNAN-Manag...
 
Implementación Repositorio De Objetos De Aprendizajes Basado En
Implementación Repositorio De Objetos De Aprendizajes Basado EnImplementación Repositorio De Objetos De Aprendizajes Basado En
Implementación Repositorio De Objetos De Aprendizajes Basado En
 
groovy & grails - lecture 13
groovy & grails - lecture 13groovy & grails - lecture 13
groovy & grails - lecture 13
 
What is Node.js used for: The 2015 Node.js Overview Report
What is Node.js used for: The 2015 Node.js Overview ReportWhat is Node.js used for: The 2015 Node.js Overview Report
What is Node.js used for: The 2015 Node.js Overview Report
 
Soluciones tecnológicas para REA
Soluciones tecnológicas para REASoluciones tecnológicas para REA
Soluciones tecnológicas para REA
 
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...
 
Stack_Overflow-Network_Graph
Stack_Overflow-Network_GraphStack_Overflow-Network_Graph
Stack_Overflow-Network_Graph
 
Responsive Design
Responsive DesignResponsive Design
Responsive Design
 

Semelhante a STACK OVERFLOW DATASET ANALYSIS

Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Reportweichen
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfParthNavale
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016Saurabh Deochake
 
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxJOELFRANKLIN13
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_ReportUrjit Patel
 
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveSentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveIRJET Journal
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Performance tuning in ranker
Performance tuning in rankerPerformance tuning in ranker
Performance tuning in rankerEosSoftware
 
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAFLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAIJwest
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachAlessandro Benedetti
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
 
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...OpenSource Connections
 

Semelhante a STACK OVERFLOW DATASET ANALYSIS (20)

Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
 
System Design
System DesignSystem Design
System Design
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016
 
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptx
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveSentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
 
Pagerank
PagerankPagerank
Pagerank
 
Final Algos
Final AlgosFinal Algos
Final Algos
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Paper ijert
Paper ijertPaper ijert
Paper ijert
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Performance tuning in ranker
Performance tuning in rankerPerformance tuning in ranker
Performance tuning in ranker
 
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAFLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
 

STACK OVERFLOW DATASET ANALYSIS

  • 1. Big Data Project Presentation Team Members: Shrinivasaragav Balasubramanian, Shelley Bhatnagar STACK OVERFLOW DATASET ANALYSIS
  • 2.  The Dataset is obtained from Stack Exchange Data Dump at the Internet Archive.  The link to the Dataset is as follows : https://archive.org/details/stackexchange  Each site under Stack Exchange is formatted as a separate archive consisting of XML files zipped via 7-zip that includes various files.  We chose the Stack Overflow Data Segment under the Stack Exchange Dump which originally is around ~ 20 GB and we brought it to 3 GB for performing analysis. Dataset Overview:
  • 3.  Stack Overflow Dataset consists of following files that are treated as tables in our Database Design:  Posts  PostLinks  Tags  Users  Votes  Batches  Comments Dataset Overview:
  • 4.
  • 5.  Since our dataset is in xml format, we designed parsers for each file i.e table, to process the data easily and dump the data into HDFS.  The parsers were designed into a Java Application, implementing Mapper and Reducer while configuring a job in Hadoop to parse the data.  The Jar is run in Hadoop Distributed Mode and the parsed data is dumped into HDFS.  Each file in dataset consists of 12 million + entries.  Each table had 6-7 attributes in average while also consisting of missing attributes, empty fields and hence inconsistent data entries which the parser took care of. Mission:
  • 6.  The Posts table consisted of an attribute named PostTypeId which is 1 if the Post is a Question Post and 2 is the Post is an answer to the Question.  Since most of our analysis was centered on this table, we divided the table into PostQuestions and PostAnswers to make the analysis simple.  Eg. <row Id="1258222" PostTypeId="2" ParentId="1238775“ CreationDate="2009-08-11T02:29:20.380" Score="1" Body="&lt;p&gt;Lisp. There are so many Lisp systems out there defined in terms of rules not imperative commands. Google ahoy...&lt;/p&gt;&#xA;" OwnerUserId="16709" LastActivityDate="2009-08-11T02:29:20.380" CommentCount="0" /> Posts Table:
  • 7.  The trending Questions that are viewed and scored highly by users.  The Questions that doesn’t have any answers.  The Questions that have been marked closed for each category.  The Questions that are dead and have no activity past 2 years.  The most viewed questions in each category.  The most scored questions in each category  The count of posted questions of each category over a timeframe (say 2 years).  The list of tags other than standard tags.  The top posted Questions in each category. Analysis using Posts
  • 8.  The RANK of the Post in the dataset.  Approximate time for a User Post in a category to expect a correct answer or a working solution. Analysis on Posts (cont)
  • 9.  The User profile with maximum views.  The top users with maximum reputation points.  Most valuable users in the dataset.  The numbers of users that have been awarded batches.  The count of users creating account in a given timeframe (say 6 months).  Recommending users to contribute an answer for a similarly liked category.  The inactive accounts over a range of time.  Total Number of dead accounts.  The Number of users bearing various batches Analysis on Users:
  • 10.  The comments that have a count greater than average count.  The users posting maximum number of comments.  The Question Post that have highest number of comments. Analysis on Comments
  • 11.  The number of spam comments in the dataset.  The Users that contribute to the spam posts.  The Posts that are scheduled to be deleted from the data dump over a period of say (6 months).  The top users carrying votes titled as favorite. Analysis on Votes
  • 12.  A page rank is calculated to find out the weightage of the posted Query contributed by a user into the dump.  Each Post written as a question maybe linked to several other similar posts that are posted by users having similar doubts.  Similarly each answer to a post can be referred by another post.  Hence, Page Rank is a ‘’VOTE” by all the other posts in the dataset.  A link to a Post counts as a vote of support, absence of which indicates lack of support. Overview of Internal Page Rank Analysis:
  • 13.  Thus if we have a Post with PostId = A, which have Posts T1…..Tn pointing to it, we take a dumping factor between 0 – 1 and we have define C(A) to be as the number of links associated with the Post, the Page Rank of a Post is given as follows:  PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Page Rank Formula:
  • 14.  The Page Rank of each Post depends on the post linked to it.  It is calculates without knowing the final value of Page Rank.  Thus we run the calculation repeatedly which takes us closer to the estimated final value. How is Page Rank Calculated?
  • 15.  The “damping factor” is quite subtle.  If it’s too high then it takes ages for the numbers to settle,  if it’s too low then you get repeated over-shoot  We performed analysis for achieving the optimal damping factor.  The Damping factor chosen for this Dataset is 0.25.  No matter from where we start the guess, once settled, the average Page Rank of all pages will be 1.0 Choosing the Dumping Factor:
  • 17. Web Application: Internal Page Rank Analysis
  • 18.  The analysis predicts and provides an estimates time in which a user can expect an activity on the Post.  Analysis involved categorizing the dataset according to the tags.  For each posted question the fastest reply was taken into consideration and the time difference between posting a question and getting the first reply was calculated.  This difference was averaged for all the posts belonging to a category, thereby predicting the activity on a post. Predicting First Activity Time On A Post
  • 19.  In the application, a user can provide the tags he/she would be using for their posts.  Based on the tags provided, the application will calculate the average time taken for an activity on each tag and then average the two results. How This Works In The Application
  • 20.  Creating a graph structure based on Posts and Related Posts.  Graph will comprise of Nodes and Edges.  Each Node will have several Edges and each Edges will be a Node again will several Edges.  Created a Pig UDF where all the Posts and Related Posts are sent as a Group.  Based on the input a graph gets created.  Rank is calculated based on how many incoming links each Node has.  The more the number of incoming links, the higher the Page Rank. How We Did It
  • 21.  Integrated Hive with the existing Hbase table.  We need to provide the hbase.columns.mapping whereas hbase.table.name is optional to provide.  We use HbaseStorage Handler to allow Hive to interact with Hbase. Hive Hbase Integration
  • 22.  HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results.  We used the Hive Thrift Server to connect with the Hive Tables from the Web Application.  Starting the Hive Thrift Server: hive –service hiveserver  Connection String: Hive Thrift Server
  • 23.  Providing Suggestions to users regarding the various questions they can answer from other categories.  We have taken the User ID, Category ID and the Interaction level as the input to Mahout User Recommender. Mahout User Based Recommender
  • 24.  We used pig queries to join the various tables and get an output which contained User ID, Category ID and Interaction level.  We used this output as an input to the Mahout User Based Recommender.  We converted the Interaction Level values to be in the range of 0 to 5.  We used the PearsonCorrelationSimilarity and the NearestNNeighbours as the neighborhood.  We then used the UserBased Recommender to provide 3 suggestions of other Categories for which the user can provide his contribution by answering the questions. How Did We Implement It
  • 25. Web Application: Mahout Recommender
  • 26.  We were able to incorporate our analysis in a Web Appplication.  The Web Application retrieves the required data using Hbase and Hive technologies.  Below attached are screenshots of the application and the analysis that has been performed.  We have used Google Charts for displaying our analysis in a graph. Web Application
  • 27. Questions Posted By User: Used HBase
  • 28. Tag Count Analysis: Most Used Tags
  • 37. Internal Page Rank Analysis
  • 39.  Performance depends upon input sizes and MR FS chunk size.  While there were queries that required sorting of data, many temp files were created and written onto the disc.  The performance of MR is evaluated by reviewing the counters for map task.  In the Parser Implemented to read the xml file, there were significant problems faced.  The number of spilled records were significantly more than the map task read that resulted in NullPointerException with the message: INFO mapreduce.Job: Job job_local1747290386_0001 failed with state FAILED due to: NA Problem Faced: