The document presents a study on classifying derogatory comments using machine learning models. It aims to build a model that can detect and classify toxic, harmful, or abusive comments on social media. The researchers collected a dataset from Kaggle containing Wikipedia comments labeled as obscene, severely toxic, threatening, identity attacking, insulting. They propose using Naive Bayes-SVM as a baseline model and comparing the performance of BERT and LSTM models on this task. Previous works that used NB-SVM, BERT and LSTM models for related tasks are summarized in a table, showing accuracy rates from 87% to 98%. The proposed system architecture involves preprocessing the data, using NB-SVM for initial classification, and then feeding the output into BERT and
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
1) The document summarizes a presentation by Richard Heimann from Data Tactics Corporation about their blended approach to big data analytics.
2) Data Tactics uses both academic and industry experts with backgrounds in fields like mathematics, computer science, and statistics to perform analytics using methods like clustering, regression, and text analysis.
3) Heimann explains that their approach combines traditional analytics techniques ("horizontals") with specialized domain knowledge ("verticals") to solve specific client problems.
1) Data Tactics Corporation uses a blended approach to big data analytics with a team of data scientists from various educational backgrounds and skills in both common and specialized analytical techniques.
2) The presentation discussed why analytics are important for business and practical reasons, noting that no single algorithm works best for all problems and algorithms must be designed for specific domains or styles of problems.
3) Shiny and BDE (Big Data Engineering) were presented as tools that can help scale analytics from academic publications to web and enterprise levels when combined with techniques like machine learning, parallel and distributed processing, and object-oriented approaches.
The document discusses map reduce and how it can be used for sequential web access-based recommendation systems. It explains that map reduce separates large, unstructured data processing from computation, allowing it to run efficiently on many machines. A map reduce job could process web server logs to build a pattern tree for recommendations, with the tree continuously updated from new data. When making recommendations for a user, their access pattern would be compared to the tree generated from all user data.
This document summarizes a project that uses machine learning and natural language processing to develop a plug-in called Ushine that improves the human review process of crisis reports on the Ushahidi platform. The plug-in detects languages, identifies private information, locations, URLs, and suggests categories for reports. It also detects duplicate reports. The plug-in is intended to guide but not replace human review. Evaluation of the plug-in and future work are discussed. Contact information is provided for collaborating on the open source project.
Data Science for Social Good and Ushahidi - Final PresentationUshahidi
The Data Science for Social Good Fellows (dssg.io) collaborated with Ushahidi (Ushahidi.com)
Presented: August 20, 2013
Video - https://www.youtube.com/watch?v=4eK8HjVG2m0
Tool - http://dssg.ushahididev.com/
The document discusses map reduce and how it can be used for recommendation systems. It describes how map reduce works by mapping data into key-value pairs and then reducing them. This allows large amounts of sparse, unstructured data to be processed efficiently across many machines. It then gives an example of how map reduce could be used to build a sequential web access-based recommendation system by mapping log data into a pattern tree that is continuously updated and used to provide recommendations.
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
The document discusses strategies for dealing with messy and imperfect data when creating websites. It describes how the Chembio Hub uses techniques like automatically tagging untagged data using significant terms analysis in Elasticsearch and creating database views to normalize different schemas. Filling gaps in tagging by querying search engines and considering flat file databases are also proposed. The goal is to enable sharing of chemical and biological research data across Oxford departments in a sustainable way without requiring perfect data formats or extensive curation.
The document presents a study on classifying derogatory comments using machine learning models. It aims to build a model that can detect and classify toxic, harmful, or abusive comments on social media. The researchers collected a dataset from Kaggle containing Wikipedia comments labeled as obscene, severely toxic, threatening, identity attacking, insulting. They propose using Naive Bayes-SVM as a baseline model and comparing the performance of BERT and LSTM models on this task. Previous works that used NB-SVM, BERT and LSTM models for related tasks are summarized in a table, showing accuracy rates from 87% to 98%. The proposed system architecture involves preprocessing the data, using NB-SVM for initial classification, and then feeding the output into BERT and
A Blended Approach to Analytics at Data Tactics CorporationRich Heimann
1) The document summarizes a presentation by Richard Heimann from Data Tactics Corporation about their blended approach to big data analytics.
2) Data Tactics uses both academic and industry experts with backgrounds in fields like mathematics, computer science, and statistics to perform analytics using methods like clustering, regression, and text analysis.
3) Heimann explains that their approach combines traditional analytics techniques ("horizontals") with specialized domain knowledge ("verticals") to solve specific client problems.
1) Data Tactics Corporation uses a blended approach to big data analytics with a team of data scientists from various educational backgrounds and skills in both common and specialized analytical techniques.
2) The presentation discussed why analytics are important for business and practical reasons, noting that no single algorithm works best for all problems and algorithms must be designed for specific domains or styles of problems.
3) Shiny and BDE (Big Data Engineering) were presented as tools that can help scale analytics from academic publications to web and enterprise levels when combined with techniques like machine learning, parallel and distributed processing, and object-oriented approaches.
The document discusses map reduce and how it can be used for sequential web access-based recommendation systems. It explains that map reduce separates large, unstructured data processing from computation, allowing it to run efficiently on many machines. A map reduce job could process web server logs to build a pattern tree for recommendations, with the tree continuously updated from new data. When making recommendations for a user, their access pattern would be compared to the tree generated from all user data.
This document summarizes a project that uses machine learning and natural language processing to develop a plug-in called Ushine that improves the human review process of crisis reports on the Ushahidi platform. The plug-in detects languages, identifies private information, locations, URLs, and suggests categories for reports. It also detects duplicate reports. The plug-in is intended to guide but not replace human review. Evaluation of the plug-in and future work are discussed. Contact information is provided for collaborating on the open source project.
Data Science for Social Good and Ushahidi - Final PresentationUshahidi
The Data Science for Social Good Fellows (dssg.io) collaborated with Ushahidi (Ushahidi.com)
Presented: August 20, 2013
Video - https://www.youtube.com/watch?v=4eK8HjVG2m0
Tool - http://dssg.ushahididev.com/
The document discusses map reduce and how it can be used for recommendation systems. It describes how map reduce works by mapping data into key-value pairs and then reducing them. This allows large amounts of sparse, unstructured data to be processed efficiently across many machines. It then gives an example of how map reduce could be used to build a sequential web access-based recommendation system by mapping log data into a pattern tree that is continuously updated and used to provide recommendations.
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
The document discusses strategies for dealing with messy and imperfect data when creating websites. It describes how the Chembio Hub uses techniques like automatically tagging untagged data using significant terms analysis in Elasticsearch and creating database views to normalize different schemas. Filling gaps in tagging by querying search engines and considering flat file databases are also proposed. The goal is to enable sharing of chemical and biological research data across Oxford departments in a sustainable way without requiring perfect data formats or extensive curation.
This document outlines a proposed chatbot project using natural language processing. It lists the team members and provides an abstract, description of the existing system using two dimensional arrays, and proposed system using recurrent neural networks and NLP. It also includes hardware and software specifications. References for related work on chatbots are provided at the end.
This document discusses principles for designing systems and databases for high performance. It emphasizes that:
1. Performance requirements should be specified upfront and the architecture should support meeting those requirements.
2. The database design impacts performance, so logical and physical designs like normalization, indexing, partitioning, and caching should be considered.
3. Application architecture also affects performance through techniques like minimizing network traffic and SQL parsing overhead.
Advantages Of The Extensible Markup Language (XML)Ashley Jean
Tim Berners-Lee created HTML at CERN in the 1980s, laying the foundation for the modern internet. As a simple markup language, HTML allowed researchers to share documents over early computer networks by defining common tags to structure and format text-based content. Though limited in capabilities, HTML established a common language that could be understood by any computer or operating system. This open standard enabled the World Wide Web to rapidly expand and evolve into the vast digital network it is today, fundamentally transforming how people access and share information on a global scale.
Software evolution research is a thriving area of software engineering research. Recent years have seen a growing interest in variety of evolution topics, as witnessed by the growing number of publications dedicated to the subject. Without attempting to be complete, in this talk we provide an overview of emerging trends in software evolution research, such as extension of the traditional boundaries of software, growing attention for social and socio-technical aspects of software development processes, and interdisciplinary research applying research techniques from other research areas to study software evolution, and software evolution research techniques to other research areas. As a large body of software evolution research is empirical in nature, we are confronted by important challenges pertaining to reproducibility of the research, and its generalizability.
Aniket Kittur, Bongwon Suh, Bryan Pendleton, Ed H. Chi.
He Says, She Says: Conflict and Coordination in Wikipedia.
In Proc. of ACM Conference on Human Factors in Computing Systems (CHI2007), pp. 453--462, April 2007. ACM Press. San Jose, CA.
http://www-users.cs.umn.edu/~echi/papers/2007-CHI/2007-Wikipedia-coordination-PARC-CHI2007.pdf
The document discusses developing an open domain chatbot using sequence modeling and machine translation techniques. It provides background on early rule-based chatbots and modern data-driven approaches. The proposed methodology collects data, performs word embeddings, uses an encoder-decoder model with attention to generate responses, and evaluates the model using metrics like F1 score.
Title of PresentationStudent’s nameFeel free to adjust the c.docxherthalearmont
Title of Presentation
Student’s name
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
1
Introduction
Includes
The name of the student evaluated and the topic
Also should detail the purpose and flow of the presentation
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Format of Paper
Evaluate the following three questions regarding the overall format of the paper.
Were all required sections included?
Were they clearly distinguished from one another?
If not, were reasons given for not including some?
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Historical Timeline and Predecessor Assessment Evaluation
Assess the following three components as detailed on the Student Evaluation Form
Sources
Content
Writing Skills
Remember that graphics go a long way in a visual presentation. Add them to play up the visual appeal of this slide but be sure to cite them in proper APA format.
Add additional slides as needed for this section.
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Analysis of Impact Evaluation
Assess the following three components as detailed on the Student Evaluation Form
Sources
Content
Writing Skills
Remember that graphics go a long way in a visual presentation. Add them to play up the visual appeal of this slide but be sure to cite them in proper APA format.
Add additional slides as needed for this section.
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Ethical Considerations Evaluation
Assess the following three components as detailed on the Student Evaluation Form
Sources
Content
Writing Skills
Remember that graphics go a long way in a visual presentation. Add them to play up the visual appeal of this slide but be sure to cite them in proper APA format.
Add additional slides as needed for this section.
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Concluding Remarks
Summarize the areas of the writer's strengths and weakness as presented in your presentation and remember to always end on a positive note!
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
References
Reference all sources used in completing this assignment.
Remember that in-text citations are just as important in a presentation as they are in papers.
The references listed here should be a list of what you have posted on your previous slides, including any images that you used, unless they are clipart.
Feel free to adjust the color and scheme of th ...
Jennifer Robbins: ARTIFACT EAST Keynote (Providence, 11/4/13)JenRobbins
This document summarizes the evolution of web design processes and workflows. It describes how processes have shifted from waterfall to more agile approaches, with integrated teams working collaboratively. It also discusses how design deliverables have changed from static mockups to interactive prototypes using HTML and CSS. Key aspects of new workflows include prototyping with code from the start, designing responsively with a content-first approach, and involving clients throughout for feedback.
Understanding Human Conversations with AI Rajath D M
Presentation from my talk about using AI to understand natural language human conversations and derive insights, trigger workflows etc. Also talking about a bit on Symbl's conversational intelligence API platform
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...Elena Simperl
Crowdsourcing can be used to achieve goals for semantic applications and the Semantic Web. There are different considerations for the design of a crowdsourcing project, including what tasks will be completed, who will complete them, how the process will work, and what incentives will motivate participants. Effective crowdsourcing relies on using the right crowd for the right task, and diversity within the crowd can improve outcomes. Various workflows and incentive structures should be tested to maximize quality, efficiency and participation when crowdsourcing semantic data.
The document discusses different types of digital data including structured, semi-structured, and unstructured data and provides examples and sources for each type. It also discusses challenges with big data including the exponential growth of data, hosting solutions, data retention, lack of skilled professionals, and issues with capturing, storing, processing, and visualizing large volumes of data. The document aims to introduce readers to big data and analytics by defining key concepts, discussing characteristics and challenges of big data, and comparing traditional business intelligence approaches to modern big data analytics.
Detecting the presence of cyberbullying using computer softwareAshish Arora
This document discusses methods for detecting cyberbullying using machine learning techniques. It proposes using a machine learning online patrol crawler that utilizes support vector machines to classify social media posts as containing cyberbullying or not. The crawler is trained on manually identified examples of cyberbullying comments involving negativity, profanity, and attacks on attributes. It also discusses using sentiment analysis and social graphs to identify bullies and victims on Twitter by analyzing tweets containing gender-based terms. The goal is to accurately classify sentiment and identify instances of bullying to increase visibility of the problem.
Respond to Discussion minimum 150 WordsThroughout my life I have.docxronak56
The document discusses different word processing programs such as Microsoft Word, StarOffice, and Corel WordPerfect. It notes some key differences between the programs, such as Word and StarOffice using a ruler bar by default while WordPerfect does not. StarOffice also uses the left side of the screen for shortcuts. All three programs allow automatic underlining, but StarOffice sometimes underlines common words incorrectly. The help systems in StarOffice are also described as not being as usable and comprehensive as the other programs. The document also discusses the importance of being able to convert and share files between different word processing programs when working with other companies.
The speaker discusses the semantic web and its potential to make data on the web smarter and more connected. He outlines several approaches to semantics like tagging, statistics, linguistics, semantic web, and artificial intelligence. The semantic web allows data to be self-describing and linked, enabling applications to become more intelligent. The speaker demonstrates a prototype semantic web application called Twine that helps users organize and share information about their interests.
Highcharts and Elsevier share recent research into making interactive web charts more accessible. Our usability studies focused on three areas, including stacked column charts, scatter plots, and charts with drill-down interactivity. We will share design considerations for keyboard navigation and the understandability of non-visual representations of data visualizations.
Jordan Ryan Molina offers a class syllabus covering fundamentals of computer science including history of computing, programming, algorithms, data storage, operating systems, networking, the internet, and social issues. The class progresses from basic concepts like binary and hardware to object-oriented programming in Java, and considers both technical topics and how technology impacts society. Students will learn through explanations, examples, and hands-on programming exercises.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Mais conteúdo relacionado
Semelhante a Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
This document outlines a proposed chatbot project using natural language processing. It lists the team members and provides an abstract, description of the existing system using two dimensional arrays, and proposed system using recurrent neural networks and NLP. It also includes hardware and software specifications. References for related work on chatbots are provided at the end.
This document discusses principles for designing systems and databases for high performance. It emphasizes that:
1. Performance requirements should be specified upfront and the architecture should support meeting those requirements.
2. The database design impacts performance, so logical and physical designs like normalization, indexing, partitioning, and caching should be considered.
3. Application architecture also affects performance through techniques like minimizing network traffic and SQL parsing overhead.
Advantages Of The Extensible Markup Language (XML)Ashley Jean
Tim Berners-Lee created HTML at CERN in the 1980s, laying the foundation for the modern internet. As a simple markup language, HTML allowed researchers to share documents over early computer networks by defining common tags to structure and format text-based content. Though limited in capabilities, HTML established a common language that could be understood by any computer or operating system. This open standard enabled the World Wide Web to rapidly expand and evolve into the vast digital network it is today, fundamentally transforming how people access and share information on a global scale.
Software evolution research is a thriving area of software engineering research. Recent years have seen a growing interest in variety of evolution topics, as witnessed by the growing number of publications dedicated to the subject. Without attempting to be complete, in this talk we provide an overview of emerging trends in software evolution research, such as extension of the traditional boundaries of software, growing attention for social and socio-technical aspects of software development processes, and interdisciplinary research applying research techniques from other research areas to study software evolution, and software evolution research techniques to other research areas. As a large body of software evolution research is empirical in nature, we are confronted by important challenges pertaining to reproducibility of the research, and its generalizability.
Aniket Kittur, Bongwon Suh, Bryan Pendleton, Ed H. Chi.
He Says, She Says: Conflict and Coordination in Wikipedia.
In Proc. of ACM Conference on Human Factors in Computing Systems (CHI2007), pp. 453--462, April 2007. ACM Press. San Jose, CA.
http://www-users.cs.umn.edu/~echi/papers/2007-CHI/2007-Wikipedia-coordination-PARC-CHI2007.pdf
The document discusses developing an open domain chatbot using sequence modeling and machine translation techniques. It provides background on early rule-based chatbots and modern data-driven approaches. The proposed methodology collects data, performs word embeddings, uses an encoder-decoder model with attention to generate responses, and evaluates the model using metrics like F1 score.
Title of PresentationStudent’s nameFeel free to adjust the c.docxherthalearmont
Title of Presentation
Student’s name
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
1
Introduction
Includes
The name of the student evaluated and the topic
Also should detail the purpose and flow of the presentation
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Format of Paper
Evaluate the following three questions regarding the overall format of the paper.
Were all required sections included?
Were they clearly distinguished from one another?
If not, were reasons given for not including some?
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Historical Timeline and Predecessor Assessment Evaluation
Assess the following three components as detailed on the Student Evaluation Form
Sources
Content
Writing Skills
Remember that graphics go a long way in a visual presentation. Add them to play up the visual appeal of this slide but be sure to cite them in proper APA format.
Add additional slides as needed for this section.
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Analysis of Impact Evaluation
Assess the following three components as detailed on the Student Evaluation Form
Sources
Content
Writing Skills
Remember that graphics go a long way in a visual presentation. Add them to play up the visual appeal of this slide but be sure to cite them in proper APA format.
Add additional slides as needed for this section.
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Ethical Considerations Evaluation
Assess the following three components as detailed on the Student Evaluation Form
Sources
Content
Writing Skills
Remember that graphics go a long way in a visual presentation. Add them to play up the visual appeal of this slide but be sure to cite them in proper APA format.
Add additional slides as needed for this section.
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
Concluding Remarks
Summarize the areas of the writer's strengths and weakness as presented in your presentation and remember to always end on a positive note!
Feel free to adjust the color and scheme of this template. Color and design are recommended in an appealing visual presentation.
‹#›
References
Reference all sources used in completing this assignment.
Remember that in-text citations are just as important in a presentation as they are in papers.
The references listed here should be a list of what you have posted on your previous slides, including any images that you used, unless they are clipart.
Feel free to adjust the color and scheme of th ...
Jennifer Robbins: ARTIFACT EAST Keynote (Providence, 11/4/13)JenRobbins
This document summarizes the evolution of web design processes and workflows. It describes how processes have shifted from waterfall to more agile approaches, with integrated teams working collaboratively. It also discusses how design deliverables have changed from static mockups to interactive prototypes using HTML and CSS. Key aspects of new workflows include prototyping with code from the start, designing responsively with a content-first approach, and involving clients throughout for feedback.
Understanding Human Conversations with AI Rajath D M
Presentation from my talk about using AI to understand natural language human conversations and derive insights, trigger workflows etc. Also talking about a bit on Symbl's conversational intelligence API platform
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...Elena Simperl
Crowdsourcing can be used to achieve goals for semantic applications and the Semantic Web. There are different considerations for the design of a crowdsourcing project, including what tasks will be completed, who will complete them, how the process will work, and what incentives will motivate participants. Effective crowdsourcing relies on using the right crowd for the right task, and diversity within the crowd can improve outcomes. Various workflows and incentive structures should be tested to maximize quality, efficiency and participation when crowdsourcing semantic data.
The document discusses different types of digital data including structured, semi-structured, and unstructured data and provides examples and sources for each type. It also discusses challenges with big data including the exponential growth of data, hosting solutions, data retention, lack of skilled professionals, and issues with capturing, storing, processing, and visualizing large volumes of data. The document aims to introduce readers to big data and analytics by defining key concepts, discussing characteristics and challenges of big data, and comparing traditional business intelligence approaches to modern big data analytics.
Detecting the presence of cyberbullying using computer softwareAshish Arora
This document discusses methods for detecting cyberbullying using machine learning techniques. It proposes using a machine learning online patrol crawler that utilizes support vector machines to classify social media posts as containing cyberbullying or not. The crawler is trained on manually identified examples of cyberbullying comments involving negativity, profanity, and attacks on attributes. It also discusses using sentiment analysis and social graphs to identify bullies and victims on Twitter by analyzing tweets containing gender-based terms. The goal is to accurately classify sentiment and identify instances of bullying to increase visibility of the problem.
Respond to Discussion minimum 150 WordsThroughout my life I have.docxronak56
The document discusses different word processing programs such as Microsoft Word, StarOffice, and Corel WordPerfect. It notes some key differences between the programs, such as Word and StarOffice using a ruler bar by default while WordPerfect does not. StarOffice also uses the left side of the screen for shortcuts. All three programs allow automatic underlining, but StarOffice sometimes underlines common words incorrectly. The help systems in StarOffice are also described as not being as usable and comprehensive as the other programs. The document also discusses the importance of being able to convert and share files between different word processing programs when working with other companies.
The speaker discusses the semantic web and its potential to make data on the web smarter and more connected. He outlines several approaches to semantics like tagging, statistics, linguistics, semantic web, and artificial intelligence. The semantic web allows data to be self-describing and linked, enabling applications to become more intelligent. The speaker demonstrates a prototype semantic web application called Twine that helps users organize and share information about their interests.
Highcharts and Elsevier share recent research into making interactive web charts more accessible. Our usability studies focused on three areas, including stacked column charts, scatter plots, and charts with drill-down interactivity. We will share design considerations for keyboard navigation and the understandability of non-visual representations of data visualizations.
Jordan Ryan Molina offers a class syllabus covering fundamentals of computer science including history of computing, programming, algorithms, data storage, operating systems, networking, the internet, and social issues. The class progresses from basic concepts like binary and hardware to object-oriented programming in Java, and considers both technical topics and how technology impacts society. Students will learn through explanations, examples, and hands-on programming exercises.
Semelhante a Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project (20)
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
1. Hate Speech / Toxic Comment
Detection
Harshit Agrawal (18074019)
Ashish Kumar (18075068)
Sachin Srivastava (18075070)
Data Mining
(CSE-362) Project
Odd Semester
2020-21
Under the guidance of
Dr. Bhaskar Biswas
Department of
Computer Science
and Engineering
IIT (BHU) Varanasi
27.09.2020
3. Introduction
The project Hate Speech / Toxic Comment Detection aims at:
● Removing any textual material containing hate or toxicity.
● Online bullying has become a serious issue in recent years.
● The threat of abuse and harassment online means that many people stop
expressing themselves.
● Toxic comments spread negativity, racial harassment and depression
● Removing hate speech makes easy for everyone to express their opinions freely
over the internet.
4. Motivation & Applications
● Social networking: Message or tweet can be reported to the support team.
● Online meetings / webinars: Message can be hidden from attendees and shown
only to the moderators.
● Chatbot training: The model can give large negative feedback for any insulting
reply towards the user.
● Threatening messages: Directly reported to concerned authorities for
immediate actions.
6. Problem Visualisation Preprocessing Training
Description of the
Problem Statement
and Challenges
Preprocessing of
data into machine
readable format
Cleaning and
Visualisation of the
dataset
Training the model
using various methods
Conclusion
Conclusion and
Scope of
Improvement
8. Problem Description
Statement
Given any text or paragraph containing a few lines in natural language, the objective
is to classify it as belonging to one or more of the following categories:
clean, obscene, threatening, insulting, toxic, severely toxic and identity hate.
Model
Severely Toxic
Toxic
Insulting
Threatening
Obscene
Clean
Identity Hate
Text / Paragraph
(natural language)
INPUT
OUTPUT
ONE OR
MORE OF
THESE
CLASSES
9. Problem Description
● Model will output the probability of text to belong to each category.
● Based on certain threshold (which can be tuned as hyperparameter), the text
can be classified to be belonging to a category / set of categories.
Multi-class classification problem Multi-label classification problem
Seven different classes:
Clean, obscene, threatening,
insulting, toxic, severely toxic,
identity hate.
Any text or paragraph can be
abusive in multiple ways: it
can be both threatening &
insulting to someone
11. Labels
● The dataset has been labeled on a crowdsourced annotation platform.
● Different users may have different opinions about a comment (judgement bias).
● Some offensive ‘ASCII arts’ which may or may not be labelled as toxic by users
depending on how it was rendered.
● Some extremely long comments got labelled to toxic without any significant
reason.
● “Give us your address and we'll come and kill you” : Not marked as toxic!
12. Challenges
Noisy Data: Real world dataset from Internet users. Full of slangs, contractions,
non-english or meaningless words and spams.
Unbalanced Dataset: Large number of comments are clean. Unclean samples are
relatively rare and with six classes of toxicity problem is further aggravated.
Computational Capability: Natural Language Processing models like RNN requires
huge processing powers and powerful GPUs for efficient training.
Memory Issues: More than 1 lakh training samples - difficult to store and process in
internal memory at a time.
13. Solution Approach
Training the model
Data Post-processing
Data Pre-processing
Dataset
XGBoost
Transfer Learning
Extremely Random Trees
Get qualitative high-level
insights into the data
Conversational AI
Remove punctuations, stop
words, stemming, etc.
Reason out the possible
source of errors
Data Cleaning & Visualization
Support Vector Machines
Recurrent Neural Networks
Binary Relevance
Classifier Chains
15. Data Cleaning & Visualization
● Checking for missing or null values.
● Adding extra column for ‘clean’ comments.
16. Data Cleaning & Visualization
Class Imbalance: The toxicity is not spread evenly across
all the classes. Out of all the hate tags, ‘toxic’ tags are
43.58%, whereas ‘threat’ tags are 1.36%.
Clean comments are ~140k out
of the ~160k total comments.
18. Data Cleaning & Visualization
● Comments obtained are from random users on internet
● Dataset is noisy. No fixed standard of writing across all the comments
● Dataset must be cleaned before further steps to obtain meaningful results
● Remove non alphanumeric characters (ip addresses, time, date) and
punctuations.
● Convert everything to lower case letters.
● Detect and remove extremely long words (may cause overfitting), non english
and meaningless words.
21. Data Cleaning & Visualization
Are longer comments more toxic?
Violin Plot Box Plot
No significant correlation between length & toxicity is observed.
22. Data Cleaning & Visualization
Violin Plot Box Plot
Slight correlation is observed. Lesser words increase the chances of toxicity.
Are longer comments more toxic?
23. Data Cleaning & Visualization
A cursory look at comments reveal many comments which look like spam.
EPIC FAIL! EPIC FAIL EPIC FAIL! EPIC FAIL EPIC FAIL! ...
EPIC FAIL! EPIC FAIL EPIC FAIL! EPIC FAIL EPIC FAIL! ...
How to classify spammers?
Though spam detection is a wide area of research in its own we can use a very simple
approach which suffice for our application.
Take comments with lot of repeated words (i.e. very less % of unique words).
Are spammers more toxic?
24. Data Cleaning & Visualization
Lesser percentage of unique words increases the chances of toxicity.
Bar Plot
Are spammers more toxic?
25. Data Cleaning & Visualization
More percentage of unique words DOES NOT decrease the chances of toxicity.
KDE Plot
Are spammers more toxic?
26. Data Cleaning & Visualization
Multi-class classification: A lot of comments
have only one tag, yet there do exist some
comment having 5 or 6 hate tags.
Heatmap of the correlation
between the tags.
27. Data Cleaning & Visualization
[18:00 3/7/2010]: Hey man!!, I'm
really not trying to edit war. It's
just that this guy is constantly
removing relevant information and
talking to me through edits instead
of my talk page. He seems to care
more about the formatting than
the actual info. (@172.16.254.1)
hey man i m really not trying to
edit war it s just that this guy is
constantly removing relevant
information and talking to me
through edits instead of my talk
page he seems to care more about
the formatting than the actual info
DATA CLEANING
COMMENT CLEANED COMMENT
29. Data Preprocessing
Machine Readable Format
Cleaned Data
Data Preprocessing
out, what, most, any, off,
too, have, more, or, the,
ours, both, whom, and, of,
aren, her, does, from, if, not,
own, this, it, a, it's, hers,
why, who, now, been, me, ...
STEMMING
argue
argued
argues
argu
STOP WORDS
arguing
30. Data Preprocessing
Conversion of cleaned data to machine readable format.
Stop Words
● Commonly used words which don’t add any meaning to the text.
● E.g.: ‘is’, ‘are’, ‘this’, ‘at’, etc.
● Removed the stop words using NLTK stopwords corpus.
31. Data Preprocessing
Stemming
● Transforming a word into its base stem, i.e., a set of characters to construct a
word and its derivatives.
● Helps in reducing the vocabulary of words, which convey the same meaning.
● E.g. ‘argue’, ‘argued’, ‘argues’ and ‘arguing’ are reduced to the base stem ‘argu’.
● Used NLTK implementation of the Porter Stemming Algorithm - it contains a
set of rules to transform a word into its base stem.
32. Data Preprocessing
hey man i m really not trying to
edit war it s just that this guy is
constantly removing relevant
information and talking to me
through edits instead of my talk
page he seems to care more about
the formatting than the actual info
hey man realli tri edit war guy
constantli remov relev inform talk
edit instead talk page seem care
format actual info
DATA PREPROCESSING
CLEANED COMMENT COMMENT IN MACHINE READABLE FORM
34. Support Vector Machines
● Constructs a Maximum Marginal Hyperplane
that best divides the dataset into classes.
● Hyperplane: decision plane that separates a
set of objects belonging to different class.
● Support Vectors: Data points which are
closest to the hyperplane.
● Margin: gap between the two lines on closest
class points.
35. Support Vector Machines
● Generates optimal hyperplane iteratively to
minimize the error.
● Objective: Find hyperplane with maximum
possible margin between support vectors.
● Can be used for both linearly separable as
well as non-linearly separable data.
36. Support Vector Machines
● For non-linear and inseparable planes, SVM
uses Kernel trick.
● Transforms input space to higher dimensional
space.
● Here, data points are plotted on x-axis and
z-axis, where z = x2
+ y2
.
● These data points can be segregated linearly.
37. Encoding Text Data
Count vectoriser and TF-IDF vectoriser:
● Used for text feature extraction
● Transform text → meaningful representation of numbers
● To fit Machine Learning algorithm for prediction.
● Tokenizing, counting, normalizing are used to extract features from text.
● Tokenizing: Text → split into tokens & assigning integer value to each token.
● Counting: counting occurence of token in document
● Normalizing: Diminishing those tokens that occur in most of the documents.
38. Encoding Text Data
Count vectoriser:
● Converts text to lowercase, removes punctuations & stopwords.
● Builds vocabulary of text, containing text from all documents
● Encodes the text into vectors with dimension |V| x |D|, where
|V| is size of vocabulary, and |D| is the number of documents.
● Each element of the vector represents the frequency of the token in the
document.
● Issue: Certain words which occur in most of the documents won’t be very
meaningful in the encoded vectors.
39. Encoding Text Data
TF-IDF vectoriser:
● Reduces impact of tokens that occur in most of the documents.
● Value of a term in vector increases proportionally to frequency of words in
document
● If the token appears in most of the documents, then that term is penalized.
● Term Frequency - TF(t, d): Number of times a term ‘t’ appears in document ‘d’.
● Inverse Document Frequency - IDF(t): Inverse of document frequency DF(t) -
number of documents that contain a term t.
40. Encoding Text Data
TF-IDF vectoriser:
● TF(t, d): High frequency of term in document increases weight of TF-IDF (local
parameter).
● IDF(t): Low document frequency of term in whole corpus increases weight of
TF-IDF (global parameter).
● TF-IDF: Considers both local and global impact of a word in vocabulary.
● Performs better than Count Vectorizer for feature extraction.
41. Encoding Text Data
Count Vectorizer
● Simple way to tokenize text and
build vocabulary of words.
● Directly gives frequency of token
w.r.t index of vocabulary.
TF-IDF Vectorizer
● Compares no. of times word
appears in document with no. of
documents containing that word.
● Assigns the value by considering
overall document of words,
penalizing the words having high
document frequency.
42. Problem transformation
● Transforms the problem into separate single class classification problem.
● Methods Used - Binary Relevance, Classifier Chains
● Binary Relevance: Each label is treated as separate single class classification
problem.
● Classifier Chains: First classifier is trained on the input, then subsequent
classifiers are trained on input + all previous classifiers in the chain.
44. Problem transformation
Classifier Chains:
● Forms chains in order to preserve label correlation.
● Generally performs better than binary relevance when correlation between
labels is involved.
Four different single-label problem
Yellow: input space, White: target variable
45. Decision Trees
● A non-parametric supervised learning method used for classification and
regression.
● The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features.
● In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making
● Growing a tree involves deciding on which features to choose and what
conditions to use for splitting, along with knowing when to stop
47. Decision Trees
● Tree models where the target variable can take a discrete set of values are
called classification trees
● In these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels
● Advantages:
○ Able to handle multi-output problems.
○ The cost of using the tree is logarithmic in the number of training data points.
○ Uses a white box model
○ Requires little data preparation
48. Decision Trees
Shortcomings:
● Decision-tree learners can create over-complex trees: overfitting
● The problem of learning an optimal decision tree is known to be NP-complete.
● There are concepts that are hard to learn because decision trees do not express
them easily, such as XOR, parity or multiplexer problems.
● Decision tree learners create biased trees if some classes dominate.
● Decision trees can be unstable because small variations in the data might result
in a completely different tree being generated.
Some of these problems can be mitigated by using decision trees with an ensemble.
49. Decision Trees Ensembling
● Ensemble methods, combines several decision trees to produce better
predictive performance than utilizing a single decision tree.
● A group of weak learners come together to form a strong learner!
● Ensembled Trees are composed of a large number of decision trees, where the
final decision is obtained taking into account the prediction of every tree.
● Specifically, by majority vote in classification problems, and by the arithmetic
mean in regression problems.
● All of these ensemble methods take a decision tree and then apply either
bagging (bootstrap aggregating) or boosting as a way to reduce variance and
bias.
51. Bagging
● Bagging (Bootstrap Aggregation) is is a general technique for combining the
predictions of many models
● It is used when our goal is to reduce the variance of a decision tree.
● The idea is to create several subsets of data from training sample chosen
randomly with replacement.
● Now, each collection of subset data is used to train their decision trees.
● As a result, we end up with an ensemble of different models.
● Combination of all the predictions from different trees are used which is more
robust than a single decision tree.
53. Random Forest
● Fitting decision trees on bootstrapped data serves to decorrelate them slightly.
● Since each node chooses what feature to split on in a greedy manner, trees can
still end up very correlated with each other.
● Random forests decorrelates the individual decision trees.
● When the CART algorithm is choosing the optimal split to make at each node,
the random forest will choose a random subset of features and only consider
those for the split.
● From that random subset of features, the algorithm will still select the optimal
feature and optimum split to make at each node.
55. Extremely Randomized Trees
● Much more computationally efficient than random forests.
● Performance is almost always comparable. In some cases, they may even
perform better!
● Adds one more step of randomization to the random forest algorithm.
● Random Forest chooses the optimum split to make for each feature while Extra
Trees chooses it randomly - so it is faster.
● Both of them will subsequently choose the best feature to split on by comparing
those chosen splits.
● From this reason comes the name of Extra Trees (Extremely Randomized
Trees)
56. Extremely Randomized Trees
● Random forest uses bootstrap replicas it subsamples the input data with
replacement.
● Extra Trees use the whole original sample.
● In the Extra Trees sklearn implementation there is an optional parameter that
allows users to bootstrap replicas, but by default, it uses the entire input
sample.
● These differences motivate the reduction of both bias and variance.
● Using the whole original sample instead of a bootstrap replica will reduce bias.
● Choosing randomly the split point of each node will reduce variance.
58. Boosting
● Boosting is another ensemble technique to create a collection of predictors.
● In this technique, learners are learned sequentially.
● Early learners are fitted with simple models to the data. Then data is analysed
for errors.
● In other words, we fit consecutive trees (random sample) and at every step, the
goal is to solve for net error from the prior tree.
● When an input is misclassified by a hypothesis, its weight is increased so that
next hypothesis is more likely to classify it correctly.
● By combining the whole set at the end converts weak learners into better
performing model.
60. Gradient Boosting
● Gradient Boosting is an extension over boosting method.
● Gradient Boosting = Boosting + Gradient Descent
● It uses gradient descent algorithm for optimizing the given loss function.
● An ensemble of trees are built one by one and individual trees are summed
sequentially.
● Next tree tries to recover the loss (difference between actual and predicted
values).
● It can supports any differential loss function.
61. Extreme Gradient Boosting
● XGBoost is like gradient boosting on ‘steroids’!
● XGBoost improves upon the base GBM framework through systems
optimization and algorithmic enhancements.
63. Extreme Gradient Boosting
System Optimization:
● Parallelization: XGBoost approaches the process of sequential tree building in
parallelized approach using all of the CPU cores during training.
● ‘Out-of-core’ computing: optimize available disk space while handling very
large datasets that do not fit into memory.
● Cache Optimization of data structures and algorithm to make best use of
hardware.
● Tree Pruning: XGBoost uses ‘max_depth’ parameter instead of negative loss
criterion, and starts pruning trees backward. This ‘depth-first’ approach
improves computational performance significantly.
64. Extreme Gradient Boosting
Algorithmic Enhancements:
● Regularization: It penalizes more complex models through both LASSO (L1) and
Ridge (L2) regularization to prevent overfitting.
● Sparsity Awareness: XGBoost handles different types of sparsity patterns in
the data as well as missing values more efficiently.
● Weighted Quantile Sketch: XGBoost employs the distributed weighted
Quantile Sketch algorithm to effectively find the optimal split points among
weighted datasets.
● Cross-validation: The algorithm comes with built-in cross-validation method at
each iteration, taking away the need to explicitly program this search .
65. Recurrent Neural Networks
● They are the Neural Networks where the output from previous step are
fed as input to the current step.
● Have a hidden layer which remembers some information about a
sequence.
66. LSTM
● Vanilla RNNs face vanishing gradients. To eliminate vanishing gradient problem
and capture long term dependencies LSTM(Long Short Term Memory) models
were developed.
● LSTMs use three gates :
○ Input Gate
○ Output Gate
○ Forget Gate
● With the help of these gates and Cell state LSTMs are able to preserve past
information of sequential data.
67. Bi-LSTM
● Bi-LSTM couples two LSTM such that one LSTM is trained on sequence in
direction of increasing index and the other is trained in the direction of
decreasing index.
● Using the hidden states of both LSTM, Bi-LSTM uses both past and future
context to give the output.
68. Word Embeddings
● Learned representation for text where words that have the same meaning have
a similar representation.
● Low-dimensional and dense. Number of features is much smaller than than size
of vocabulary.
● Vector values are learned in a way that resembles a neural network.
69. Pretrained Embeddings
● Unstructured textual data needs to be processed into some structured,
vectorized formats which can be understood and acted upon by machine
learning algorithms.
● Obtained by unsupervised training of a model on a large corpora of text.
● Captures the syntax and semantic meaning of a word by training on a large
corpora.
● Increases efficiency of the model.
● Several pre trained word-embeddings are present namely, Word2Vec, FastText,
Glove.
70. Word2Vec
● Statistical method for efficiently learning a standalone word embedding from a
text corpus.
● King - Man + Woman = Queen
● Two popular models are SkipGram and CBOW.
71. Word2Vec
CBOW Model
● Learns the embeddings by predicting the current word based on its context.
● Trains faster and works better for frequent words.
SkipGram Model
● Learns by predicting the surrounding words given a current word.
● Works well with small data and can represent rare words.
72. FastText
● Represents each word as n-gram of characters to allow the embeddings to
understand its prefixes and suffixes. E.g., the word vector, apple, could be
broken down into separate word vectors units as ap, app, ple.
● Uses skip-gram model for learning embeddings.
● Works well on rare words as the word in broken down in n-grams to get its
embeddings.
● Even works well on the words not present in the vocabulary because the
n-gram character vectors are shared with other words.
73. Glove
● Stands for Global Vectors.
● An unsupervised learning algorithm for obtaining vector representations for
words.
● Learns by constructing a co-occurrence matrix that counts how frequently a
word appears in a context.
● Performs better on word analogy and named entity recognition.
● Does not rely only on local statistics (local context information of words), but
also incorporates global statistics (word co-occurrence) to obtain word
vectors.
74. Receiver Operating Characteristic
● An ROC curve is a graph showing the performance of a classification model at
all classification thresholds.
● This curve plots two parameters:
○ True Positive Rate (TPR) : Fraction of positive data points classified as positive.
○ False Positive Rate (FPR) : Fraction of negative data points classified as positive.
75. Receiver Operating Characteristic
● An ROC curve plots TPR vs. FPR at different classification thresholds.
● Lowering the classification threshold classifies more items as positive, thus
increasing both False Positives and True Positives.
76. Area Under the ROC Curve
● AUC measures the entire two-dimensional area underneath the entire ROC
curve from (0,0) to (1,1).
● More the area under curve better the classifier!
● AUC provides an aggregate measure of performance across all possible
classification thresholds.
● AUC can be interpreted as the probability that the model ranks a random
positive example more highly than a random negative example.
● AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong
has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
77. Area Under the ROC Curve
● AUC is desirable for the following two reasons:
○ AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute
values.
○ AUC is classification-threshold-invariant. It measures the quality of the model's predictions
irrespective of what classification threshold is chosen.
78. Results and Comparison
Model Mean AUC_ROC Score
Support Vector Machines (Binary Relevance) 0.66
Support Vector Machines (Classifier Chains) 0.67
Logistic Regression (Binary Relevance) 0.73
Logistic Regression (Classifier Chains) 0.76
Extra Trees 0.93
XGBoost 0.96
LSTM without pretrained embeddings 0.97
LSTM with FastText embedding 0.96
LSTM with Glove embedding 0.88
LSTM with Word2Vec embedding 0.85
79. Conclusion and Further Works
● Classical Models like SVM and Logistic Regression fail to achieve high
AUC_ROC score.
● Classifier Chain method has a slight edge over Binary Relevance.
● Tree based ensembling methods and Recurrent Neural networks can achieve a
very high AUC_ROC score.
● Transfer learning reduces the execution time but does not help with
performance.
● Further Improvements can be achieved by experimenting with complicated
neural networks and use of state of the art Transformers.