Data Science Competition

•Transferir como PPTX, PDF•

15 gostaram•2,319 visualizações

Jeong-Yoon Lee

Presented at the 27th Annual KSEA South-Western Regional Conference (KSEA-SWRC 2017) on February 25, 2017

Tecnologia

Data Science Competition
2. 25. 2017
The 27th Annual KSEA South-Western Regional Conference
Jeong-Yoon Lee, Ph.D.

Chief Data Scientist, Conversion Logic
Ph.D. in Computer Science, USC
M.S. in Electrical Engineering, USC
B.S. in Electrical Engineering, SNU
KDD Cup Winner 2012 & 2015
Top 10, Kaggle 2015
Jeong-Yoon Lee, Ph.D.

Why Compete
• For fun
• For experience
• For learning
• For networking
4

Fun
• Competing with others
• Incremental improvement
5

Data Science Competitions
Since 1997
2006 - 2009
Since 2010

Competition Structure
Training Data
Test Data
Feature Label
Provided Submission Public LB Score Private LB Score

Kaggle
• 250+ competitions since 2010
• 500K+ users
• 50K+ competitors
• $3MM+ prize paid out

Misconceptions on Competitions
• No ETL
• No EDA
• Not worth it
• Not for production
18

No ETL? - Deloitte Western Australia Rental Prices
19

No ETL? - Outbrain Click Prediction
20
2B page views. 16.9MM clicks. 700MM users. 560 sites

No ETL? - YouTube-8M Video Understanding Challenge
21
1.7TB feature-level data. 31GB video-level data.

No EDA?
• Most of competitions provide actual labels - typical EDA
• Anonymized data - more creative EDA
o People decode age, states, time intervals, income, etc.
23

No EDA?
• Anonymized data - more creative EDA
24

Not worth it?
• Performance matters
• You walk easier when you can run
25

Not for Production?
• Kaggle Kernel
o Max execution time:10 minutes
o Max file output: 500MB
o Memory limit: 8GB
26

Ensemble Pipeline at Conversion Logic
27

Best Practices
• Feature Engineering
• Algorithms
• Cross Validation
• Ensemble
29

Feature Engineering
• Numerical - Log, Log(1 + x), Normalization, Binarization
• Categorical - One-hot-encode, TF-IDF (text), Weight-of-
Evidence
• Timeseries - Stats, FFT, MFCC, ERP (EEG)
• Numerical/Timeseries to Categorical - RF/GBM*
• Dimensionality Reduction - PCA, SVD, Autoencoder
* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
30

Algorithms
Algorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)
31

Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
32

Ensemble
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/
34

Why Competition
• For fun
• For experiences
• For learning
• For networking
36

One Last Thing
37
Google: 20K applications per week
Conversion Logic: 200 applications per week

Mais conteúdo relacionado

Mais procurados

BSSML17 - Deepnets

BigML, Inc

"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand." In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started

Automatic machine learning (AutoML) 101

QuantUniversity

Full Webinar: https://info.tigergraph.com/graph-gurus-28 In this webinar, we will use the recommendation system problem, which can be efficiently solved as a graph problem, to demonstrate the in-database training capability of TigerGraph, a native graph database. A hybrid (memory-based + model-based) recommendation system will be implemented in TigerGraph. Specifically, the latent factor model used for recommendation will be trained within the database. In this Graph Gurus episode, we will: -Review multiple widely-used recommendation methods -Introduce the concept of in-database machine learning -Present an in-database machine learning solution for a real time recommendation system

Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...

TigerGraph

Winning data science competitions

Owen Zhang

Boosting Algorithms Omar Odibat

omarodibat

Using AI to build AI is a promising solution to give the power of AI to those who can't afford it as those multinational corporations. The technology is also known as Automatic Machine Learning (AutoML). OneClick.ai is the first deep learning AutoML platform that make the latest AI technology accessible to anyone with/without AI background. The deck gives a 30 minutes overview of the recent history of AutoML, and how OneClick.ai innovates on it. Check out our platform at http://www.oneclick.ai

The Evolution of AutoML

Ning Jiang

Graph Gurus Episode 19: Deep Learning Implemented by GSQL on a Native Paralle...

TigerGraph

Deep feature synthesis

Durra Sahtout

Full Webinar: https://info.tigergraph.com/graph-gurus-27 What does finding the best location for a warehouse/office/retail store have in common with finding the most influential person in a referral network? Answer: they are both Centrality problems and can be solved with graph algorithms. Join us for Part 2 of our five-part webinar series on using graph algorithms for advanced analytics. By attending this webinar you will: - Hear about use cases for centrality graph algorithms - Learn how to select the right algorithm for your use case - Be able to run and tailor GSQL graph algorithms

Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2

TigerGraph

Using Graph Algorithms for Advanced Analytics - Part 5 Classification

TigerGraph

This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/vUqC8UPw9SU Description: The good news is building fair, accountable, and transparent machine learning systems is possible. The bad news is it’s harder than many blogs and software package docs would have you believe. The truth is nearly all interpretable machine learning techniques generate approximate explanations, that the fields of eXplainable AI (XAI) and Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) are very new, and that few best practices have been widely agreed upon. This combination can lead to some ugly outcomes! This talk aims to make your interpretable machine learning project a success by describing fundamental technical challenges you will face in building an interpretable machine learning system, defining the real-world value proposition of approximate explanations for exact models, and then outlining the following viable techniques for debugging, explaining, and testing machine learning models: *Model visualizations including decision tree surrogate models, individual conditional expectation (ICE) plots, partial dependence plots, and residual analysis. *Reason code generation techniques like LIME, Shapley explanations, and Treeinterpreter. *Sensitivity Analysis. Plenty of guidance on when, and when not, to use these techniques will also be shared, and the talk will conclude by providing guidelines for testing generated explanations themselves for accuracy and stability. Open source examples (with lots of comments and helpful hints) for building interpretable machine learning systems are available to accompany the talk at: https://github.com/jphall663/interpretable_machine_learning_with_python Bio: Patrick Hall is senior director for data science products at H2O.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2O.ai, Patrick held global customer facing roles and research and development roles at SAS Institute. Speaker's Bio: Patrick Hall is a senior director for data science products at H2o.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining H2o.ai, Patrick held global customer facing roles and R & D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick was the 11th person worldwide to become a Cloudera certified data scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai

Sri Ambati

Introduction to XGBoost

Joonyoung Yi

Model Drift Monitoring using Tensorflow Model Analysis

Vivek Raja P S

This presentation about Scikit-learn will help you understand what is Scikit-learn, what can we achieve using Scikit-learn and a demo on how to use Scikit-learn in Python. Scikit is a powerful and modern machine learning python library. It's a great tool for fully and semi-automated advanced data analysis and information extraction. There are a lot of reasons why Scikit-Learn is a preferred machine learning tool. It has efficient tools to identify and organize problems, such as whether it fits a supervised or unsupervised learning model. It contains many free and open data sets. It has a rich set of built-in libraries for learning and predicting. It provides model support for every problem type. It also has built-in functions such as pickle for model persistence. It is supported by a huge open source community and vendor base. Now, let us get started and understand Sciki-Learn in detail. Below topics are explained in this Scikit-Learn presentation: 1. What is Scikit-learn? 2. What we can achieve using Scikit-learn 3. Demo Simplilearn’s Python Training Course is an all-inclusive program that will introduce you to the Python development language and expose you to the essentials of object-oriented programming, web development with Django and game development. Python has surpassed Java as the top language used to introduce U.S. students to programming and computer science. This course will give you hands-on development experience and prepare you for a career as a professional Python programmer. What is this course about? The All-in-One Python course enables you to become a professional Python programmer. Any aspiring programmer can learn Python from the basics and go on to master web development & game development in Python. Gain hands-on experience creating a flappy bird game clone & website functionalities in Python. What are the course objectives? By the end of this online Python training course, you will be able to: 1. Internalize the concepts & constructs of Python 2. Learn to create your own Python programs 3. Master Python Django & advanced web development in Python 4. Master PyGame & game development in Python 5. Create a flappy bird game clone The Python training course is recommended for: 1. Any aspiring programmer can take up this bundle to master Python 2. Any aspiring web developer or game developer can take up this bundle to meet their training needs Learn more at https://www.simplilearn.com/mobile-and-software-development/python-development-training

Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...

Simplilearn

Mais procurados (14)

BSSML17 - Deepnets

Automatic machine learning (AutoML) 101

Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...

Winning data science competitions

Boosting Algorithms Omar Odibat

The Evolution of AutoML

Graph Gurus Episode 19: Deep Learning Implemented by GSQL on a Native Paralle...

Deep feature synthesis

Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2

Using Graph Algorithms for Advanced Analytics - Part 5 Classification

Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai

Introduction to XGBoost

Model Drift Monitoring using Tensorflow Model Analysis

Scikit-Learn Tutorial | Machine Learning With Scikit-Learn | Sklearn | Python...

Destaque

How hackathons can drive top line revenue growth

HackerEarth

Tda presentation

HJ van Veen

The workshop will present how to combine tools to quickly query, transform and model data using command line tools. The goal is to show that command line tools are efficient at handling reasonable sizes of data and can accelerate the data science process. We will show that in many instances, command line processing ends up being much faster than ‘big-data’ solutions. The content of the workshop is derived from the book of the same name (http://datascienceatthecommandline.com/). In addition, we will cover vowpal-wabbit (https://github.com/JohnLangford/vowpal_wabbit) as a versatile command line tool for modeling large datasets.

Data science at the command line

Sharat Chikkerur

Doing your first Kaggle (Python for Big Data sets)

Domino Data Lab

DataRobot R Package

DataRobot

HackerEarth Sourcing Solution

HackerEarth

6 rules of enterprise innovation

HackerEarth

How to assess & hire Java developers accurately?

HackerEarth

by Szilard Pafka Chief Scientist at Epoch Szilard studied Physics in the 90s in Budapest and has obtained a PhD by using statistical methods to analyze the risk of financial portfolios. Next he has worked in finance quantifying and managing market risk. A decade ago he moved to California to become the Chief Scientist of a credit card processing company doing what now is called data science (data munging, analysis, modeling, visualization, machine learning etc). He is the founder/organizer of several data science meetups in Santa Monica, and he is also a visiting professor at CEU in Budapest, where he teaches data science in the Masters in Business Analytics program. While extracting business value from data has been performed by practitioners for decades, the last several years have seen an unprecedented amount of hype in this field. This hype has created not only unrealistic expectations in results, but also glamour in the usage of the newest tools assumably capable of extraordinary feats. In this talk I will apply the much needed methods of critical thinking and quantitative measurements (that data scientists are supposed to use daily in solving problems for their companies) to assess the capabilities of the most widely used software tools for data science. I will discuss in details two such analyses, one concerning the size of datasets used for analytics and the other one regarding the performance of machine learning software used for supervised learning.

No-Bullshit Data Science

Domino Data Lab

Managing Data Science | Lessons from the Field

Domino Data Lab

Fairly Measuring Fairness In Machine Learning

HJ van Veen

Leverage Social Media for Employer Brand and Recruiting

HackerEarth

Ethics in Data Science and Machine Learning

HJ van Veen

Work - LIGHT Ministry

Jeong-Yoon Lee

Open Innovation - A Case Study

HackerEarth

Feature hashing is a powerful technique for handling high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online learning scenarios. While an approximation, it has surprisingly low accuracy tradeoffs in many machine learning problems. Feature hashing has been made somewhat popular by libraries such as Vowpal Wabbit and scikit-learn. In Spark MLlib, it is mostly used for text features, however its use cases extend more broadly. Many Spark users are not familiar with the ways in which feature hashing might be applied to their problems. In this talk, I will cover the basics of feature hashing, and how to use it for all feature types in machine learning. I will also introduce a more flexible and powerful feature hashing transformer for use within Spark ML pipelines. Finally, I will explore the performance and scalability tradeoffs of feature hashing on various datasets.

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...

Spark Summit

USC LIGHT Ministry Introduction

Jeong-Yoon Lee

Intra company hackathons using HackerEarth

HackerEarth

Need to spark some killer innovation into your product line? Thinking about holding a brainstorming session? Brainstorming sessions are for wusses and wusses don’t get the corner office. Instead, you’ll learn some more productive techniques that can help you to release your inner-Hulk and become that guy that everyone wants on their next-generation product. Note that there are a lot of build slides and formatting that slideshare has rendered poorly. Feel free to download the deck for best results or connect with me and I'll send you a copy.

Kill the wabbit

Joe Kleinwaechter

Druva Casestudy - HackerEarth

HackerEarth

Destaque (20)

How hackathons can drive top line revenue growth

Tda presentation

Data science at the command line

Doing your first Kaggle (Python for Big Data sets)

DataRobot R Package

HackerEarth Sourcing Solution

6 rules of enterprise innovation

How to assess & hire Java developers accurately?

No-Bullshit Data Science

Managing Data Science | Lessons from the Field

Fairly Measuring Fairness In Machine Learning

Leverage Social Media for Employer Brand and Recruiting

Ethics in Data Science and Machine Learning

Work - LIGHT Ministry

Open Innovation - A Case Study

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...

USC LIGHT Ministry Introduction

Intra company hackathons using HackerEarth

Kill the wabbit

Druva Casestudy - HackerEarth

Semelhante a Data Science Competition

E3MV - Embedded Vision - Sundance

Sundance Multiprocessor Technology Ltd.

Melt iron heterogeneous computing - lspe v3

Rinka Singh

Predictive Analytics in Manufacturing

Data Science Thailand

ACIC: Automatic Cloud I/O Configurator for HPC Applications

Hyun joong

Datalake project

lec01.pdf

Amazon WorkSpaces is a desktop computing service that runs in the cloud, and now offers GPU configurations to support design and engineering applications and three-dimensional modeling. We show you how running these applications on Amazon WorkSpaces graphics bundles, in close proximity to data you already store on AWS, can help you process and visualize the results you need. We discuss the economics of running Amazon WorkSpaces graphics bundles, and demonstrate the experience of running a graphics-intensive application on a GPU-enabled Amazon WorkSpace. We also invite Autodesk (or TRC or ESRi) to discuss how they are using Amazon WorkSpaces graphics bundles in their business.

AWS re:Invent 2016: Hardware-Accelerating Graphics Desktop Workloads with Ama...

Amazon Web Services

FPGA-enhanced Bioinformatics @ NECST

NECST Lab @ Politecnico di Milano

This presentation was given at the Green500 BoF at SC21, in which PFN's VP of Computing Infrastructure Yusuke Doi discussed the power measurement for PFN's MN-3 supercomputer with MN-Core™ accelerators and how the company improved MN-3's power efficiency from 29.7GF/W to 39.38GF/W in 5 months. More about MN-Core: https://projects.preferred.jp/mn-core/en/ More about MN-3: https://projects.preferred.jp/supercomputers/en/

MN-3, MN-Core and HPL - SC21 Green500 BOF

Preferred Networks

Physical Design Services

eInfochips (An Arrow Company)

IBM and the Netherlands Institute for Radio Astronomy ASTRON have unveiled the world’s first water-cooled 64-bit microserver. The prototype, which is roughly the size of a smartphone, is part of the proposed IT roadmap for the Square Kilometre Array (SKA), an international consortium to build the world’s largest and most sensitive radio telescope. Scientists estimate that the processing power required to operate the telescope will be equal to several millions of today’s fastest computers. The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size. Not only is the microserver compact, it is also very energy-efficient. One of its innovations is hotwater cooling, which in addition to keeping the chip operating temperature below 85 degrees C, will also transport electrical power by means of a copper plate. The concept is based on the same technology IBM developed for the SuperMUC supercomputer located outside of Munich, Germany. IBM scientists hope to keep each microserver operating between 35–40 watts including the system on a chip (SOC) — the current design is 60 watts. The next step for scientists is to begin to take 128 of the microserver boards using the newest T4240 chips to create a 2U rack unit with 1536 cores and 3072 threads with up to 6 terabytes of DRAM. In addition, they will be adding an Ethernet switch and power module to the integrated water-cooling.

IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...

IBM Research

A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...

Larry Smarr

In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility. "DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program." Watch the video: https://wp.me/p3RLHQ-k94 Learn more: https://dirac.ac.uk/ and http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility

inside-BigData.com

Modeling the Effect of Packet Loss on Speech Quality: Genetic Programming Bas...

adil raja

Modeling the Effect of Packet Loss on Speech Quality: Genetic Programming Bas...

adil raja

Scaling face recognition with big data - Bogdan Bocse

ITCamp

2017 09-ohkawa-MCSoC2017-presen

Takeshi Ohkawa

Architectural Optimizations for High Performance and Energy Efficient Smith-W...

NECST Lab @ Politecnico di Milano

Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

Jim Dowling

Semelhante a Data Science Competition (20)

E3MV - Embedded Vision - Sundance

Melt iron heterogeneous computing - lspe v3

Predictive Analytics in Manufacturing

ACIC: Automatic Cloud I/O Configurator for HPC Applications

Hyun joong

Datalake project

lec01.pdf

AWS re:Invent 2016: Hardware-Accelerating Graphics Desktop Workloads with Ama...

FPGA-enhanced Bioinformatics @ NECST

MN-3, MN-Core and HPL - SC21 Green500 BOF

Physical Design Services

IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...

A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...

40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility

Modeling the Effect of Packet Loss on Speech Quality: Genetic Programming Bas...

Scaling face recognition with big data - Bogdan Bocse

2017 09-ohkawa-MCSoC2017-presen

Architectural Optimizations for High Performance and Energy Efficient Smith-W...

Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

Último

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

Real Time Object Detection Using Open CV

Khem

Scaling API-first – The story of a global engineering organization

Radu Cotescu

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

How to convert PDF to text with Nanonets

naman860154

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Sara Mae O’Brien Scott and Tatiana Baquero Cakici, Senior Consultants at Enterprise Knowledge (EK), presented “AI Fast Track to Search-Focused AI Solutions” at the Information Architecture Conference (IAC24) that took place on April 11, 2024 in Seattle, WA. In their presentation, O’Brien-Scott and Cakici focused on what Enterprise AI is, why it is important, and what it takes to empower organizations to get started on a search-based AI journey and stay on track. The presentation explored the complexities of enterprise search challenges and how IA principles can be leveraged to provide AI solutions through the use of a semantic layer. O’Brien-Scott and Cakici showcased a case study where a taxonomy, an ontology, and a knowledge graph were used to structure content at a healthcare workforce solutions organization, providing personalized content recommendations and increasing content findability. In this session, participants gained insights about the following: Most common types of AI categories and use cases; Recommended steps to design and implement taxonomies and ontologies, ensuring they evolve effectively and support the organization’s search objectives; Taxonomy and ontology design considerations and best practices; Real-world AI applications that illustrated the value of taxonomies, ontologies, and knowledge graphs; and Tools, roles, and skills to design and implement AI-powered search solutions.

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Enterprise Knowledge

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Delhi Call girls

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

Data Science Competition

1. Data Science Competition 2. 25. 2017 The 27th Annual KSEA South-Western Regional Conference Jeong-Yoon Lee, Ph.D.

2. Chief Data Scientist, Conversion Logic Ph.D. in Computer Science, USC M.S. in Electrical Engineering, USC B.S. in Electrical Engineering, SNU KDD Cup Winner 2012 & 2015 Top 10, Kaggle 2015 Jeong-Yoon Lee, Ph.D.

3. Why Data Science Competition

4. Why Compete • For fun • For experience • For learning • For networking 4

5. Fun • Competing with others • Incremental improvement 5

10. 10

11. Data Science Competition

12. Data Science Competitions Since 1997 2006 - 2009 Since 2010

13. Competition Structure Training Data Test Data Feature Label Provided Submission Public LB Score Private LB Score

14. Kaggle • 250+ competitions since 2010 • 500K+ users • 50K+ competitors • $3MM+ prize paid out

15. Kaggle

16. Kaggle

17. Misconceptions on Competitions

18. Misconceptions on Competitions • No ETL • No EDA • Not worth it • Not for production 18

19. No ETL? - Deloitte Western Australia Rental Prices 19

20. No ETL? - Outbrain Click Prediction 20 2B page views. 16.9MM clicks. 700MM users. 560 sites

21. No ETL? - YouTube-8M Video Understanding Challenge 21 1.7TB feature-level data. 31GB video-level data.

22. No ETL? 22

23. No EDA? • Most of competitions provide actual labels - typical EDA • Anonymized data - more creative EDA o People decode age, states, time intervals, income, etc. 23

24. No EDA? • Anonymized data - more creative EDA 24

25. Not worth it? • Performance matters • You walk easier when you can run 25

26. Not for Production? • Kaggle Kernel o Max execution time:10 minutes o Max file output: 500MB o Memory limit: 8GB 26

27. Ensemble Pipeline at Conversion Logic 27

28. Best Practices

29. Best Practices • Feature Engineering • Algorithms • Cross Validation • Ensemble 29

30. Feature Engineering • Numerical - Log, Log(1 + x), Normalization, Binarization • Categorical - One-hot-encode, TF-IDF (text), Weight-of- Evidence • Timeseries - Stats, FFT, MFCC, ERP (EEG) • Numerical/Timeseries to Categorical - RF/GBM* • Dimensionality Reduction - PCA, SVD, Autoencoder * http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf 30

31. Algorithms Algorithm Tool Note Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions Random Forests Scikit-Learn, randomForest Used to be popular before GBM Extremely Random Trees Scikit-Learn Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble. Support Vector Machine Scikit-Learn FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions Factorization Machine libFM Winning solution for KDD Cup 2012 Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu) 31

32. Cross Validation Training data are split into five folds where the sample size and dropout rate are preserved (stratified). 32

33.

34. Ensemble * for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/ 34

35. KDDCup 2015 Solution 35

36. Why Competition • For fun • For experiences • For learning • For networking 36

37. One Last Thing 37 Google: 20K applications per week Conversion Logic: 200 applications per week

38. Thank You

Notas do Editor

I am Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic. I am going to tell you little bit about our attribution approach.
states, age, time interval, weekday,
states, age, time interval, weekday,
Training data are split into five folds while the sample size and dropout rate are preserved across folds. For validation, each of single and ensemble models is trained five times. Each time, one fold is held out and the remain- ing four folds are used for training. Then, predictions for the hold-out folds are combined and form the model’s CV pre- diction. CV predictions are used in AUC score calculation and/or as inputs in ensemble model training. For test, each of single and ensemble models is retrained with whole training data. Then predictions for test data are used for submission and/or as inputs in ensemble model prediction.
For validation, each of single and ensemble models is trained five times. Each time, one fold is held out and the remain- ing four folds are used for training. Then, predictions for the hold-out folds are combined and form the model’s CV pre- diction. CV predictions are used in AUC score calculation and/or as inputs in ensemble model training. For test, each of single and ensemble models is retrained with whole training data. Then predictions for test data are used for submission and/or as inputs in ensemble model prediction.
Stage-I Ensemble: We trained 15 stage-I ensemble classifiers with different subsets of CV predictions of 64 individual classifiers. Stage-II Ensemble: We trained 2 stage-II ensemble classifiers with different subsets of CV predictions of 15 stage-I ensemble classifiers. Stage-III Ensemble: We trained a stage-III ensemble classifier with CV predictions of 5 classifiers: 1 stage-II ensemble, 3 stage-I ensemble, and 1 individual classifiers

Data Science Competition

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (14)

Destaque

Destaque (20)

Semelhante a Data Science Competition

Semelhante a Data Science Competition (20)

Último

Último (20)

Data Science Competition

Notas do Editor