Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017

•

6 gostaram•1,822 visualizações

Feature engineering is one of the most important, yet elusive, skills to master if you want to be a good data scientist. Machine learning competitions are hardly ever won with strong modeling techniques alone -- it is the combination of creative feature engineering and powerful modeling techniques that makes the difference. This tutorial will give the audience practical tips and tricks to improve the performance of machine learning algorithms. We will broadly look at feature engineering for applied machine learning, touching on subjects like: categorical vs. numerical variables, data cleaning, feature extraction, transformations, and imputation.

Tecnologia

Feature Engineering
HJ van Veen - Data Scientist at Nubank

Feature Engineering
• Better data beats big data 
• Applied Machine Learning is data infra,
feature engineering, modeling 
• Feature engineering is turning your
data into something a model
understands 
• Creativity, Inquisitive, Agility

Feature Types
Categorical "female", "teacher_ID_115"
Numbers -0.734, 58, 71165.80
Temporal 21-12-2012, 18:15, "Domingo"
Spatial "Guarujá", latitude:51.865
Text "<h1>Eu falo português!</h1>"
Images Grayscale, .jpg

Onehot Encoding
• Encode k variables into a one-of-k
array sized k 
• Bag of words 
• Linear algorithms and neural networks 
• “NL” -> one_of_k(“NL”) -> [0, 0, 0, 1]

Hash Encoding
• Encode k variables into a one-of-h
array sized h 
• Collisions 
• Fast & Memory-friendly 
• “NL” -> hash(“NL”) -> [0, 0, 1]

Label Encoding
• Give k variables a unique numerical ID 
• Tree-based algorithms 
• Dimensionality friendly 
• "NL" -> unique_id("NL") -> [3]

Binary Encoding
• Uses binary representation of label ID 
• Can encode over 4 billion categoricals
into 32 bits 
• "NL" ->  
binary(unique_id("NL")) ->  
[1, 0, 1, 1, 1, 1, 1]

Count Encoding
• Replace variable with its count in the
train set 
• Captures popularity of the variable 
• "NL" ->  
count_in_train(“NL”) ->  
[5]

Rank Count Encoding
• Uniquely rank a variables count in train
set 
• Avoid collisions and outliers 
• "NL" ->  
rank(count_in_train(“NL”)) ->  
[7]

Likelihood Encoding
• Replace variable by its target average 
• Avoid overﬁt 
• "NL" ->  
mean_of_target(“NL”)) ->  
[0.66]

Embedding Encoding
• Use a model to create an embedding 
• Faster & More memory friendly 
• “NL”, “F” ->  
nn_embed([“NL”, “F”])) ->  
[0.66, 0.71, 0.05]

Numbers
• Imputation 
• Binning, Rounding, Log-transforms 
• Categorical encoding

Temporal
• Day of Week, Hour of day 
• Trends 
• Proximity to major events

Spatial
• Proximity to major cities 
• Kriging, Clustering 
• Fraud signals

Text
• TFIDF 
• n-Grams 
• Reducing dimensionality

Images
• Resize 
• Rotate, skew, whiten 
• Aggregate statistics

Missing
• Imputing 
• Hardcoding 
• Ignoring

Consolidation
• Common & Rare 
• Spelling errors 
• Cleaning

Expansion
• User agents 
• Emails 
• Hierarchical codes

Interactions
• Hardcode interactions 
• Division, Multiplication, Addition,
Substraction, Combination, Similarity 
• Tools & Processes

Aggregate statistics
• Row & Column Statistics 
• Counts 
• Reading level
• Blacklist membership

Scaling
• Log transforms 
• MinMax Scaling 
• Z-Scoring / Standard Scaling

Meta-features
• Unsupervised 
• Model stacking 
• Feature stacking

Case Study
• Predict fraudsters from “name” and “email” form 
• Expansion
• Temporal 
• Aggregate Statistics
• Randomness 
• Interactions

Conclusion
• Use XGBoost 
• Label encode categorical variables 
• Impute NaNs with -999, and set missing
parameter 
• Use subsampling to quickly test new variables
• Try everything until you reach a plateau or
deadline

Mais conteúdo relacionado

Mais procurados

Approaching (almost) Any Machine Learning Problem (kaggledays dubai)

Abhishek Thakur

XGBoost: the algorithm that wins every competition

Jaroslaw Szymczak

Machine learning and_nlp

ankit_ppt

Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot! If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students. --------------------------------------------------------------- Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout. --------------------------------------------------------------- Speaker Bio: Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays. Pre-requisite(if any): R /Calculus Preparation: A laptop with R installed. Windows users might need to have RTools installed as well. Agenda: Introduction of Xgboost Real World Application Model Specification Parameter Introduction Advanced Features Kaggle Winning Solution Event arrangement: 6:45pm Doors open. Come early to network, grab a beer and settle in. 7:00-9:00pm XgBoost Demo Reference: https://github.com/dmlc/xgboost

Xgboost

Vivian S. Zhang

Introduction to Boosted Trees by Tianqi Chen

Zhuyi Xue

Data Wrangling For Kaggle Data Science Competitions

Krishna Sankar

Kaggle presentation

HJ van Veen

Matrix decomposition and_applications_to_nlp

ankit_ppt

Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014. Abstract: This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price. I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.

Gradient Boosted Regression Trees in scikit-learn

DataRobot

Presented at #H2OWorld 2017 in Mountain View, CA. Enjoy the video: https://youtu.be/Q0QmziFcfU0. Learn more about H2O.ai: https://www.h2o.ai/. Follow @h2oai: https://twitter.com/h2oai. - - - Abstract: This talk covers the problems of leakage in a standard k-fold meta modeling (ensembling / stacking) setup, the connection to HCC target encoding and how to properly deal with the induced consequences in order to detect and finally avoid overfitting. Mathias' Bio: A Kaggle Grandmaster and a Data Scientist at H2O.ai, Mathias Müller holds an AI and ML focused diploma (eq. M.Sc.) in computer science from Humboldt University in Berlin. During his studies, he keenly worked on computer vision in the context of bio-inspired visual navigation of autonomous flying quadrocopters. Prior to H2O.ai, he as a machine learning engineer for FSD Fahrzeugsystemdaten GmbH in the automotive sector. His stint with Kaggle was a chance encounter as he stumbled upon the data competition platform while looking for a more ML-focused platform as compared to TopCoder. This is where he entered his first predictive modeling competition and climbed up the ladder to be a Grandmaster. He is an active contributor to XGBoost and is working on Driverless AI with H2O.ai.

Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...

Sri Ambati

Using Deep Learning to Find Similar Dresses

HJ van Veen

Ppt shuai

Xiang Zhang

GBM package in r

mark_landry

Hands-on Tutorial of Deep Learning

Chun-Ming Chang

Dynamics in graph analysis (PyData Carolinas 2016)

Benjamin Bengfort

Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem." NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python. Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy. -------------------------------------- Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot! If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.

Streaming Python on Hadoop

Vivian S. Zhang

Gbm.more GBM in H2O

Sri Ambati

Modern classification techniques

mark_landry

Probabilistic Data Structures and Approximate Solutions

Oleksandr Pryymak

Customer Clustering For Retail Marketing

Jonathan Sedar

Mais procurados (20)

Approaching (almost) Any Machine Learning Problem (kaggledays dubai)

XGBoost: the algorithm that wins every competition

Machine learning and_nlp

Xgboost

Introduction to Boosted Trees by Tianqi Chen

Data Wrangling For Kaggle Data Science Competitions

Kaggle presentation

Matrix decomposition and_applications_to_nlp

Gradient Boosted Regression Trees in scikit-learn

Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...

Using Deep Learning to Find Similar Dresses

Ppt shuai

GBM package in r

Hands-on Tutorial of Deep Learning

Dynamics in graph analysis (PyData Carolinas 2016)

Streaming Python on Hadoop

Gbm.more GBM in H2O

Modern classification techniques

Probabilistic Data Structures and Approximate Solutions

Customer Clustering For Retail Marketing

Semelhante a Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017

tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When

David Peyruc

Introduction to Neo4j for the Emirates & Bahrain

Neo4j

Graphs fun vjug2

Neo4j

Bigdata analytics

Keshav Tripathy

MongoDB Administration 101

MongoDB

Big Data Tel Aviv 2019 v.3.0 I 'Graph database 4 beginners' - Michael Kogan

Dataconomy Media

SDEC2011 NoSQL Data modelling

Korea Sdec

Feature Engineering

HJ van Veen

Data Processing and Aggregation with MongoDB

MongoDB

Big Data and NoSQL have led to big changes In the data environment, but are they all in the best interest of data? Are they technologies that "free us from the harsh limitations of relational databases?" In this month's webinar, we will be answering questions like these, plus: Have we managed to free organizations from having to do Data Modeling? Is there a need for a Data Modeler on NoSQL projects? If we build Data Models, which types will work? If we build Data Models, how will they be used? If we build Data Models, when will they be used? Who will use Data Models? Where does Data Quality happen? Finally, we will wrap with 10 tips for data modelers in organizations incorporating NoSQL in their modern Data Architectures.

Big Challenges in Data Modeling: NoSQL and Data Modeling

DATAVERSITY

Survey of Big Data Infrastructures

m.a.kirn

NoSQL Introduction

ForwardSprint

DDD why_who - for CHTTI

Michael Chen

TrustYou analiza reseñas de hoteles en linea para crear un resumen para cada hotel en el mundo: http://www.trust-score.com/ ; Los datos procesados por TrustYou son integrados por servicios como Kayak, Trivago, HipMunk entre otros. Cada semana analizamos la web para descargar 3 millones de reviews. Estos son analizados usando técnicas de lingüística computacional, procesamiento de lenguaje natural y aprendizaje de maquinas. Para ello usamos casi exclusivamente Python. En esta charla Miguel les contará qué estrategias y herramientas usa TrustYou para lograr este objetivo. Presentador: Miguel Fernando Cabrera. (@mfcabrera) Ing. de Sistemas e Informática de la Universidad Nacional de Colombia (Medellin), M.Sc. en Computer Science de la TU Munich con un Honours en Administración Tecnológica del CDTM. Fundó y lideró hasta principio del 2015 Munich DataGeeks, un grupo de interés en ML y Data Science más Baviera. Actualmente trabaja como como Data Engineer / Scientist para TrustYou.

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Big Data Colombia

Mongodb intro

christkv

Have you ever considered Machine Learning as a black box? It sounds as a kind of magic happening. Although being one among many solutions available, Azure ML has proved to be a great balance between flexibility, usability and affordable price. But how does Azure ML compare with the other ML providers? How to choose the appropriate algorithm? Do you understand the key performance indicators and how to improve the quality of your models? The session is about understanding the black box and using it for IoT workload and not only.

Machine learning for IoT - unpacking the blackbox

Ivo Andreev

A Production Quality Sketching Library for the Analysis of Big Data

Databricks

DA_01_Intro.pptx

Alok Mohapatra

Writing Space and the Cassandra NoSQL DBMS

DataStax Academy

MongoDB Schema Design

Alex Litvinok

Semelhante a Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017 (20)

tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When

Introduction to Neo4j for the Emirates & Bahrain

Graphs fun vjug2

Bigdata analytics

MongoDB Administration 101

Big Data Tel Aviv 2019 v.3.0 I 'Graph database 4 beginners' - Michael Kogan

SDEC2011 NoSQL Data modelling

Feature Engineering

Data Processing and Aggregation with MongoDB

Big Challenges in Data Modeling: NoSQL and Data Modeling

Survey of Big Data Infrastructures

NoSQL Introduction

DDD why_who - for CHTTI

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Mongodb intro

Machine learning for IoT - unpacking the blackbox

A Production Quality Sketching Library for the Analysis of Big Data

DA_01_Intro.pptx

Writing Space and the Cassandra NoSQL DBMS

MongoDB Schema Design

Mais de PAPIs.io

The daily job of a Data Scientist ranges from a variety of tasks: improving models performance or dealing with framework structure implementations. Machine learning as a service, a hot topic in the field, implies thinking about architecture to allow constant improvements in performance for our products. This presentation shows one architecture design using RESTful resources, document oriented databases and pre-trained pipelines to achieve real-time predictions of time series with high availability, scalability and freedom to Data Scientists work directly on improving the accuracy rate of our products. We fine tunned to work on time series forecasting which is a very challenging field that still needs better solutions in terms of innovative modeling. During the presentation will be shown how these decisions keep our Data Scientists focused on working with real data and thinking about improvements that can reach a large volume of time series instead of singular and localized actions.

Shortening the time from analysis to deployment with ml as-a-service — Luiz A...

PAPIs.io

For online businesses, recommender systems are paramount. There is an increasing need to take into account all the user information to tailor the best product offer, tailored to each new user. Part of that information is the content that the user actually sees: the visuals of the products. When it comes to products like luxury hotels, pictures of the room, the building or even the nearby beach can significantly impact users’ decision. In this talk, we will describe how we improved an online vacation retailer recommender system by using the information in images. We’ll explain how to leverage open data and pre-trained deep learning models to derive information on user taste. We will use a transfer learning approach that enables companies to use state of the art machine learning methods without needing deep learning expertise.

Extracting information from images using deep learning and transfer learning ...

PAPIs.io

Graphs are used to map relations on unstructured data. Companies’ data are most from database and mined using traditional data mining approach. However, model relational data as a graph can reveal useful insights and discovery relation among data that is ignored by traditional data mining techniques. In this work we used graphs to map physician relations using claim data as a proxy and this approach reveal interesting insights from health insurance company.

Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...

PAPIs.io

Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...

PAPIs.io

Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...

PAPIs.io

In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly introduce a powerful machine learning library (MLlib) along with a general overview of the Spark framework, describing how to launch applications within a cluster. In this way, a demo will show how to simulate a Spark cluster in a local machine using images available on a Docker Hub public repository. In the end, another demo will show how to save time using unit tests for validating jobs before running them in a cluster.

Building machine learning applications locally with Spark — Joel Pinho Lucas ...

PAPIs.io

Battery life is critical for smart devices, but optimizing it requires cooperation from the entire software ecosystem. Wasteful software affects user perception about devices’ battery quality. Therefore, a large team within a producer of those smart devices is focused on identifying and correcting energy consumption bugs. Since the software ecosystem grows fast, that team faces a lot of suspect issues, from which only a small fraction turns out to be genuine. Our project aims to streamline energy-related bug processing in devices of the company and its partners, by automatically identifying anomalous behaviors related to battery drain using data mining and machine learning.

Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...

PAPIs.io

News recommendations are particularly challenging given the high number of new contents produced every day and the fast deterioration of its value for the users, demanding models and infrastructure able to deal with those nuances and serve a newly trained model about 100 times per day. Attending this presentation you're going to follow a detailed overview of how R&D team of Hearst's TV division is putting together Google BigQuery, Kubernetes cluster and Tensorflow to build a hybrid recommendation system combining model-based matrix factorization, content recency, and content semantics through NLP.

A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...

PAPIs.io

Machine learning as a service (MLaS) is imperative to the success of many companies as many internal teams and organizations need to gain business intelligence from big data. Building a scalable MLaS in a very challenging problem. In this paper, we present the scalable MLaS we built for a company that operates globally. We focus on several scalability challenges and our technical solutions. Video at https://www.youtube.com/watch?v=MpnszJ_3Ong Couldn't attend PAPIs '16? Get access to the other presentations' slides and videos at https://gumroad.com/products/fehon/

Scaling machine learning as a service at Uber — Li Erran Li at #papis2016

PAPIs.io

This talk will offer answers to the following questions: What is data-driven decision making? What is AI? What is Business Intelligence? Why are these concepts important? What are the biggest challenges and opportunities? Daniel is the CEO of Satalia that provides AI inspired solutions to solve industries hardest problems. He’s the co-founder of the ASI that transitions scientists into data scientists. Daniel has a MSci and EngD in AI from UCL, and is Director of UCL’s Business Analytics MSc; applying AI to solve business/social problems. Daniel has many Advisory and Executive positions, holds an international Kauffman Global Entrepreneur Scholarship and actively promotes innovation across the globe.

Real-world applications of AI - Daniel Hulme @ PAPIs Connect

PAPIs.io

Possibly the most important lesson we have learned after 60 years of AI research is that what seemed to be very difficult to achieve, such as accurate medical diagnosis to playing chess at the level of a Grand Master, turned out to be relatively easy whereas what seemed easy, such as visual object recognition or deep language understanding, turned out to be extremely difficult. In my talk I will try to explain the reasons for this apparent contradiction by briefly reviewing the past and present of AI and projecting it into the near future. Ramon Lopez de Mantaras is Research Professor of the Spanish National Research Council (CSIC) and Director of the Artificial Intelligence Research Institute of the CSIC. Technical Engineer EE (Electrical Engineering) from the Technical Engineering School of Mondragón (Spain) in 1973. Master of Sciences in Automatic Control from the University of Toulouse III (France) in 1974, Ph.D. in Physics from the University of Toulouse III (France), in 1977, with a thesis in Robotics (done at LAAS, CNRS). Master of Science in Engineering (ComputerScience) from the University of California at Berkeley (USA) in 1979. Ph.D. in Computer Science, from the Technical University of Catalonia, Barcelona (Spain) in 1981.

Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...

PAPIs.io

Everybody uses price promotions in retail. However, individual pricing is seldom used, particularly in offline retail. Marketing literature has been advocating the use of individual price discrimination for decades. Furthermore, product recommendations, ever-present in e-commerce, are also not often found in offline retail. We show the machine learning driven system behind a new promotion channel that enables retailers and manufacturers alike to target individual customers in offline retail. Lessons learned, technologies used, and machine learning approaches driving our system will be shown. Daniel Guhl has a background in economics & marketing, and got interested in data modeling during his Ph.D.. Currently, he is working as a data scientist at a Berlin based Start-up and is pursuing a postdoc at Humboldt University. He enjoys learning everyday and focuses on solving real world problems.

Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...

PAPIs.io

Deep Learning (DL) is becoming a big tsunami in the Machine Learning community. This talk aims at introducing DL, its motivation and main techniques. However, part of this talk is also devoted to demystify DL. What are the main advantages but also the main drawbacks of DL?. And what are the key issues that the practitioners have to consider? Roberto Paredes is an Associate Professor at Departamento de Sistemas Informáticos y Computación DSIC of the Universidad Poliécnica de Valencia UPV. He belongs to the Pattern Recognition and Human Language Technologies Research Centre PRHLT. Roberto Paredes is the Director of the PRHLT and the President of the Spanish AERFAI Association. His main research interests are around the statistical learning, machine learning and more recently neural networks and deep learning.

Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect

PAPIs.io

The best services have one thing in common: a superb customer experience. Banking services are no exception to this rule, and indeed the quest for an effortless, well informed, and personalized customer experience is one of the main goals of today's innovation in digital banking services. According to what Maslow has described in his "pyramid of needs", customers are seeking a more intimate and meaningful experience where banking services can actively assist the customer in performing and managing their financial life. Predictive APIs have a fundamental role in all this, as they enable a new set of customer journeys such as automatic categorization of transactions, detecting and alerting recurrent payments, pre-approving credit requests or provide better tools to fight fraud without limiting legitimate customer transactions. In this talk, I will focus on how to provide better banking services by using predictive APIs. I will describe the path on how to get there and the challenges of implementing predictive APIs in a strictly audited and regulated domain such as banking. Finally, I will briefly introduce a number of data science techniques to implement those customer journeys and describe how big/fast data engineering can be used to realize predictive data pipelines. Natalino is currently Enterprise Data Architect at ING in the Netherlands, where leads the strategy, definition, design and implementation of big/fast data solutions for data-driven applications, for personalized marketing, predictive analytics, and fraud/security management. All-round Software Architect, Data Technologist, Innovator, with 15+ years experience in research, development and management of distributed architectures and scalable services and applications.

Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect

PAPIs.io

Fintech startups are taking business away from traditional institutions like banks, exchanges, and brokerages. One of the reasons that these startups are able to compete with $30B+ behemoths like Credit Suisse and Goldman Sachs is their advanced decision making capabilities. By leveraging new data sources and better predictive analytics, companies like Ferratum Bank can make more accurate decisions in a fraction of the time. This talk will cover: Types of decisions you can automate Challenges in building predictive, financial apps First-hand, real-world examples Greg Lamp is the co-Founder and CTO of Yhat. In this role, Greg leads development of Yhat's core products and infrastructure and is the principal architect of the company's cloud and on-premise enterprise software applications. Greg was previously a product manager at OnDeck, a fintech startup in New York and before that an analyst at comScore. Greg is a graduate of the University of Virginia.

Microdecision making in financial services - Greg Lamp @ PAPIs Connect

PAPIs.io

What is the future we want to create, and what can we do – starting today – to actively shape that future with general AI? This talk outlines a vision for the future of humankind once AI reaches human or superhuman levels, and leads the audience through the steps one research group is taking to get there. From the economics of smart robots and job replacement, to bionic humans exploring the universe through space travel, the talk offers a window into the work of 30 researchers focused on AI development and safety, and explains what attendees can do themselves to help make that future happen. JoEllen is the AI Safety Ambassador and Head of PR for GoodAI, a Prague-based general AI research and development company. A high school teacher by trade, she has a bachelor’s degrees in English and Philosophy from Seattle University, a master’s degree in Transatlantic Studies from Charles University in Prague, and is the recipient of Fulbright grant. JoEllen is particularly interested in how AI will affect international government and political relations.

Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...

PAPIs.io

Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. In this talk we'll show how to use an AWS Spark cluster to train a model quickly from a laptop at a very little cost (around 10€). Vincent Van Steenbergen is a freelance (big) data engineer who's working on a range of international projects, implementing systems able to handle terabytes of data, usually involving Spark, Scala, Kafka, Hadoop and Cassandra. His main interest right now is applying these techniques to solve machine learning problems. Vincent was previously a technical architect at Property. Works, a real estate startup in London and before that an R&D engineer at IDAaaS.

Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...

PAPIs.io

Shopping, or as the people on the other side of the counter call it, retail has become the number one breeding ground for predictive applications in the enterprise. What started as simple recommendation engines has evolved into a complex and powerful ecosystem of predictive applications that affect core processes such as pricing, replenishment and staff planning. In this talk, Ulrich Kerzel will share impact and experiences from building and operating predictive applications for large retailers, and explain why the future of retail is as much a science as an art. Dr. Ulrich Kerzel is a Senior data scientists at Blue Yonder and renowned scientist with research experience at the University of Cambridge and CERN. Ulrich Kerzel earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. After his PhD, he went to the University of Cambridge, were he was a Senior Research Fellow at Magdelene College. His research work focused on complex statistical analyses to understand the origin of matter and antimatter using data from the LHCb experiment at the Large Hadron Collider at CERN, the world’s biggest research institute for particle physics. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a senior data scientist.

How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect

PAPIs.io

We live in a world of data, of big data. A big portion of this data has been generated by humans, and particularly through their mobile phones. In fact, there are almost as many mobile phones in the world as humans. The mobile phone is the piece of technology with the highest levels of adoption in human history. We carry them with us all through the day (and night, in many cases), leaving digital traces of our physical interactions. Mobile phones have become sensors of human activity in the large scale and also the most personal devices. In my talk, I will present some of the work that we are doing at Telefonica Research in the area of human behavior understanding from data captured with mobile phones, and particularly our work in the area of Big Data for Social Good. I will highlight opportunities but also challenges that we would need to address in order to truly leverage this opportunity. Nuria Oliver is a computer scientist and Scientific Director at Telefónica. She holds a Ph.D. from the Media Lab at MIT. She is one of the most cited female computer scientist in Spain, with her research having been cited by more than 8900 publications. She is well known for her work in computational models of human behavior, human computer-interaction, intelligent user interfaces, mobile computing and big data for social good.

The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...

PAPIs.io

ML services are quickly becoming a commodity, and they will be taken for granted by developers and computer users alike in the near future. The building blocks for ML as an ubiquitous service are already in place, almost always in the form of remote APIs that provide a first level of abstraction over ML problem-solving and, specially, obviate scalability and resource allocation issues. But that's not enough: those building blocks still leak implementation details inessential to the application developer that needs to provide domain-specific solutions. We need to ascend a couple of rungs in the abstraction ladder and provide domain-specific languages to describe ML solutions without nitty-gritty details unrelated to the problem at hand, offering non-experts the possibility of automating their ML solutions. In this talk, we'll discuss our experience designing and developing BigML's data wrangling and ML workflow DSLs, Flatline and WhizzML, and how they generalize to similar ML services and APIs. Jose A. Ortega Ruiz is part of the founding team of BigML, a little startup trying to apply machine learning and other AI techniques to big data, and make them accessible to non-specialists. He was hacking for Oblong from 2008 to early 2011. Before that, he worked for Google (from July 2007). From June 2005 to May 2007, he worked on embedded software development for the scientific payload of LISA Pathfinder. He was a theoretical physicist in a previous life, and wrote a Ph. D. thesis on gravitational wave detectors. He also got a bachelor’s degree in computer science. Between 2003 and 2005, he taught courses on programming and computer networks at the Universitat Autonoma of Barcelona, where he was part of the mobile agents research group.

Automating Machine Learning Workflows: A Report from the Trenches - Jose A. O...

PAPIs.io

Mais de PAPIs.io (20)