Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks, in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, Erin LeDell (Chief Machine Learning Scientist, H2O.ai), provides an overview of the field of "Automatic Machine Learning" and introduces the new AutoML functionality in H2O. Erin also provides simple code examples to get you started using AutoML. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) is available in all the H2O interfaces including the h2o R package, Python module, Scala/Java library, and the Flow web GUI. Speaker Bio: Erin LeDell is the Chief Machine Learning Scientist at H2O.ai, the company that produces the open source machine learning platform, H2O. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE in 2016) and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc.

Scalable Automatic Machine Learning in H2O

Sri Ambati

Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance.

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...

Productionizing Machine Learning with a Microservices Architecture

The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

AutoML Toolkit – Deep Dive

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...

If you’ve brought two or more ML models into production, you know the struggle that comes from managing multiple data sets, feature engineering pipelines, and models. This talk will propose a whole new approach to MLOps that allows you to successfully scale your models, without increasing latency, by merging a database, a feature store, and machine learning. Splice Machine is a hybrid (HTAP) database built upon HBase and Spark. The database powers a one of a kind single-engine feature store, as well as the deployment of ML models as tables inside the database. A simple JDBC connection means Splice Machine can be used with any model ops environment, such as Databricks. The HBase side allows us to serve features to deployed ML models, and generate ML predictions, in milliseconds. Our unique Spark engine allows us to generate complex training sets, as well as ML predictions on petabytes of data. In this talk, Monte will discuss how his experience running the AI lab at NASA, and as CEO of Red Pepper, Blue Martini Software and Rocket Fuel, led him to create Splice Machine. Jack will give a quick demonstration of how it all works.

Unified MLOps: Feature Stores & Model Deployment

Machine Learning Pipelines

jeykottalam

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow

This session was recorded in San Francisco on February 5th, 2019 and can be viewed here: https://youtu.be/ndUtKRzVUCo In this presentation, Erin LeDell (Chief Machine Learning Scientist, H2O.ai), will provide an overview of the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. Bio: Erin is the Chief Machine Learning Scientist at H2O.ai. Erin has a Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on automatic machine learning, ensemble machine learning and statistical computing. She also holds a B.S. and M.A. in Mathematics. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE Digital in 2016) and Marvin Mobile Security (acquired by Veracode in 2012), and the founder of DataScientific, Inc.

Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...

Sri Ambati

Mlflow with databricks

Liangjun Jiang

In this talk, we will present the basic features and functionality of Flock, an end-to-end research platform that we are developing at CISL which simplifies and automates the integration of machine learning solutions in data engines. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, model optimizations and support for the ONNX model format. We will showcase Flock's features through a demo using Microsoft's Azure Data Studio and SQL Server.

Flock: Data Science Platform @ CISL

Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing. In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30% In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.

How We Optimize Spark SQL Jobs With parallel and sync IO

Machine learning practitioners are most comfortable using high-level programming languages such as Python. This is a barrier to parallelizing algorithms with big data frameworks such as Apache Spark, which are written in lower-level languages. Databricks partnered with the Regeneron Genetics Center to create the Glow library for population-scale genomics data storage and analytics. Glow V1.0.0 includes PySpark-based implementations for both existing and novel machine learning algorithms. We will discuss how leveraging tooling for Python users, especially Pandas UDFs, accelerated our development velocity and impacted our algorithms’ computational performance.

Extending Machine Learning Algorithms with PySpark

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Understanding and Improving Code Generation

Databricks Runtime is the execution environment that powers millions of VMs running data engineering and machine learning workloads daily in Databricks. Inside Databricks, we run millions of tests per day to ensure the quality of different versions of Databricks Runtime. Due to the large number of tests executed daily, we have been continuously facing the challenge of effective test result monitoring and problem triaging. In this talk, I am going to share our experience of building the automated test monitoring and reporting system using Databricks. I will cover how we ingest data from different data sources like CI systems and Bazel build metadata to Delta, and how we analyze test results and report failures to their owners through Jira. I will also show you how this system empowers us to build different types of reports that effectively track the quality of changes made to Databricks Runtime.

Managing Millions of Tests Using Databricks

Operationalize Apache Spark Analytics

Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow

Mais procurados (20)

Koalas: How Well Does Koalas Work?

Scalable Automatic Machine Learning in H2O

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...

Productionizing Machine Learning with a Microservices Architecture

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

AutoML Toolkit – Deep Dive

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...

Unified MLOps: Feature Stores & Model Deployment

Machine Learning Pipelines

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow

Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...

Mlflow with databricks

Flock: Data Science Platform @ CISL

How We Optimize Spark SQL Jobs With parallel and sync IO

Extending Machine Learning Algorithms with PySpark

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Understanding and Improving Code Generation

Managing Millions of Tests Using Databricks

Operationalize Apache Spark Analytics

Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow

Semelhante a Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

Productionizing Spark and the REST Job Server- Evan Chan

You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.

Productionizing Spark and the Spark Job Server

Evan Chan

Getting started with SparkSQL - Desert Code Camp 2016

clairvoyantllc

"Container technologies such as Docker are rapidly becoming the de-facto way to deploy cloud applications, and Java is committed to being a good container citizen. This session will explain how OpenJDK fits into the world of containers, specifically how it fits with Docker images and containers. The session will focus on the production of optimized Docker images containing a JDK. We will introduce technologies such as jlink, that can be used to reduce the size of the created image. The session will explain Alpine/musl support for an effective image and runtime. The session will also talk about and the inclusion of Class Data Sharing (CDS) archives and Ahead of Time (AOT) shared object libraries for improving startup time. The attendees will learn about the recent work that has gone into OpenJDK for interacting with container resource limitations."

Java in a world of containers

Docker, Inc.

Java in a World of Containers - DockerCon 2018

Arun Gupta

JavaOne 2016 NetBeans is not just a Java IDE. It supports JavaScript as a first-class citizen and provides a complete integrated development environment. It also provides project types for server-side JavaScript (Node.js) as well as web browsers and mobile (Apache Cordova). In addition, it supports Grunt, Mocha and Selenium, Angular and Knockout, and more. This session provides an update on NetBeans 8.1 and demonstrates the top new JavaScript features. You will see a Node.js application in action, look at the support for JavaScript unit testing, and also see how easy it is to debug an Apache Cordova application running on a tethered iPhone.

Java script nirvana in netbeans [con5679]

Ryan Cuprak

Why to docker

Karthik Gaekwad

Docker Introduction

w_akram

Learning Oracle with Oracle VM VirtualBox

Leighton Nelson

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...

It's easy to make mistakes when Dockerizing your Java applications. In this webinar, Alexei Ledenev (Cheif Researcher at Codefresh) shared his experience on how to craft the perfect Java-Docker build flow. He explained best practices and common pitfalls, then demonstrated how to create a build pipeline that consistently produces small, efficient, and secure Docker images. View the webinar recording and summary here- https://codefresh.io/blog/webinar-creating-efficient-docker-build-pipeline-java-apps/

Webinar: Creating an Effective Docker Build Pipeline for Java Apps

Codefresh

This presentation explores how SQL developers can deliver powerful machine learning applications by leveraging Spark's SQL and MLlib libraries. A brief overview covering Spark components and architecture kicks things off, and then we dive right in with a live demonstration of loading and querying data using Spark SQL. Next, we'll examine the basics of machine learning algorithms and workflows before getting under the hood of a Spark MLlib-based recommendation engine. Our final demonstration looks at how familiar tools can be used to query our recommendation data before we wrap up with a survey of real-world use cases. This presentation is cross-posted on GitHub here: https://github.com/crwarman/SparkSqlAndMachineLearning

Spark SQL & Machine Learning - A Practical Demonstration

Craig Warman

Lessons Learned From Running Spark On Docker

Public, private, and hybrid; software, platform, and infrastructure. A discussion of the current state of the Platform-as-a-Service space, and why the keys to success lie in enabling developer productivity, and providing openness and choice. This presentation considers the success of Open Source in general, looks at the Cloud Foundry project, and explains why Cloud Foundry-based PaaSes are the best places to host your applications written in Java and other JVM-based languages. Presented at GOTO Aarhus 2013

Getting started with Apache Spark

Habib Ahmed Bhutto

Run your Java code on Cloud Foundry

Andy Piper

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Hyderabad Scalability Meetup

Dec6 meetup spark presentation

Ramesh Mudunuri

In today’s cloud native world, Docker Images are the lingua franca for platform portability. Unfortunately, there’s no clear direction for developers to turn their Spring applications into those Docker Images. The most likely tool for Docker Image creation, Dockerfile, has serious Day 2 limitations that make it a poor choice for many situations. This session will explore how to use the Cloud Native Buildpacks (CNCF) project and its integrations into the Spring ecosystem. It will cover the use of Spring Boot’s Maven and Gradle plugins, the pack CLI, the kpack Kubernetes service, and more.

Spring to Image

VMware Tanzu

Using the Splunk Java SDK

Damien Dallimore

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

Rafal Kwasny

Semelhante a Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017 (20)

Productionizing Spark and the REST Job Server- Evan Chan

Productionizing Spark and the Spark Job Server

Getting started with SparkSQL - Desert Code Camp 2016

Java in a world of containers

Java in a World of Containers - DockerCon 2018

Java script nirvana in netbeans [con5679]

Why to docker

Docker Introduction

Learning Oracle with Oracle VM VirtualBox

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...

Webinar: Creating an Effective Docker Build Pipeline for Java Apps

Spark SQL & Machine Learning - A Practical Demonstration

Lessons Learned From Running Spark On Docker

Getting started with Apache Spark

Run your Java code on Cloud Foundry

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Dec6 meetup spark presentation

Spring to Image

Using the Splunk Java SDK

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

Mais de PAPIs.io

Feature engineering is one of the most important, yet elusive, skills to master if you want to be a good data scientist. Machine learning competitions are hardly ever won with strong modeling techniques alone -- it is the combination of creative feature engineering and powerful modeling techniques that makes the difference. This tutorial will give the audience practical tips and tricks to improve the performance of machine learning algorithms. We will broadly look at feature engineering for applied machine learning, touching on subjects like: categorical vs. numerical variables, data cleaning, feature extraction, transformations, and imputation.

Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017

Graphs are used to map relations on unstructured data. Companies’ data are most from database and mined using traditional data mining approach. However, model relational data as a graph can reveal useful insights and discovery relation among data that is ignored by traditional data mining techniques. In this work we used graphs to map physician relations using claim data as a proxy and this approach reveal interesting insights from health insurance company.

Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...

Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...

Battery life is critical for smart devices, but optimizing it requires cooperation from the entire software ecosystem. Wasteful software affects user perception about devices’ battery quality. Therefore, a large team within a producer of those smart devices is focused on identifying and correcting energy consumption bugs. Since the software ecosystem grows fast, that team faces a lot of suspect issues, from which only a small fraction turns out to be genuine. Our project aims to streamline energy-related bug processing in devices of the company and its partners, by automatically identifying anomalous behaviors related to battery drain using data mining and machine learning.

Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...

News recommendations are particularly challenging given the high number of new contents produced every day and the fast deterioration of its value for the users, demanding models and infrastructure able to deal with those nuances and serve a newly trained model about 100 times per day. Attending this presentation you're going to follow a detailed overview of how R&D team of Hearst's TV division is putting together Google BigQuery, Kubernetes cluster and Tensorflow to build a hybrid recommendation system combining model-based matrix factorization, content recency, and content semantics through NLP.

A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...

Machine learning as a service (MLaS) is imperative to the success of many companies as many internal teams and organizations need to gain business intelligence from big data. Building a scalable MLaS in a very challenging problem. In this paper, we present the scalable MLaS we built for a company that operates globally. We focus on several scalability challenges and our technical solutions. Video at https://www.youtube.com/watch?v=MpnszJ_3Ong Couldn't attend PAPIs '16? Get access to the other presentations' slides and videos at https://gumroad.com/products/fehon/

Scaling machine learning as a service at Uber — Li Erran Li at #papis2016

This talk will offer answers to the following questions: What is data-driven decision making? What is AI? What is Business Intelligence? Why are these concepts important? What are the biggest challenges and opportunities? Daniel is the CEO of Satalia that provides AI inspired solutions to solve industries hardest problems. He’s the co-founder of the ASI that transitions scientists into data scientists. Daniel has a MSci and EngD in AI from UCL, and is Director of UCL’s Business Analytics MSc; applying AI to solve business/social problems. Daniel has many Advisory and Executive positions, holds an international Kauffman Global Entrepreneur Scholarship and actively promotes innovation across the globe.

Real-world applications of AI - Daniel Hulme @ PAPIs Connect

Possibly the most important lesson we have learned after 60 years of AI research is that what seemed to be very difficult to achieve, such as accurate medical diagnosis to playing chess at the level of a Grand Master, turned out to be relatively easy whereas what seemed easy, such as visual object recognition or deep language understanding, turned out to be extremely difficult. In my talk I will try to explain the reasons for this apparent contradiction by briefly reviewing the past and present of AI and projecting it into the near future. Ramon Lopez de Mantaras is Research Professor of the Spanish National Research Council (CSIC) and Director of the Artificial Intelligence Research Institute of the CSIC. Technical Engineer EE (Electrical Engineering) from the Technical Engineering School of Mondragón (Spain) in 1973. Master of Sciences in Automatic Control from the University of Toulouse III (France) in 1974, Ph.D. in Physics from the University of Toulouse III (France), in 1977, with a thesis in Robotics (done at LAAS, CNRS). Master of Science in Engineering (ComputerScience) from the University of California at Berkeley (USA) in 1979. Ph.D. in Computer Science, from the Technical University of Catalonia, Barcelona (Spain) in 1981.

Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...

Everybody uses price promotions in retail. However, individual pricing is seldom used, particularly in offline retail. Marketing literature has been advocating the use of individual price discrimination for decades. Furthermore, product recommendations, ever-present in e-commerce, are also not often found in offline retail. We show the machine learning driven system behind a new promotion channel that enables retailers and manufacturers alike to target individual customers in offline retail. Lessons learned, technologies used, and machine learning approaches driving our system will be shown. Daniel Guhl has a background in economics & marketing, and got interested in data modeling during his Ph.D.. Currently, he is working as a data scientist at a Berlin based Start-up and is pursuing a postdoc at Humboldt University. He enjoys learning everyday and focuses on solving real world problems.

Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...

Deep Learning (DL) is becoming a big tsunami in the Machine Learning community. This talk aims at introducing DL, its motivation and main techniques. However, part of this talk is also devoted to demystify DL. What are the main advantages but also the main drawbacks of DL?. And what are the key issues that the practitioners have to consider? Roberto Paredes is an Associate Professor at Departamento de Sistemas Informáticos y Computación DSIC of the Universidad Poliécnica de Valencia UPV. He belongs to the Pattern Recognition and Human Language Technologies Research Centre PRHLT. Roberto Paredes is the Director of the PRHLT and the President of the Spanish AERFAI Association. His main research interests are around the statistical learning, machine learning and more recently neural networks and deep learning.

Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect

The best services have one thing in common: a superb customer experience. Banking services are no exception to this rule, and indeed the quest for an effortless, well informed, and personalized customer experience is one of the main goals of today's innovation in digital banking services. According to what Maslow has described in his "pyramid of needs", customers are seeking a more intimate and meaningful experience where banking services can actively assist the customer in performing and managing their financial life. Predictive APIs have a fundamental role in all this, as they enable a new set of customer journeys such as automatic categorization of transactions, detecting and alerting recurrent payments, pre-approving credit requests or provide better tools to fight fraud without limiting legitimate customer transactions. In this talk, I will focus on how to provide better banking services by using predictive APIs. I will describe the path on how to get there and the challenges of implementing predictive APIs in a strictly audited and regulated domain such as banking. Finally, I will briefly introduce a number of data science techniques to implement those customer journeys and describe how big/fast data engineering can be used to realize predictive data pipelines. Natalino is currently Enterprise Data Architect at ING in the Netherlands, where leads the strategy, definition, design and implementation of big/fast data solutions for data-driven applications, for personalized marketing, predictive analytics, and fraud/security management. All-round Software Architect, Data Technologist, Innovator, with 15+ years experience in research, development and management of distributed architectures and scalable services and applications.

Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect

Fintech startups are taking business away from traditional institutions like banks, exchanges, and brokerages. One of the reasons that these startups are able to compete with $30B+ behemoths like Credit Suisse and Goldman Sachs is their advanced decision making capabilities. By leveraging new data sources and better predictive analytics, companies like Ferratum Bank can make more accurate decisions in a fraction of the time. This talk will cover: Types of decisions you can automate Challenges in building predictive, financial apps First-hand, real-world examples Greg Lamp is the co-Founder and CTO of Yhat. In this role, Greg leads development of Yhat's core products and infrastructure and is the principal architect of the company's cloud and on-premise enterprise software applications. Greg was previously a product manager at OnDeck, a fintech startup in New York and before that an analyst at comScore. Greg is a graduate of the University of Virginia.

Microdecision making in financial services - Greg Lamp @ PAPIs Connect

What is the future we want to create, and what can we do – starting today – to actively shape that future with general AI? This talk outlines a vision for the future of humankind once AI reaches human or superhuman levels, and leads the audience through the steps one research group is taking to get there. From the economics of smart robots and job replacement, to bionic humans exploring the universe through space travel, the talk offers a window into the work of 30 researchers focused on AI development and safety, and explains what attendees can do themselves to help make that future happen. JoEllen is the AI Safety Ambassador and Head of PR for GoodAI, a Prague-based general AI research and development company. A high school teacher by trade, she has a bachelor’s degrees in English and Philosophy from Seattle University, a master’s degree in Transatlantic Studies from Charles University in Prague, and is the recipient of Fulbright grant. JoEllen is particularly interested in how AI will affect international government and political relations.

Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...

Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. In this talk we'll show how to use an AWS Spark cluster to train a model quickly from a laptop at a very little cost (around 10€). Vincent Van Steenbergen is a freelance (big) data engineer who's working on a range of international projects, implementing systems able to handle terabytes of data, usually involving Spark, Scala, Kafka, Hadoop and Cassandra. His main interest right now is applying these techniques to solve machine learning problems. Vincent was previously a technical architect at Property. Works, a real estate startup in London and before that an R&D engineer at IDAaaS.

Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...

Shopping, or as the people on the other side of the counter call it, retail has become the number one breeding ground for predictive applications in the enterprise. What started as simple recommendation engines has evolved into a complex and powerful ecosystem of predictive applications that affect core processes such as pricing, replenishment and staff planning. In this talk, Ulrich Kerzel will share impact and experiences from building and operating predictive applications for large retailers, and explain why the future of retail is as much a science as an art. Dr. Ulrich Kerzel is a Senior data scientists at Blue Yonder and renowned scientist with research experience at the University of Cambridge and CERN. Ulrich Kerzel earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. After his PhD, he went to the University of Cambridge, were he was a Senior Research Fellow at Magdelene College. His research work focused on complex statistical analyses to understand the origin of matter and antimatter using data from the LHCb experiment at the Large Hadron Collider at CERN, the world’s biggest research institute for particle physics. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a senior data scientist.

How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect

We live in a world of data, of big data. A big portion of this data has been generated by humans, and particularly through their mobile phones. In fact, there are almost as many mobile phones in the world as humans. The mobile phone is the piece of technology with the highest levels of adoption in human history. We carry them with us all through the day (and night, in many cases), leaving digital traces of our physical interactions. Mobile phones have become sensors of human activity in the large scale and also the most personal devices. In my talk, I will present some of the work that we are doing at Telefonica Research in the area of human behavior understanding from data captured with mobile phones, and particularly our work in the area of Big Data for Social Good. I will highlight opportunities but also challenges that we would need to address in order to truly leverage this opportunity. Nuria Oliver is a computer scientist and Scientific Director at Telefónica. She holds a Ph.D. from the Media Lab at MIT. She is one of the most cited female computer scientist in Spain, with her research having been cited by more than 8900 publications. She is well known for her work in computational models of human behavior, human computer-interaction, intelligent user interfaces, mobile computing and big data for social good.

The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...

ML services are quickly becoming a commodity, and they will be taken for granted by developers and computer users alike in the near future. The building blocks for ML as an ubiquitous service are already in place, almost always in the form of remote APIs that provide a first level of abstraction over ML problem-solving and, specially, obviate scalability and resource allocation issues. But that's not enough: those building blocks still leak implementation details inessential to the application developer that needs to provide domain-specific solutions. We need to ascend a couple of rungs in the abstraction ladder and provide domain-specific languages to describe ML solutions without nitty-gritty details unrelated to the problem at hand, offering non-experts the possibility of automating their ML solutions. In this talk, we'll discuss our experience designing and developing BigML's data wrangling and ML workflow DSLs, Flatline and WhizzML, and how they generalize to similar ML services and APIs. Jose A. Ortega Ruiz is part of the founding team of BigML, a little startup trying to apply machine learning and other AI techniques to big data, and make them accessible to non-specialists. He was hacking for Oblong from 2008 to early 2011. Before that, he worked for Google (from July 2007). From June 2005 to May 2007, he worked on embedded software development for the scientific payload of LISA Pathfinder. He was a theoretical physicist in a previous life, and wrote a Ph. D. thesis on gravitational wave detectors. He also got a bachelor’s degree in computer science. Between 2003 and 2005, he taught courses on programming and computer networks at the Universitat Autonoma of Barcelona, where he was part of the mobile agents research group.

Automating Machine Learning Workflows: A Report from the Trenches - Jose A. O...

Implementing a machine learning solution from scratch requires a lot of resource investment before yielding results. It is tempting to look for off the shelf machine learning solutions that are easy to integrate within one’s product instead. In this talk, you will follow a real case example of how the need to solve a specific problem led to doing a benchmark on a series of machine learning services. You will learn how these services compare, and pick up some tips on how to conduct your own benchmarks along the way. Inês Almeida is a machine learning enthusiast from Lisbon, Portugal, where she has given several talks on the topic, in particular on neural networks. Her goal is to share knowledge that is useful for newbies and experts alike. Inês has a Physics MSc. degree and currently works as a data scientist at Liquid Data Intelligence.

Machine Learning Services Benchmark - Inês Almeida @ PAPIs Connect

Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis. Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.

Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...

As you walk into your office on Monday morning, before you've even had a chance to grab a cup of coffee, your CEO asks to see you. He's worried: both customer churn and fraudulent transactions have increased over the past 6 months. As Data Manager, you have 6 months to solve that. As Data Manager, you know the challenges ahead: Multitudes of technology choices to make Building a team and solving the skill-set disconnect Data can be deceiving... Figuring out what the successful data product must be The goal of this talk is to provide some perspective to these topics Florian works in the “data” field since 01’, back when it was not yet big. He worked in successful startups in search engine, advertising and gaming industries, holding various data or CTO’s role. He started Dataiku in 2013, his first venture as a CEO, with the goal of alleviating the daily pains from the data enthusiasts and let them express their creativity.

How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect