O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Big data talking stories in Healthcare

559 visualizações

Publicada em

Azure Data Platform Services
HDInsight Clusters in Azure
Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog
Data Transformations: Apache Storm, Apache Spark, Azure Data Factory
Healthcare / Life Sciences Use Cases

Publicada em: Tecnologia
  • Entre para ver os comentários

Big data talking stories in Healthcare

  1. 1. Azure Data Platform Services Mostafa Elzoghbi http://mostafa.rocks Twitter: @MostafaElzoghbi
  2. 2. Session takeaways and objectives • Azure Data Platform Services • HDInsight Clusters in Azure • Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog • Data Transformations: Apache Storm, Apache Spark, Azure Data Factory • Healthcare / Life Sciences Use Cases
  3. 3. Azure Data Platform Services
  4. 4. HDInsight is a cloud implementation on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack that is the go-to solution for big data analysis. It includes implementations of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari, and so on. HDInsight also integrates with business intelligence (BI) tools such as Power BI, Excel, SQL Server Analysis Services, and SQL Server Reporting Services. HDInsight is available on Windows and Linux HDInsight on Linux: A Hadoop cluster on Ubuntu HDInsight on Windows: A Hadoop cluster on Win Server 2012 R2 What is HDInsight
  5. 5. HDInsight provides cluster Types & custom configurations for: • Hadoop (HDFS) • HBase • Storm • Spark • R Server • Hive Interactive, Kafka (Preview) HDInsight has powerful programming extensions for languages including C#, Java, and .NET. Business Value Proposition: Skip maintaining and purchasing hardware HDInsight clusters on Azure
  6. 6. HDInsight clusters on Azure
  7. 7. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families Data is stored in the rows of a table, and data within a row is grouped by column family. The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem. What is HBase
  8. 8. Order No Customer Name Customer Phone Company Name Company Address 12012015 Mostafa 101-232-2345 Microsoft Redmond, WA Customer Company Order No Customer Name Customer Phone Company Name Company Address 12012015 Mostafa 101-232-2345 Microsoft Redmond, WA
  9. 9. HBase Commands: create  Equivalent to Create table in T-SQL get  Equivalent to Select statements in T-SQL put  Equivalent to Update, Insert statement in T-SQL scan  Equivalent to Select (no where condition) in T-SQL HBase shell is your query tool to execute in CRUD commands to a HBase cluster. Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API. An HBase database can also be queried by using Hive using SQLHive. What is HBase
  10. 10. Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and analysis of data by using HiveQL (a query language similar to SQL). Hive understands how to work with structured and semi-structured data, such as text files where the fields are delimited by specific characters. Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data. Hive can also be extended through user-defined functions (UDF). A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL. Support for multiple execution engine: WebHCat, HiveServer2 (faster) or Tez. What is Hive
  11. 11. Apache Storm is a distributed, fault-tolerant, open-source computation system that allows you to process data in real-time with Hadoop. Apache Storm on HDInsight allows you to create distributed, real-time analytics solutions in the Azure environment by using Apache Hadoop. Ability to write Storm components in C#, JAVA and Python. Azure Scale up or Scale down without an impact for running Storm topologies. Ease of provision and use in Azure portal & development templates in Visual Studio. What is Apache Storm
  12. 12. Apache Storm apps are submitted as Topologies. A topology is a graph of computation that processes streams Stream: An unbound collection of tuples. Streams are produced by spouts and bolts, and they are consumed by bolts. Tuple: A named list of dynamically typed values. Spout: Consumes data from a data source and emits one or more streams. Bolt: Consumes streams, performs processing on tuples, and may emit streams. Bolts are also responsible for writing data to external storage, such as a queue, HDInsight, HBase, a blob, or other data store. Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures. Apache Storm Components
  13. 13. Apache Spark™ is a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala, Python, R. Combine SQL, streaming, and complex analytics. Spark's in-memory computation capabilities make it a good choice for iterative algorithms in ML and graph computations. Support for R Server & Azure Data Lake. What is Apache Spark
  14. 14. Part 3: Single Slide DEMO – Apache Spark
  15. 15. Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics. Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance. Data Lake Analytics—a no-limits analytics job service to power intelligent action Azure Data Lake (ADL)
  16. 16. Azure Data Lake (ADL)
  17. 17. Azure Data Lake Analytics is a new service, built to make big data analytics easy. Key capabilities: Dynamic scaling: do analytics on terabytes or even exabytes of data Develop faster, debug, and optimize smarter using familiar tools: U-SQL Jobs. U-SQL: simple and familiar, powerful, and extensible: SQL language with the power of expressive C#, process & analyze data with skills you already have. Affordable and cost effective. Works with all your Azure Data: Data Lake Analytics is optimized to work with Azure Data Lake - providing the highest level of performance, throughput, and parallelization for your big data workloads. Data Lake Analytics can also work with Azure Blob storage and Azure SQL database. Azure Data Lake Analytics
  18. 18. Part 3: Single Slide DEMO – Azure Data Lake
  19. 19. Healthcare & Pharmaceuticals Use Cases
  20. 20. Provider of enzymes for Food, Pharma Industry Pharmaceuticals Global provider of natural ingredients such as cultures and enzymes for the food, pharmaceutical, nutritional, and agricultural industries. Part 1: What They Did | Gene Analysis Challenge Collect clinical trial data from electronic laboratory notebooks (ELNs) Collect structured and unstructured data from sources like automated equipment (robots, temperature sensors, and other devices) Need to do analysis and look for patterns on this data Solution Chose Azure HDInsight, SQL Server on-premises Examine patterns in chemical composition and physical properties of cultures and enzymes e.g. examine the gene composition of our bacteria and how it actually behaves in yogurt Gene Analysis
  21. 21. BK1 Provider of enzymes for Food, Pharma Industry Part 2: How They Did It | Gene Analysis How They Did It Collect data in Azure Blobs • Extract data from all the files (GBs in size) • Transpose into JSON using .NET HDInsight processes data for insights • Hive is used to run queries • Mainly Select statements (joins/unions) • Maximum 20 lines of code Use SQL Server for reporting Use Hive ODBC connector Gene Analysis SQL Server On-premises Lab Information Management System Automated equipment (robots, temperature sensors, etc)
  22. 22. Healthcare Application Provider Healthcare Leading provider of healthcare software applications for clinical, financial, pharmacy, etc. Part 1: What They Did | Data Lake to deliver better analytics Challenge Multiple healthcare applications operating in product silos Want to implement a data lake and provide distributed processing for better analytics and predictions for their end customers: • Population, risk, and Care management • Clinical decision support using predictions • Real time quality measures to assist providers reach their regulatory requirements • Capacity management predictions • Enrich clinical data with NLP on unstructured physicians notes to reduce over/under treatment, readmissions, and faster claims processing Solution Chose Azure HDInsight, HBase, and custom .NET application Begin by storing all medical records in Azure Acquire, clean data and insert into data lake Data Lake Analytics
  23. 23. BK1 Healthcare Application Provider Part 2: How They Did It | Data Lake to deliver better analytics How They Did It Collect data in Azure Blobs • Have medical records in JSON format HDInsight processes data for insights • MapReduce preprocesses data • used to insert data into data lake • Developed .NET application to interface with HBase Data Lake Analytics Azure Blobs Azure HDInsight Medical Records in JSON .NET Application Insert into Blobs
  24. 24. References • HDInsight Documentation (R, Storm, Spark, Kafka, ADL,..etc) https://azure.microsoft.com/en-us/services/hdinsight/ • Spark Programming Guide http://spark.apache.org/docs/latest/programming-guide.html • edx.org: Free Apache Spark courses • Get started with Data Science VMs in Azure https://blogs.technet.microsoft.com/machinelearning/tag/dsvm/