O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Introduction to Data Science - Week 4 - Tools and Technologies in Data Science

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Introduction to data science
Introduction to data science
Carregando em…3
×

Confira estes a seguir

1 de 25 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Introduction to Data Science - Week 4 - Tools and Technologies in Data Science (20)

Anúncio

Mais de Ferdin Joe John Joseph PhD (20)

Mais recentes (20)

Anúncio

Introduction to Data Science - Week 4 - Tools and Technologies in Data Science

  1. 1. DSA – 105 Introduction to Data Science Week 4 – Tools and Technologies in Data Science Ferdin Joe John Joseph, PhD Faculty of Information Technology Thai-Nichi Institute of Technology
  2. 2. Week 3 Agenda • Tools and Technologies in Data Science Faculty of Information Technology, Thai - Nichi Institute of Technology 2
  3. 3. Tools and Technologies in 2017 Faculty of Information Technology, Thai - Nichi Institute of Technology 3
  4. 4. Programming Languages • Python • R • Java • C++ • Perl • Matlab/Octave Faculty of Information Technology, Thai - Nichi Institute of Technology 4
  5. 5. Data Bases • MySQL • No SQL • Microsoft SQL server • Oracle Faculty of Information Technology, Thai - Nichi Institute of Technology 5
  6. 6. Data Analytics Tools • SAS • Tableau • IBM SPSS Statistics • Microsoft Excel • Statistica • Rapid Miner • SAP Faculty of Information Technology, Thai - Nichi Institute of Technology 6
  7. 7. API • Scala • Tensor Flow • Amazon Web Services Faculty of Information Technology, Thai - Nichi Institute of Technology 7
  8. 8. Servers and Application Frameworks • Hadoop • Spark • Microsoft Azure • Jupyter Faculty of Information Technology, Thai - Nichi Institute of Technology 8
  9. 9. Tools and Technologies in 2018 Faculty of Information Technology, Thai - Nichi Institute of Technology 9
  10. 10. Recruiters Requirements 2018 Faculty of Information Technology, Thai - Nichi Institute of Technology 10
  11. 11. R: The Most Popular Language for Data Science Once the data scientist has completed the often time-consuming process of “cleaning” and preparing the data for analysis, R is a popular software package for actually doing the math and visualizing the results. An open- source statistical modeling language, R has traditionally been popular in the academic community, which means that lots of data scientists will be familiar with it. R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem, R has become increasingly popular as programmers have created additional add-on packages for handling big datasets and parallel processing techniques that have come to dominate statistical modeling today. • Parallel helps R take advantage of parallel processing for both multicore Windows machines and clusters of POSIX (OS X, Linux, UNIX) machines. • Snow helps divvy up R calculations on a cluster of computers, which is useful for computationally intensive processes like simulations or AI learning processes. • Rhadoop and Rhipe allow programmers to interface with Hadoop from R, which is particularly important for the “MapReduce” function of dividing the computing problem among separate clusters and then re- combining or “reducing” all of the varying results into a single answer. R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more. Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make marketing campaigns more effective, and reporting. Faculty of Information Technology, Thai - Nichi Institute of Technology 11
  12. 12. Java & the Java Virtual Machine Organizations that seek to write custom analytics tools from scratch increasingly use the venerable language Java, as well as other languages that run on the Java Virtual Machine (JVM). Java is a variation of the object-oriented C++ language, and because Java runs on a platform-agnostic virtual machine, programs can be compiled once and run anywhere. The upside of using the JVM over a language written to run directly on the processor is the reduction in development time. This simpler development process has been a draw for data analytics, making JVM- based data mining tools very popular. Also, Hadoop—the popular open-source, distributed big data storage and analysis software—is written in Java. Faculty of Information Technology, Thai - Nichi Institute of Technology 12
  13. 13. Java & the Java Virtual Machine Java has rich open-source libraries for data mining, including Mahout and Weka, and the JVM provides robust memory management and exception handling. Other programming languages that can be used with the JVM include: • Scala: This programming language has the same efficiency as Java because it’s run on the JVM. However, it’s also become increasingly popular in data mining because it permits developers to use object-oriented programming (OOP) as well as functional programming. Users of Scala include The Guardian, LinkedIn, Foursquare, Novell, Siemens, Twitter, and the SPARK data mining environment at the UC Berkeley AMP Lab. • Clojure: A dialect of the 1980s-era artificial intelligence language LISP, Clojure is a primarily (although not 100%) functional language that also runs on the JVM. Clojure keeps data static and was designed for running concurrent processes. These features are important because, in contrast, object-oriented code executing concurrent processes will sometimes attempt to write to the same variable simultaneously. Keeping data structures immutable avoids this problem. Clojure has access to Java libraries, and the same development efficiencies as Java. Clojure can use the LISP macro facility to integrate with Hadoop and SQL. Users of Clojure include Netflix, Zendesk, Citibank, WalMart Labs, and Spotify. Faculty of Information Technology, Thai - Nichi Institute of Technology 13
  14. 14. Python: A High-Level Programming Language with Excellent Data Libraries Python is a high-level language, meaning that the creators automated certain housekeeping processes in order to make code easier to write. Python has robust libraries that support statistical modeling (Scipy and Numpy), data mining (Orange and Pattern), and visualization (Matplotlib). Scikit-learn, a library of machine learning techniques very useful to data scientists, has attracted developers from Spotify, OKCupid, and Evernote, but can be challenging to master. Faculty of Information Technology, Thai - Nichi Institute of Technology 14
  15. 15. Excel: Powerful Data Analytics on a Smaller Scale Excel can actually accomplish a lot of sophisticated analysis It’s easy to use and widely available. While it’s not best for analyzing truly massive, unstructured datasets For example, a massive dataset of some 30 million healthcare records distributed via Hadoop across dozens of servers It is surprisingly powerful when used for a variety of data analytics projects at a small scale. These can include clustering, optimization, and predictive modeling using supervised AI learning or forecasting techniques. Faculty of Information Technology, Thai - Nichi Institute of Technology 15
  16. 16. SAS (Statistical Analysis System): Data Mining Software Suite Used for advanced analytics, data management, and social media analytics, SAS is a robust suite that’s popular for business intelligence analysis of large data and unstructured datasets. In 2015, SAS topped the Gartner Magic Quadrant list in terms of “ability to execute” in the category of advanced analytics platforms due to the breadth and quality of its predictive modeling and data mining techniques. With a well-regarded visualization tool and integration with open-source tools like R, Hadoop and Python, SAS also puts significant effort into making tools backwards compatible, an important feature when looking at older historical datasets. Faculty of Information Technology, Thai - Nichi Institute of Technology 16
  17. 17. SAS (Statistical Analysis System): Data Mining Software Suite Say, for example, a company’s sales records were prepared for use by SAS in 1998. With backwards compatibility, they can still be read today. In large organizations, employee turnover over the years puts a premium on the continuity of tools. So, when a data scientist retires, you won’t lose the ability to access their work if they preferred older software that no one new to the position knows how to use. SAS can be costly, has a complicated licensing structure that some customers have found to be annoying, and has a steep learning curve. Although it’s expensive and complicated, it’s a very popular option, with more than 65,000 customers. Faculty of Information Technology, Thai - Nichi Institute of Technology 17
  18. 18. IBM: SPSS Modeler and SPSS Analytics Forrester Research Wave ranks IBM’s advanced data analytics platform as the top offering in the advanced analytics category for its breadth of tools that handle all elements of big data modeling: loading, “cleaning,” preparing, and then predictive modeling, whether using statistical or machine learning techniques. Other makers of highly rated commercial tools for advanced data analytics include SAP, KNIME, RapidMiner, Oracle, and Alteryx. Faculty of Information Technology, Thai - Nichi Institute of Technology 18
  19. 19. IBM: SPSS Modeler and SPSS Analytics SPSS Modeler and SPSS Statistics were acquired by IBM in 2009, and have a loyal following among statisticians. These tools integrate Hadoop to facilitate file-system computing using big datasets. The Social Media Analytics product helps data scientists harvest data from Twitter, Facebook, and other platforms to perform customer sentiment analysis. Gartner reports that the IBM advanced analytics platform has lower customer satisfaction ratings than average, largely due to weak customer support, inadequate documentation, and a challenging installation process. Faculty of Information Technology, Thai - Nichi Institute of Technology 19
  20. 20. SQL vs. NoSQL Databases: Tackling the “Messiness” of Big Data Another important distinction in the world of data is SQL databases vs. NoSQL databases, both of which are well suited to different types of datasets. Here’s a quick look at what makes them different in the context of data analysis. The traditional “relational” database was designed for an era in which data was far more expensive to collect and to store—and much more carefully organized. Structured Query Language (SQL) has been the means by which programmers transfer data to and from those neatly categorized rows and columns. Only 5% of the world’s information was structured data—and the rest consists of articles, photos, videos, social media posts, machine-to-machine communication, product inventory, and technical documents. So data scientists turned to a different standard for data storage called “NoSQL.” Faculty of Information Technology, Thai - Nichi Institute of Technology 20
  21. 21. Databases MySQL: Open-Source RDBMS Purchased by Oracle in 2009, MySQL is a widely used RDBMS (relational database management system) and one part of the LAMP software stack. This free, open-source database management system is used by web applications like WordPress, Drupal, Facebook, Twitter, and YouTube. MongoDB The most popular NoSQL database system available on the market is the open-source MongoDB, which has been used by Metlife, The Weather Channel, Bosch, and Expedia. MongoDB has well-regarded customer service, and the tool is particularly popular with startups. One of the fastest-growing big data projects involving MongoDB is Apache Spark, a distributed computing framework from the Apache Software Project that’s designed to operationalize real-time analytics. Paired up with MongoDB, Spark allows organizations to put real-time analytics reporting to use. Other commonly used open-source NoSQL databases include HBase, MariaDB, and Cassandra. Oracle Oracle has nearly 50% of the traditional relational database market, with products such as Oracle Database and OracleTimesTen. The database behemoth has also entered the market for unstructured data storage with Oracle NoSQL, and for open-source SQL databases that compete with its proprietary offerings. While popular and considered to be top-notch by many, they’re expensive. Faculty of Information Technology, Thai - Nichi Institute of Technology 21
  22. 22. Databases SQL Server DBMS: Enterprise-Level Database Management Microsoft SQL Server DBMS is a competitive enterprise-level database management system that includes support for SQL or noSQL architectures, in-memory computing, the cloud, and analytics on transactions. Existing customers are generally impressed with its performance Other strong performers in the market include SAP, IBM, EnterpriseDB, InterSystems, and MarkLogic. Faculty of Information Technology, Thai - Nichi Institute of Technology 22
  23. 23. File System Computing Hadoop: File System Computing What is “file system computing”? It’s a way to store and tackle the analytics for truly massive datasets. For example, 2 billion data points from sensors on an auto assembly line area that are stored on a cluster of servers, with each connected to multiple drives, would be enormous. Because this kind of dataset is too large to extract from the drives to a place where it can be analyzed, software like Hadoop was created. Hadoop is an open-source software tool specially designed to help data scientists manage the unwieldiness of big data. It eliminates the need to extract data from the storage devices altogether, bringing the analytics to the data so it can be processed in place. It has increasingly become the industry standard for file system computing projects involving big data, with prominent users including Facebook, Yahoo, and The New York Times. There are many other platforms that do file system computing, such as SciDB, but Hadoop has risen to the top with user contributions that extend its functionality, like Hive, Pig, Spark, and MapReduce. Even software giants like Microsoft and IBM have created their own Hadoop tools, rather than reinventing the wheel. Faculty of Information Technology, Thai - Nichi Institute of Technology 23
  24. 24. R installation procedure Follow the procedure in the link below and install R software. https://bit.ly/2MKNB4j This will help you learn R along with basic mathematics and statistics in the next one month time. Concepts learned so far from Java is enough to accomplish this. Faculty of Information Technology, Thai - Nichi Institute of Technology 24
  25. 25. Next Week… • Basic Mathematics I Faculty of Information Technology, Thai - Nichi Institute of Technology 25

×