O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

introduction to data science

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Unit  i big data introduction
Unit i big data introduction
Carregando em…3
×

Confira estes a seguir

1 de 28 Anúncio

introduction to data science

Baixar para ler offline

Data sciences is the topnotch in our world now as it enables us to predict the future and behaviors of people and systems alike.
Hence, this course focuses on introducing the processing involved in data sciences.

Data sciences is the topnotch in our world now as it enables us to predict the future and behaviors of people and systems alike.
Hence, this course focuses on introducing the processing involved in data sciences.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (19)

Semelhante a introduction to data science (20)

Anúncio

Mais recentes (20)

Anúncio

introduction to data science

  1. 1. ICT 3202 - INTRODUCTION TO DATA SCIENCE BY ENGR. JOHNSON C. UBAH B.ENG, M.ENG, HCNA, ASM
  2. 2. Course description This course is an introduction to data science. The major goals of this course are to learn how to use tools for acquiring, cleaning, analyzing, exploring, and visualizing data; making data-driven inferences and decisions; and effectively communicating results. For practical purposes one may work with Python, Octave/Matlab, ...
  3. 3. Fields to be covered  Data mining  Statistics  Machine learning  Information visualization  Network analysis  Natural language processing  Algorithms  Software engineering  Databases  Distributed systems  Big data
  4. 4. Introduction Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data Data science is related to data mining and big data.
  5. 5. Introduction to data science Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.[3] It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.
  6. 6. Big data Big Data refers to a huge volume of data that can be structured, semi-structured and unstructured. It comprises of 5 Vs i.e. Volume: It refers to an amount of data or size of data that can be in quintillion when comes to big data. Variety: It refers to different types of data like social media, web server logs etc.
  7. 7. Big Data Velocity: It refers to how fast data is growing, data is exponentially growing and at a very fast rate. Veracity: It refers to an uncertainty of data like social media means if the data can be trusted or not. Value: It refers to the data which we are storing and processing is worth and how we are getting benefit from this huge amount of data.
  8. 8. Structured data Data that is the easiest to search and organize, because it is usually contained in rows and columns and its elements can be mapped into fixed pre-defined fields, is known as structured data. Often structured data is managed using Structured Query Language (SQL)—a programming software language developed by IBM in the 1970s for relational databases.
  9. 9. Structured data Examples of structured data include financial data such as accounting transactions, address details, demographic information, star ratings by customers, machines logs, location data from smart phones and smart devices, etc. Today, most estimate structured data accounts for less than 20 percent of all data.
  10. 10. Unstructured data A much bigger percentage of all the data is our world is unstructured data. Unstructured data is data that cannot be contained in a row-column database and doesn’t have an associated data model. Think of the text of an email message. The lack of structure made unstructured data more difficult to search, manage and analyse.
  11. 11. Unstructured data Other examples of unstructured data include photos, video and audio files, text files, social media content, satellite imagery, presentations, PDFs, open-ended survey responses, websites and call center transcripts/recordings. Instead of spreadsheets or relational databases, unstructured data is usually stored in data lakes, NoSQL databases, applications and data warehouses.
  12. 12. Semi-structured data Beyond structured and unstructured data, there is a third category, which basically is a mix between both of them. The type of data defined as semi-structured data has some defining or consistent characteristics but doesn’t conform to a structure as rigid as is expected with a relational database. Therefore, there are some organizational properties such as semantic tags or metadata to make it easier to organize, but there’s still fluidity in the data.
  13. 13. Email messages are a good example. While the actual content is unstructured, it does contain structured data such as name and email address of sender and recipient, time sent, etc. Another example is a digital photograph. The image itself is unstructured, but if the photo was taken on a smart phone, for example, it would be date and time stamped, geo tagged, and would have a device ID Semi-structured data
  14. 14. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
  15. 15. How much data does it take to be called Big Data? Usually, data which is equal to or greater than 1 Tb known as Big Data. Analysts predict that by 2020, there will be 5,200 Gbs of data on every person in the world. Example: On average, people spend about 50 million tweets per day, Walmart processes 1 million customer transaction per hour.
  16. 16. Why is Big Data Important? The importance of Big Data does not mean how much data we have but what would you get out of that data. We can analyze data to reduce cost and time, smart decision making etc. Challenges: Storing such a huge amount of data efficiently. How do we process and extract valuable information from this huge amount of data within a given timeframe? Solution: Hadoop and Spark framework
  17. 17. Data Mining Data Mining also known as Knowledge Discovery of Data refers to extracting knowledge from a large amount of data i.e. Big Data. It is mainly used in statistics, machine learning and artificial intelligence. It is the step of the “Knowledge discovery in databases”.
  18. 18. Data Mining basics The components of data mining mainly consist of 5 levels, those are: – 1. Extract, transform and load data into warehouse 2. Store and manage 3. Provide data access (Communication) 4. Analyze (Process) 5. User Interface (Present data to user)
  19. 19. Need for Data Mining Analyze relationship and patterns in stored transaction data to get information which will help for better business decisions. Data mining helps in Credit ratings, targeted marketing, Fraud detection like which types of transactions are like to be a fraud by checking the past transactions of a user, checking customer relationship like which customers are loyal and which will leave for other company.
  20. 20. We can do 4 relationships using data mining: 1. Classes: It is used to locate the target 2. Clusters: It will group the data items to logical relation 3. Association: Relationship between data 4. Sequential Pattern: To anticipate behavioral patterns and trends.
  21. 21. Challenges in Data Mining 1. Mining different types of Knowledge in databases 2. Handling noise and incomplete data 3. Efficiency and scaling of data mining algorithms 4. Handling relational and complex types of data 5. Protection of data security, integrity, and privacy
  22. 22. Head To Head Comparison Between Big Data vs Data Mining Big Data and Data Mining are two different concepts, Big data is a term which refers to a large amount of data whereas data mining refers to deep drive into the data to extract the key knowledge/Pattern/Information from a small or large amount of data.
  23. 23. The main concept in Data Mining is to dig deep into analyzing the patterns and relationships of data that can be used further in Artificial Intelligence, Predictive Analysis etc. But the main concept in Big Data is the source, variety, volume of data and how to store and process this amount of data. Head To Head Comparison Between Big Data vs Data Mining
  24. 24. Analyzing of Big data to give a business solution or to make a business definition plays a crucial role to determine growth. Data Mining does not depend on Big Data as it can be done on the small or large amount of data but big data surely depends on Data Mining because if we are not able to find the value/importance of a large amount of data then that data is of no use. Head To Head Comparison Between Big Data vs Data Mining
  25. 25. Head To Head Comparison Between Big Data vs Data Mining Features Data mining Big Data Focus It mainly focuses on lots of details of a data It mainly focuses on lots of relationship between data View It is a close-up view of data It is Big picture of data Data It expresses what about data It expresses why of the data Volume It can be used for small data or big data It refers to a large amount of data set
  26. 26. Head To Head Comparison Between Big Data vs Data Mining Features Data Mining Bid Data Definition It is a technique for analyzing data It is a concept than a precise term Data types Structured data, relational and dimensional database Structured, semi-structured and unstructured data (in NoSQL) Analysis Mainly statistical analysis, focus on prediction and discovery of business factors on small scale Mainly data analysis, focus on prediction and discovery of business factors on large scale. Result Mainly for strategic decision making Dashboards and predictive measures.
  27. 27. Big data only refers to only a large amount of data and all the big data solutions depends on the availability of data. It can be considered as the combination of Business Intelligence and Data Mining. Data mining uses different kinds of tools and software on Big data to return specific results. It is mainly “looking for a needle in a haystack” In short, big data is the asset and data mining is the manager of that is used to provide beneficial results.
  28. 28. Thank you!!! QUESTION

×