O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

BD-zero lecture.pptx

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
project report on hadoop
project report on hadoop
Carregando em…3
×

Confira estes a seguir

1 de 25 Anúncio

Mais Conteúdo rRelacionado

Semelhante a BD-zero lecture.pptx (20)

Mais de vishal choudhary (20)

Anúncio

Mais recentes (20)

BD-zero lecture.pptx

  1. 1. Zero lecture Big Data Analytics Lab VISHAL CHOUDHARY
  2. 2. 8CS4-21: Big Data Analytics Lab  Credit:2  Max. Marks: 50 (IA:30, ETE:20)  0L+0T+2P  End Term Exam: 2 Hours
  3. 3. List of Experiments: 1. Implement the following Data structures in Java i) nked Lists ii) ii) Stacks iii) iii) Queues iv) iv) Set v) v) Map
  4. 4. 2.Perform setting up and Installing Hadoop in its three operating modes: Standalone, Pseudodistributed, Fully distributed. Hadoop Mainly works on 3 different Modes:  Standalone Mode  Pseudo-distributed Mode  Fully-Distributed Mode 1. Standalone Mode In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes in Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also means that we are installing Hadoop only in a single system. By default, Hadoop is made to run in this Standalone Mode or we can also call it as the Local mode. We mainly use Hadoop in this Mode for the Purpose of Learning, testing, and debugging.  Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS (Hadoop distributed file system) is one of the major components for Hadoop which utilized for storage Permission is not utilized in this mode. You can think of HDFS as similar to the file system’s available for windows i.e. NTFS (New Technology File System) and FAT32(File Allocation Table which stores the data in the blocks of 32 bits ). when your Hadoop works in this mode there is no need to configure the files – hdfs-site.xml, mapred- site.xml, core-site.xml for Hadoop environment. In this Mode, all of your Processes will run on a single JVM(Java Virtual Machine) and this mode can only be used for small development purposes.
  5. 5. 2. Pseudo Distributed Mode (Single Node Cluster) .  In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster is simulated, which means that all the processes inside the cluster will run independently to each other. All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual Machine) or we can say run on different java processes that is why it is called a Pseudo-distributed.  One thing we should remember that as we are using only the single node set up so all the Master and Slave processes are handled by the single system. Namenode and Resource Manager are used as Master and Datanode and Node Manager is used as a slave. A secondary name node is also used as a Master. The purpose of the Secondary Name node is to just keep the hourly based backup of the Name node. In this Mode,  Hadoop is used for development and for debugging purposes both.  Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and Output processes.  We need to change the configuration files mapred-site.xml, core- site.xml, hdfs-site.xml for setting up the environment.
  6. 6. 3. Fully Distributed Mode (Multi-Node Cluster)  This is the most important one in which multiple nodes are used few of them run the Master Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave Daemon’s that are DataNode and Node Manager. Here Hadoop will run on the clusters of Machine or nodes. Here the data that is used is distributed across different nodes. This is actually the Production Mode of Hadoop let’s clarify or understand this Mode in a better way in Physical Terminology.  Once you download the Hadoop in a tar file format or zip file format then you install it in your system and you run all the processes in a single system but here in the fully distributed mode we are extracting this tar or zip file to each of the nodes in the Hadoop cluster and then we are using a particular node for a particular process. Once you distribute the process among the nodes then you’ll define which nodes are working as a master or which one of them
  7. 7. Fully Distributed Mode
  8. 8. 3.Implement the following file management tasks in Hadoop:  Adding files and directories  Retrieving files  Deleting files Hint: A typical Hadoop workflow creates data files (such  as log files) elsewhere and copies them into HDFS using one of the
  9. 9. 4.Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm. MapReduce is a programming model and an associated implementation for processing large data sets Users specify a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with
  10. 10. 5. Write a Map Reduce program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good candidate for analysis with MapReduce,since it is semi structured and record-oriented.  weather sensors are collecting weather information across the globe in a large volume of log data. This weather data is semi- structured and record-oriented.  This data is stored in a line-oriented ASCII format, where each row represents a single record. Each row has lots of fields like longitude, latitude, daily max-min temperature, daily average temperature, etc. for easiness, we will focus on the main element, i.e. temperature. We will use the data from the National Centres for Environmental Information(NCEI). It has a massive amount of historical weather data that we can use for data analysis.  Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The term MapReduce actually refers to the following two different tasks that Hadoop programs perform:
  11. 11.  The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).  The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.
  12. 12. 6.Implement Matrix Multiplication with Hadoop Map Reduce MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly to make computation faster, save time, and mostly used in distributed systems. It has 2 important parts:  Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary, you search for the word “Data” and its associated meaning is “facts and statistics collected together for reference or analysis”. Here the Key is Data and the  Value associated with is facts and statistics collected together for reference or analysis.  Reducer: It is responsible for processing data in parallel and produce final output.
  13. 13. 7.Install and Run Pig then write Pig Latin scripts to sort, group, join, project,and filter your data  Pig is a high-level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!  In a MapReduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.  A Pig Latin program consists of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation, by Hadoop Pig execution environment. Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig in Hadoop allows the programmer to focus on data rather than the nature of execution.  PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter. 
  14. 14. 8.Install and Run Hive then use Hive to create, alter, and drop databases,tables, views, functions, and indexes. Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop
  15. 15. 9.Solve some real life big data problems.
  16. 16. Program -1 :Linked list in java  Linked List is a part of the Collection framework present in java.util package. This class is an implementation of the LinkedList data structure which is a linear data structure where the elements are not stored in contiguous locations and every element is a separate object with a data part and address part. The elements are linked using pointers and addresses. Each element is known as a node. Due to the dynamicity and ease of insertions and deletions, they are preferred over the arrays. It also has few disadvantages like the nodes cannot be accessed directly instead we need to start from the head and follow through the link to reach to a node we wish to access.
  17. 17. create and use a linked list. import java.util.*; public class Test { public static void main(String args[]) { LinkedList<String> ll = new LinkedList<String>(); // Adding elements to the linked list ll.add("A"); ll.add("B"); ll.addLast("C"); ll.addFirst("D"); ll.add(2, "E"); System.out.println(ll); ll.remove("B"); ll.remove(3); ll.removeFirst(); ll.removeLast(); System.out.println(ll); } }
  18. 18. Performing Various Operations on LinkedList 1. Adding Elements: In order to add an element to an ArrayList, we can use the add() method. This method is overloaded to perform multiple operations based on different parameters. They are:  add(Object): This method is used to add an element at the end of the LinkedList.  add(int index, Object): This method is used to add an element at a specific index in the LinkedList.
  19. 19. 2. Changing Elements: After adding the elements, if we wish to change the element, it can be done using the set() method. Since a LinkedList is indexed, the element which we wish to change is referenced by the index of the element. Therefore, this method takes an index and the updated element which needs to be inserted at that index.
  20. 20.  // Java program to change elements in a LinkedList import java.util.*; public class GFG { public static void main(String args[]) { LinkedList<String> ll = new LinkedList<>(); ll.add("Geeks"); ll.add("Geeks"); ll.add(1, "Geeks"); System.out.println("Initial LinkedList " + ll); ll.set(1, "For"); System.out.println("Updated LinkedList " + ll); } }
  21. 21. 3. Removing Elements: In order to remove an element from a LinkedList, we can use the remove() method. This method is overloaded to perform multiple operations based on different parameters. They are: remove(Object): This method is used to simply remove an object from the LinkedList. If there are multiple such objects, then the first occurrence of the object is removed. remove(int index): Since a LinkedList is indexed, this method takes an integer value which simply removes the element present at that specific index in the LinkedList. After removing the element, all the elements are moved to the left to fill the space and the indices of the objects are updated.
  22. 22. // Java program to remove elements // in a LinkedList import java.util.*; public class GFG { public static void main(String args[]) { LinkedList<String> ll = new LinkedList<>(); ll.add("Geeks"); ll.add("Geeks"); ll.add(1, "For"); System.out.println( "Initial LinkedList " + ll); ll.remove(1); System.out.println( "After the Index Removal " + ll); ll.remove("Geeks"); System.out.println( "After the Object Removal " + ll); } }
  23. 23.  4. Iterating the LinkedList: There are multiple ways to iterate through the LinkedList. The most famous ways are by using the basic for loop in combination with a get() method to get the element at a specific index and the advanced for loop.
  24. 24. // Java program to iterate the elements in an LinkedList import java.util.*; public class GFG { public static void main(String args[]) { LinkedList<String> ll = new LinkedList<>(); ll.add("Geeks"); ll.add("Geeks"); ll.add(1, "For"); // Using the Get method and the // for loop for (int i = 0; i < ll.size(); i++) { System.out.print(ll.get(i) + " "); } System.out.println(); // Using the for each loop for (String str : ll) System.out.print(str + " "); } }

×