The document discusses several topics:
1. It explains the stream data model architecture with a diagram showing streams entering a processing system and being stored in an archival store or working store.
2. It defines a Bloom filter and describes how to calculate the probability of a false positive.
3. It outlines the Girvan-Newman algorithm for detecting communities in a graph by calculating betweenness values and removing edges.
4. It mentions PageRank and the Flajolet-Martin algorithm for approximating the number of unique objects in a data stream.
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
Bigdata analytics
1. Q 1.
(a)
Explain the Stream Data Model Architecture with a neat diagram.
In analogy to a database-management system, we can view a stream processor as a kind of
data-management system, the high-level organization of which is suggested in Fig.
Any number of streams can enter the system. Each stream can provide elements at its own
schedule; they need not have the same data rates or data types, and the time between elements
of one stream need not be uniform. The fact that the rate of arrival of stream elements is not
under the control of the system distinguishes stream processing from the processing of data
that goes on within a database-management system. The latter system controls the rate at
which data is read from the disk, and therefore never has to worry about data getting lost as it
attempts to execute queries. Streams may be archived in a large archival store, but we assume
it is not possible to answer queries from the archival store. It could be examined only under
special circumstances using time-consuming retrieval processes. There is also a working store,
into which summaries or parts of streams may be placed, and which can be used for answering
queries. The working store might be disk, or it might be main memory, depending on how fast
we need to process queries. But either way, it is of sufficiently limited capacity that it cannot
store all the data from all the streams.
2
What is bloom filter? Determine the probability of false positivenness in Bloom Filter.
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values
to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S.
The model to use is throwing darts at targets. Suppose we have x targets and y darts. Any dart
is equally likely to hit any target. After throwing the darts, how many targets can we expect to
be hit at least once?
The probability that a given dart will not hit a given target is (x − 1)/x
The probability that none of the y darts will hit a given target is ((x−1)/x)^y
2. We can write this expression as (1 – 1 x )^x( y x ).
Using the approximation (1−ǫ)1/ǫ = 1/e for small E we conclude that the probability
that none of the y darts hit a given target is e−y/x.
3. Explain Girvan Newman Algorithm .Detect communities for the following graph using Girvan
Newman Algorithm(Edge Betweenness mentioned in the graph)
In order to find out between edges, we need to calculate shortest paths from going
through each of the edges.
Girvan - Newman Algorithm visits each node X once and computes the number of
shortest paths from X to each of the other nodes that go through each of the edges.
The algorithm begins by performing a breadth first search [BFS] of the graph, starting
at the node X.
The edges that go between node at the same level can never be a part of a shortest path
from X.
Edges DAG edge will be part of at-least one shortest path from root X.
To complete the betweeness calculation, we have to repeat this calculation for every
node as the root and sum the contributions.
After calculations, following graph shows final betweenness values:
We can cluster by taking the in order to increasing betweenness and add them to the
graph at a time.
We can remove edge with highest value to cluster the graph.
In the example graph we remove edge BD to get two communities as follows:
4. 5 Explain Flajolet-Martin Algorithm.Perform FM for the stream 1.3.2,1,2,3,4,3,1,2,3,1……….
Flajolet-Martin algorithm approximates the number of unique objects in a stream or a
database in one pass. If the stream contains n elements with m of them unique, this algorithm
runs in O(n)O(n) time and needs O(log(m))O(log(m)) memory.
Algorithm:
1. Create a bit vector (bit array) of sufficient length L, such that 2L>n2L>n, the number
of elements in the stream. Usually a 64-bit vector is sufficient since 264264 is quite
large for most purposes.
2. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i0i. So initialize each bit to 0.
3. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i. So initialize each bit to 0.
4. The i-th bit in this vector/array represents whether we have seen a hash function value
whose binary representation ends in 0i. So initialize each bit to 0.
Example S=1,3,2,1,2,3,4,3,1,2,3,1S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5h(x)=(6x+1) mod 5
Assume |b| = 5
R = max( r(a) ) = 5
So no. of distinct elements = N=2R=25=32
6 Write psuedocode for pagerank calculation using MapReduce. What is the role of combiners
in performing the pagerank calculation?
5. Combiners: (2 Marks)
There are two reasons
1. We might wish to add terms for v ′ i , the ith component of the result vector v, at the
Map tasks. This improvement is the same as using a combiner, since the Reduce
function simply adds terms with a common key. Recall that for a MapReduce
implementation of matrix–vector multiplication, the key is the value of i for which a
term mijvj is intended.
2. We might not be using MapReduce at all, but rather executing the iteration step at a
single machine or a collection of machines.
7. Explain CURE clustering algorithm with an example.
The CURE (Clustering Using Representatives) Algorithm is large scale clustering algorithm
in the point assignment classs which assumes Euclidean space. It does not assume anything
about the shape of clusters; they need not be normally distributed, and can even have strange
bends, S-shapes, or even rings.
Instead of representing clusters by their centroid, it uses a collection of representative points,
as the name implies.
The CURE algorithm is divided into into phases:
1. Initialization in CURE
2. Completion of the CURE Algorithm
Initialization in CURE:
1. Take a small sample of the data and cluster it in main memory. In principle, any
clustering method could be used, but as CURE is designed to handle oddly shaped
clusters, it is often advisable to use a hierarchical method in which clusters are merged
when they have a close pair of points.
6. 2. Select a small set of points from each cluster to be representative points. These points
should be chosen to be as far from one another as possible, using the K-means method.
3. Move each of the representative points a fixed fraction of the distance between its
location and the centroid of its cluster. Perhaps 20% is a good fraction to choose. Note
that this step requires a Euclidean space, since otherwise, there might not be any notion
of a line between two points.
Completion of the CURE Algorithm:
The next phase of CURE is to merge two clusters if they have a pair of representative points,
one from each cluster, that are sufficiently close. The user may pick the distance that defines
“close.” This merging step can repeat, until there are no more sufficiently close clusters.