Cwkaa 2010

COMSM1402 Advanced Algorithms 2010
Rapha¨l Clifford
e
November 5, 2010

Due: The coursework should be handed in at the start of the lecture on
Friday, 17 December. This is both the normal and late deadline. Online sub-
missions can be made up to midnight. Your marks will be based on the best
five answers out of the first six questions plus your mark for question seven.
Problems:
1. The purpose of this question is to show a weakly universal class of hash
√
functions H for which E[M ] = n − 1 . M is the maximum load as-
suming n items are hashed into n slots using a universal family of hash
def
functions. For positive n, we use the notation [n] = {0, . . . , n − 1}.
Define a family H of hash functions from [n] to [n] as follows. Let be
an integer, with 1 ≤ ≤ n. For each V ⊂ [n] of cardinality , we define
the hash function hV : [n] → [n] by the following property. hV maps each
element of V onto 0, and hV maps [n]V injectively into [n]{0}. Note
that hV is not uniquely determined by this property, but we can always
choose one hV satisfying this property (verify). Define
H := {hV : V ⊂ [n], |V | = }
√
Argue that H is weakly universal if ≤ n − 1. Note that the maximum
load always equals .
[10 points]
2. The following approach is useful in streaming algorithms; you should think
about why this might be. Suppose that we have a sequence of items,
passing by one at a time. We want to maintain a sample of one item that
has the property that it is uniformly distributed over all the items that
we have seen at each step. Moreover, we want to accomplish this without
knowing the total number of items in advance or storing all of the items
that we see. Consider the following algorithm, which stores just one item
in memory at all times. When the first item appears, it is stored in the
memory. When the kth item appears, it replaces the item in memory with
probability 1/k. Explain why this algorithm solves the problem.
Now suppose instead we want a sample of s items instead of just one,
without replacement. That is, we don’t want to get the same item multiple

1

times in our sample. If this weren’t an issue, we could get a sample of s
items with replacement just by running s independent copies of the above.
Generalize the above process to that case. (Hint: start by taking the first
s items and storing them as your sample. With what probability should
each new item come into the sample?) [10 points]

3. The simplest variant of cuckoo hashing is as follows. There is a table with
m cells. Each element x can hash into exactly two locations, given by hash
functions, h1 (x) and h2 (x). When an item is placed into the hash table,
if at least one of these two location is free, the item is placed in the free
location. If neither locations is free, x is placed in one of the two locations,
and kicks out the element y that is in that location. Then y is placed in
its alternative location. If that location is free, then all is well, and y is
placed there. Otherwise, y must kick out the element in that location,
and this new element must try to move to its alternative location, and so
on.
It is possible that, at some point, the process will loop. The loop can
either be found explicitly, or a limit on the number of times elements can
be kicked out can be enforced and the whole dataset rehashed if this limit
is ever reached.
One way to generalise this is to use more than two hash functions so that
each element has more than two alternatives for which element to kick
out randomly at each step. The task is to implement a generalised variant
of cuckoo hashing. You should make a choice about how you will create
the hash functions and explain it clearly in terms of the randomness and
independence you are using. You could for example, simply toss some
coins if you only need a small number of random bits to start off. Feel
free to try different hash function families and report on what effect, if
any, this has. You may also want to experiment with creating random
numbers using methods described in the lectures or otherwise. In your
experiments, use a table of size 8192, and add elements until the first time
you cannot add an element. (For convenience, you may assume an element
cannot be added if, after repeating the kick out step 20 times, you are not
done.) Using 2 hash functions and then 3 hash functions, and running
the experiment 1000 times, examine how full the hash table can be before
problems start to occur. Compare your results with the bounds from the
theory and discuss what you find. For this problem, please submit your
code.
You can choose any programming language you like, but please include
clear instructions on how to run your code on a lab machine in a file called
readme.txt that is included with your submission.
[10 points]
4. This question has two parts. A naive implementation of a van Emde Boas
tree uses O(|U |) space, where |U | is the universe size. Explain in detail

2

how this can be reduced to O(n) space (where n is the number of elements
to be stored). What are the complexities of the different operations in your
reduced space data stucture?
The van Emde Boas tree layout can be used to implement a number of
other data stuctures and to speed up important applications. Find an
example from the literature and explain in detail how the van Emde Boas
tree improves the time complexities of the relevant operations. Your ex-
planation should give suitable citations and ideally provide proofs of any
results you report.
[10 points]

5. Consider the following pattern matching problem involving wildcard sym-
bols. A single character wildcard is said to match any other symbol in the
input alphabet.
INPUT: Text T = t1 . . . tn , pattern P = p1 . . . pm . At most of the pat-
tern characters pi are non-wildcards (i.e. normal characters) and the rest
single character wildcards.
OUTPUT: The Hamming distance between P and every substring of T of
length m.

Example: let p = ab?ab and text t = b?bbabba and = 4. The output is
3, 0, 2, 4.

(a) Give an algorithm that solves this problem.
(b) What is the asymptotic time complexity of your algorithm? Make
sure to explain your working carefully.

The better the time complexity, the more marks will be awarded. In
particular, extra marks will be given for fast solutions whose running time
is parameterised by as well as n and m. A Θ(nm) time solution will gain
no marks.
You can assume it takes no more than log2 n bits (i.e a single word of
memory) to represent any of the input symbols and that simple arithmetic
operations on the input symbols, including addition and multiplication
take constant time.
[10 points]
6. (a) The recurrence for the running time of the algorithm for computing
a suffix array presented in lectures is T (n) = T (2n/3) + O(n). Show
how to modify the algorithm to give one whose recurrence is T (n) =
T (3n/7) + O(n). Is 3/7 the best possible, or can you do better?
(b) Suppose we have a pattern p and a text t and we want to find for every
position in t the longest substring of p that matches there exactly.

3

Give a fast algorithm to solve this problem together with its analysis.
The better the time complexity, the more marks will be awarded.
[10 points]
7. For this question you are asked to write a two page summary of a research
paper. I would like you to choose a highly cited paper from one of the
leading algorithms conferences to write about. Luckily there is already a
website (http://www.cs.utah.edu/~suresh/citations/) that has been
through the papers written from 1997–2006 for FOCS, STOC and SODA
(look up what these stand for) and counted the citation numbers for you
although these numbers are now underestimates in most cases. Alter-
natively you may choose a paper from any of the conferences listed at
http://www.cs.tau.ac.il/~iftgam/eventlist.htm. You should check
on http://scholar.google.com that any paper you choose has a current
citation count of at least one hundred.
Please post the title of the paper, its authors, the conference name and
the number of citations on the unit forum as soon as you have made your
choice. You may not, of course, choose the same paper as someone else.
Your two page review should include:
• A short one or two paragraph summary of the paper.
• A deeper, more extensive outline of the main points of the paper,
including for example assumptions made, arguments presented, data
analyzed, and conclusions drawn.
• Any limitations or extensions you see for the ideas in the paper.
• Your opinion of the paper; primarily, the quality of the ideas and its
real or potential impact.

[30 points]

Academic Integrity: All the work you hand in should be your own. If you
work with other students, you should list them on your coursework along with
a brief explanation of which topics you discussed. In general, any source other
than the lectures should be explicitly cited at the point where it is used.

4

Cwkaa 2010

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a Cwkaa 2010

Semelhante a Cwkaa 2010 (20)

Último

Último (20)

Cwkaa 2010