SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
What you can do with a Tall-and-Skinny !
QR Factorization on Hadoop: !
Large regressions, Principal Components

Slides bit.ly/16LS8Vk
                                               @dgleich
Code github.com/dgleich/mrtsqr
                            dgleich@purdue.edu



DAVID F. GLEICH
ASSISTANT PROFESSOR !
COMPUTER SCIENCE !
PURDUE UNIVERSITY





                                                                              1
                                  David Gleich · Purdue
    bit.ly/16LS8Vk
Why you should stay …
you like advanced machine learning techniques

you want to understand how to compute the
singular values and vectors of a huge matrix
(that’s tall and skinny)

you want to learn about large-scale regression,
and principal components from a matrix
perspective




                                                                       2
                            David Gleich · Purdue
   bit.ly/16LS8Vk
What I’m going to assume
you know 


MapReduce

Python

Some simple matrix manipulation




                                                                     3
                          David Gleich · Purdue
   bit.ly/16LS8Vk
Tall-and-Skinny
    matrices
    
    (m ≫ n) 
    Many rows (like a billion)
A
    A few columns (under 10,000)

                regression and!
                general linear models!
                with many samples!                  From tinyimages"
                
                                                           collection
        Used in
 block iterative methods
                panel factorizations
                

                approximate kernel k-means 
                

                big-data SVD/PCA!




                                                                                   4
                                        David Gleich · Purdue
   bit.ly/16LS8Vk
If you have tons of small
records, then there is probably
a tall-and-skinny matrix
somwhere




                                                            5
                 David Gleich · Purdue
   bit.ly/16LS8Vk
Tall-and-skinny matrices are
common in BigData

A : m x n, m ≫ n
                                             A1

Key is an arbitrary row-id
                                                              A2
Value is the 1 x n array "
for a row
                                                              A3

Each submatrix Ai is an "
                                                              A4 
the input to a map task.




                                                                         6
                              David Gleich · Purdue
   bit.ly/16LS8Vk
PCA of 80,000,000!
         images
1000 pixels
                                                                              1


                                                                             0.8                                                           0




                                                                                                                    Fraction of variance
                                                      Fraction of variance
 80,000,000 images




                                                                             0.6                                                           0


                      A                                                      0.4                                                           0


                                                                             0.2                                                           0


                          First 16 columns of V as                            0
                                                                                   20      40     60     80   100
                                           images
                                   Principal Components



                                                      Figure 5: The 16 most impo
                                                      nent basis functions (by row




                                                                                                                7
  Constantine & Gleich, MapReduce 2011.
             David Gleich · Purdue
 bit.ly/16LS8Vk
via       the sum of red-pixel values in each image as a linear combi-
            nation of the gray values in each image. Formally, if ri is the
time
 and
             Regression with 80,000,000
            sum of the red components in all pixels of image i, and Gi,j
            is the gray value of the jth pixel in image i, then we wanted
per-
 ates
             images
      q        q
            to find min i (ri ≠ j Gi,j sj )2 . There is no particular im-
 (for       portance to this regression problem, we use it merely as a
            demonstration.
    1000 pixels
 on),
split       The coe cients sj are dis-
  file       played as an image to approx.
                 The goal was at the right.
d by        They reveal regionsthere was
                 how much red of the im-
 test       age in a picture fromimportant
                  that are not as the
     80,000,000 images




  the       in determining the overall red
                 value of the grayscale
 r in       component of an image. The
                 pixels only. 
         A color scale varies from light-
final
 size       blue (strongly measure of blue
                 We get a negative) to
pers        (0) howred (strongly positive).
                 and much “redness”
            The computation took 30 min-
                 each pixel contributes to
1000        utes using the Dumbo frame-
                 the whole.
            work and a two-iteration job with 250 intermediate reducers.
 h is
  the          We also solved a principal component problem to find a
hav-        principal component basis for each image. Let G be matrix
final        of Gi,j ’s from the regression andDavidui be the meanbit.ly/16LS8Vk
                                               let Gleich · Purdue
 of the ith




                                                                                   8
Let’s talk about QR!




                                                            9
                 David Gleich · Purdue
   bit.ly/16LS8Vk
QR Factorization and the
Gram Schmidt process

                                    Consider a set of vectors v1 to
                                    vn. Set u1 to be v1.
                                    
                                    Create a new vector u2 by
                                    removing any “component” of
                                    u1 from v2.
                                    
                                    Create a new vector u3 by
                                    removing any “component” of
                                    u1 and u2 from v3.
                                    
                                    …
    “Gram-Schmidt process” "




                                                                          10
    from Wikipedia
                               David Gleich · Purdue
   bit.ly/16LS8Vk
QR Factorization and the
Gram Schmidt process

                                   v1 = a1 u1
                                   v2 = b1 u1 + b2 u2
                                   v3 = c1 u1 + c2 u2 + c3 u3
   ⇥                       ⇤
       v1   v2    v3    ...
                               2                                3
                                  a1      b1       c1      ...
        ⇥                      ⇤6 0
                                6         b2       c2      ... 77
   = u1          u2    v3   ... 6 0       0        c3      ... 7
                                4                               5
                                   .
                                   .       .
                                           .        .
                                                    .      ..
                                   .       .        .         .




                                                                           11
                                David Gleich · Purdue
   bit.ly/16LS8Vk
QR Factorization and the
Gram Schmidt process

                              v1 = a1 u1
                              v2 = b1 u1 + b2 u2
                              v3 = c1 u1 + c2 u2 + c3 u3
 For this problem
                        V = UR                 All vectors in U
                                               are at right
                                               angles, i.e. they
What it’s usually"
written as by others
   A = QR                 are decoupled




                                                                       12
                            David Gleich · Purdue
   bit.ly/16LS8Vk
QR Factorization and the
Gram Schmidt process

              R
                       v1 = a1 u1
                       v2 = b1 u1 + b2 u2
                       v3 = c1 u1 + c2 u2 + c3 u3

 A   =
   Q                             All vectors in U
                                        are at right
                                        angles, i.e. they
                                        are decoupled




                                                                13
                     David Gleich · Purdue
   bit.ly/16LS8Vk
PCA of 80,000,000!
         images
                                                       First 16
                                                                      columns
                                                                        of V as
                                                                       images
1000 pixels
                                       R                       V
                                              SVD
                   (principal
                                      TSQR
                        components)
 80,000,000 images




                                                           Top 100
                      A           X                        singular
                                                           values
                          Zero"
                          mean"
                          rows



                          MapReduce                  Post Processing




                                                                                                                 14
  Constantine & Gleich, MapReduce 2010.
                              David Gleich · Purdue
   bit.ly/16LS8Vk
Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute  colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.





                                                                       15
                            David Gleich · Purdue
   bit.ly/16LS8Vk
The rest of the talk!
     Full TSQR code in hadoopy
import random, numpy, hadoopy                       def close(self):
class SerialTSQR:                                    self.compress()
 def __init__(self,blocksize,isreducer):             for row in self.data:
   self.bsize=blocksize                                key = random.randint(0,2000000000)
   self.data = []                                      yield key, row
   if isreducer: self.__call__ = self.reducer
   else: self.__call__ = self.mapper                def mapper(self,key,value):
                                                     self.collect(key,value)
 def compress(self):
  R = numpy.linalg.qr(                              def reducer(self,key,values):
    numpy.array(self.data),'r')                      for value in values: self.mapper(key,value)
  # reset data and re-initialize to R
  self.data = []                                if __name__=='__main__':
  for row in R:                                   mapper = SerialTSQR(blocksize=3,isreducer=False)
   self.data.append([float(v) for v in row])      reducer = SerialTSQR(blocksize=3,isreducer=True)
                                                  hadoopy.run(mapper, reducer)
 def collect(self,key,value):
  self.data.append(value)
  if len(self.data)>self.bsize*len(self.data[0]):
    self.compress()




                                                                                                     16
                                                          David Gleich · Purdue
   bit.ly/16LS8Vk
Communication avoiding QR (Demmel et al. 2008) !
     on MapReduce (Constantine and Gleich, 2010)
                                             Algorithm
                                             Data Rows of a matrix
              A1   A1                        Map QR factorization of rows
                   A2
                        qr                   Reduce QR factorization of rows
              A2             Q2   R2
Mapper 1                                qr
Serial TSQR   A3                  A3          Q3   R3

                                                   A4   qr             emit
              A4                                             Q4   R4

              A5   A5
                        qr
              A6   A6        Q6   R6
Mapper 2                                qr
Serial TSQR   A7                  A7          Q7   R7

                                                   A8   qr             emit
              A8                                             Q8   R8


              R4   R4
Reducer 1
Serial TSQR             qr             emit
              R8   R8        Q    R




                                                                                                        17
                                                             David Gleich · Purdue
   bit.ly/16LS8Vk
The rest of the talk!
     Full TSQR code in hadoopy
import random, numpy, hadoopy                       def close(self):
class SerialTSQR:                                    self.compress()
 def __init__(self,blocksize,isreducer):             for row in self.data:
   self.bsize=blocksize                                key = random.randint(0,2000000000)
   self.data = []                                      yield key, row
   if isreducer: self.__call__ = self.reducer
   else: self.__call__ = self.mapper                def mapper(self,key,value):
                                                     self.collect(key,value)
 def compress(self):
  R = numpy.linalg.qr(                              def reducer(self,key,values):
    numpy.array(self.data),'r')                      for value in values: self.mapper(key,value)
  # reset data and re-initialize to R
  self.data = []                                if __name__=='__main__':
  for row in R:                                   mapper = SerialTSQR(blocksize=3,isreducer=False)
   self.data.append([float(v) for v in row])      reducer = SerialTSQR(blocksize=3,isreducer=True)
                                                  hadoopy.run(mapper, reducer)
 def collect(self,key,value):
  self.data.append(value)
  if len(self.data)>self.bsize*len(self.data[0]):
    self.compress()




                                                                                                     18
                                                          David Gleich · Purdue
   bit.ly/16LS8Vk
Too many maps cause too
much data to one reducer!

                Each image is 5k.
                Each HDFS block has "
                12,800 images.
                6,250 total blocks.
                Each map outputs "
                1000-by-1000 matrix
                One reducer gets a 6.25M-
                by-1000 matrix (50GB)




                                                                 19
                      David Gleich · Purdue
   bit.ly/16LS8Vk
map           emit                         reduce          emit                                      reduce        emit
                R1                                           R2,1                                                      R
    A1    Mapper 1-1
                                               S1   Reducer 1-1
                                                                                                       S(2)
                                                                                                       A2     Reducer 2-1
         Serial TSQR                                Serial TSQR                                               Serial TSQR




                                                                                          shuffle
                                                                           identity map
         map           emit                         reduce          emit
                R2                                           R2,2
    A2    Mapper 1-2                    S(1)   S
                                               A2   Reducer 1-2
                              shuffle

         Serial TSQR                                Serial TSQR

A
         map           emit                         reduce          emit
                R3                                           R2,3
    A3    Mapper 1-3
                                               S3
                                               A2   Reducer 1-3
         Serial TSQR                                Serial TSQR


         map           emit
                R4
    A3
     4    Mapper 1-4
         Serial TSQR




                                                                                                                             20
                       Iteration 1                                                                  Iteration 2
                                                        David Gleich · Purdue
                         bit.ly/16LS8Vk
Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute  colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.





                                                                       21
                            David Gleich · Purdue
   bit.ly/16LS8Vk
Hadoop streaming isn’t
always slow!
Synthetic data test on 100,000,000-by-500 matrix (~500GB)
Codes implemented in MapReduce streaming
Matrix stored as TypedBytes lists of doubles
Python frameworks use Numpy+ATLAS matrix.
Custom C++ TypedBytes reader/writer with ATLAS matrix.

                Iter 1
          Iter 2
             Overall"
                 Total (secs.)
   Total (secs.)
      Total (secs.)
    Dumbo
       960
             217
                1177
    Hadoopy
     612
             118
                730
    C++!         350!             37!                 387!
    Java
        436
             66
                 502




                                                                                    22
                                         David Gleich · Purdue
   bit.ly/16LS8Vk
Use multiple iterations for
problems with many columns
                           Cols.
     Iters.
   Split"    Maps
     Secs.
                                                (MB)
Increasing split size
                           50
        1
        64
       8000
      388
improves performance
(accounts for Hadoop       –
         –
        256
      2000
      184

data movement)
            –
         –
        512
      1000
      149

Increasing iterations      –
         2
        64
       8000
      425
helps for problems with    –
         –
        256
      2000
      220
many columns.
             –
         –
        512
      1000
      191

(1000 columns with 64-     1000
      1
        512
      1000
      666
MB split size overloaded
                           –
         2
        64
       6000
      590
the single reducer.)
                           –
         –
        256
      2000
      432

                           –
         –
        512
      1000
      337




                                                                             23
                                 David Gleich · Purdue
   bit.ly/16LS8Vk
More about how to !
compute a regression

                                      2
          min kAx bk
                XX
                                                                2
          = min  (   Aij xj                            bi )
                        i        j
 A                                        b1
                            A1       A1
                                                 Q2        b2 = Q2T b1
                                            qr
                            A2       A2               R2
          Mapper 1                                         qr
          Serial TSQR       A3                        A3

     b
                     A4




                                                                                  24
                                     David Gleich · Purdue
     bit.ly/16LS8Vk
TSQR code in hadoopy for
    regressions
import random, numpy, hadoopy                       def close(self):
class SerialTSQR:                                    self.compress()
 def __init__(self,blocksize,isreducer):             for i,row in enumerate(self.data):
   […]                                                 key = random.randint(0,2000000000)
                                                       yield key, (row, self.rhs[i])
 def compress(self):
  Q,R = numpy.linalg.qr(                            def mapper(self,key,value):
    numpy.array(self.data), ‘full’)                  self.collect(key,unpack(value))
  # reset data and re-initialize to R
  self.data = []                                    def reducer(self,key,values):
  for row in R:                                      for value in values: self.mapper(key,
   self.data.append([float(v) for v in row])          unpack(value))
  self.rhs = list( numpy.dot(Q.T,
                    numpy.array(self.rhs) )     if __name__=='__main__':
                                                  mapper = SerialTSQR(blocksize=3,isreducer=False)
 def collect(self,key,valuerhs):                  reducer = SerialTSQR(blocksize=3,isreducer=True)
  self.data.append(valuerhs[0])                   hadoopy.run(mapper, reducer)
  self.rhs.append(valuerhs[1])
  if len(self.data)>self.bsize*len(self.data[0]):
    self.compress()




                                                                                                      25
                                                           David Gleich · Purdue
   bit.ly/16LS8Vk
More about how to !
compute a regression
                              min kAx         bk2
                               = min kQRx            bk2
                                   Orthogonal or “right angle” matrices"
                                   don’t change vector magnitude
                                          T                T     2
                    QT b
      = min kQ QRx               Q bk
 A             R               = min kRx          Q T bk2
         QR"
         for "                       This is a tiny linear system!
      Regression
                            def compute_x(output):!
                              R,y = load_from_hdfs(output)!
                              x = numpy.linalg.solve(R,y)!
                              write_output(x,output+’-x’)!
     b




                                                                            26
                                David Gleich · Purdue
    bit.ly/16LS8Vk
We do a similar step for the
PCA and compute the 1000-
by-1000 SVD on one machine




                                                          27
               David Gleich · Purdue
   bit.ly/16LS8Vk
Getting the matrix Q is tricky!




                                                             28
                  David Gleich · Purdue
   bit.ly/16LS8Vk
What about the matrix Q?

We want Q to be                                           Constantine & Gleich,
                                                          MapReduce 2011
numerically
orthogonal.
                             Prior work
                     norm ( QTQ – I )



                                         AR-1
A condition number
measures problem                                                               Benson, Gleich,
sensitivity.
                                                                  Demmel, Submitted 
                                                              AR + "
                                                                -1


                                                                  nt
         Direct TSQR
                                                           refineme
                                                 iterative                     Benson, Gleich, "
Prior methods all                                                              Demmel, Submitted
failed without any                           105
                                  1020
warning.
                                      Condition number




                                                                                                    29
                                                      David Gleich · Purdue
     bit.ly/16LS8Vk
Taking care of business by
keeping track of Q
                                        3. Distribute the
                                                           pieces of Q*1 and
                                                           form the true Q

       Mapper 1
                                                           Mapper 3
                                                 Task 2
                         R1
                                                   Q11
       A1
         Q1
                     R1
   Q11
 R
                   Q1
          Q1
                                           R2
    Q21




                                                               Q output
                               R output
                         R2
               R3
    Q31
                           Q21
       A2
         Q2
                                                     Q2
          Q2
                                           R4
    Q41
                         R3
                                                     Q31
                                    2. Collect R on one
       A3
         Q3
                                                     Q3
          Q3
                                    node, compute Qs
                                    for each piece
                         R4
                                                     Q41
       A4 
        Q4
                                                     Q4
          Q4

               1. Output local Q and
               R in separate files




                                                                                               30
                                             David Gleich · Purdue
          bit.ly/16LS8Vk
Code available from
github.com/arbenson/mrtsqr
…
it isn’t too bad.




                                                                31
                     David Gleich · Purdue
   bit.ly/16LS8Vk
Future work … more columns!

With ~3000 columns, one 64MB chunk is a local
QR computation. 

Could “iterate in blocks of 3000” columns to
continue … maybe “efficient” for 10,000 columns

Need different ideas for 100,000 columns
(randomized methods?)




                                                                      32
                           David Gleich · Purdue
   bit.ly/16LS8Vk
Questions?

www.cs.purdue.edu/~dgleich
@dgleich
dgleich@purdue.edu





                                                           33
                David Gleich · Purdue
   bit.ly/16LS8Vk

Mais conteúdo relacionado

Mais procurados

Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsFrank Nielsen
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Multi-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasMulti-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasGus Gutoski
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHolistic Benchmarking of Big Linked Data
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012Zheng Mengdi
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
Sparse Binary Zero Sum Games (ACML2014)
Sparse Binary Zero Sum Games (ACML2014)Sparse Binary Zero Sum Games (ACML2014)
Sparse Binary Zero Sum Games (ACML2014)Jialin LIU
 
Bayesian modelling and computation for Raman spectroscopy
Bayesian modelling and computation for Raman spectroscopyBayesian modelling and computation for Raman spectroscopy
Bayesian modelling and computation for Raman spectroscopyMatt Moores
 
Data science certification course
Data science certification courseData science certification course
Data science certification courseTejaspathiLV
 

Mais procurados (16)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Multi-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasMulti-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideas
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K Characters
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
Sparse Binary Zero Sum Games (ACML2014)
Sparse Binary Zero Sum Games (ACML2014)Sparse Binary Zero Sum Games (ACML2014)
Sparse Binary Zero Sum Games (ACML2014)
 
Bayesian modelling and computation for Raman spectroscopy
Bayesian modelling and computation for Raman spectroscopyBayesian modelling and computation for Raman spectroscopy
Bayesian modelling and computation for Raman spectroscopy
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
 
GDRR Opening Workshop - Variance Reduction for Reliability Assessment with St...
GDRR Opening Workshop - Variance Reduction for Reliability Assessment with St...GDRR Opening Workshop - Variance Reduction for Reliability Assessment with St...
GDRR Opening Workshop - Variance Reduction for Reliability Assessment with St...
 
Data science certification course
Data science certification courseData science certification course
Data science certification course
 

Destaque

A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignmentDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveDavid Gleich
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignmentDavid Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsDavid Gleich
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...David Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationDavid Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...David Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisDavid Gleich
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detectionDavid Gleich
 

Destaque (20)

A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architectures
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspective
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignment
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architectures
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 

Semelhante a What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

Line Detection on the GPU
Line Detection on the GPU Line Detection on the GPU
Line Detection on the GPU Gernot Ziegler
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdfNarenRajVivek
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Universitat Politècnica de Catalunya
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingElectronic Arts / DICE
 
Algebraic methods for design QC-LDPC codes
Algebraic methods for design QC-LDPC codesAlgebraic methods for design QC-LDPC codes
Algebraic methods for design QC-LDPC codesUsatyuk Vasiliy
 
Easy edd phd talks 28 oct 2008
Easy edd phd talks 28 oct 2008Easy edd phd talks 28 oct 2008
Easy edd phd talks 28 oct 2008Taha Sochi
 
Compression using JPEG
Compression using JPEGCompression using JPEG
Compression using JPEGSabih Hasan
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14Sri Ambati
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchEshanAgarwal4
 
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019Alberto Massidda - Scenes from a memory - Codemotion Rome 2019
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019Codemotion
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...Universitat Politècnica de Catalunya
 
Technical aptitude questions
Technical aptitude questionsTechnical aptitude questions
Technical aptitude questionssadiqkhanpathan
 
20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aiaYi-Fan Liou
 
Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)Christian Robert
 
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Alex Conway
 

Semelhante a What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions (20)

Line Detection on the GPU
Line Detection on the GPU Line Detection on the GPU
Line Detection on the GPU
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdf
 
Robust watermarking technique sppt
Robust watermarking technique spptRobust watermarking technique sppt
Robust watermarking technique sppt
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
 
Logic presentation
Logic presentationLogic presentation
Logic presentation
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
 
Algebraic methods for design QC-LDPC codes
Algebraic methods for design QC-LDPC codesAlgebraic methods for design QC-LDPC codes
Algebraic methods for design QC-LDPC codes
 
Easy edd phd talks 28 oct 2008
Easy edd phd talks 28 oct 2008Easy edd phd talks 28 oct 2008
Easy edd phd talks 28 oct 2008
 
Compression using JPEG
Compression using JPEGCompression using JPEG
Compression using JPEG
 
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
 
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
 
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019Alberto Massidda - Scenes from a memory - Codemotion Rome 2019
Alberto Massidda - Scenes from a memory - Codemotion Rome 2019
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
 
Technical aptitude questions
Technical aptitude questionsTechnical aptitude questions
Technical aptitude questions
 
20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aia
 
Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)
 
ICRA Nathan Piasco
ICRA Nathan PiascoICRA Nathan Piasco
ICRA Nathan Piasco
 
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
 

Mais de David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for HadoopDavid Gleich
 

Mais de David Gleich (12)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

What you can do with a tall-and-skinny QR factorization in Hadoop: Principal components and large regressions

  • 1. What you can do with a Tall-and-Skinny ! QR Factorization on Hadoop: ! Large regressions, Principal Components Slides bit.ly/16LS8Vk @dgleich Code github.com/dgleich/mrtsqr dgleich@purdue.edu DAVID F. GLEICH ASSISTANT PROFESSOR ! COMPUTER SCIENCE ! PURDUE UNIVERSITY 1 David Gleich · Purdue bit.ly/16LS8Vk
  • 2. Why you should stay … you like advanced machine learning techniques you want to understand how to compute the singular values and vectors of a huge matrix (that’s tall and skinny) you want to learn about large-scale regression, and principal components from a matrix perspective 2 David Gleich · Purdue bit.ly/16LS8Vk
  • 3. What I’m going to assume you know MapReduce Python Some simple matrix manipulation 3 David Gleich · Purdue bit.ly/16LS8Vk
  • 4. Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A A few columns (under 10,000) regression and! general linear models! with many samples! From tinyimages" collection Used in block iterative methods panel factorizations approximate kernel k-means big-data SVD/PCA! 4 David Gleich · Purdue bit.ly/16LS8Vk
  • 5. If you have tons of small records, then there is probably a tall-and-skinny matrix somwhere 5 David Gleich · Purdue bit.ly/16LS8Vk
  • 6. Tall-and-skinny matrices are common in BigData A : m x n, m ≫ n A1 Key is an arbitrary row-id A2 Value is the 1 x n array " for a row A3 Each submatrix Ai is an " A4 the input to a map task. 6 David Gleich · Purdue bit.ly/16LS8Vk
  • 7. PCA of 80,000,000! images 1000 pixels 1 0.8 0 Fraction of variance Fraction of variance 80,000,000 images 0.6 0 A 0.4 0 0.2 0 First 16 columns of V as 0 20 40 60 80 100 images Principal Components Figure 5: The 16 most impo nent basis functions (by row 7 Constantine & Gleich, MapReduce 2011. David Gleich · Purdue bit.ly/16LS8Vk
  • 8. via the sum of red-pixel values in each image as a linear combi- nation of the gray values in each image. Formally, if ri is the time and Regression with 80,000,000 sum of the red components in all pixels of image i, and Gi,j is the gray value of the jth pixel in image i, then we wanted per- ates images q q to find min i (ri ≠ j Gi,j sj )2 . There is no particular im- (for portance to this regression problem, we use it merely as a demonstration. 1000 pixels on), split The coe cients sj are dis- file played as an image to approx. The goal was at the right. d by They reveal regionsthere was how much red of the im- test age in a picture fromimportant that are not as the 80,000,000 images the in determining the overall red value of the grayscale r in component of an image. The pixels only. A color scale varies from light- final size blue (strongly measure of blue We get a negative) to pers (0) howred (strongly positive). and much “redness” The computation took 30 min- each pixel contributes to 1000 utes using the Dumbo frame- the whole. work and a two-iteration job with 250 intermediate reducers. h is the We also solved a principal component problem to find a hav- principal component basis for each image. Let G be matrix final of Gi,j ’s from the regression andDavidui be the meanbit.ly/16LS8Vk let Gleich · Purdue of the ith 8
  • 9. Let’s talk about QR! 9 David Gleich · Purdue bit.ly/16LS8Vk
  • 10. QR Factorization and the Gram Schmidt process Consider a set of vectors v1 to vn. Set u1 to be v1. Create a new vector u2 by removing any “component” of u1 from v2. Create a new vector u3 by removing any “component” of u1 and u2 from v3. … “Gram-Schmidt process” " 10 from Wikipedia David Gleich · Purdue bit.ly/16LS8Vk
  • 11. QR Factorization and the Gram Schmidt process v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 ⇥ ⇤ v1 v2 v3 ... 2 3 a1 b1 c1 ... ⇥ ⇤6 0 6 b2 c2 ... 77 = u1 u2 v3 ... 6 0 0 c3 ... 7 4 5 . . . . . . .. . . . . 11 David Gleich · Purdue bit.ly/16LS8Vk
  • 12. QR Factorization and the Gram Schmidt process v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 For this problem V = UR All vectors in U are at right angles, i.e. they What it’s usually" written as by others A = QR are decoupled 12 David Gleich · Purdue bit.ly/16LS8Vk
  • 13. QR Factorization and the Gram Schmidt process R v1 = a1 u1 v2 = b1 u1 + b2 u2 v3 = c1 u1 + c2 u2 + c3 u3 A = Q All vectors in U are at right angles, i.e. they are decoupled 13 David Gleich · Purdue bit.ly/16LS8Vk
  • 14. PCA of 80,000,000! images First 16 columns of V as images 1000 pixels R    V SVD (principal TSQR components) 80,000,000 images Top 100 A X singular values Zero" mean" rows MapReduce Post Processing 14 Constantine & Gleich, MapReduce 2010. David Gleich · Purdue bit.ly/16LS8Vk
  • 15. Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute  colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec. 15 David Gleich · Purdue bit.ly/16LS8Vk
  • 16. The rest of the talk! Full TSQR code in hadoopy import random, numpy, hadoopy def close(self): class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: self.bsize=blocksize key = random.randint(0,2000000000) self.data = [] yield key, row if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def mapper(self,key,value): self.collect(key,value) def compress(self): R = numpy.linalg.qr( def reducer(self,key,values): numpy.array(self.data),'r') for value in values: self.mapper(key,value) # reset data and re-initialize to R self.data = [] if __name__=='__main__': for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False) self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 16 David Gleich · Purdue bit.ly/16LS8Vk
  • 17. Communication avoiding QR (Demmel et al. 2008) ! on MapReduce (Constantine and Gleich, 2010) Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2 Mapper 1 qr Serial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6 Mapper 2 qr Serial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4 Reducer 1 Serial TSQR qr emit R8 R8 Q R 17 David Gleich · Purdue bit.ly/16LS8Vk
  • 18. The rest of the talk! Full TSQR code in hadoopy import random, numpy, hadoopy def close(self): class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: self.bsize=blocksize key = random.randint(0,2000000000) self.data = [] yield key, row if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def mapper(self,key,value): self.collect(key,value) def compress(self): R = numpy.linalg.qr( def reducer(self,key,values): numpy.array(self.data),'r') for value in values: self.mapper(key,value) # reset data and re-initialize to R self.data = [] if __name__=='__main__': for row in R: mapper = SerialTSQR(blocksize=3,isreducer=False) self.data.append([float(v) for v in row]) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 18 David Gleich · Purdue bit.ly/16LS8Vk
  • 19. Too many maps cause too much data to one reducer! Each image is 5k. Each HDFS block has " 12,800 images. 6,250 total blocks. Each map outputs " 1000-by-1000 matrix One reducer gets a 6.25M- by-1000 matrix (50GB) 19 David Gleich · Purdue bit.ly/16LS8Vk
  • 20. map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) S A2 Reducer 1-2 shuffle Serial TSQR Serial TSQR A map emit reduce emit R3 R2,3 A3 Mapper 1-3 S3 A2 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR 20 Iteration 1 Iteration 2 David Gleich · Purdue bit.ly/16LS8Vk
  • 21. Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute  colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec. 21 David Gleich · Purdue bit.ly/16LS8Vk
  • 22. Hadoop streaming isn’t always slow! Synthetic data test on 100,000,000-by-500 matrix (~500GB) Codes implemented in MapReduce streaming Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+ATLAS matrix. Custom C++ TypedBytes reader/writer with ATLAS matrix. Iter 1 Iter 2 Overall" Total (secs.) Total (secs.) Total (secs.) Dumbo 960 217 1177 Hadoopy 612 118 730 C++! 350! 37! 387! Java 436 66 502 22 David Gleich · Purdue bit.ly/16LS8Vk
  • 23. Use multiple iterations for problems with many columns Cols. Iters. Split" Maps Secs. (MB) Increasing split size 50 1 64 8000 388 improves performance (accounts for Hadoop – – 256 2000 184 data movement) – – 512 1000 149 Increasing iterations – 2 64 8000 425 helps for problems with – – 256 2000 220 many columns. – – 512 1000 191 (1000 columns with 64- 1000 1 512 1000 666 MB split size overloaded – 2 64 6000 590 the single reducer.) – – 256 2000 432 – – 512 1000 337 23 David Gleich · Purdue bit.ly/16LS8Vk
  • 24. More about how to ! compute a regression 2 min kAx bk XX 2 = min ( Aij xj bi ) i j A b1 A1 A1 Q2 b2 = Q2T b1 qr A2 A2 R2 Mapper 1 qr Serial TSQR A3 A3 b A4 24 David Gleich · Purdue bit.ly/16LS8Vk
  • 25. TSQR code in hadoopy for regressions import random, numpy, hadoopy def close(self): class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for i,row in enumerate(self.data): […] key = random.randint(0,2000000000) yield key, (row, self.rhs[i]) def compress(self): Q,R = numpy.linalg.qr( def mapper(self,key,value): numpy.array(self.data), ‘full’) self.collect(key,unpack(value)) # reset data and re-initialize to R self.data = [] def reducer(self,key,values): for row in R: for value in values: self.mapper(key, self.data.append([float(v) for v in row]) unpack(value)) self.rhs = list( numpy.dot(Q.T, numpy.array(self.rhs) ) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) def collect(self,key,valuerhs): reducer = SerialTSQR(blocksize=3,isreducer=True) self.data.append(valuerhs[0]) hadoopy.run(mapper, reducer) self.rhs.append(valuerhs[1]) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 25 David Gleich · Purdue bit.ly/16LS8Vk
  • 26. More about how to ! compute a regression min kAx bk2 = min kQRx bk2 Orthogonal or “right angle” matrices" don’t change vector magnitude T T 2 QT b = min kQ QRx Q bk A R = min kRx Q T bk2 QR" for " This is a tiny linear system! Regression def compute_x(output):! R,y = load_from_hdfs(output)! x = numpy.linalg.solve(R,y)! write_output(x,output+’-x’)! b 26 David Gleich · Purdue bit.ly/16LS8Vk
  • 27. We do a similar step for the PCA and compute the 1000- by-1000 SVD on one machine 27 David Gleich · Purdue bit.ly/16LS8Vk
  • 28. Getting the matrix Q is tricky! 28 David Gleich · Purdue bit.ly/16LS8Vk
  • 29. What about the matrix Q? We want Q to be Constantine & Gleich, MapReduce 2011 numerically orthogonal. Prior work norm ( QTQ – I ) AR-1 A condition number measures problem Benson, Gleich, sensitivity. Demmel, Submitted AR + " -1 nt Direct TSQR refineme iterative Benson, Gleich, " Prior methods all Demmel, Submitted failed without any 105 1020 warning. Condition number 29 David Gleich · Purdue bit.ly/16LS8Vk
  • 30. Taking care of business by keeping track of Q 3. Distribute the pieces of Q*1 and form the true Q Mapper 1 Mapper 3 Task 2 R1 Q11 A1 Q1 R1 Q11 R Q1 Q1 R2 Q21 Q output R output R2 R3 Q31 Q21 A2 Q2 Q2 Q2 R4 Q41 R3 Q31 2. Collect R on one A3 Q3 Q3 Q3 node, compute Qs for each piece R4 Q41 A4 Q4 Q4 Q4 1. Output local Q and R in separate files 30 David Gleich · Purdue bit.ly/16LS8Vk
  • 31. Code available from github.com/arbenson/mrtsqr … it isn’t too bad. 31 David Gleich · Purdue bit.ly/16LS8Vk
  • 32. Future work … more columns! With ~3000 columns, one 64MB chunk is a local QR computation. Could “iterate in blocks of 3000” columns to continue … maybe “efficient” for 10,000 columns Need different ideas for 100,000 columns (randomized methods?) 32 David Gleich · Purdue bit.ly/16LS8Vk
  • 33. Questions? www.cs.purdue.edu/~dgleich @dgleich dgleich@purdue.edu 33 David Gleich · Purdue bit.ly/16LS8Vk

Notas do Editor

  1. I think this took 30 minutes using our slowest codes. Our fastest codes should take it down to about 3-4 minutes. You’ll probably wait longer to get your job scheduled.
  2. I think this took 30 minutes using our slowest codes. Our fastest codes should take it down to about 3-4 minutes. You’ll probably wait longer to get your job scheduled.