SlideShare uma empresa Scribd logo
1 de 52
Introduction to SPU Optimizations
                       Part 2: An Optimization Example



                              P˚l-Kristian Engstad
                               a
                         pal engstad@naughtydog.com




                                      March 5, 2010



P˚
 al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Introduction




    As seen in part 1 of these slides, the SPUs have a very powerful instruction set,
    but how can we utilize this power efficiently?

    Solution:

        Analyze the problem.
        Get C/C++ code working.
        Optimize!




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Analyzing for the SPU




        Analyze memory requirements.
        Partition your problem and data into sub-problems and sub-data until they
        fit on a single SPU.
        Re-arrange your data to facilitate a streaming model.

    SPUs only have 256 kB local memory, so in most cases you need to partition
    your input data to fit. Sometimes, this can be hard. If your input data is a
    “spaghetti-tree” with pointers to pointers, etc., then your task will be very hard.

    The SPU wants:

        Data in contigous memory chunks, preferably each larger than 128 bytes.
        Data aligned to 128 byte boundaries, or at least to 16 byte boundaries.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimize for the SPU


    First things first. We have to:

        Get (C/C++) code working on a single SPU using synchronous memory
        transfers.
        Enable asynchronous memory transfers.
        Enable asynchronous execution (multiple SPUs).

    When everything works. Optimize:

        Parallelize inner loops, using intrinsics.
        Translate inner loops to assembly.
        Employ software pipe-lining.
        Optimize setup, prologue and epilogue code.

    Also make sure you test your loops to make sure they work on known input
    data.



           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    Given a vertex position Pv , and light position Pl and we calculate the light
    direction,
                           ˆ = (Pl − Pv )/ Pl − Pv = L/ L ,
                           l                                                      (1)
                                                         √
    where we use the distance to the light r = L = L · L. The attenuation
    factor is a function of the distance to the light, using two constants α1 and α2 ,
                                                             α1
                                      α(r ) = clamp(            − α2 , 0, 1),                      (2)
                                                             r2
    and then finally get the color at that vertex Vc through

                                       Vc = α(r ) max(0, nv · ˆ Lc ,
                                                         ˆ l)                                      (3)

    where Lc is the (r , g , b)-color of the light source, and nv is the vertex normal.
                                                               ˆ




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com     Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    // Cg-function:
    float3 lcalc(float3 vpos, float3 vnrm,
                 float att0, float att1,
                 float3 lpos, float3 lclr)
    {
       float3 L = lpos - vpos;
       float r = sqrt(dot(l, l));
       float3 ldir = L / r;
       float at = saturate(att0 / (r*r) - att1);
       return at * max(0, dot(vnrm, ldir)) * lclr;
    }




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    A trivial re-arrangement, to avoid two divides is to notice that
                                             1    1
                                               = √ = rsqrt(d),                                         (4)
                                             r     d
    and                                                 „          «2
                                1      1                      1
                                   = √     =                 √          = (rsqrt(d))2                  (5)
                                r2   ( d)2                     d
    thus:




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com         Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    // Core function
    float3 lcalc(float3 vpos, float3 vnrm,
                 float att0, float att1,
                 float3 lpos, float3 lclr)
    {
       float3 L = lpos - vpos;
       float d = dot(l, l);
       float irs = rsqrt(d);
       float3 ldir = L * irs;
       float at = min(att0 * irs * irs - att1, 1);
       return max(0, at) * max(0, dot(vnrm, ldir)) * lclr;
    }




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    We want to use the lcalc function to accumulate the light from several light
    sources. So, our algorithm is:

    foreach vertex do
      vertex.clr = float3(0,0,0);
      foreach light do
        vertex.clr += lcalc(vertex.pos, vertex.nrm,
                            light.att, light.pos, light.clr)
      endfor
    endfor




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation



    Let’s see how well GCC can do. We’ll define some structures:

    #define ALIGN(x) __attribute__((aligned(16)))
    #define INLINE   inline __attribute__((always_inline))
    #define VECRET   /* __attribute__((vecreturn)) PPU only */
    struct float3 { float x, y, z; } VECRET;
    struct light { float3 pos; float att0;
                    float3 clr; float att1; } ALIGN(16);
    struct vertex { float3 pos; float dummy1;
                    float3 nrm; float dummy2;
                    float3 clr; float dummy3; } ALIGN(16);
    INLINE float3 lcalc(vertex &v, light& l) // For convenience.
      { return lcalc(v.pos, v.nrm, l.att0, l.att1, l.pos, l.clr); }




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation



    Some math functions:

    INLINE float3 mkvec3(float x, float y, float z)
       { float3 res = { x, y, z }; return res; }
    INLINE float dot(float3 a, float3 b)
       { return a.x*b.x + a.y*b.y + a.z*b.z; }
    INLINE float3 operator-(float3 a, float3 b)
       { return mkvec3(a.x-b.x, a.y-b.y, a.z-b.z); }
    INLINE void operator+=(float3& a, float3 b)
      { a.x += b.x; a.y += b.y; a.z += b.z; }
    INLINE float3 operator*(float k, float3 b)
       { return mkvec3(k*b.x, k*b.y, k*b.z); }
    INLINE float rsqrt(float x)        { return 1/sqrtf(x); }
    INLINE float max(float a, float b) { return (a > b) ? a : b; }
    INLINE float min(float a, float b) { return (a < b) ? a : b; }




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation



    Our loop now becomes:

    void accLights(int nLights, light *light,
                   int nVerts, vertex *vtx)
    {
      for (int v = 0; v < nVerts; v++)
        for (int l = 0; l < nLights; l++)
          if (l == 0)
             vtx[v].clr = lcalc(vtx[v], light[l]);
          else
             vtx[v].clr += lcalc(vtx[v], light[l]);
    }

    Results: 231.6 cycles/vtx/light.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    Where does GCC go wrong?

        It inlines all code. [Good.]
        Cycles per instruction = 2.07.
        Lots of stalls, 50.7% due to dependencies.
        No vectorization!

    Let’s try to help the compiler.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    Our loop now becomes:

    void accLights(int nLights, light *light,
                   int nVerts, vertex *vtx) {
      if (nLights > 0)
        for (int v = 0; v < nVerts; v+=4)
          for (int i = 0; i < 4; i++)
            vtx[v+i].clr = lcalc(vtx[v+i], light[l]);
      for (int l = 1; l < nLights; l++)
        for (int v = 0; v < nVerts; v+=4)
          for (int i = 0; i < 4; i++)
            vtx[v+i].clr += lcalc(vtx[v+i], light[l]);
    }

    Results: 208.5 cycles/vtx/light.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation



    Let’s vectorize by hand. We will reorganize our vertexes to be groups of 4
    vertexes organised such that 4 x-s are located in a single vector register:

    typedef vector         float real; // 4 real numbers
    struct vec3 {
       real xs, ys,        zs;
       vec3() {}
       vec3(real x,        real y, real z) : xs(x), ys(y), zs(z) {}
    };
    struct vertex4         { vec3 pos, nrm, clr; };

    Think of a hardware vector register as four real values, not as a vector. A vec3
    is four (3-)vectors, and a vertex4 is four vertexes.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation



    As before, we need to define addition, subtraction, dot-product, scalar
    multiplication, and a couple of other functions:

    INLINE vec3 operator+(vec3 a, vec3 b)
     { return vec3(a.xs + b.xs, a.ys + b.ys, a.zs + b.zs); }
    INLINE vec3 operator-(vec3 a, vec3 b)
      { return vec3(a.xs - b.xs, a.ys - b.ys, a.zs - b.zs); }
    INLINE real dot(vec3 a, vec3 b)
      { return a.xs * b.xs + a.ys * b.ys + a.zs * b.zs; }
    INLINE vec3 operator*(real as, vec3 b)
      { return vec3(as * b.xs, as * b.ys, as * b.zs); }
    INLINE vec3 splat(float3 a)
      { return vec3(spu_splats(a.x), spu_splats(a.y), spu_splats(a.z)); }




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    Also, since the light values are the same for each of the four components, we
    must “splat” the light components into the vector:

    struct lightenv { // 8 vector floats.
       vec3 pos, clr;
       real att0s, att1s;
       lightenv(light in) {
         pos = splat(in.pos); clr = splat(in.clr);
         att0s = spu_splats(in.att0);
         att1s = spu_splats(in.att1);
       }
    };




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation



    The core function:

    // Light calculation for 4 vertexes
    static INLINE vec3 lcalc4(vertex4 v, lightenv l)
    {
      vec3 L = l.pos - v.pos;
      real Lsq = dot(L, L);
      real irs = rsqrtf4(Lsq);
      vec3 ldir = irs * L;
      real at = fminf4(l.att0s * irs * irs - l.att1s, ones);
      real dp = dot(v.nrm, ldir);
      return fmaxf4(zeros, at) * fmaxf4(zeros, dp) * lclr;
    }




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation


    The calling function:

    void accLights(int nLights, light *light,
                   int nVerts, vertex4 *vtx)
    {
      for (int l = 0; l < nLights; l++) {
        lightenv lgt(light[l]);
        for (int v = 0; v < nVerts/4; v++) {
          vec3 clr = lcalc4(vtx[v], lgt);
          vtx[v].clr = clr + ((l==0) ? zerovec : vtx[v].clr);
        }
      }
    }

    Result: 31.8 cycles/vertex/light! This is a factor of 6.56 better, but we can
    do much more...



           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    It is more than 4 times better since in addtion to having vectorized the code (4
    times speedup), the compiler no longer has to do akward shuffles when loading
    and storing values.

    However, using spusim, we still see that more than 50% of the cycles are
    wasted by dependency stalls. Of the rest, 76% are still single-issued. So, GCC
    does a terrible job of optimizing our loop.

    We can try to unroll the loop manually:




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




        for (int v = 0; v < nVerts/4; v+=4) {
          vec3 clr0 = lcalc4(vtx[v+0], lgt);
          vec3 clr1 = lcalc4(vtx[v+1], lgt);
          vec3 clr2 = lcalc4(vtx[v+2], lgt);
          vec3 clr3 = lcalc4(vtx[v+3], lgt);
          vtx[v+0].clr = clr0 + ((l==0) ? zerovec                                 :   vtx[v+0].clr);
          vtx[v+1].clr = clr1 + ((l==0) ? zerovec                                 :   vtx[v+1].clr);
          vtx[v+2].clr = clr2 + ((l==0) ? zerovec                                 :   vtx[v+2].clr);
          vtx[v+3].clr = clr3 + ((l==0) ? zerovec                                 :   vtx[v+3].clr);
        }

    Result: 15.0 cycles/vertex/light!! Pretty good, but spusim tells us that
    although most stalls are gone (about 10%), 60% of the cycles are single-issue
    and not dual issue.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation


    What is the minimum number of cycles in our loop? We have:

    L = l.pos - v.pos                                              //   3 fs
    Lsq = dot(L, L)                                                //   fm + 2 fma
    dp = dot(v.nrm, L)                                             //   fm + 2 fma
    irs = rsqrtf4(Lsq)                                             //   frsqrest+fi+(2fm+fnms+fma)
    at = l.att0s * irs * irs - l.att1s                             //   fm, fms
    at = fminf4(at, ones)                                          //   fcgt, selb
    m0 = fmaxf4(zeros, at)                                         //   fcgt, selb
    m1 = fmaxf4(zeros, dp * irs)                                   //   fm, fcgt, selb
    ms = m0 * m1                                                   //   fm
    tmp = lclr * irs                                               //   3 fm
    clr = ms * tmp + pclr                                          //   3 fma

    So, adding one instruction for adding the loop-counter, we should be able to
    software pipeline this down to 31/4 = 7.75 cycles!



           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    We can improve this loop a lot by using a process called software pipe-lining.
    It is a fairly mechanic process:

        Write out the instruction sequence in assembly.
        Add up even instructions and odd instructions. The maximum is the II,
        the initiation interval.
        Also note (write down) the latency of each instruction.
        Create a table with two columns and as many rows as the II.
        Insert instructions from the top adhering to latency demands.
        If you get to row II, then wrap around.

    In the following slides, we’ll show how we do it.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    We’ll start with

    vec3 L = l.pos - v.pos;

    This is going to be translated (in assembly) to:

    {o6}   lqx vposx, pxptr, iofs                   ;   vposx := mem[pxptr+iofs]
    {o6}   lqx vposy, pyptr, iofs                   ;   vposy := mem[pyptr+iofs]
    {o6}   lqx vposz, pzptr, iofs                   ;   vposz := mem[pzptr+iofs]
    {e6}   fs Lx, lposx, vposx                      ;   even pipe, latency = 6.
    {e6}   fs Ly, lposy, vposy                      ;   {text} are comments.
    {e6}   fs Lz, lposz, vposz

    The reason we use the x-form is that we then will have only one offset variable,
    simplifying our loop control. Of course, this means that we have to initialize
    pxptr, pyptr and pzptr accordingly. All light variables must also be initialized.

    Let’s start!


            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
       {nop}                                    {o6:0} lqx vposx, pxptr, iofs; 0
       {nop}                                    {o6:0} lqx vposy, pyptr, iofs; 1
       {nop}                                    {o6:0} lqx vposz, pzptr, iofs; 2
       {nop}                                           {lnop}                ; 3
       {nop}                                           {lnop}                ; 4
       {nop}                                           {lnop}                ; 5
{e6:0} fs Lx, lposx, vposx                             {lnop}                ; 6
{e6:0} fs Ly, lposy, vposy                             {lnop}                ; 7
{e6:0} fs Lz, lposz, vposz                             {lnop}                ; 8
       {nop}                                           {lnop}                ; 9
       {nop}                                           {lnop}                ; 10
       {nop}                                           {lnop}                ; 11
       {nop}                                           {lnop}                ; 12
       {nop}                                           {lnop}                ; 13
       {nop}                                           {lnop}                ; 14
       {nop}                                           {lnop}                ; 15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    Then we just continue with

    Lsq = dot(L, L)
    dp = dot(v.nrm, L)

    Translated:

    {e6}   fm Lsq, Lx, Lx                           ;   Lsq   =   Lx*Lx
    {e6}   fma Lsq, Ly, Ly, Lsq                     ;   Lsq   =   Ly*Ly     + Lsq
    {e6}   fma Lsq, Lz, Lz, Lsq                     ;   Lsq   =   Lz*Lz     + Lsq
    {e6}   fm dp, vnrmx, Lx                         ;   dp    =   Nx*Lx
    {e6}   fma dp, vnrmy, Ly, dp                    ;   dp    =   Ny*Ly     + dp
    {e6}   fma dp, vnrmz, Lz, dp                    ;   dp    =   Nz*Lz     + dp




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com    Introduction to SPU Optimizations
;;; Start of loop (page 1)
       {nop}                                    {o6:0} lqx vposx, pxptr, iofs; 0
       {nop}                                    {o6:0} lqx vposy, pyptr, iofs; 1
       {nop}                                    {o6:0} lqx vposz, pzptr, iofs; 2
       {nop}                                           {lnop}                ; 3
       {nop}                                           {lnop}                ; 4
       {nop}                                           {lnop}                ; 5
{e6:0} fs Lx, lposx, vposx                             {lnop}                ; 6
{e6:0} fs Ly, lposy, vposy                             {lnop}                ; 7
{e6:0} fs Lz, lposz, vposz                             {lnop}                ; 8
       {nop}                                           {lnop}                ; 9
       {nop}                                           {lnop}                ; 10
       {nop}                                           {lnop}                ; 11
       {nop}                                           {lnop}                ; 12
       {nop}                                           {lnop}                ; 13
       {nop}                                           {lnop}                ; 14
       {nop}                                           {lnop}                ; 15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
       {nop}                                    {o6:0}    lqx vposx,           pxptr,      iofs;    0
       {nop}                                    {o6:0}    lqx vposy,           pyptr,      iofs;    1
       {nop}                                    {o6:0}    lqx vposz,           pzptr,      iofs;    2
       {nop}                                    {o6:0}    lqx vnrmx,           nxptr,      iofs;    3
       {nop}                                    {o6:0}    lqx vnrmy,           nyptr,      iofs;    4
       {nop}                                    {o6:0}    lqx vnrmz,           nzptr,      iofs;    5
{e6:0} fs Lx, lposx, vposx                                {lnop}                               ;    6
{e6:0} fs Ly, lposy, vposy                                {lnop}                               ;    7
{e6:0} fs Lz, lposz, vposz                                {lnop}                               ;    8
       {nop}                                              {lnop}                               ;    9
       {nop}                                              {lnop}                               ;   10
       {nop}                                              {lnop}                               ;   11
{e6:0} fm Lsq, Lx, Lx                                     {lnop}                               ;   12
{e6:0} fm dp, vnrmx, Lx                                   {lnop}                               ;   13
       {nop}                                              {lnop}                               ;   14
       {nop}                                              {lnop}                               ;   15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Continued (page 2)
       {nop}                                               {lnop}                           ;   16
       {nop}                                               {lnop}                           ;   17
       {nop}                                               {lnop}                           ;   18
{e6:0} fma Lsq, Ly, Ly, Lsq                                {lnop}                           ;   19
{e6:0} fma dp, vnrmy, Ly, dp                               {lnop}                           ;   20
       {nop}                                               {lnop}                           ;   21
       {nop}                                               {lnop}                           ;   22
       {nop}                                               {lnop}                           ;   23
       {nop}                                               {lnop}                           ;   24
{e6:0} fma Lsq_, Lz, Lz, Lsq                               {lnop}                           ;   25
{e6:0} fma dp_, vnrmz, Lz, dp                              {lnop}                           ;   26
       {nop}                                               {lnop}                           ;   27
       {nop}                                               {lnop}                           ;   28
       {nop}                                               {lnop}                           ;   29
       {nop}                  {o1:-}                       brnz ..., looptop                ;   30
;;; Must wrap around!

Notice that we rename the output variables from Lsq and dp to Lsq and dp ,
so that they don’t clash with registers when they are carried into the next loop.


       P˚
        al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    The irs = frsqrf4(Lsq) function is implemented using:

    {o4}   frsqest t0, Lsq
    {e7}   fi t1, Lsq, t0
    {e6}   fm t2, t1, Lsq
    {e6}   fm t3, t1, onehalf
    {e6}   fnms t2, t2, t1, one
    {e6}   fma irs, t3, t2, t1

    After this sequence, irs is good to 24 bits precision. Notice the huge latency
    (∆L = 4 + 7 + 6 + 6 + 6 + 6 = 35), so the result is ready the next iteration.




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
       {nop}                                     {o6:0}    lqx vposx, pxptr,                iofs;    0
       {nop}                                     {o6:0}    lqx vposy, pyptr,                iofs;    1
       {nop}                                     {o6:0}    lqx vposz, pzptr,                iofs;    2
       {nop}                                     {o6:0}    lqx vnrmx, nxptr,                iofs;    3
       {nop}                                     {o6:0}    lqx vnrmy, nyptr,                iofs;    4
       {nop}                                     {o6:0}    lqx vnrmz, nzptr,                iofs;    5
{e6:0} fs Lx, lposx, vposx                       {o4:1}    frsqest t0, Lsq_                     ;    6
{e6:0} fs Ly, lposy, vposy                                 {lnop}                               ;    7
{e6:0} fs Lz, lposz, vposz                                 {lnop}                               ;    8
       {nop}                                               {lnop}                               ;    9
{e7:1} fi t1, Lsq_, t0                                     {lnop}                               ;   10
       {nop}                                               {lnop}                               ;   11
{e6:0} fm Lsq, Lx, Lx                                      {lnop}                               ;   12
{e6:0} fm dp, vnrmx, Lx                                    {lnop}                               ;   13
       {nop}                                               {lnop}                               ;   14
       {nop}                                               {lnop}                               ;   15

We annotate with :1 to indicate that these instructions are delayed into the
next iteration of the loop.


       P˚
        al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Continued (page 2)
       {nop}                                              {lnop}                           ;   16
       {nop}                                              {lnop}                           ;   17
{e6:1} fm t2, t1, Lsq_                                    {lnop}                           ;   18
{e6:0} fma Lsq, Ly, Ly, Lsq                               {lnop}                           ;   19
{e6:0} fma dp, vnrmy, Ly, dp                              {lnop}                           ;   20
       {nop}                                              {lnop}                           ;   21
       {nop}                                              {lnop}                           ;   22
       {nop}                                              {lnop}                           ;   23
{e6:1} fm t3, t1, onehalf                                 {lnop}                           ;   24
{e6:0} fma Lsq_, Lz, Lz, Lsq                              {lnop}                           ;   25
{e6:0} fma dp, vnrmz, Lz, dp                              {lnop}                           ;   26
       {nop}                                              {lnop}                           ;   27
       {nop}                                              {lnop}                           ;   28
       {nop}                                              {lnop}                           ;   29
{e6:1} fnms t2, t2, t1, one {o1:-}                        brnz ..., looptop                ;   30




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
       {nop}                                    {o6:0}    lqx vposx, pxptr,                iofs;    0
       {nop}                                    {o6:0}    lqx vposy, pyptr,                iofs;    1
       {nop}                                    {o6:0}    lqx vposz, pzptr,                iofs;    2
       {nop}                                    {o6:0}    lqx vnrmx, nxptr,                iofs;    3
       {nop}                                    {o6:0}    lqx vnrmy, nyptr,                iofs;    4
{e6:2} fma irs, t3, t2, t1                      {o6:0}    lqx vnrmz, nzptr,                iofs;    5
{e6:0} fs Lx, lposx, vposx                      {o4:1}    frsqest t0, Lsq_                     ;    6
{e6:0} fs Ly, lposy, vposy                                {lnop}                               ;    7
{e6:0} fs Lz, lposz, vposz                                {lnop}                               ;    8
       {nop}                                              {lnop}                               ;    9
{e7:1} fi t1, Lsq_, t0                                    {lnop}                               ;   10
       {nop}                                              {lnop}                               ;   11
{e6:0} fm Lsq, Lx, Lx                                     {lnop}                               ;   12
{e6:0} fm dp, vnrmx, Lx                                   {lnop}                               ;   13
       {nop}                                              {lnop}                               ;   14
       {nop}                                              {lnop}                               ;   15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    Continuing with:

    at = l.att0s * irs * irs - l.att1s                              // fm, fms
    at = fminf4(at, ones)                                           // fcgt, selb

    This translates to:

    {e6}   fm ir, irs, irs
    {e6}   fms at, att0s, ir, att1s
    {e2}   fcgt c0, at, ones        ; c0.w = at.w>0 ? 111..111 : 000..000
    {e2}   selb at, c0, at, ones




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
       {nop}                                    {o6:0}    lqx vposx, pxptr,                iofs;    0
       {nop}                                    {o6:0}    lqx vposy, pyptr,                iofs;    1
       {nop}                                    {o6:0}    lqx vposz, pzptr,                iofs;    2
       {nop}                                    {o6:0}    lqx vnrmx, nxptr,                iofs;    3
       {nop}                                    {o6:0}    lqx vnrmy, nyptr,                iofs;    4
{e6:2} fma irs, t3, t2, t1                      {o6:0}    lqx vnrmz, nzptr,                iofs;    5
{e6:0} fs Lx, lposx, vposx                      {o4:1}    frsqest t0, Lsq_                     ;    6
{e6:0} fs Ly, lposy, vposy                                {lnop}                               ;    7
{e6:0} fs Lz, lposz, vposz                                {lnop}                               ;    8
       {nop}                                              {lnop}                               ;    9
{e7:1} fi t1, Lsq_, t0                                    {lnop}                               ;   10
{e6:2} fm ir, irs, irs                                    {lnop}                               ;   11
{e6:0} fm Lsq, Lx, Lx                                     {lnop}                               ;   12
{e6:0} fm dp, vnrmx, Lx                                   {lnop}                               ;   13
       {nop}                                              {lnop}                               ;   14
       {nop}                                              {lnop}                               ;   15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Continued (page 2)
       {nop}                                              {lnop}                           ;   16
       {nop}                                              {lnop}                           ;   17
{e6:1} fm t2, t1, Lsq_                                    {lnop}                           ;   18
{e6:0} fma Lsq, Ly, Ly, Lsq                               {lnop}                           ;   19
{e6:2} fms at, att0s, ir, att1s                           {lnop}                           ;   20
{e6:0} fma dp, vnrmy, Ly, dp                              {lnop}                           ;   21
       {nop}                                              {lnop}                           ;   22
       {nop}                                              {lnop}                           ;   23
{e6:1} fm t3, t1, onehalf                                 {lnop}                           ;   24
{e6:0} fma Lsq_, Lz, Lz, Lsq                              {lnop}                           ;   25
{e2:2} fcgt c0, at, ones                                  {lnop}                           ;   26
{e6:0} fma dp, vnrmz, Lz, dp                              {lnop}                           ;   27
{e2:2} selb at, c0, at, ones                              {lnop}                           ;   28
       {nop}                                              {lnop}                           ;   29
{e6:1} fnms t2, t2, t1, one {o1:-}                        brnz ..., looptop                ;   30




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    Continuing with:

    m1 = fmaxf4(zeros, dp*irs)                                      // fm, fcgt, selb
    m0 = fmaxf4(zeros, at)                                          // fcgt, selb
    ms = m0 * m1                                                    // fm

    This translates to:

    {e6}   fm dprs,       dp, irs
    {e2}   fcgt c1,       zeros, dprs
    {e2}   selb m1,       c1, dprs, zeros
    {e2}   fcgt c2,       zeros, at
    {e2}   selb m0,       c2, at, zeros




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1) : lmove dst, src == rotm dst, src, zeros
       {nop}                 {o6:0} lqx vposx, pxptr, iofs; 0
       {nop}                 {o6:0} lqx vposy, pyptr, iofs; 1
       {nop}                 {o6:0} lqx vposz, pzptr, iofs; 2
       {nop}                 {o6:0} lqx vnrmx, nxptr, iofs; 3
       {nop}                 {o6:0} lqx vnrmy, nyptr, iofs; 4
{e6:2} fma irs, t3, t2, t1   {o6:0} lqx vnrmz, nzptr, iofs; 5
{e6:0} fs Lx, lposx, vposx   {o4:1} frsqest t0, Lsq_      ; 6
{e6:0} fs Ly, lposy, vposy          {lnop}                ; 7
{e6:0} fs Lz, lposz, vposz          {lnop}                ; 8
       {nop}                        {lnop}                ; 9
{e7:1} fi t1, Lsq_, t0              {lnop}                ; 10
{e6:2} fm ir, irs, irs              {lnop}                ; 11
{e6:0} fm Lsq, Lx, Lx               {lnop}                ; 12
{e6:2} fm dprs, dp__, irs           {lnop}                ; 13
{e6:0} fm dp, vnrmx, Lx      {o4:1} lmove dp__, dp_       ; 14
       {nop}                        {lnop}                ; 15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Continued (page 2)
       {nop}                                              {lnop}                           ;   16
       {nop}                                              {lnop}                           ;   17
{e6:1} fm t2, t1, Lsq_                                    {lnop}                           ;   18
{e6:0} fma Lsq, Ly, Ly, Lsq                               {lnop}                           ;   19
{e6:2} fms at, att0s, ir, att1s                           {lnop}                           ;   20
{e6:0} fma dp, vnrmy, Ly, dp                              {lnop}                           ;   21
{e2:2} fcgt c1, zeros, dprs                               {lnop}                           ;   22
       {nop}                                              {lnop}                           ;   23
{e6:1} fm t3, t1, onehalf                                 {lnop}                           ;   24
{e6:0} fma Lsq_, Lz, Lz, Lsq                              {lnop}                           ;   25
{e2:2} fcgt c0, at, ones                                  {lnop}                           ;   26
{e6:0} fma dp_, vnrmz, Lz, dp                             {lnop}                           ;   27
{e2:2} selb at, c0, at, ones                              {lnop}                           ;   28
{e2:2} selb m1, c1, dprs, zeros                           {lnop}                           ;   29
{e6:1} fnms t2, t2, t1, one {o1:-}                        brnz ..., looptop                ;   30




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
{e2:3} fcgt c2, zeros, at    {o6:0}                       lqx vposx, pxptr,                iofs;    0
       {nop}                 {o6:0}                       lqx vposy, pyptr,                iofs;    1
{e2:3} selb m0, c2, at, zeros{o6:0}                       lqx vposz, pzptr,                iofs;    2
       {nop}                 {o6:0}                       lqx vnrmx, nxptr,                iofs;    3
{e6:3} fm ms, m0, m1         {o6:0}                       lqx vnrmy, nyptr,                iofs;    4
{e6:2} fma irs, t3, t2, t1   {o6:0}                       lqx vnrmz, nzptr,                iofs;    5
{e6:0} fs Lx, lposx, vposx   {o4:1}                       frsqest t0, Lsq_                     ;    6
{e6:0} fs Ly, lposy, vposy                                {lnop}                               ;    7
{e6:0} fs Lz, lposz, vposz                                {lnop}                               ;    8
       {nop}                                              {lnop}                               ;    9
{e7:1} fi t1, Lsq_, t0                                    {lnop}                               ;   10
{e6:2} fm ir, irs, irs                                    {lnop}                               ;   11
{e6:0} fm Lsq, Lx, Lx                                     {lnop}                               ;   12
{e6:2} fm dprs, dp__, irs                                 {lnop}                               ;   13
{e6:0} fm dp, vnrmx, Lx      {o4:1}                       lmove dp__, dp_                      ;   14
       {nop}                                              {lnop}                               ;   15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    Finally, we can squeeze in our last 6 instructions:

    tmp = lclr * irs                       // 3 fm
    clr = ms * tmp + prev                  // 3 fma

    Translation:

    {e6}   fm tmpr, lclrr, irs
    {e6}   fm tmpg, lclrg, irs
    {e6}   fm tmpb, lclrb, irs
    {e6}   fma clrr, ms, tmpr, prevr ; prev color
    {e6}   fma clrg, ms, tmpg, prevg
    {e6}   fma clrb, ms, tmpb, prevg




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
{e2:3} fcgt c2, zeros, at    {o6:0}                        lqx vposx, pxptr,                iofs;    0
{e6:3} fm tmpr, lclrr, irs_ {o6:0}                         lqx vposy, pyptr,                iofs;    1
{e2:3} selb m0, c2, at, zeros{o6:0}                        lqx vposz, pzptr,                iofs;    2
{e6:3} fm tmpg, lclrg, irs_ {o6:0}                         lqx vnrmx, nxptr,                iofs;    3
{e6:3} fm ms, m0, m1         {o6:0}                        lqx vnrmy, nyptr,                iofs;    4
{e6:2} fma irs, t3, t2, t1   {o6:0}                        lqx vnrmz, nzptr,                iofs;    5
{e6:0} fs Lx, lposx, vposx   {o4:1}                        frsqest t0, Lsq_                     ;    6
{e6:0} fs Ly, lposy, vposy                                 {lnop}                               ;    7
{e6:0} fs Lz, lposz, vposz                                 {lnop}                               ;    8
{e6:3} fm tmpb, lclrb, irs_                                {lnop}                               ;    9
{e7:1} fi t1, Lsq_, t0                                     {lnop}                               ;   10
{e6:2} fm ir, irs, irs                                     {lnop}                               ;   11
{e6:0} fm Lsq, Lx, Lx        {o4:2}                        lmove irs_, irs                      ;   12
{e6:2} fm dprs, dp__, irs_                                 {lnop}                               ;   13
{e6:0} fm dp, vnrmx, Lx      {o4:1}                        lmove dp__, dp_                      ;   14
{e6:3} fma clrr, ms, tmpr, prevr                           {lnop}                               ;   15

Here, we needed to extend the lifetime of irs, thus introducing the move to
irs in iteration 2, in order to have a clean copy in iteration 3.


       P˚
        al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Continued (page 2)
       {nop}                                              {lnop}                           ;   16
{e6:3} fma clrg, ms, tmpg, prevg                          {lnop}                           ;   17
{e6:1} fm t2, t1, Lsq_                                    {lnop}                           ;   18
{e6:0} fma Lsq, Ly, Ly, Lsq                               {lnop}                           ;   19
{e6:2} fms at, att0s, ir, att1s                           {lnop}                           ;   20
{e6:0} fma dp, vnrmy, Ly, dp                              {lnop}                           ;   21
{e2:2} fcgt c1, zeros, dprs                               {lnop}                           ;   22
{e6:3} fma clrb, ms, tmpb, prevb                          {lnop}                           ;   23
{e6:1} fm t3, t1, onehalf                                 {lnop}                           ;   24
{e6:0} fma Lsq_, Lz, Lz, Lsq                              {lnop}                           ;   25
{e2:2} fcgt c0, at, ones                                  {lnop}                           ;   26
{e6:0} fma dp_, vnrmz, Lz, dp                             {lnop}                           ;   27
{e2:2} selb at, c0, at, ones                              {lnop}                           ;   28
{e2:2} selb m1, c1, dprs, zeros                           {lnop}                           ;   29
{e6:1} fnms t2, t2, t1, one {o1:-}                        brnz ..., looptop                ;   30




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    We’re almost done! We must finish up with a technique to select prevr, prevg
    and prevb which must zero for the first loop, but use the previous color the
    next loops. We can do this by using a different select map. For light 0, we’ll
    use mask = m 0000, which creates all zeros, and mask = m ABCD for all others.
    So, for each of the color’s (R,G,B)-fields:

    {o6} lqx vclrr, crptr, oofs          ; load using output offset
    {o4} shufb prevr, vclrr, vclrr, mask ; copy vclrr or set to zeros
    ;; use prevr
    {o6} stqx vclrr, crptr, oofs          ; load using output offset




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Start of loop (page 1)
{e2:3} fcgt c2, zeros, at    {o6:0} lqx vposx, pxptr, iofs; 0
{e6:3} fm tmpr, lclrr, irs_ {o6:0} lqx vposy, pyptr, iofs; 1
{e2:3} selb m0, c2, at, zeros{o6:0} lqx vposz, pzptr, iofs; 2
{e6:3} fm tmpg, lclrg, irs_ {o6:0} lqx vnrmx, nxptr, iofs; 3
{e6:3} fm ms, m0, m1         {o6:3} lqx vclrr, crptr, oofs; 4
{e6:2} fma irs, t3, t2, t1   {o6:3} lqx vclrg, cgptr, oofs; 5
{e6:0} fs Lx, lposx, vposx   {o4:1} frsqest t0, Lsq_      ; 6
{e6:0} fs Ly, lposy, vposy   {o6:3} lqx vclrb, cbptr, oofs; 7
{e6:0} fs Lz, lposz, vposz   {o6:0} lqx vnrmy, nyptr, iofs; 8
{e6:3} fm tmpb, lclrb, irs_ {o6:0} lqx vnrmz, nzptr, iofs; 9
{e7:1} fi t1, Lsq_, t0       {o4:3} shufb prevr, vclrr, vclrr, mask; 10
{e6:2} fm ir, irs, irs       {o4:3} shufb prevg, vclrg, vclrg, mask; 11
{e6:0} fm Lsq, Lx, Lx        {o4:2} lmove irs_, irs     ; 12
{e6:2} fm dprs, dp__, irs    {o4:3} shufb prevb, vclrb, vclrb, mask; 13
{e6:0} fm dp, vnrmx, Lx      {o4:1} lmove dp__, dp_     ; 14
{e6:3} fma clrr, ms, tmpr, prevr {lnop}                   ; 15




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Continued (page 2)
       {nop}                        {lnop}                ; 16
{e6:3} fma clrg, ms, tmpg, prevg    {lnop}                ; 17
{e6:1} fm t2, t1, Lsq_              {lnop}                ; 18
{e6:0} fma Lsq, Ly, Ly, Lsq         {lnop}                ; 19
{e6:2} fms at, att0s, ir, att1s     {lnop}                ; 20
{e6:0} fma dp, vnrmy, Ly, dp        {lnop}                ; 21
{e2:2} fcgt c1, zeros, dprs {o6:3} stqx clrr, crptr, oofs; 22
{e6:3} fma clrb, ms, tmpb, prevb    {lnop}                ; 23
{e6:1} fm t3, t1, onehalf    {o6:3} stqx clrg, cgptr, oofs; 24
{e6:0} fma Lsq_, Lz, Lz, Lsq        {lnop}                ; 25
{e2:2} fcgt c0, at, ones            {lnop}                ; 26
{e6:0} fma dp_, vnrmz, Lz, dp       {lnop}                ; 27
{e2:2} selb at, c0, at, ones {o6:3} stqx clrb, cbptr, oofs; 28
{e2:2} selb m1, c1, dprs, zeros     {lnop}                ; 29
{e6:1} fnms t2, t2, t1, one {o1:-} brnz ..., looptop      ; 30




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    We’re finally close to done. The last thing we need is to make sure our input
    and output offsets are correct through out the execution. As seen, we count
    downwards by 0x90, which equals sizeof(vertex4) per loop.

    So, we initialize iofs = (numVerts/4 - 1) * 0x90, and that should take
    care of the input loads. For the output offsets we need a delay unit, initialized
    such that the first 3 times we’ll overwrite the last vertex4 positions. We’ll use
    the “delay machine” shuffle m bcdA for this.

    {o4} shufb oofs, iofs, oofs, m_bcdA ; oofs.x                                <-    oofs.y,
                                          oofs.y                                <-    oofs.z,
                                          oofs.z                                <-    oofs.w,
                                          oofs.w                                <-    iofs.x




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation


    Example, (numVerts/4 = 4), at cycle 0, N=0x90:
    Loop 0: iofs = 3N, oofs = { 3N, 3N, 3N, 3N }
    Loop 1: iofs = 2N, oofs = { 3N, 3N, 3N, 2N }
    Loop 2: iofs = 1N, oofs = { 3N, 3N, 2N, 1N }
    Loop 3: iofs = 0N, oofs = { 3N, 2N, 1N, 0N }
    Loop 4: iofs =-1N, oofs = { 2N, 1N, 0N,-1N }
    Loop 5: iofs =-2N, oofs = { 1N, 0N,-1N,-2N }
    Loop 6: iofs =-3N, oofs = { 0N,-1N,-2N,-3N }




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
;;; Continued (page 2)
{e2:-} ai iofs, iofs, -0x90                               {lnop}                ;          16
{e6:3} fma clrg, ms, tmpg, prevg                          {lnop}                ;          17
{e6:1} fm t2, t1, Lsq_       {o4:3}                       shufb oofs, iofs, oofs,          m_bcdA
{e6:0} fma Lsq, Ly, Ly, Lsq                               {lnop}                ;          19
{e6:2} fms at, att0s, ir, att1s                           {lnop}                ;          20
{e6:0} fma dp, vnrmy, Ly, dp                              {lnop}                ;          21
{e2:2} fcgt c1, zeros, dprs {o6:3}                        stqx clrr, crptr, oofs;          22
{e6:3} fma clrb, ms, tmpb, prevb                          {lnop}                ;          23
{e6:1} fm t3, t1, onehalf    {o6:3}                       stqx clrg, cgptr, oofs;          24
{e6:0} fma Lsq_, Lz, Lz, Lsq                              {lnop}                ;          25
{e2:2} fcgt c0, at, ones                                  {lnop}                ;          26
{e6:0} fma dp_, vnrmz, Lz, dp                             {lnop}                ;          27
{e2:2} selb at, c0, at, ones {o6:3}                       stqx clrb, cbptr, oofs;          28
{e2:2} selb m1, c1, dprs, zeros                           {lnop}                ;          29
{e6:1} fnms t2, t2, t1, on   {o1:3}                       brnz oofs, looptop    ;          30




      P˚
       al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation


    And now we are done! Well, not entirely – we’ll have to code the prologue (the
    entry code) and the epilogue (exit code), test it and make sure it works. But it
    will!

    Looking at the code, there are no stalls. In four iterations taking 4 × 31 = 124
    cycles, the first four result will be produced, and for every next interval of 31
    cycles, another four results are produced. That means that our loop runs at
    7.75 cycles/vertex/light!

    Was this hard to do?

        Fairly mechanic. We have tools that can schedule automatically!
        Requires simple loops.
        Usually a factor 2-4 of what unrolled GCC can achieve.
        Our “frontend” tool automatically colors registers.
        And it has some other nice features as well.
        So: Just do it!


           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation




    Improvements on algorithm:

        The strange storage format can be changed to allow for regular
        array-of-structures.
        All we need to do is to add in transposing of input data.
             9 loads turns into 12 loads.
             Transposing 4 vector regs into 3 takes: 7o, 1e+6o, 2e+5o or 4e+4o.
             Transposing 3 vector regs into 4 takes: 6o, 1e+5o, 2e+4o.
             3 stores turns into 4 stores.
             Overall addition: odd = 3 + 3 × 4 + 6 + 1 = 22, even = 3 × 4 = 12. Since
             we had 10 odd for free: 22 − 10 = 12 extra cycles, or 3.0 extra cycles/loop.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Optimizing Example - Lighting Calculation

    Conclusion:

    The original algorithm can be coded, without changing memory layouts, down
    to 10.75 cycles/loop, or 7.75 cycles/loop with our proposed change in memory
    layout.

    If you remember nothing else, then remember that:

        vectorization can give huge improvements without going to assembly,
        and that
        software loop-scheduling can give huge improvements on top of that.

    And finally, a question:

        Why on earth can’t the compiler do a better job?




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations

Mais conteúdo relacionado

Mais procurados

A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkJaewook. Kang
 
conference_poster_4
conference_poster_4conference_poster_4
conference_poster_4Jiayi Jiang
 
Fourier supplementals
Fourier supplementalsFourier supplementals
Fourier supplementalsPartha_bappa
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetaggingTakashi Abe
 
Digital Communication
Digital CommunicationDigital Communication
Digital CommunicationPartha_bappa
 
Ff tand matlab-wanjun huang
Ff tand matlab-wanjun huangFf tand matlab-wanjun huang
Ff tand matlab-wanjun huangSagar Ahir
 
DSP_FOEHU - Lec 08 - The Discrete Fourier Transform
DSP_FOEHU - Lec 08 - The Discrete Fourier TransformDSP_FOEHU - Lec 08 - The Discrete Fourier Transform
DSP_FOEHU - Lec 08 - The Discrete Fourier TransformAmr E. Mohamed
 
A Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB SimulinkA Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB SimulinkJaewook. Kang
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideWooSung Choi
 
Image trnsformations
Image trnsformationsImage trnsformations
Image trnsformationsJohn Williams
 
fourier series and fourier transform
fourier series and fourier transformfourier series and fourier transform
fourier series and fourier transformVikas Rathod
 
Dft and its applications
Dft and its applicationsDft and its applications
Dft and its applicationsAgam Goel
 
Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009A Jorge Garcia
 

Mais procurados (20)

A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB Simulink
 
conference_poster_4
conference_poster_4conference_poster_4
conference_poster_4
 
Fourier supplementals
Fourier supplementalsFourier supplementals
Fourier supplementals
 
Nabaa
NabaaNabaa
Nabaa
 
Matched filter
Matched filterMatched filter
Matched filter
 
Skyline queries
Skyline queriesSkyline queries
Skyline queries
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetagging
 
Digital Communication
Digital CommunicationDigital Communication
Digital Communication
 
Ff tand matlab-wanjun huang
Ff tand matlab-wanjun huangFf tand matlab-wanjun huang
Ff tand matlab-wanjun huang
 
DSP_FOEHU - Lec 08 - The Discrete Fourier Transform
DSP_FOEHU - Lec 08 - The Discrete Fourier TransformDSP_FOEHU - Lec 08 - The Discrete Fourier Transform
DSP_FOEHU - Lec 08 - The Discrete Fourier Transform
 
Lecture 9
Lecture 9Lecture 9
Lecture 9
 
A Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB SimulinkA Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB Simulink
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
Dsp Lab Record
Dsp Lab RecordDsp Lab Record
Dsp Lab Record
 
Fourier transforms
Fourier transforms Fourier transforms
Fourier transforms
 
Image trnsformations
Image trnsformationsImage trnsformations
Image trnsformations
 
Digital Signal Processing Assignment Help
Digital Signal Processing Assignment HelpDigital Signal Processing Assignment Help
Digital Signal Processing Assignment Help
 
fourier series and fourier transform
fourier series and fourier transformfourier series and fourier transform
fourier series and fourier transform
 
Dft and its applications
Dft and its applicationsDft and its applications
Dft and its applications
 
Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009
 

Destaque

Adventures In Data Compilation
Adventures In Data CompilationAdventures In Data Compilation
Adventures In Data CompilationNaughty Dog
 
Lighting Shading by John Hable
Lighting Shading by John HableLighting Shading by John Hable
Lighting Shading by John HableNaughty Dog
 
Uncharted Animation Workflow
Uncharted Animation WorkflowUncharted Animation Workflow
Uncharted Animation WorkflowNaughty Dog
 
Uncharted 2: Character Pipeline
Uncharted 2: Character PipelineUncharted 2: Character Pipeline
Uncharted 2: Character PipelineNaughty Dog
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneNaughty Dog
 
Creating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's FortuneCreating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's FortuneNaughty Dog
 
Amazing Feats of Daring - Uncharted Post Mortem
Amazing Feats of Daring - Uncharted Post MortemAmazing Feats of Daring - Uncharted Post Mortem
Amazing Feats of Daring - Uncharted Post MortemNaughty Dog
 
Multiprocessor Game Loops: Lessons from Uncharted 2: Among Thieves
Multiprocessor Game Loops: Lessons from Uncharted 2: Among ThievesMultiprocessor Game Loops: Lessons from Uncharted 2: Among Thieves
Multiprocessor Game Loops: Lessons from Uncharted 2: Among ThievesNaughty Dog
 
State-Based Scripting in Uncharted 2: Among Thieves
State-Based Scripting in Uncharted 2: Among ThievesState-Based Scripting in Uncharted 2: Among Thieves
State-Based Scripting in Uncharted 2: Among ThievesNaughty Dog
 
Naughty Dog Vertex
Naughty Dog VertexNaughty Dog Vertex
Naughty Dog VertexNaughty Dog
 

Destaque (11)

Adventures In Data Compilation
Adventures In Data CompilationAdventures In Data Compilation
Adventures In Data Compilation
 
Autodesk Mudbox
Autodesk MudboxAutodesk Mudbox
Autodesk Mudbox
 
Lighting Shading by John Hable
Lighting Shading by John HableLighting Shading by John Hable
Lighting Shading by John Hable
 
Uncharted Animation Workflow
Uncharted Animation WorkflowUncharted Animation Workflow
Uncharted Animation Workflow
 
Uncharted 2: Character Pipeline
Uncharted 2: Character PipelineUncharted 2: Character Pipeline
Uncharted 2: Character Pipeline
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s Fortune
 
Creating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's FortuneCreating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's Fortune
 
Amazing Feats of Daring - Uncharted Post Mortem
Amazing Feats of Daring - Uncharted Post MortemAmazing Feats of Daring - Uncharted Post Mortem
Amazing Feats of Daring - Uncharted Post Mortem
 
Multiprocessor Game Loops: Lessons from Uncharted 2: Among Thieves
Multiprocessor Game Loops: Lessons from Uncharted 2: Among ThievesMultiprocessor Game Loops: Lessons from Uncharted 2: Among Thieves
Multiprocessor Game Loops: Lessons from Uncharted 2: Among Thieves
 
State-Based Scripting in Uncharted 2: Among Thieves
State-Based Scripting in Uncharted 2: Among ThievesState-Based Scripting in Uncharted 2: Among Thieves
State-Based Scripting in Uncharted 2: Among Thieves
 
Naughty Dog Vertex
Naughty Dog VertexNaughty Dog Vertex
Naughty Dog Vertex
 

Semelhante a SPU Optimizations - Part 2

2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technicalalpinedatalabs
 
Talk on Resource Allocation Strategies for Layered Multimedia Multicast Services
Talk on Resource Allocation Strategies for Layered Multimedia Multicast ServicesTalk on Resource Allocation Strategies for Layered Multimedia Multicast Services
Talk on Resource Allocation Strategies for Layered Multimedia Multicast ServicesAndrea Tassi
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihmSajid Marwat
 
Lecture01a correctness
Lecture01a correctnessLecture01a correctness
Lecture01a correctnessSonia Djebali
 
My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...Alexander Litvinenko
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMWireilla
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesSreedhar Chowdam
 
Algorithm And analysis Lecture 03& 04-time complexity.
 Algorithm And analysis Lecture 03& 04-time complexity. Algorithm And analysis Lecture 03& 04-time complexity.
Algorithm And analysis Lecture 03& 04-time complexity.Tariq Khan
 
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...IJERA Editor
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptxKokilaK25
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Cdiscount
 

Semelhante a SPU Optimizations - Part 2 (20)

2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Talk on Resource Allocation Strategies for Layered Multimedia Multicast Services
Talk on Resource Allocation Strategies for Layered Multimedia Multicast ServicesTalk on Resource Allocation Strategies for Layered Multimedia Multicast Services
Talk on Resource Allocation Strategies for Layered Multimedia Multicast Services
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
parallel
parallelparallel
parallel
 
Time complexity.ppt
Time complexity.pptTime complexity.ppt
Time complexity.ppt
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihm
 
Lecture01a correctness
Lecture01a correctnessLecture01a correctness
Lecture01a correctness
 
My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
Bellmon Ford Algorithm
Bellmon Ford AlgorithmBellmon Ford Algorithm
Bellmon Ford Algorithm
 
Algorithm And analysis Lecture 03& 04-time complexity.
 Algorithm And analysis Lecture 03& 04-time complexity. Algorithm And analysis Lecture 03& 04-time complexity.
Algorithm And analysis Lecture 03& 04-time complexity.
 
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 

Último

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Último (20)

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

SPU Optimizations - Part 2

  • 1. Introduction to SPU Optimizations Part 2: An Optimization Example P˚l-Kristian Engstad a pal engstad@naughtydog.com March 5, 2010 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 2. Introduction As seen in part 1 of these slides, the SPUs have a very powerful instruction set, but how can we utilize this power efficiently? Solution: Analyze the problem. Get C/C++ code working. Optimize! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 3. Analyzing for the SPU Analyze memory requirements. Partition your problem and data into sub-problems and sub-data until they fit on a single SPU. Re-arrange your data to facilitate a streaming model. SPUs only have 256 kB local memory, so in most cases you need to partition your input data to fit. Sometimes, this can be hard. If your input data is a “spaghetti-tree” with pointers to pointers, etc., then your task will be very hard. The SPU wants: Data in contigous memory chunks, preferably each larger than 128 bytes. Data aligned to 128 byte boundaries, or at least to 16 byte boundaries. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 4. Optimize for the SPU First things first. We have to: Get (C/C++) code working on a single SPU using synchronous memory transfers. Enable asynchronous memory transfers. Enable asynchronous execution (multiple SPUs). When everything works. Optimize: Parallelize inner loops, using intrinsics. Translate inner loops to assembly. Employ software pipe-lining. Optimize setup, prologue and epilogue code. Also make sure you test your loops to make sure they work on known input data. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 5. Optimizing Example - Lighting Calculation Given a vertex position Pv , and light position Pl and we calculate the light direction, ˆ = (Pl − Pv )/ Pl − Pv = L/ L , l (1) √ where we use the distance to the light r = L = L · L. The attenuation factor is a function of the distance to the light, using two constants α1 and α2 , α1 α(r ) = clamp( − α2 , 0, 1), (2) r2 and then finally get the color at that vertex Vc through Vc = α(r ) max(0, nv · ˆ Lc , ˆ l) (3) where Lc is the (r , g , b)-color of the light source, and nv is the vertex normal. ˆ P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 6. Optimizing Example - Lighting Calculation // Cg-function: float3 lcalc(float3 vpos, float3 vnrm, float att0, float att1, float3 lpos, float3 lclr) { float3 L = lpos - vpos; float r = sqrt(dot(l, l)); float3 ldir = L / r; float at = saturate(att0 / (r*r) - att1); return at * max(0, dot(vnrm, ldir)) * lclr; } P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 7. Optimizing Example - Lighting Calculation A trivial re-arrangement, to avoid two divides is to notice that 1 1 = √ = rsqrt(d), (4) r d and „ «2 1 1 1 = √ = √ = (rsqrt(d))2 (5) r2 ( d)2 d thus: P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 8. Optimizing Example - Lighting Calculation // Core function float3 lcalc(float3 vpos, float3 vnrm, float att0, float att1, float3 lpos, float3 lclr) { float3 L = lpos - vpos; float d = dot(l, l); float irs = rsqrt(d); float3 ldir = L * irs; float at = min(att0 * irs * irs - att1, 1); return max(0, at) * max(0, dot(vnrm, ldir)) * lclr; } P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 9. Optimizing Example - Lighting Calculation We want to use the lcalc function to accumulate the light from several light sources. So, our algorithm is: foreach vertex do vertex.clr = float3(0,0,0); foreach light do vertex.clr += lcalc(vertex.pos, vertex.nrm, light.att, light.pos, light.clr) endfor endfor P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 10. Optimizing Example - Lighting Calculation Let’s see how well GCC can do. We’ll define some structures: #define ALIGN(x) __attribute__((aligned(16))) #define INLINE inline __attribute__((always_inline)) #define VECRET /* __attribute__((vecreturn)) PPU only */ struct float3 { float x, y, z; } VECRET; struct light { float3 pos; float att0; float3 clr; float att1; } ALIGN(16); struct vertex { float3 pos; float dummy1; float3 nrm; float dummy2; float3 clr; float dummy3; } ALIGN(16); INLINE float3 lcalc(vertex &v, light& l) // For convenience. { return lcalc(v.pos, v.nrm, l.att0, l.att1, l.pos, l.clr); } P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 11. Optimizing Example - Lighting Calculation Some math functions: INLINE float3 mkvec3(float x, float y, float z) { float3 res = { x, y, z }; return res; } INLINE float dot(float3 a, float3 b) { return a.x*b.x + a.y*b.y + a.z*b.z; } INLINE float3 operator-(float3 a, float3 b) { return mkvec3(a.x-b.x, a.y-b.y, a.z-b.z); } INLINE void operator+=(float3& a, float3 b) { a.x += b.x; a.y += b.y; a.z += b.z; } INLINE float3 operator*(float k, float3 b) { return mkvec3(k*b.x, k*b.y, k*b.z); } INLINE float rsqrt(float x) { return 1/sqrtf(x); } INLINE float max(float a, float b) { return (a > b) ? a : b; } INLINE float min(float a, float b) { return (a < b) ? a : b; } P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 12. Optimizing Example - Lighting Calculation Our loop now becomes: void accLights(int nLights, light *light, int nVerts, vertex *vtx) { for (int v = 0; v < nVerts; v++) for (int l = 0; l < nLights; l++) if (l == 0) vtx[v].clr = lcalc(vtx[v], light[l]); else vtx[v].clr += lcalc(vtx[v], light[l]); } Results: 231.6 cycles/vtx/light. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 13. Optimizing Example - Lighting Calculation Where does GCC go wrong? It inlines all code. [Good.] Cycles per instruction = 2.07. Lots of stalls, 50.7% due to dependencies. No vectorization! Let’s try to help the compiler. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 14. Optimizing Example - Lighting Calculation Our loop now becomes: void accLights(int nLights, light *light, int nVerts, vertex *vtx) { if (nLights > 0) for (int v = 0; v < nVerts; v+=4) for (int i = 0; i < 4; i++) vtx[v+i].clr = lcalc(vtx[v+i], light[l]); for (int l = 1; l < nLights; l++) for (int v = 0; v < nVerts; v+=4) for (int i = 0; i < 4; i++) vtx[v+i].clr += lcalc(vtx[v+i], light[l]); } Results: 208.5 cycles/vtx/light. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 15. Optimizing Example - Lighting Calculation Let’s vectorize by hand. We will reorganize our vertexes to be groups of 4 vertexes organised such that 4 x-s are located in a single vector register: typedef vector float real; // 4 real numbers struct vec3 { real xs, ys, zs; vec3() {} vec3(real x, real y, real z) : xs(x), ys(y), zs(z) {} }; struct vertex4 { vec3 pos, nrm, clr; }; Think of a hardware vector register as four real values, not as a vector. A vec3 is four (3-)vectors, and a vertex4 is four vertexes. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 16. Optimizing Example - Lighting Calculation As before, we need to define addition, subtraction, dot-product, scalar multiplication, and a couple of other functions: INLINE vec3 operator+(vec3 a, vec3 b) { return vec3(a.xs + b.xs, a.ys + b.ys, a.zs + b.zs); } INLINE vec3 operator-(vec3 a, vec3 b) { return vec3(a.xs - b.xs, a.ys - b.ys, a.zs - b.zs); } INLINE real dot(vec3 a, vec3 b) { return a.xs * b.xs + a.ys * b.ys + a.zs * b.zs; } INLINE vec3 operator*(real as, vec3 b) { return vec3(as * b.xs, as * b.ys, as * b.zs); } INLINE vec3 splat(float3 a) { return vec3(spu_splats(a.x), spu_splats(a.y), spu_splats(a.z)); } P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 17. Optimizing Example - Lighting Calculation Also, since the light values are the same for each of the four components, we must “splat” the light components into the vector: struct lightenv { // 8 vector floats. vec3 pos, clr; real att0s, att1s; lightenv(light in) { pos = splat(in.pos); clr = splat(in.clr); att0s = spu_splats(in.att0); att1s = spu_splats(in.att1); } }; P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 18. Optimizing Example - Lighting Calculation The core function: // Light calculation for 4 vertexes static INLINE vec3 lcalc4(vertex4 v, lightenv l) { vec3 L = l.pos - v.pos; real Lsq = dot(L, L); real irs = rsqrtf4(Lsq); vec3 ldir = irs * L; real at = fminf4(l.att0s * irs * irs - l.att1s, ones); real dp = dot(v.nrm, ldir); return fmaxf4(zeros, at) * fmaxf4(zeros, dp) * lclr; } P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 19. Optimizing Example - Lighting Calculation The calling function: void accLights(int nLights, light *light, int nVerts, vertex4 *vtx) { for (int l = 0; l < nLights; l++) { lightenv lgt(light[l]); for (int v = 0; v < nVerts/4; v++) { vec3 clr = lcalc4(vtx[v], lgt); vtx[v].clr = clr + ((l==0) ? zerovec : vtx[v].clr); } } } Result: 31.8 cycles/vertex/light! This is a factor of 6.56 better, but we can do much more... P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 20. Optimizing Example - Lighting Calculation It is more than 4 times better since in addtion to having vectorized the code (4 times speedup), the compiler no longer has to do akward shuffles when loading and storing values. However, using spusim, we still see that more than 50% of the cycles are wasted by dependency stalls. Of the rest, 76% are still single-issued. So, GCC does a terrible job of optimizing our loop. We can try to unroll the loop manually: P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 21. Optimizing Example - Lighting Calculation for (int v = 0; v < nVerts/4; v+=4) { vec3 clr0 = lcalc4(vtx[v+0], lgt); vec3 clr1 = lcalc4(vtx[v+1], lgt); vec3 clr2 = lcalc4(vtx[v+2], lgt); vec3 clr3 = lcalc4(vtx[v+3], lgt); vtx[v+0].clr = clr0 + ((l==0) ? zerovec : vtx[v+0].clr); vtx[v+1].clr = clr1 + ((l==0) ? zerovec : vtx[v+1].clr); vtx[v+2].clr = clr2 + ((l==0) ? zerovec : vtx[v+2].clr); vtx[v+3].clr = clr3 + ((l==0) ? zerovec : vtx[v+3].clr); } Result: 15.0 cycles/vertex/light!! Pretty good, but spusim tells us that although most stalls are gone (about 10%), 60% of the cycles are single-issue and not dual issue. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 22. Optimizing Example - Lighting Calculation What is the minimum number of cycles in our loop? We have: L = l.pos - v.pos // 3 fs Lsq = dot(L, L) // fm + 2 fma dp = dot(v.nrm, L) // fm + 2 fma irs = rsqrtf4(Lsq) // frsqrest+fi+(2fm+fnms+fma) at = l.att0s * irs * irs - l.att1s // fm, fms at = fminf4(at, ones) // fcgt, selb m0 = fmaxf4(zeros, at) // fcgt, selb m1 = fmaxf4(zeros, dp * irs) // fm, fcgt, selb ms = m0 * m1 // fm tmp = lclr * irs // 3 fm clr = ms * tmp + pclr // 3 fma So, adding one instruction for adding the loop-counter, we should be able to software pipeline this down to 31/4 = 7.75 cycles! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 23. Optimizing Example - Lighting Calculation We can improve this loop a lot by using a process called software pipe-lining. It is a fairly mechanic process: Write out the instruction sequence in assembly. Add up even instructions and odd instructions. The maximum is the II, the initiation interval. Also note (write down) the latency of each instruction. Create a table with two columns and as many rows as the II. Insert instructions from the top adhering to latency demands. If you get to row II, then wrap around. In the following slides, we’ll show how we do it. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 24. Optimizing Example - Lighting Calculation We’ll start with vec3 L = l.pos - v.pos; This is going to be translated (in assembly) to: {o6} lqx vposx, pxptr, iofs ; vposx := mem[pxptr+iofs] {o6} lqx vposy, pyptr, iofs ; vposy := mem[pyptr+iofs] {o6} lqx vposz, pzptr, iofs ; vposz := mem[pzptr+iofs] {e6} fs Lx, lposx, vposx ; even pipe, latency = 6. {e6} fs Ly, lposy, vposy ; {text} are comments. {e6} fs Lz, lposz, vposz The reason we use the x-form is that we then will have only one offset variable, simplifying our loop control. Of course, this means that we have to initialize pxptr, pyptr and pzptr accordingly. All light variables must also be initialized. Let’s start! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 25. ;;; Start of loop (page 1) {nop} {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {nop} {o6:0} lqx vposz, pzptr, iofs; 2 {nop} {lnop} ; 3 {nop} {lnop} ; 4 {nop} {lnop} ; 5 {e6:0} fs Lx, lposx, vposx {lnop} ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {nop} {lnop} ; 10 {nop} {lnop} ; 11 {nop} {lnop} ; 12 {nop} {lnop} ; 13 {nop} {lnop} ; 14 {nop} {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 26. Optimizing Example - Lighting Calculation Then we just continue with Lsq = dot(L, L) dp = dot(v.nrm, L) Translated: {e6} fm Lsq, Lx, Lx ; Lsq = Lx*Lx {e6} fma Lsq, Ly, Ly, Lsq ; Lsq = Ly*Ly + Lsq {e6} fma Lsq, Lz, Lz, Lsq ; Lsq = Lz*Lz + Lsq {e6} fm dp, vnrmx, Lx ; dp = Nx*Lx {e6} fma dp, vnrmy, Ly, dp ; dp = Ny*Ly + dp {e6} fma dp, vnrmz, Lz, dp ; dp = Nz*Lz + dp P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 27. ;;; Start of loop (page 1) {nop} {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {nop} {o6:0} lqx vposz, pzptr, iofs; 2 {nop} {lnop} ; 3 {nop} {lnop} ; 4 {nop} {lnop} ; 5 {e6:0} fs Lx, lposx, vposx {lnop} ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {nop} {lnop} ; 10 {nop} {lnop} ; 11 {nop} {lnop} ; 12 {nop} {lnop} ; 13 {nop} {lnop} ; 14 {nop} {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 28. ;;; Start of loop (page 1) {nop} {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {nop} {o6:0} lqx vposz, pzptr, iofs; 2 {nop} {o6:0} lqx vnrmx, nxptr, iofs; 3 {nop} {o6:0} lqx vnrmy, nyptr, iofs; 4 {nop} {o6:0} lqx vnrmz, nzptr, iofs; 5 {e6:0} fs Lx, lposx, vposx {lnop} ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {nop} {lnop} ; 10 {nop} {lnop} ; 11 {e6:0} fm Lsq, Lx, Lx {lnop} ; 12 {e6:0} fm dp, vnrmx, Lx {lnop} ; 13 {nop} {lnop} ; 14 {nop} {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 29. ;;; Continued (page 2) {nop} {lnop} ; 16 {nop} {lnop} ; 17 {nop} {lnop} ; 18 {e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19 {e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 20 {nop} {lnop} ; 21 {nop} {lnop} ; 22 {nop} {lnop} ; 23 {nop} {lnop} ; 24 {e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25 {e6:0} fma dp_, vnrmz, Lz, dp {lnop} ; 26 {nop} {lnop} ; 27 {nop} {lnop} ; 28 {nop} {lnop} ; 29 {nop} {o1:-} brnz ..., looptop ; 30 ;;; Must wrap around! Notice that we rename the output variables from Lsq and dp to Lsq and dp , so that they don’t clash with registers when they are carried into the next loop. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 30. Optimizing Example - Lighting Calculation The irs = frsqrf4(Lsq) function is implemented using: {o4} frsqest t0, Lsq {e7} fi t1, Lsq, t0 {e6} fm t2, t1, Lsq {e6} fm t3, t1, onehalf {e6} fnms t2, t2, t1, one {e6} fma irs, t3, t2, t1 After this sequence, irs is good to 24 bits precision. Notice the huge latency (∆L = 4 + 7 + 6 + 6 + 6 + 6 = 35), so the result is ready the next iteration. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 31. ;;; Start of loop (page 1) {nop} {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {nop} {o6:0} lqx vposz, pzptr, iofs; 2 {nop} {o6:0} lqx vnrmx, nxptr, iofs; 3 {nop} {o6:0} lqx vnrmy, nyptr, iofs; 4 {nop} {o6:0} lqx vnrmz, nzptr, iofs; 5 {e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {e7:1} fi t1, Lsq_, t0 {lnop} ; 10 {nop} {lnop} ; 11 {e6:0} fm Lsq, Lx, Lx {lnop} ; 12 {e6:0} fm dp, vnrmx, Lx {lnop} ; 13 {nop} {lnop} ; 14 {nop} {lnop} ; 15 We annotate with :1 to indicate that these instructions are delayed into the next iteration of the loop. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 32. ;;; Continued (page 2) {nop} {lnop} ; 16 {nop} {lnop} ; 17 {e6:1} fm t2, t1, Lsq_ {lnop} ; 18 {e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19 {e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 20 {nop} {lnop} ; 21 {nop} {lnop} ; 22 {nop} {lnop} ; 23 {e6:1} fm t3, t1, onehalf {lnop} ; 24 {e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25 {e6:0} fma dp, vnrmz, Lz, dp {lnop} ; 26 {nop} {lnop} ; 27 {nop} {lnop} ; 28 {nop} {lnop} ; 29 {e6:1} fnms t2, t2, t1, one {o1:-} brnz ..., looptop ; 30 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 33. ;;; Start of loop (page 1) {nop} {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {nop} {o6:0} lqx vposz, pzptr, iofs; 2 {nop} {o6:0} lqx vnrmx, nxptr, iofs; 3 {nop} {o6:0} lqx vnrmy, nyptr, iofs; 4 {e6:2} fma irs, t3, t2, t1 {o6:0} lqx vnrmz, nzptr, iofs; 5 {e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {e7:1} fi t1, Lsq_, t0 {lnop} ; 10 {nop} {lnop} ; 11 {e6:0} fm Lsq, Lx, Lx {lnop} ; 12 {e6:0} fm dp, vnrmx, Lx {lnop} ; 13 {nop} {lnop} ; 14 {nop} {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 34. Optimizing Example - Lighting Calculation Continuing with: at = l.att0s * irs * irs - l.att1s // fm, fms at = fminf4(at, ones) // fcgt, selb This translates to: {e6} fm ir, irs, irs {e6} fms at, att0s, ir, att1s {e2} fcgt c0, at, ones ; c0.w = at.w>0 ? 111..111 : 000..000 {e2} selb at, c0, at, ones P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 35. ;;; Start of loop (page 1) {nop} {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {nop} {o6:0} lqx vposz, pzptr, iofs; 2 {nop} {o6:0} lqx vnrmx, nxptr, iofs; 3 {nop} {o6:0} lqx vnrmy, nyptr, iofs; 4 {e6:2} fma irs, t3, t2, t1 {o6:0} lqx vnrmz, nzptr, iofs; 5 {e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {e7:1} fi t1, Lsq_, t0 {lnop} ; 10 {e6:2} fm ir, irs, irs {lnop} ; 11 {e6:0} fm Lsq, Lx, Lx {lnop} ; 12 {e6:0} fm dp, vnrmx, Lx {lnop} ; 13 {nop} {lnop} ; 14 {nop} {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 36. ;;; Continued (page 2) {nop} {lnop} ; 16 {nop} {lnop} ; 17 {e6:1} fm t2, t1, Lsq_ {lnop} ; 18 {e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19 {e6:2} fms at, att0s, ir, att1s {lnop} ; 20 {e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 21 {nop} {lnop} ; 22 {nop} {lnop} ; 23 {e6:1} fm t3, t1, onehalf {lnop} ; 24 {e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25 {e2:2} fcgt c0, at, ones {lnop} ; 26 {e6:0} fma dp, vnrmz, Lz, dp {lnop} ; 27 {e2:2} selb at, c0, at, ones {lnop} ; 28 {nop} {lnop} ; 29 {e6:1} fnms t2, t2, t1, one {o1:-} brnz ..., looptop ; 30 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 37. Optimizing Example - Lighting Calculation Continuing with: m1 = fmaxf4(zeros, dp*irs) // fm, fcgt, selb m0 = fmaxf4(zeros, at) // fcgt, selb ms = m0 * m1 // fm This translates to: {e6} fm dprs, dp, irs {e2} fcgt c1, zeros, dprs {e2} selb m1, c1, dprs, zeros {e2} fcgt c2, zeros, at {e2} selb m0, c2, at, zeros P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 38. ;;; Start of loop (page 1) : lmove dst, src == rotm dst, src, zeros {nop} {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {nop} {o6:0} lqx vposz, pzptr, iofs; 2 {nop} {o6:0} lqx vnrmx, nxptr, iofs; 3 {nop} {o6:0} lqx vnrmy, nyptr, iofs; 4 {e6:2} fma irs, t3, t2, t1 {o6:0} lqx vnrmz, nzptr, iofs; 5 {e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {e7:1} fi t1, Lsq_, t0 {lnop} ; 10 {e6:2} fm ir, irs, irs {lnop} ; 11 {e6:0} fm Lsq, Lx, Lx {lnop} ; 12 {e6:2} fm dprs, dp__, irs {lnop} ; 13 {e6:0} fm dp, vnrmx, Lx {o4:1} lmove dp__, dp_ ; 14 {nop} {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 39. ;;; Continued (page 2) {nop} {lnop} ; 16 {nop} {lnop} ; 17 {e6:1} fm t2, t1, Lsq_ {lnop} ; 18 {e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19 {e6:2} fms at, att0s, ir, att1s {lnop} ; 20 {e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 21 {e2:2} fcgt c1, zeros, dprs {lnop} ; 22 {nop} {lnop} ; 23 {e6:1} fm t3, t1, onehalf {lnop} ; 24 {e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25 {e2:2} fcgt c0, at, ones {lnop} ; 26 {e6:0} fma dp_, vnrmz, Lz, dp {lnop} ; 27 {e2:2} selb at, c0, at, ones {lnop} ; 28 {e2:2} selb m1, c1, dprs, zeros {lnop} ; 29 {e6:1} fnms t2, t2, t1, one {o1:-} brnz ..., looptop ; 30 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 40. ;;; Start of loop (page 1) {e2:3} fcgt c2, zeros, at {o6:0} lqx vposx, pxptr, iofs; 0 {nop} {o6:0} lqx vposy, pyptr, iofs; 1 {e2:3} selb m0, c2, at, zeros{o6:0} lqx vposz, pzptr, iofs; 2 {nop} {o6:0} lqx vnrmx, nxptr, iofs; 3 {e6:3} fm ms, m0, m1 {o6:0} lqx vnrmy, nyptr, iofs; 4 {e6:2} fma irs, t3, t2, t1 {o6:0} lqx vnrmz, nzptr, iofs; 5 {e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {nop} {lnop} ; 9 {e7:1} fi t1, Lsq_, t0 {lnop} ; 10 {e6:2} fm ir, irs, irs {lnop} ; 11 {e6:0} fm Lsq, Lx, Lx {lnop} ; 12 {e6:2} fm dprs, dp__, irs {lnop} ; 13 {e6:0} fm dp, vnrmx, Lx {o4:1} lmove dp__, dp_ ; 14 {nop} {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 41. Optimizing Example - Lighting Calculation Finally, we can squeeze in our last 6 instructions: tmp = lclr * irs // 3 fm clr = ms * tmp + prev // 3 fma Translation: {e6} fm tmpr, lclrr, irs {e6} fm tmpg, lclrg, irs {e6} fm tmpb, lclrb, irs {e6} fma clrr, ms, tmpr, prevr ; prev color {e6} fma clrg, ms, tmpg, prevg {e6} fma clrb, ms, tmpb, prevg P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 42. ;;; Start of loop (page 1) {e2:3} fcgt c2, zeros, at {o6:0} lqx vposx, pxptr, iofs; 0 {e6:3} fm tmpr, lclrr, irs_ {o6:0} lqx vposy, pyptr, iofs; 1 {e2:3} selb m0, c2, at, zeros{o6:0} lqx vposz, pzptr, iofs; 2 {e6:3} fm tmpg, lclrg, irs_ {o6:0} lqx vnrmx, nxptr, iofs; 3 {e6:3} fm ms, m0, m1 {o6:0} lqx vnrmy, nyptr, iofs; 4 {e6:2} fma irs, t3, t2, t1 {o6:0} lqx vnrmz, nzptr, iofs; 5 {e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6 {e6:0} fs Ly, lposy, vposy {lnop} ; 7 {e6:0} fs Lz, lposz, vposz {lnop} ; 8 {e6:3} fm tmpb, lclrb, irs_ {lnop} ; 9 {e7:1} fi t1, Lsq_, t0 {lnop} ; 10 {e6:2} fm ir, irs, irs {lnop} ; 11 {e6:0} fm Lsq, Lx, Lx {o4:2} lmove irs_, irs ; 12 {e6:2} fm dprs, dp__, irs_ {lnop} ; 13 {e6:0} fm dp, vnrmx, Lx {o4:1} lmove dp__, dp_ ; 14 {e6:3} fma clrr, ms, tmpr, prevr {lnop} ; 15 Here, we needed to extend the lifetime of irs, thus introducing the move to irs in iteration 2, in order to have a clean copy in iteration 3. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 43. ;;; Continued (page 2) {nop} {lnop} ; 16 {e6:3} fma clrg, ms, tmpg, prevg {lnop} ; 17 {e6:1} fm t2, t1, Lsq_ {lnop} ; 18 {e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19 {e6:2} fms at, att0s, ir, att1s {lnop} ; 20 {e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 21 {e2:2} fcgt c1, zeros, dprs {lnop} ; 22 {e6:3} fma clrb, ms, tmpb, prevb {lnop} ; 23 {e6:1} fm t3, t1, onehalf {lnop} ; 24 {e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25 {e2:2} fcgt c0, at, ones {lnop} ; 26 {e6:0} fma dp_, vnrmz, Lz, dp {lnop} ; 27 {e2:2} selb at, c0, at, ones {lnop} ; 28 {e2:2} selb m1, c1, dprs, zeros {lnop} ; 29 {e6:1} fnms t2, t2, t1, one {o1:-} brnz ..., looptop ; 30 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 44. Optimizing Example - Lighting Calculation We’re almost done! We must finish up with a technique to select prevr, prevg and prevb which must zero for the first loop, but use the previous color the next loops. We can do this by using a different select map. For light 0, we’ll use mask = m 0000, which creates all zeros, and mask = m ABCD for all others. So, for each of the color’s (R,G,B)-fields: {o6} lqx vclrr, crptr, oofs ; load using output offset {o4} shufb prevr, vclrr, vclrr, mask ; copy vclrr or set to zeros ;; use prevr {o6} stqx vclrr, crptr, oofs ; load using output offset P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 45. ;;; Start of loop (page 1) {e2:3} fcgt c2, zeros, at {o6:0} lqx vposx, pxptr, iofs; 0 {e6:3} fm tmpr, lclrr, irs_ {o6:0} lqx vposy, pyptr, iofs; 1 {e2:3} selb m0, c2, at, zeros{o6:0} lqx vposz, pzptr, iofs; 2 {e6:3} fm tmpg, lclrg, irs_ {o6:0} lqx vnrmx, nxptr, iofs; 3 {e6:3} fm ms, m0, m1 {o6:3} lqx vclrr, crptr, oofs; 4 {e6:2} fma irs, t3, t2, t1 {o6:3} lqx vclrg, cgptr, oofs; 5 {e6:0} fs Lx, lposx, vposx {o4:1} frsqest t0, Lsq_ ; 6 {e6:0} fs Ly, lposy, vposy {o6:3} lqx vclrb, cbptr, oofs; 7 {e6:0} fs Lz, lposz, vposz {o6:0} lqx vnrmy, nyptr, iofs; 8 {e6:3} fm tmpb, lclrb, irs_ {o6:0} lqx vnrmz, nzptr, iofs; 9 {e7:1} fi t1, Lsq_, t0 {o4:3} shufb prevr, vclrr, vclrr, mask; 10 {e6:2} fm ir, irs, irs {o4:3} shufb prevg, vclrg, vclrg, mask; 11 {e6:0} fm Lsq, Lx, Lx {o4:2} lmove irs_, irs ; 12 {e6:2} fm dprs, dp__, irs {o4:3} shufb prevb, vclrb, vclrb, mask; 13 {e6:0} fm dp, vnrmx, Lx {o4:1} lmove dp__, dp_ ; 14 {e6:3} fma clrr, ms, tmpr, prevr {lnop} ; 15 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 46. ;;; Continued (page 2) {nop} {lnop} ; 16 {e6:3} fma clrg, ms, tmpg, prevg {lnop} ; 17 {e6:1} fm t2, t1, Lsq_ {lnop} ; 18 {e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19 {e6:2} fms at, att0s, ir, att1s {lnop} ; 20 {e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 21 {e2:2} fcgt c1, zeros, dprs {o6:3} stqx clrr, crptr, oofs; 22 {e6:3} fma clrb, ms, tmpb, prevb {lnop} ; 23 {e6:1} fm t3, t1, onehalf {o6:3} stqx clrg, cgptr, oofs; 24 {e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25 {e2:2} fcgt c0, at, ones {lnop} ; 26 {e6:0} fma dp_, vnrmz, Lz, dp {lnop} ; 27 {e2:2} selb at, c0, at, ones {o6:3} stqx clrb, cbptr, oofs; 28 {e2:2} selb m1, c1, dprs, zeros {lnop} ; 29 {e6:1} fnms t2, t2, t1, one {o1:-} brnz ..., looptop ; 30 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 47. Optimizing Example - Lighting Calculation We’re finally close to done. The last thing we need is to make sure our input and output offsets are correct through out the execution. As seen, we count downwards by 0x90, which equals sizeof(vertex4) per loop. So, we initialize iofs = (numVerts/4 - 1) * 0x90, and that should take care of the input loads. For the output offsets we need a delay unit, initialized such that the first 3 times we’ll overwrite the last vertex4 positions. We’ll use the “delay machine” shuffle m bcdA for this. {o4} shufb oofs, iofs, oofs, m_bcdA ; oofs.x <- oofs.y, oofs.y <- oofs.z, oofs.z <- oofs.w, oofs.w <- iofs.x P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 48. Optimizing Example - Lighting Calculation Example, (numVerts/4 = 4), at cycle 0, N=0x90: Loop 0: iofs = 3N, oofs = { 3N, 3N, 3N, 3N } Loop 1: iofs = 2N, oofs = { 3N, 3N, 3N, 2N } Loop 2: iofs = 1N, oofs = { 3N, 3N, 2N, 1N } Loop 3: iofs = 0N, oofs = { 3N, 2N, 1N, 0N } Loop 4: iofs =-1N, oofs = { 2N, 1N, 0N,-1N } Loop 5: iofs =-2N, oofs = { 1N, 0N,-1N,-2N } Loop 6: iofs =-3N, oofs = { 0N,-1N,-2N,-3N } P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 49. ;;; Continued (page 2) {e2:-} ai iofs, iofs, -0x90 {lnop} ; 16 {e6:3} fma clrg, ms, tmpg, prevg {lnop} ; 17 {e6:1} fm t2, t1, Lsq_ {o4:3} shufb oofs, iofs, oofs, m_bcdA {e6:0} fma Lsq, Ly, Ly, Lsq {lnop} ; 19 {e6:2} fms at, att0s, ir, att1s {lnop} ; 20 {e6:0} fma dp, vnrmy, Ly, dp {lnop} ; 21 {e2:2} fcgt c1, zeros, dprs {o6:3} stqx clrr, crptr, oofs; 22 {e6:3} fma clrb, ms, tmpb, prevb {lnop} ; 23 {e6:1} fm t3, t1, onehalf {o6:3} stqx clrg, cgptr, oofs; 24 {e6:0} fma Lsq_, Lz, Lz, Lsq {lnop} ; 25 {e2:2} fcgt c0, at, ones {lnop} ; 26 {e6:0} fma dp_, vnrmz, Lz, dp {lnop} ; 27 {e2:2} selb at, c0, at, ones {o6:3} stqx clrb, cbptr, oofs; 28 {e2:2} selb m1, c1, dprs, zeros {lnop} ; 29 {e6:1} fnms t2, t2, t1, on {o1:3} brnz oofs, looptop ; 30 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 50. Optimizing Example - Lighting Calculation And now we are done! Well, not entirely – we’ll have to code the prologue (the entry code) and the epilogue (exit code), test it and make sure it works. But it will! Looking at the code, there are no stalls. In four iterations taking 4 × 31 = 124 cycles, the first four result will be produced, and for every next interval of 31 cycles, another four results are produced. That means that our loop runs at 7.75 cycles/vertex/light! Was this hard to do? Fairly mechanic. We have tools that can schedule automatically! Requires simple loops. Usually a factor 2-4 of what unrolled GCC can achieve. Our “frontend” tool automatically colors registers. And it has some other nice features as well. So: Just do it! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 51. Optimizing Example - Lighting Calculation Improvements on algorithm: The strange storage format can be changed to allow for regular array-of-structures. All we need to do is to add in transposing of input data. 9 loads turns into 12 loads. Transposing 4 vector regs into 3 takes: 7o, 1e+6o, 2e+5o or 4e+4o. Transposing 3 vector regs into 4 takes: 6o, 1e+5o, 2e+4o. 3 stores turns into 4 stores. Overall addition: odd = 3 + 3 × 4 + 6 + 1 = 22, even = 3 × 4 = 12. Since we had 10 odd for free: 22 − 10 = 12 extra cycles, or 3.0 extra cycles/loop. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 52. Optimizing Example - Lighting Calculation Conclusion: The original algorithm can be coded, without changing memory layouts, down to 10.75 cycles/loop, or 7.75 cycles/loop with our proposed change in memory layout. If you remember nothing else, then remember that: vectorization can give huge improvements without going to assembly, and that software loop-scheduling can give huge improvements on top of that. And finally, a question: Why on earth can’t the compiler do a better job? P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations