SlideShare a Scribd company logo
1 of 20
Download to read offline
Quick Search algorithm
       and strstr


             Cybozu Labs
2012/3/31 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 3(#x86opti)
Agenda
 Quick Search
 vs. strstr of gcc on Core2Duo
 vs. strstr of gcc on Xeon
 fast strstr using strchr of gcc
 vs. my implementation on Xeon
   restriction
 vs. strstr of VC2011 beta on Core i7
 feature of pcmpestri
 range version of strstr



2012/3/31 #x86opti                       2 /20
Quick Search algorithm(1/2)
 Simplified and improved Boyer-Moore algorithm
   initialized table for "this is"
    char      't'    'h'   'I'   's'      ''   other
    skip     +7      +6    +2    +1       +3    +8
                                                           0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
                                                       0   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       1   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
   How to initialize table for given                  2   3   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
    string [str, str + len)                            3
                                                       4
                                                           8
                                                           8
                                                               8
                                                               8
                                                                   8
                                                                   8
                                                                       8
                                                                       8
                                                                           8
                                                                           8
                                                                               8
                                                                               8
                                                                                   8
                                                                                   8
                                                                                       8
                                                                                       8
                                                                                           8
                                                                                           8
                                                                                               8
                                                                                               8
                                                                                                   8
                                                                                                   8
                                                                                                       8
                                                                                                       8
                                                                                                           8
                                                                                                           8
                                                                                                               8
                                                                                                               8
                                                                                                                   8
                                                                                                                   8
                                                                                                                       8
                                                                                                                       8
                                                       5   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
  int tbl_[256];                                       6   8   8   8   8   8   8   8   8   6   2   8   8   8   8   8   8
                                                       7   8   8   8   1   7   8   8   8   8   8   8   8   8   8   8   8
  void init(const char *str,           int len) {      8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       9   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      std::fill(tbl_, tbl_ +           256, len);      A   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      for (size_t i = 0; i <           len; i++) {     B   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       C   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
          tbl_[str[i]] = len           - i;            D   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      }                                                E   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       F   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
  }
2012/3/31 #x86opti                                                                                                         3 /20
Quick Search algorithm(2/2)
 Searching phase
   simple and fast
   see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html

       const char *find(const char *begin, const char *end) {
           while (begin <= end - len_) {
               if (memcmp(str_, begin, len_) == 0) return begin;
               begin += tbl_[begin[len_]];
           }
           return end;
       };




2012/3/31 #x86opti                                                 4 /20
Benchmark
 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
   33MB UTF-8 text
                                  10
             cycle/Byte to find



                                   8               fast
                                   6
                                   4
                                   2                      strstr
                                   0                      org Qs




                                       substring
   Qs(Quick search) is faster for long substring
   Remark: assume text does not have ‘¥0’for strstr
2012/3/31 #x86opti                                                 5 /20
A little modification of Qs
 avoid memcmp
              const char *find(const char *begin, const char *end) {
                  while (begin <= end - len_) {
                      if (memcmp(str_, begin, len_) == 0) return begin;
                      begin += tbl_[begin[len_]];
                  }
                  return end; }



     const char *find(const char *begin, const char *end){
         while (begin <= end - len_) {
             for (size_t i = 0; i < len_; i++) {
                 if (str_[i] != begin[i]) goto NEXT;
             }
             return begin;
         NEXT:
             begin += tbl_[static_cast<unsigned char>(begin[len_])];
         }
         return end; }
2012/3/31 #x86opti                                                        6 /20
Benchmark again
 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
   33MB UTF-8 text
                                  10
             cycle/Byte to find



                                   8               fast
                                   6
                                   4
                                                          strstr
                                   2
                                                          org Qs
                                   0
                                                          Qs'



                                       substring

   modified Qs(Qs’) is more faster
   Should we use modified Qs'?
2012/3/31 #x86opti                                                 7 /20
strstr on gcc 4.6 with SSE4.2
 Xeon X5650 2.67Gz on Linux
 strstr with SSE4.2 is faster than Qs’
  for substring with length less than 9 byte
                                  5
             cycle/Byte to find




                                  4               fast
                                  3
                                  2
                                  1                      strstr
                                  0                      Qs'




                                      substring

 Is strstr of gcc is fastest implementation?
2012/3/31 #x86opti                                                8 /20
strstr implementation by strchr
 Find candidate of location by strchr at first,
  and verify the correctness
   strchr of gcc with SSE4.2 is fast

     const char *mystrstr_C(const char *str, const char *key) {
         size_t len = strlen(key);
         while (*str) {
             const char *p = strchr(str, key[0]);
             if (p == 0) return 0;
             if (memcmp(p + 1, key + 1, len - 1) == 0) return p;
             str = p + 1;
         }
         return 0;
     }



2012/3/31 #x86opti                                                 9 /20
strstr vs. mystrstr_C
 Xeon X5650 2.6GHz + gcc 4.6.1
 mystrstr_C is 1.5 ~ 3 times faster than strstr
   except for “ko-re-wa”(in UTF-8)
   maybe penalty for many bad candidates
                              10
         cycle/Byte to find




                               8        fast
                               6
                               4
                                               strstr
                               2
                                               Qs'
                               0
                                               my_strstr_C


                                   substring

2012/3/31 #x86opti                                           10 /20
real speed of SSE4.2(pcmpistri)
 my_strstr is always faster than Qs’
   2 ~ 4 times faster than strstr of gcc

                              10

                               8               fast
         cycle/Byte to find




                               6

                               4                      strstr
                                                      Qs'
                               2
                                                      my_strstr_C
                               0                      my_strstr



                                   substring

2012/3/31 #x86opti                                                  11 /20
Implementation of my_strstr(1/2)
 https://github.com/herumi/opti/blob/master/str_util.hpp
   written in Xbyak(for my convenience)
 Main loop
     // a : rax(or eax), c : rcx(or ecx)
     //       input a : ptr to text
     //             key : ptr to key
     //       use save_a, save_key, c

         movdqu(xm0, ptr [key]); // xm0 = *key
     L(".lp");
         pcmpistri(xmm0, ptr [a], 12);
               // 12(1100b) = [equal ordered:unsigned:byte]
         jbe(".headCmp");
         add(a, 16);
         jmp(".lp");
     L(".headCmp");
         jnc(".notFound");
2012/3/31 #x86opti                                            12 /20
Implementation of my_strstr(2/2)
 Compare tail in“headCmp”
         ...
         add(a, c); // get position
         mov(save_a, a); // save a
         mov(save_key, key); // save key
     L(".tailCmp");
         movdqu(xm1, ptr [save_key]);
         pcmpistri(xmm1, ptr [save_a], 12);
         jno(".next");
         js(".found");
         // rare case
         add(save_a, 16);
         add(save_key, 16);
         jmp(".tailCmp");
     L(".next");
         add(a, 1);
         jmp(".lp");
2012/3/31 #x86opti                            13 /20
Pros and Cons of my_strstr
 Pros
   very fast
   Is this implementation with Qs fastest?
     No, overhead is almost larger(variable address offset)
 Cons
   access max 16 bytes beyond of the end of text
     almost no problem except for page boundary
     allocate memory with margin
             4KiB readable page                                         not readable page
          FF7   FF8   FF9   FFA   FFB   FFC   FFD   FFE   FFF   000   001   002   003




                                                                 access
                       pcmpistri                                violation
                                        end of text
2012/3/31 #x86opti                                                                      14 /20
strstr of Visual Studio 11
 almost same speed as my_strstr
 of Couse safe to use
   i7-2620 3.4GHz + Windows 7 + VS 11beta
                               8
          cycle/Byte to find




                               6        fast
                               4

                               2               strstr
                                               Qs'
                               0
                                               my_strstr


                                   substring



2012/3/31 #x86opti                                         15 /20
All benchmarks on i7-2600
 find "ko-re-wa" in 33MiB text
   the results strongly depends on text and key


                                                                strstr(before SSE4.2)
                                                      fast
                                                                Qs(gcc)

                                                                Qs'(gcc)

                                                                strstr(gcc;SSE4.2)

                                                                strstr(VC;SSE4.2)

                                                                my_strstr(SSE4.2)


       0   1   2     3   4   5   6   7   8   9 10 11 12 13 14
                     rate for the timing of Qs(gcc)

2012/3/31 #x86opti                                                                      16 /20
range version of strstr
 strstr is not available for string including‘¥0’
 use std::string.find()
   but it is not optimized for SSE4.2
     naive but fast implementation by C
     const char *findStr_C(const char *begin, const char *end,
       const char *key, size_t keySize) {
       while (begin + keySize <= end) {
         const char *p = memchr(begin, key[0], end - begin);
         if (p == 0) break;
         if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p;
         begin = p + 1;
       }
       return end; }

 str_util.hpp provides findStr with SSE4.2
   4 ~ 5 times faster than findStr_C on i7-2600 + VC11
2012/3/31 #x86opti                                                17 /20
feature of pcmpestri
 very complex mnemonics
xmm0 : head of key pcmpestri xmm0, ptr [p], 12
 rax : keySize
 p : pointer to text
                                  rcx : pos of key if found
 rdx : text size
                                  CF : if found
                                  ZF : end of text
                                  SF : end of key
                                  OF : all match
    L(".lp");
        pcmpestri(xmm0, ptr [p], 12);
        lea(p, ptr [p + 16]);
        lea(d, ptr [d - 16]);           do not change carry
        ja(".lp");
        jnc(".notFound");
        // compare leading str...
2012/3/31 #x86opti                                            18 /20
Difference between Xeon and i7
 main loop of my_strstr
    L(".lp");
        pcmpistri(xmm0, ptr [a], 12);
        if (isSandyBridge) {
            lea(a, ptr [a + 16]);
            ja(".lp");                      a little faster on i7
        } else {
            jbe(".headCmp");
            add(a, 16);                 1.1 times faster on Xeon
            jmp(".lp");
    L(".headCmp");
        }
        jnc(".notFound");
        // get position
        if (isSandyBridge) {
            lea(a, ptr [a + c - 16]);
        } else {
            add(a, c);
        }
2012/3/31 #x86opti                                                  19 /20
other features of str_util.hpp
 strchr_any(text, key)[or findChar_any]
   returns a pointer to the first occurrence of any
    character of key int the text
  // search character position of '?', '#', '$', '!', '/', ':'
  strchr_any(text,"?#$!/:");
   same speed as strchr by using SSE4.2
   max length of key is 16
 strchr_range(txt, key)[or findChar_range]
   returns a pointer to the first occurrence of a
    character in range [key[0], key[1]], [key[2], key[3]], ...
   also same speed as strchr and max len(key) = 16
  // search character position of [0-9], [a-f], [A-F]
  strchr_range(text,"09afAF");
2012/3/31 #x86opti                                               20 /20

More Related Content

What's hot

Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Akihiro Suda
 
PlaySQLAlchemy: SQLAlchemy入門
PlaySQLAlchemy: SQLAlchemy入門PlaySQLAlchemy: SQLAlchemy入門
PlaySQLAlchemy: SQLAlchemy入門泰 増田
 
.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理KageShiron
 
SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介MITSUNARI Shigeo
 
Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦Preferred Networks
 
暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学MITSUNARI Shigeo
 
x86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNTx86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNTtakesako
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説Takateru Yamagishi
 
組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門Norishige Fukushima
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたMITSUNARI Shigeo
 
12 分くらいで知るLuaVM
12 分くらいで知るLuaVM12 分くらいで知るLuaVM
12 分くらいで知るLuaVMYuki Tamura
 
Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Masahito Zembutsu
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境KiyotomoHiroyasu
 
Effective Modern C++ 勉強会 Item 22
Effective Modern C++ 勉強会 Item 22Effective Modern C++ 勉強会 Item 22
Effective Modern C++ 勉強会 Item 22Keisuke Fukuda
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門Fixstars Corporation
 

What's hot (20)

Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
ヤフー社内でやってるMySQLチューニングセミナー大公開
ヤフー社内でやってるMySQLチューニングセミナー大公開ヤフー社内でやってるMySQLチューニングセミナー大公開
ヤフー社内でやってるMySQLチューニングセミナー大公開
 
フラグを愛でる
フラグを愛でるフラグを愛でる
フラグを愛でる
 
PlaySQLAlchemy: SQLAlchemy入門
PlaySQLAlchemy: SQLAlchemy入門PlaySQLAlchemy: SQLAlchemy入門
PlaySQLAlchemy: SQLAlchemy入門
 
.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理
 
Glibc malloc internal
Glibc malloc internalGlibc malloc internal
Glibc malloc internal
 
SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介SSE4.2の文字列処理命令の紹介
SSE4.2の文字列処理命令の紹介
 
Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦Kubernetesによる機械学習基盤への挑戦
Kubernetesによる機械学習基盤への挑戦
 
暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学
 
x86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNTx86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNT
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
 
組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみた
 
12 分くらいで知るLuaVM
12 分くらいで知るLuaVM12 分くらいで知るLuaVM
12 分くらいで知るLuaVM
 
Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境
 
Effective Modern C++ 勉強会 Item 22
Effective Modern C++ 勉強会 Item 22Effective Modern C++ 勉強会 Item 22
Effective Modern C++ 勉強会 Item 22
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
明日使えないすごいビット演算
明日使えないすごいビット演算明日使えないすごいビット演算
明日使えないすごいビット演算
 
ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門ARM CPUにおけるSIMDを用いた高速計算入門
ARM CPUにおけるSIMDを用いた高速計算入門
 

Similar to Quick Search algorithm and strstr

Similar to Quick Search algorithm and strstr (17)

Mike Short
Mike ShortMike Short
Mike Short
 
Sandu TAB
Sandu TABSandu TAB
Sandu TAB
 
Lagralita
LagralitaLagralita
Lagralita
 
Lagralita
LagralitaLagralita
Lagralita
 
Lagralita
LagralitaLagralita
Lagralita
 
Existing Utilities
Existing UtilitiesExisting Utilities
Existing Utilities
 
Grow By Learning 2009[1] Sept To Dec
Grow By Learning 2009[1] Sept To DecGrow By Learning 2009[1] Sept To Dec
Grow By Learning 2009[1] Sept To Dec
 
208 Loft Boutique Pres Web
208 Loft Boutique Pres Web208 Loft Boutique Pres Web
208 Loft Boutique Pres Web
 
Grow By Learning[2] September December 2008
Grow By Learning[2] September   December 2008Grow By Learning[2] September   December 2008
Grow By Learning[2] September December 2008
 
MIS - Plant Performance
MIS - Plant PerformanceMIS - Plant Performance
MIS - Plant Performance
 
16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis
 
16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis
 
T2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage CrisisT2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage Crisis
 
T2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage CrisisT2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage Crisis
 
TEMS Total Energy Management Service
TEMS Total Energy Management ServiceTEMS Total Energy Management Service
TEMS Total Energy Management Service
 
Thoughts of a friend
Thoughts of a friendThoughts of a friend
Thoughts of a friend
 
HCI in IoT
HCI in IoTHCI in IoT
HCI in IoT
 

More from MITSUNARI Shigeo

範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコルMITSUNARI Shigeo
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advancedMITSUNARI Shigeo
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenMITSUNARI Shigeo
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化MITSUNARI Shigeo
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用MITSUNARI Shigeo
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介MITSUNARI Shigeo
 

More from MITSUNARI Shigeo (20)

範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advanced
 
暗認本読書会12
暗認本読書会12暗認本読書会12
暗認本読書会12
 
暗認本読書会11
暗認本読書会11暗認本読書会11
暗認本読書会11
 
暗認本読書会10
暗認本読書会10暗認本読書会10
暗認本読書会10
 
暗認本読書会9
暗認本読書会9暗認本読書会9
暗認本読書会9
 
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
 
暗認本読書会8
暗認本読書会8暗認本読書会8
暗認本読書会8
 
暗認本読書会7
暗認本読書会7暗認本読書会7
暗認本読書会7
 
暗認本読書会6
暗認本読書会6暗認本読書会6
暗認本読書会6
 
暗認本読書会5
暗認本読書会5暗認本読書会5
暗認本読書会5
 
暗認本読書会4
暗認本読書会4暗認本読書会4
暗認本読書会4
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
私とOSSの25年
私とOSSの25年私とOSSの25年
私とOSSの25年
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
 
HPC Phys-20201203
HPC Phys-20201203HPC Phys-20201203
HPC Phys-20201203
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介
 
ゆるバグ
ゆるバグゆるバグ
ゆるバグ
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Quick Search algorithm and strstr

  • 1. Quick Search algorithm and strstr Cybozu Labs 2012/3/31 MITSUNARI Shigeo(@herumi) x86/x64 optimization seminar 3(#x86opti)
  • 2. Agenda  Quick Search  vs. strstr of gcc on Core2Duo  vs. strstr of gcc on Xeon  fast strstr using strchr of gcc  vs. my implementation on Xeon  restriction  vs. strstr of VC2011 beta on Core i7  feature of pcmpestri  range version of strstr 2012/3/31 #x86opti 2 /20
  • 3. Quick Search algorithm(1/2)  Simplified and improved Boyer-Moore algorithm  initialized table for "this is" char 't' 'h' 'I' 's' '' other skip +7 +6 +2 +1 +3 +8 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8  How to initialize table for given 2 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 string [str, str + len) 3 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 int tbl_[256]; 6 8 8 8 8 8 8 8 8 6 2 8 8 8 8 8 8 7 8 8 8 1 7 8 8 8 8 8 8 8 8 8 8 8 void init(const char *str, int len) { 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 std::fill(tbl_, tbl_ + 256, len); A 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 for (size_t i = 0; i < len; i++) { B 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 C 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 tbl_[str[i]] = len - i; D 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } 2012/3/31 #x86opti 3 /20
  • 4. Quick Search algorithm(2/2)  Searching phase  simple and fast  see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; }; 2012/3/31 #x86opti 4 /20
  • 5. Benchmark  2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 2 strstr 0 org Qs substring  Qs(Quick search) is faster for long substring  Remark: assume text does not have ‘¥0’for strstr 2012/3/31 #x86opti 5 /20
  • 6. A little modification of Qs  avoid memcmp const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; } const char *find(const char *begin, const char *end){ while (begin <= end - len_) { for (size_t i = 0; i < len_; i++) { if (str_[i] != begin[i]) goto NEXT; } return begin; NEXT: begin += tbl_[static_cast<unsigned char>(begin[len_])]; } return end; } 2012/3/31 #x86opti 6 /20
  • 7. Benchmark again  2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 strstr 2 org Qs 0 Qs' substring  modified Qs(Qs’) is more faster  Should we use modified Qs'? 2012/3/31 #x86opti 7 /20
  • 8. strstr on gcc 4.6 with SSE4.2  Xeon X5650 2.67Gz on Linux  strstr with SSE4.2 is faster than Qs’ for substring with length less than 9 byte 5 cycle/Byte to find 4 fast 3 2 1 strstr 0 Qs' substring  Is strstr of gcc is fastest implementation? 2012/3/31 #x86opti 8 /20
  • 9. strstr implementation by strchr  Find candidate of location by strchr at first, and verify the correctness  strchr of gcc with SSE4.2 is fast const char *mystrstr_C(const char *str, const char *key) { size_t len = strlen(key); while (*str) { const char *p = strchr(str, key[0]); if (p == 0) return 0; if (memcmp(p + 1, key + 1, len - 1) == 0) return p; str = p + 1; } return 0; } 2012/3/31 #x86opti 9 /20
  • 10. strstr vs. mystrstr_C  Xeon X5650 2.6GHz + gcc 4.6.1  mystrstr_C is 1.5 ~ 3 times faster than strstr  except for “ko-re-wa”(in UTF-8)  maybe penalty for many bad candidates 10 cycle/Byte to find 8 fast 6 4 strstr 2 Qs' 0 my_strstr_C substring 2012/3/31 #x86opti 10 /20
  • 11. real speed of SSE4.2(pcmpistri)  my_strstr is always faster than Qs’  2 ~ 4 times faster than strstr of gcc 10 8 fast cycle/Byte to find 6 4 strstr Qs' 2 my_strstr_C 0 my_strstr substring 2012/3/31 #x86opti 11 /20
  • 12. Implementation of my_strstr(1/2)  https://github.com/herumi/opti/blob/master/str_util.hpp  written in Xbyak(for my convenience)  Main loop // a : rax(or eax), c : rcx(or ecx) // input a : ptr to text // key : ptr to key // use save_a, save_key, c movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); jnc(".notFound"); 2012/3/31 #x86opti 12 /20
  • 13. Implementation of my_strstr(2/2)  Compare tail in“headCmp” ... add(a, c); // get position mov(save_a, a); // save a mov(save_key, key); // save key L(".tailCmp"); movdqu(xm1, ptr [save_key]); pcmpistri(xmm1, ptr [save_a], 12); jno(".next"); js(".found"); // rare case add(save_a, 16); add(save_key, 16); jmp(".tailCmp"); L(".next"); add(a, 1); jmp(".lp"); 2012/3/31 #x86opti 13 /20
  • 14. Pros and Cons of my_strstr  Pros  very fast  Is this implementation with Qs fastest? No, overhead is almost larger(variable address offset)  Cons  access max 16 bytes beyond of the end of text almost no problem except for page boundary allocate memory with margin 4KiB readable page not readable page FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 001 002 003 access pcmpistri violation end of text 2012/3/31 #x86opti 14 /20
  • 15. strstr of Visual Studio 11  almost same speed as my_strstr  of Couse safe to use  i7-2620 3.4GHz + Windows 7 + VS 11beta 8 cycle/Byte to find 6 fast 4 2 strstr Qs' 0 my_strstr substring 2012/3/31 #x86opti 15 /20
  • 16. All benchmarks on i7-2600  find "ko-re-wa" in 33MiB text  the results strongly depends on text and key strstr(before SSE4.2) fast Qs(gcc) Qs'(gcc) strstr(gcc;SSE4.2) strstr(VC;SSE4.2) my_strstr(SSE4.2) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 rate for the timing of Qs(gcc) 2012/3/31 #x86opti 16 /20
  • 17. range version of strstr  strstr is not available for string including‘¥0’  use std::string.find()  but it is not optimized for SSE4.2 naive but fast implementation by C const char *findStr_C(const char *begin, const char *end, const char *key, size_t keySize) { while (begin + keySize <= end) { const char *p = memchr(begin, key[0], end - begin); if (p == 0) break; if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p; begin = p + 1; } return end; }  str_util.hpp provides findStr with SSE4.2  4 ~ 5 times faster than findStr_C on i7-2600 + VC11 2012/3/31 #x86opti 17 /20
  • 18. feature of pcmpestri  very complex mnemonics xmm0 : head of key pcmpestri xmm0, ptr [p], 12 rax : keySize p : pointer to text rcx : pos of key if found rdx : text size CF : if found ZF : end of text SF : end of key OF : all match L(".lp"); pcmpestri(xmm0, ptr [p], 12); lea(p, ptr [p + 16]); lea(d, ptr [d - 16]); do not change carry ja(".lp"); jnc(".notFound"); // compare leading str... 2012/3/31 #x86opti 18 /20
  • 19. Difference between Xeon and i7  main loop of my_strstr L(".lp"); pcmpistri(xmm0, ptr [a], 12); if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); a little faster on i7 } else { jbe(".headCmp"); add(a, 16); 1.1 times faster on Xeon jmp(".lp"); L(".headCmp"); } jnc(".notFound"); // get position if (isSandyBridge) { lea(a, ptr [a + c - 16]); } else { add(a, c); } 2012/3/31 #x86opti 19 /20
  • 20. other features of str_util.hpp  strchr_any(text, key)[or findChar_any]  returns a pointer to the first occurrence of any character of key int the text // search character position of '?', '#', '$', '!', '/', ':' strchr_any(text,"?#$!/:");  same speed as strchr by using SSE4.2  max length of key is 16  strchr_range(txt, key)[or findChar_range]  returns a pointer to the first occurrence of a character in range [key[0], key[1]], [key[2], key[3]], ...  also same speed as strchr and max len(key) = 16 // search character position of [0-9], [a-f], [A-F] strchr_range(text,"09afAF"); 2012/3/31 #x86opti 20 /20