2. Agenda
Quick Search
vs. strstr of gcc on Core2Duo
vs. strstr of gcc on Xeon
fast strstr using strchr of gcc
vs. my implementation on Xeon
restriction
vs. strstr of VC2011 beta on Core i7
feature of pcmpestri
range version of strstr
2012/3/31 #x86opti 2 /20
4. Quick Search algorithm(2/2)
Searching phase
simple and fast
see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html
const char *find(const char *begin, const char *end) {
while (begin <= end - len_) {
if (memcmp(str_, begin, len_) == 0) return begin;
begin += tbl_[begin[len_]];
}
return end;
};
2012/3/31 #x86opti 4 /20
5. Benchmark
2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
33MB UTF-8 text
10
cycle/Byte to find
8 fast
6
4
2 strstr
0 org Qs
substring
Qs(Quick search) is faster for long substring
Remark: assume text does not have ‘¥0’for strstr
2012/3/31 #x86opti 5 /20
6. A little modification of Qs
avoid memcmp
const char *find(const char *begin, const char *end) {
while (begin <= end - len_) {
if (memcmp(str_, begin, len_) == 0) return begin;
begin += tbl_[begin[len_]];
}
return end; }
const char *find(const char *begin, const char *end){
while (begin <= end - len_) {
for (size_t i = 0; i < len_; i++) {
if (str_[i] != begin[i]) goto NEXT;
}
return begin;
NEXT:
begin += tbl_[static_cast<unsigned char>(begin[len_])];
}
return end; }
2012/3/31 #x86opti 6 /20
7. Benchmark again
2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
33MB UTF-8 text
10
cycle/Byte to find
8 fast
6
4
strstr
2
org Qs
0
Qs'
substring
modified Qs(Qs’) is more faster
Should we use modified Qs'?
2012/3/31 #x86opti 7 /20
8. strstr on gcc 4.6 with SSE4.2
Xeon X5650 2.67Gz on Linux
strstr with SSE4.2 is faster than Qs’
for substring with length less than 9 byte
5
cycle/Byte to find
4 fast
3
2
1 strstr
0 Qs'
substring
Is strstr of gcc is fastest implementation?
2012/3/31 #x86opti 8 /20
9. strstr implementation by strchr
Find candidate of location by strchr at first,
and verify the correctness
strchr of gcc with SSE4.2 is fast
const char *mystrstr_C(const char *str, const char *key) {
size_t len = strlen(key);
while (*str) {
const char *p = strchr(str, key[0]);
if (p == 0) return 0;
if (memcmp(p + 1, key + 1, len - 1) == 0) return p;
str = p + 1;
}
return 0;
}
2012/3/31 #x86opti 9 /20
10. strstr vs. mystrstr_C
Xeon X5650 2.6GHz + gcc 4.6.1
mystrstr_C is 1.5 ~ 3 times faster than strstr
except for “ko-re-wa”(in UTF-8)
maybe penalty for many bad candidates
10
cycle/Byte to find
8 fast
6
4
strstr
2
Qs'
0
my_strstr_C
substring
2012/3/31 #x86opti 10 /20
11. real speed of SSE4.2(pcmpistri)
my_strstr is always faster than Qs’
2 ~ 4 times faster than strstr of gcc
10
8 fast
cycle/Byte to find
6
4 strstr
Qs'
2
my_strstr_C
0 my_strstr
substring
2012/3/31 #x86opti 11 /20
12. Implementation of my_strstr(1/2)
https://github.com/herumi/opti/blob/master/str_util.hpp
written in Xbyak(for my convenience)
Main loop
// a : rax(or eax), c : rcx(or ecx)
// input a : ptr to text
// key : ptr to key
// use save_a, save_key, c
movdqu(xm0, ptr [key]); // xm0 = *key
L(".lp");
pcmpistri(xmm0, ptr [a], 12);
// 12(1100b) = [equal ordered:unsigned:byte]
jbe(".headCmp");
add(a, 16);
jmp(".lp");
L(".headCmp");
jnc(".notFound");
2012/3/31 #x86opti 12 /20
13. Implementation of my_strstr(2/2)
Compare tail in“headCmp”
...
add(a, c); // get position
mov(save_a, a); // save a
mov(save_key, key); // save key
L(".tailCmp");
movdqu(xm1, ptr [save_key]);
pcmpistri(xmm1, ptr [save_a], 12);
jno(".next");
js(".found");
// rare case
add(save_a, 16);
add(save_key, 16);
jmp(".tailCmp");
L(".next");
add(a, 1);
jmp(".lp");
2012/3/31 #x86opti 13 /20
14. Pros and Cons of my_strstr
Pros
very fast
Is this implementation with Qs fastest?
No, overhead is almost larger(variable address offset)
Cons
access max 16 bytes beyond of the end of text
almost no problem except for page boundary
allocate memory with margin
4KiB readable page not readable page
FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 001 002 003
access
pcmpistri violation
end of text
2012/3/31 #x86opti 14 /20
15. strstr of Visual Studio 11
almost same speed as my_strstr
of Couse safe to use
i7-2620 3.4GHz + Windows 7 + VS 11beta
8
cycle/Byte to find
6 fast
4
2 strstr
Qs'
0
my_strstr
substring
2012/3/31 #x86opti 15 /20
16. All benchmarks on i7-2600
find "ko-re-wa" in 33MiB text
the results strongly depends on text and key
strstr(before SSE4.2)
fast
Qs(gcc)
Qs'(gcc)
strstr(gcc;SSE4.2)
strstr(VC;SSE4.2)
my_strstr(SSE4.2)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
rate for the timing of Qs(gcc)
2012/3/31 #x86opti 16 /20
17. range version of strstr
strstr is not available for string including‘¥0’
use std::string.find()
but it is not optimized for SSE4.2
naive but fast implementation by C
const char *findStr_C(const char *begin, const char *end,
const char *key, size_t keySize) {
while (begin + keySize <= end) {
const char *p = memchr(begin, key[0], end - begin);
if (p == 0) break;
if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p;
begin = p + 1;
}
return end; }
str_util.hpp provides findStr with SSE4.2
4 ~ 5 times faster than findStr_C on i7-2600 + VC11
2012/3/31 #x86opti 17 /20
18. feature of pcmpestri
very complex mnemonics
xmm0 : head of key pcmpestri xmm0, ptr [p], 12
rax : keySize
p : pointer to text
rcx : pos of key if found
rdx : text size
CF : if found
ZF : end of text
SF : end of key
OF : all match
L(".lp");
pcmpestri(xmm0, ptr [p], 12);
lea(p, ptr [p + 16]);
lea(d, ptr [d - 16]); do not change carry
ja(".lp");
jnc(".notFound");
// compare leading str...
2012/3/31 #x86opti 18 /20
19. Difference between Xeon and i7
main loop of my_strstr
L(".lp");
pcmpistri(xmm0, ptr [a], 12);
if (isSandyBridge) {
lea(a, ptr [a + 16]);
ja(".lp"); a little faster on i7
} else {
jbe(".headCmp");
add(a, 16); 1.1 times faster on Xeon
jmp(".lp");
L(".headCmp");
}
jnc(".notFound");
// get position
if (isSandyBridge) {
lea(a, ptr [a + c - 16]);
} else {
add(a, c);
}
2012/3/31 #x86opti 19 /20
20. other features of str_util.hpp
strchr_any(text, key)[or findChar_any]
returns a pointer to the first occurrence of any
character of key int the text
// search character position of '?', '#', '$', '!', '/', ':'
strchr_any(text,"?#$!/:");
same speed as strchr by using SSE4.2
max length of key is 16
strchr_range(txt, key)[or findChar_range]
returns a pointer to the first occurrence of a
character in range [key[0], key[1]], [key[2], key[3]], ...
also same speed as strchr and max len(key) = 16
// search character position of [0-9], [a-f], [A-F]
strchr_range(text,"09afAF");
2012/3/31 #x86opti 20 /20