I am
Worked with MySQL since time immemorial, MySQL AB.
Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.
I am
Worked with MySQL since time immemorial, MySQL AB.
Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.
So what’s all this about? We switched our regex library in 8.0.4. At the time I blogged about it here.
The old one was written by HS in 1986. called regexp very good regex library. Has been used widely in the Unix realm, is part of POSIX standard. Also called “the book regexp library” because he updated it for the book Software Solutions in C in 1994. .
Made its way to Tcl, Postgres and even early perl. Apparently Postgres still use it.
Really good. Great performance. But ASCII only. Worked byte-by-byte. Lacks many features.
Not safe – Easy to put in infinite loop.
You can only do boolean search, not do matching doesn’t have a pattern buffer out of the box. Hence doesn’t support search-replace.
And this was a quite popular request. Four FR bugs against getting the matched substring alone.
We had 51 “Affects me” in total. CTE had 59, but that’s a really popular feature.
Now we have four functions
Instr → position, before or after
Like → boolean
Replace → replaces a match, capture
Substr → the matched substring
So here’s the agenda. On top, we have security, which is why we chose ICU. Perhaps not the obvious choice given the candidates. It also has close ties to Unicode.
What I won’t cover here is all the features of regular expressions. These are documented in our manual and if you can always head to the ICU documentations.
My ambition is to teach you about how to work efficiently and securely with unicode and to give some insight into where common wisdom breaks down.
I presented here 3 years ago and I had a really good time, so I wanted to go again. I told my boss what am I going to talk about, I haven’t really added anything new since last time. All I can think of is the new regular expression. “Tell’em about that, he said” They’ll love that. So, I submitted this talk as a 20 minute presentation. Not only did it get accepted but it got upgraded to 30 minutes. I couldn’t think of much to say, so I asked around. “What do YOU want to know about regular expressions with Unicode?”. Nobody had a clue. So that’s why I just picked some common pitfalls that I consider tricky.
The way a malicious user can exploit regex matching is by exhausting the memory or creating an infinite loop, consuming all the cpu time.
Out of the box there’s always cap on runtime.
Runtime is specified in “steps of the match engine. A bit vacuous. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds.
Match the first A, capture, then repeat that match. Backtrack, match 2nd, repeat that and so on. Eventually fail because of the C.
Set conservatively to 32 (secure by default)
Here I’m trying to run out of memory. Really have to provoke here. Reach the time limit first.
Match empty string 120 times, repeat that 11 times, repeat that 11 times, etc.
Backtracking stack used by engine.
Bytes. Choking to 239 bytes
Default size 80 MB. Never managed to DOS server.
So… about the icu library
What is ICU library. Set of I18n libs. What they provide is Globalization support and Unicode for software applications. They have an open source license. From what I gather compatible with GNU, but IANAL.
Used by Java, Apple, Amazon, IBM…
Unicode consortium mostly known for emoji nowadays. New releases of Unicode typically contain new emojis. And so you have to be able search for them. Haha-papa a.k.a. Sushi-beer bug. And so regexp have to suport them.
💬 5 billion emojis are sent daily on Facebook Messenger
📸 By mid-2015, half of all comments on Instagram included an emoji
🍑 Only 7% of people use the peach emoji as a fruit
The rest mostly use it as a butt or for other non-fruit uses
According to emojipedia
In a sense ICU is Unicode. Support for all of Unicode
We ship ICU with MySQL, and optionally build bundled. We ship 59.1. I notice Ubuntu 18.04 ships 60.
There’s the internationalization library which contains regexp and charsets. All we use right now. All we bundle. The common library contains things like the breakiterator which helps work with grapheme clusters. I won’t go into grapheme clusters in this presentation. We don’t handle those yet.
The data library is not used currently. Don’t ship. Fairly big, not needed for regexp.
Tell you a bit about Unicode
Specifies three encodings.
+ constant size
+ maps 1-to-1 to unicode codepoints
- space consuming
+ Optimized for Western ASCII
+ Small (for Western)
+ Self-synchronizing (what isn’t???)
- Variable size
De-facto standard for the web 92.9%
Generally regarded Worst of both worlds
- Bigger than UTF-8
- Not fixed like UTF-32
+ More is constant (what? Which planes?)
+ Also self-synchronizing
Surrogate pairs
Broken in Java. How?
Alas, used by ICU
So they way we use ICU is, unless you start on the first character, we count the code points before. Convert the rest to UTF-16, search with ICU. We use ICU’s C API. There is C++ API.
So, I have two examples how to work with Unicode.
You can specify case sensitivity in three ways. Mode modifiers Inside the regexp have the highest priorority.
If there are no mode modifiers, match_paramete is used. String of modifiers. ‘c’ means case-sensitive, ‘i’ means in-sensitive.
If there are none of those, we look at the collation. There are rules for computing which collation should be used in any comparison. Apply here.
Case insensitivity seems simple at first. Text is normalized by transforming to the same case. Then compare.
On the next slide we see how such a case mapping could look.
Totally obvious, right? One character maps to exactly one character. This is called simple case insensitive matching.
Well there are some trickier cases.
The german Ess-zet is generally understood to be equivalent to two s’es. So in full case insensitive matching they should be equal. Since there is no esszet in any other language, this folding is part of the default.
I could go on all day about case mapping, it’s a 61-page document in the Unicode standard. But these are the essentials.
This example is a little more complicated for me.
Here one letter obviously maps to two letters. Actually letters. Not just code points.
If you paste them and press backspace, the little I goes away.
In this case they’re different. Works the same way.
It’s all greek to me.
Full case folding used when the pattern contains anything looks like a character string, even just one char.
A match can never start within an expanded character.
The anchors here enforce a match that would 1) start in the middle 2) end in the middle
This is consistent with how collations work with the equals predicate.
Hard to read collation name
Charset, language code, pb – don’t know, accent sensitive, case sensitive.
Case folding can also be language dependent. In the default case folding, capital I folds to small I with dot. However, in the turkish case folding, a dotted capital I is case folded to dotted lowercase I. Dotless capital I folds to dotless lowercase I.
In Turkish locale, actually wrong.
Another problem with full Unicode and regexp. You need to be careful when you send non-ASCII data from a client. Here is a cautionary tale. Here I changed the variables character_set_connection, character_set_client and character_set_results. What SET NAMES does.
So, I create a table.
I populate it. Swedish letter å. Pronounced
Read back.
Check with a regular expression match.
So, everything is fine, right? Let’s do a “trust but verify” here. I want to see what’s actualy in the table. The problem is that it will always be converted to my character set. I want to apply a function to it on the server side. Problem is, all functions will also convert their arguments. What to do? All functions save one: The hex() function. It will tell the truth.
So here we have .. what? Is this really a w/ring ? Let’s check.
This is not å in any encoding. What is going on
My terminal is UTF-8. So, when I type å on my Swedish keyboard, it sends c3a5 to the server. Now, when I set character_set_client, what I really said “interpret as latin1”. Fine c3a5 thats a-wave yen. Stores that. But the table stores utf8 so let’s convert.
And that becomes
Now, when I do select, it reads character_set_results, oh yeah, you speak latin-1. Let me translate for ya.
And so we’re back full circle.
Especially tricky with latin-1 since anything is valid latin-1. No check fails.
So here’s a power tip for troubleshooting your multilinguas regexps. If you use hex codes and character set introducers, it’s totally unambiguous. As you see here.