In this presentation we explore some of the problems of unicode and how they can be used for nefarious purposes in order to exploit a range of critical vulnerabilities including SQL Injection, XSS and many other.
2. Introduction
• Standard for representing text for most of
the world’s writing systems
• The most recent version is Unicode 6.0
• Widely adopted by most programming
platforms, operating systems and The Web
• The most widely used unicode encodings
are UTF-8 and UTF-16
3. Introduction to UTF-8
• UTF-8 (UCS Transformation Format - 8bit)
• Backwards compatible with ASCII
• Simple ASCII chars are represented by a
single byte
• Other characters can include up to 4
bytes but 31 bits in total spanning across
6 physical bytes
5. UTF-8 Encoding Rules
• Every ASCII character is also valid UTF-8 character
(up to 7 bits or 128 characters)
• For every other UTF-8 byte sequence the first byte
indicates the length of the sequence in bytes
• The rest of the bytes from the byte sequence have 10
as the two most significant bits
• This helps to easily find where a byte sequence
starts and ends
• There are more rules but this is a good start...
6. Interesting UTF-8
Characters
• UTF-8 also provides a lot of function characters such as
• Byte Order Mark (BOM) - 0xEF, 0xBB, 0xBF are placed at the start of the document to indicate UTF-8
• Left to Right Mark (LRM) - 0xE2, 0x80, 0x8E are placed to indicate text orientation
• In HTML - ‎ ‎ or ‎
• Right to Left Mark (RLM) - 0xE2, 0x80, 0x8F are placed to indicate text orientation
• In HTML - ‏ ‏ or ‏
• Left to Right Embedding (LRE) - 0xE2, 0x80, 0xAA
• In HTML - ‪
• Right to Left Embedding (RLE) - 0xE2, 0x80, 0xAB
• In HTML - ‫
• There are more...
7. Clarifications
• How exactly the hex sequence 0xE2, 0x80, 0x8E maps to
‎ in HTML?
• 0xE2, 0x80, 0x8E is UTF-8
• ‎ is 0x20, 0x0E in UTF-16
• also known as 0x0000200E in UTF-32
• There is no magic!You simply need to know which
encoding system you are working with and find out what
characters it supports.
• http://www.decodeunicode.org - is a good reference
8. Multiple
Representations
• The same character can be represented multiple ways
• For example
• . (DOT) is represented as 0x2E
• It is also the equivalent of 0xC0, 0xAE
• It is also the equivalent of 0xE0, 0x80, 0xAE
• It is also the equivalent of 0xF0, 0x80, 0x80, 0xAE
• It is also the equivalent of 0xF8, 0x80, 0x80, 0x80, 0xAE
• It is also the equivalent of 0xFC, 0x80, 0x80, 0x80, 0x80, 0xAE
10. Half and Full Width
Forms
• Graphic characters are traditionally classed as
halfwidth and fullwidth characters
• In a fixed width font a halfwidth character takes
the half of the width of a fullwidth character
• In Unicode you can find characters which are
presented in their halfwidth and fullwidth forms
• http://www.unicode.org/charts/PDF/UFF00.pdf -
for more information
11. Fullwidth Latin
Characters
• Halfwidth and Fullwidth notations make sense when
used for characters such as those found in the Japanese
and Chinese character sets
• The specifications also talk about latin characters
presented in their fullwidth forms
• As a result the following mappings are possible
• A - 0x41 (halfwidth) = A - 0xEF, 0xBC, 0xA1 (fullwidth)
• B - 0x42 (halfwidth) = B - 0xEF, 0xBC, 0xA2 (fullwidth)
• etc.
12. Security Considerations
• Visual Security Issues
• Internationalized names
• Left to Right and Right to Left representations
• Charset Translation Issues
• Occurs when strings are normalized before and after
translation between character sets
• Characters in multiple representation
• The same character can be represented in multiple ways
13. Case Study:Windows
Filename Mangling
• Consider the following files
• [RTLO]cod.stnemucodtnatropmi.exe
• [RTLO]cod.yrammusevituc[LTRO]n1c[LTRO].exe
• [RTLO]gpj.!nuf_stohsnee[LTRO]n1c[LTRO].scr
• Visually these files look different
• exe.importantdocuments.doc
• n1c.executivesummary.doc
• n1c.screenshots_fun!.jpg
14. Case Study:The
PAYPAL Scam
• What is the difference between paypal.com
and paypai.com or between intel.com and
lntel.com?
• How about citybank.com?
• 0000000: d181 6974 7962 616e 6b2e 636f 6d ..itybank.com
• 0xd1, 0x81 is the Cyrillic letter c which looks like the latin letter c
although they are very different
15. Case Study: Directory
Traversal
• Let’s say an application shows images by requesting /getimage.jsp?
name=image.jpg
• The attacker tries to retrieve an arbitrary file by requesting /
getimage.jsp?name=../../../../boot.ini
• Unfortunately the attack fails because the application checks
for the presence of ../ character sequence
• ../ is 0x2E, 0x2E, 0x5C in hex
• ../ is also 0x2E, 0xC0, 0xAE, 0x5C in overlong UTF-8
• Since 0x2E, 0xC0, 0xAE, 0x5C is not equal to 0x2E, 0x2E, 0x5C
the security check is bypassed and the file content retrieved