There are only two RFC standards for encoding a Request URI: hex encoding and UTF-8 Unicode encoding. Double percent hex encoding, double nibble hex encoding, first/second nibble hex encoding, 2/3-byte UTF encoding, %U UTF encoding should be blocked. Mismatch encoding should also be taken care of.
URI Hex Encoding
The encoding method consists of escaping a hexadecimal byte value for the encoded character with a ‘%’. If we wanted to hex encode a capital A , the encoding would look like %41 i.e ‘A’. In double percent encoding percent is encoded using hex encoding followed by the hexadecimal byte value to be encoded. So %2541 = ‘A’
In first nibble only first nibble is encoded. i.e 4 of \x41 will be encoded.
So %%341 = ‘A’. During the first URL decoding pass the %34 is decoded as the numeral 4, which leaves %41 for the second pass. During the second pass, the %41 is decoded as a capital A. Similar for second nibble.
GET /index%2541.html HTTP/1.1 (double percent)
GET /index%%34%31.html HTTP/1.1 (double nibble)
GET /index%%341.html HTTP/1.1 (first nibble)
GET /index%4%31.html HTTP/1.1 (second nibble)
UTF-8 Hex Encoding
HTTP web servers use UTF-8 encoding to represent Unicode code points that are outside of the ASCII code point range (1 – 127).
UTF-8 works by giving special meaning to the high-bits in a byte. A UTF-8 two and three byte UTF-8 sequence is shown below:
110xxxxx 10xxxxxx (two byte sequence)
1110xxxx 10xxxxxx 10xxxxxx (three byte sequence)
The first byte in a UTF-8 sequence is the most important because it contains how many bytes are in the complete UTF-8 sequence. This is determined by counting the high bits up to the first zero. The rest of the bits after the zero in the first UTF-8 byte are bits in the final value to be computed. To encode UTF-8 in the URL, the UTF-8 sequence is escaped with a percent for each byte. For example, a capital letter A is encoded in a two byte UTF-8 sequence as %C1%81 (11000001 10000001 = 1000001 = ‘A’)
Similarly think how ‘A’ can be encoded in a three byte UTF-8 sequence.
GET /index%C1%81.html HTTP/1.1 (2-byte)
GET /index%E0%81%81.html HTTP/1.1 (3-byte)
GET /index%U0041.html HTTP/1.1 (%U encoding)
GET / scripts/default.id%u0061?AAA… (240 times)
Consider the “.” (dot) represented as 2E, C0 AE, E0 80 AE, F0 80 80 AE, F8 80 80 80 AE, or FC 80 80 80 80 AE as 1 byte, 2byte, 3byte, 4byte, 5byte or 6byte UTF-8 encoding.
In most circumstances, Unicode attacks have been successful due to poor security validating of the UTF-8 encoded character or string, and the interpretation of illegal octet sequences. Following cases might occur:
• An application may prohibit the use of the NULL character when parsed the single octet 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NULL.
• Only decode the six least significant bits. The two most significant bits, normally “10”, may also be replaced with “00”, “01” or “11”. Thus the “.” (dot) may be represented as C0 AE, C0 2E, C0 6E and C0 EE.
11000000 10101110 (C0 AE)
11000000 00101110 (C0 2E)
11000000 01101110 (C0 6E)
11000000 11101110 (C0 EE)
• Various application components may prohibit the use of the string “..\” and the corresponding single octet sequence 2E 2E 5C, yet permit the illegal octet sequence 2E C0 AE 5C.
A successful attack may be made using valid or invalid URL encoding.
Valid URL encoding refers to the escape-encoding of each UTF-8 sequence octet. For example, the “/” UTF-8 sequence could be encoded as %C0%AF.
An invalid URL encoding refers to the use of non-hexadecimal digits that may be incorrectly interpreted as an alternative, but valid, hexadecimal digit. For example, %C0 is interpreted as the character number (‘C’ - ‘A’ + 10 ) ×16 + (‘0’ – ‘0’) = 192.
If we apply the same principle,
%BG is interpreted as (‘B’ – ‘A’ + 10) × 16 + (‘G’ – ‘0’) = 192
%QF is interpreted as (‘Q’ – ‘A’ + 10) × 16 + (‘F’ – ‘0’) = 431, which, when represented as a single byte (8 significant bits), yields 175. Corresponding to %AF.
If the application’s algorithm will accept non-hexadecimal digits (such as ‘S’), then it may be possible to have variants for %C0 such as %BG. In the case of the “/”, it is possible to represent the character as %C0%AF or %BG%QF for example.
%U encoding presents a different way to encode Unicode code point values up to 65535 (or two bytes). %U precedes 4 hexadecimal nibble values.
For example, here is a list of the various code points that resolved to the capital letter "A": U+0041, U+0100, U+0102, U+0104, U+01CD, U+01DE, U+8721. Remember that many of these code points have multiple representations themselves. Since IIS is not case-sensitive, this leads to 30 different representations for the letter "A". There are 34 for "E", 36 for "I", 39 for "O", and 58 for "U". The string "AEIOU" can be expressed 83060640 different ways
Mismatch encoding uses different encoding types to represent an ASCII character and is not a unique encoding by itself. For example, let’s encode a capital A using the Microsoft %U encoding method. But since IIS will do a double decode on a URL, we can use some of the other methods to encode the %U method. For instance, we can encode the U in the %U method with a normal hex encoding. So a simple %U0041 becomes %%550041.
Try to figure out which ASCII character this encoding represents: %U0025%550%303%37