-
Notifications
You must be signed in to change notification settings - Fork 83
Description
When running olefile against several word documents I noticed that the code page is being incorrectly parsed:
$ olefile ebbd7703c87daedded17ac6a17a048c7 | egrep codepage
- codepage: -535
- codepage_doc: 1252
Manually looking at the summary stream you can see:
00000000: FE FF 00 00 06 02 02 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 01 00 00 00 E0 85 9F F2 ................
00000020: F9 4F 68 10 AB 91 08 00 2B 27 B3 D9 [30 00 00 00] .Oh.....+'..0...
00000030: 14 01 00 00 0D 00 00 00 [01 00 00 00 70 00 00 00] ............p...
00000040: 04 00 00 00 78 00 00 00 07 00 00 00 84 00 00 00 ....x...........
00000050: 08 00 00 00 94 00 00 00 09 00 00 00 A4 00 00 00 ................
00000060: 12 00 00 00 B0 00 00 00 0A 00 00 00 D0 00 00 00 ................
00000070: 0C 00 00 00 DC 00 00 00 0D 00 00 00 E8 00 00 00 ................
00000080: 0E 00 00 00 F4 00 00 00 0F 00 00 00 FC 00 00 00 ................
00000090: 10 00 00 00 04 01 00 00 13 00 00 00 0C 01 00 00 ................
000000A0: [02 00 00 00 E9 FD 00 00] 1E 00 00 00 04 00 00 00 ................
000000B0: E7 A2 A7 00 1E 00 00 00 08 00 00 00 6F 6E 6C 69 ............onli
000000C0: 6E 65 00 00 1E 00 00 00 08 00 00 00 48 43 47 72 ne..........HCGr
I surrounded the relevant locations with []:
- Offset 0x2c stores the offset (0x30) into the stream where the first property set begins
- Offset 0x38 stores the property ID (0x1) which signifies code page
- Offset 0x3c stores the offset (0x70), relative from the property set, that this value is located at
- Calculating 0x70 + 0x30 yields 0xA0, this is where the code page property is stored
The first 2 bytes is the PropertyType (VT_I2) while the second 2 bytes are padding.
The next 4 bytes are the actual value 0xFDE9 or 65001 which represents UTF-8 (https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers).
I'm 90% sure the issue is that it is being represented as a u16 instead of a u32:
Lines 2231 to 2235 in 375a2d7
| if property_type == VT_I2: # 16-bit signed integer | |
| value = i16(s, offset) | |
| if value >= 32768: | |
| value = value - 65536 | |
| size = 2 |
Looking at https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/18a44a26-4a67-4894-8db2-52a701f2473f it specifies that:
Value (4 bytes at offset 204):
It then misleadingly describes it as a 2-byte signed integer.
I believe it should be done as a u32 (I'm not sure if this will mess up processing anywhere else and should be vetted or create a one off case for codepage):
if property_type == VT_I2: # 32-bit signed integer
value = i32(s, offset)
size = 4This change seemingly fixes the issue for me:
$ python3 olefile.py ~/Exclusions/malware/ebbd7703c87daedded17ac6a17a048c7 | egrep codepage
- codepage: 65001
- codepage_doc: 1252