Skip to content

[bug] incorrect code page parsing #144

@ctlayon

Description

@ctlayon

When running olefile against several word documents I noticed that the code page is being incorrectly parsed:

$ olefile ebbd7703c87daedded17ac6a17a048c7 | egrep codepage
- codepage: -535
- codepage_doc: 1252

Manually looking at the summary stream you can see:

00000000: FE FF 00 00 06 02 02 00  00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00  01 00 00 00 E0 85 9F F2  ................
00000020: F9 4F 68 10 AB 91 08 00  2B 27 B3 D9 [30 00 00 00]  .Oh.....+'..0...
00000030: 14 01 00 00 0D 00 00 00 [01 00 00 00 70 00 00 00]  ............p...
00000040: 04 00 00 00 78 00 00 00  07 00 00 00 84 00 00 00  ....x...........
00000050: 08 00 00 00 94 00 00 00  09 00 00 00 A4 00 00 00  ................
00000060: 12 00 00 00 B0 00 00 00  0A 00 00 00 D0 00 00 00  ................
00000070: 0C 00 00 00 DC 00 00 00  0D 00 00 00 E8 00 00 00  ................
00000080: 0E 00 00 00 F4 00 00 00  0F 00 00 00 FC 00 00 00  ................
00000090: 10 00 00 00 04 01 00 00  13 00 00 00 0C 01 00 00  ................
000000A0: [02 00 00 00 E9 FD 00 00]  1E 00 00 00 04 00 00 00  ................
000000B0: E7 A2 A7 00 1E 00 00 00  08 00 00 00 6F 6E 6C 69  ............onli
000000C0: 6E 65 00 00 1E 00 00 00  08 00 00 00 48 43 47 72  ne..........HCGr

I surrounded the relevant locations with []:

  1. Offset 0x2c stores the offset (0x30) into the stream where the first property set begins
  2. Offset 0x38 stores the property ID (0x1) which signifies code page
  3. Offset 0x3c stores the offset (0x70), relative from the property set, that this value is located at
  4. Calculating 0x70 + 0x30 yields 0xA0, this is where the code page property is stored

The first 2 bytes is the PropertyType (VT_I2) while the second 2 bytes are padding.
The next 4 bytes are the actual value 0xFDE9 or 65001 which represents UTF-8 (https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers).

I'm 90% sure the issue is that it is being represented as a u16 instead of a u32:

olefile/olefile/olefile.py

Lines 2231 to 2235 in 375a2d7

if property_type == VT_I2: # 16-bit signed integer
value = i16(s, offset)
if value >= 32768:
value = value - 65536
size = 2

Looking at https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/18a44a26-4a67-4894-8db2-52a701f2473f it specifies that:
Value (4 bytes at offset 204):

It then misleadingly describes it as a 2-byte signed integer.

I believe it should be done as a u32 (I'm not sure if this will mess up processing anywhere else and should be vetted or create a one off case for codepage):

if property_type == VT_I2: # 32-bit signed integer
    value = i32(s, offset)
    size = 4

This change seemingly fixes the issue for me:

$ python3 olefile.py ~/Exclusions/malware/ebbd7703c87daedded17ac6a17a048c7 | egrep codepage
- codepage: 65001
- codepage_doc: 1252

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions