Skip to content

Document summary, title and more are not accessible with input bytes content #194

@tomdpsrd

Description

@tomdpsrd

Actual Behavior

Document.summary() is not working with python3 when the document is based on bytes and not on string content.

Steps to Reproduce the Problem

Follow the readme steps

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
Traceback (most recent call last):
...
    RE_CHARSET.findall(page) + RE_PRAGMA.findall(page) + RE_XML.findall(page)
    ^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot use a string pattern on a bytes-like object

How to correct

String Regexp should be updated ro bytes regexp since encoding.get_encoding is only used for bytes content.
In encoding.py :

RE_CHARSET = re.compile(br'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
RE_PRAGMA = re.compile(br'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
RE_XML = re.compile(br'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions