77917e9b55
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness. |
||
---|---|---|
.. | ||
__init__.py | ||
archive_org.py | ||
dom.py | ||
favicon.py | ||
git.py | ||
headers.py | ||
media.py | ||
mercury.py | ||
pdf.py | ||
readability.py | ||
screenshot.py | ||
singlefile.py | ||
title.py | ||
wget.py |