1
0
Fork 0
archivebox/archivebox/extractors
Ben Muthalaly 77917e9b55 Fix HTML title parsing bugs.
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
..
__init__.py just use out_dir 2023-05-29 10:03:49 +02:00
archive_org.py enforce utf8 on literally all file operations because windows sucks 2021-03-27 01:16:29 -04:00
dom.py After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
favicon.py Add FAVICON_PROVIDER option for custom favicon service 2023-05-05 20:42:36 -05:00
git.py Refactor should_save_extractor methods to accept overwrite parameter 2021-01-21 15:56:32 -06:00
headers.py Refactor should_save_extractor methods to accept overwrite parameter 2021-01-21 15:56:32 -06:00
media.py Don't be strict on unicode errors 2022-09-12 20:40:45 +00:00
mercury.py improve readability and mercury error handling and fix output path to be relative 2021-02-16 15:53:11 -05:00
pdf.py After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
readability.py remove unused import 2022-02-09 10:48:51 +08:00
screenshot.py After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
singlefile.py add CHROME_TIMEOUT args 2023-03-14 20:29:41 +09:00
title.py Fix HTML title parsing bugs. 2023-10-09 02:00:01 -05:00
wget.py add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00