archivebox

History

Ben Muthalaly 77917e9b55 Fix HTML title parsing bugs. This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.		2023-10-09 02:00:01 -05:00
..
__init__.py	just use out_dir	2023-05-29 10:03:49 +02:00
archive_org.py	enforce utf8 on literally all file operations because windows sucks	2021-03-27 01:16:29 -04:00
dom.py	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
favicon.py	Add FAVICON_PROVIDER option for custom favicon service	2023-05-05 20:42:36 -05:00
git.py	Refactor `should_save_extractor` methods to accept `overwrite` parameter	2021-01-21 15:56:32 -06:00
headers.py	Refactor `should_save_extractor` methods to accept `overwrite` parameter	2021-01-21 15:56:32 -06:00
media.py	Don't be strict on unicode errors	2022-09-12 20:40:45 +00:00
mercury.py	improve readability and mercury error handling and fix output path to be relative	2021-02-16 15:53:11 -05:00
pdf.py	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
readability.py	remove unused import	2022-02-09 10:48:51 +08:00
screenshot.py	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
singlefile.py	add CHROME_TIMEOUT args	2023-03-14 20:29:41 +09:00
title.py	Fix HTML title parsing bugs.	2023-10-09 02:00:01 -05:00
wget.py	add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support	2021-04-10 04:21:36 -04:00