From 71ca0b27a86b9c2c8ba6807aba274a3f3d166f3d Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 4 Jan 2024 13:47:21 -0800 Subject: [PATCH] minor formatting and wording improvements to README --- README.md | 27 ++++++++++++++++++--------- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index dfde64fb..596abfff 100644 --- a/README.md +++ b/README.md @@ -623,7 +623,8 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici ## Archive Layout -All of ArchiveBox's state (including the SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create more than one for different collections. +All of ArchiveBox's state (including the SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". +Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create more than one for different collections.
@@ -633,7 +634,7 @@ All of ArchiveBox's state (including the SQLite DB, archived assets, config, log All `archivebox` CLI commands are designed to be run from inside an ArchiveBox data folder, starting with `archivebox init` to initialize a new collection inside an empty directory. ```bash -mkdir ~/archivebox && cd ~/archivebox +mkdir ~/archivebox && cd ~/archivebox # just an example, can be anywhere archivebox init ``` @@ -719,11 +720,12 @@ The paths in the static exports are relative, make sure to keep them next to you +If you're importing pages with private content or URLs containing secret tokens you don't want public (e.g Google Docs, paywalled content, unlisted videos, etc.), **you may want to disable some of the extractor methods to avoid leaking that content to 3rd party APIs or the public**. +
Click to expand... -If you're importing pages with private content or URLs containing secret tokens you don't want public (e.g Google Docs, paywalled content, unlisted videos, etc.), **you may want to disable some of the extractor methods to avoid leaking that content to 3rd party APIs or the public**. ```bash # don't save private content to ArchiveBox, e.g.: @@ -757,11 +759,12 @@ archivebox config --set CHROME_BINARY=chromium # ensure it's using Chromium ### Security Risks of Viewing Archived JS +Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and **typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks**. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page and [Issue #239](https://github.com/ArchiveBox/ArchiveBox/issues/239) for more details. +
Click to expand... -Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and **typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks**. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page and [Issue #239](https://github.com/ArchiveBox/ArchiveBox/issues/239) for more details. ```bash # visiting an archived page with malicious JS: @@ -790,12 +793,12 @@ The admin UI is also served from the same origin as replayed JS, so malicious pa ### Working Around Sites that Block Archiving +For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) actively block archiving or bots in general. There are a number of approaches to work around this. +
Click to expand... -For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) actively block archiving or bots in general. There are a number of approaches to work around this. - - Set [`CHROME_USER_AGENT`, `WGET_USER_AGENT`, `CURL_USER_AGENT`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#curl_user_agent) to impersonate a real browser (instead of an ArchiveBox bot) - Set up a logged-in browser session for archiving using [`CHROME_DATA_DIR` & `COOKIES_FILE`](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile) - Rewrite your URLs before archiving to swap in an alternative frontend thats more bot-friendly e.g. @@ -810,11 +813,13 @@ In the future we plan on adding support for running JS scripts during archiving ### Saving Multiple Snapshots of a Single URL +ArchiveBox appends a hash with the current date `https://example.com#2020-10-24` to differentiate when a single URL is archived multiple times. +
Click to expand... -First-class support for saving multiple snapshots of each site over time will be [added eventually](https://github.com/ArchiveBox/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs). For now **ArchiveBox is designed to only archive each unique URL with each extractor type once**. The workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash: +Because ArchiveBox uniquely identifies snapshots by URL, it must use a workaround to take multiple snapshots of the same URL (otherwise they would show up as a single Snapshot entry). It makes the URLs of repeated snapshots unique by adding a hash with the archive date at the end: ```bash archivebox add 'https://example.com#2020-10-24' @@ -822,7 +827,9 @@ archivebox add 'https://example.com#2020-10-24' archivebox add 'https://example.com#2020-10-25' ``` -The Re-Snapshot Button button in the Admin UI is a shortcut for this hash-date workaround. +The Re-Snapshot Button button in the Admin UI is a shortcut for this hash-date multi-snapshotting workaround. + +Improved support for saving multiple snapshots of a single URL without this hash-date workaround will be [added eventually](https://github.com/ArchiveBox/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs). #### Learn More @@ -835,11 +842,13 @@ The