diff --git a/README.md b/README.md
index 02c4b2a5..b70f3d54 100644
--- a/README.md
+++ b/README.md
@@ -83,7 +83,7 @@ ls ./archive/*/index.json # or browse directly via the filesyste
-
+
## Key Features
@@ -106,7 +106,7 @@ ls ./archive/*/index.json # or browse directly via the filesyste
-### Quickstart
+# Quickstart
**🖥 Supported OSs:** Linux/BSD, macOS, Windows (w/ Docker, WSL/WSL2) **🎮 CPU Architectures:** amd64, x86, arm8, arm7 (raspi >=3)
@@ -337,22 +337,19 @@ archivebox config --set PUBLIC_ADD_VIEW=False
## Dependencies
-You don't need to install all the dependencies, ArchiveBox will automatically enable the relevant modules based on whatever you have available, but it's recommended to use the official [Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker) with everything preinstalled.
+You don't need to install all the dependencies, ArchiveBox will automatically enable the relevant modules based on whatever you have available, but it's recommended to use the official [Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker) with everything preinstalled for the best experience.
-If you so choose, you can also install ArchiveBox and its dependencies directly on any Linux or macOS systems using the [system package manager](https://github.com/ArchiveBox/ArchiveBox/wiki/Install) and the `archivebox setup` command.
+You can also install ArchiveBox and its dependencies using your [system package manager](https://github.com/ArchiveBox/ArchiveBox/wiki/Install) or `pip` directly on any Linux or macOS system, or on Windows (advanced users only).
```bash
# install archivebox with your system package manager
# apt/brew/pip/etc install ... (see Quickstart instructions above)
-# run the setup to auto install all the extractors and extras
-archivebox setup
-
-# see information about all the dependencies
-archivebox --version
+archivebox setup # auto install all the extractors and extras
+archivebox --version # see info and versions of installed dependencies
```
-ArchiveBox is written in Python 3 so it requires `python3` and `pip3` available on your system. It also uses a set of optional, but highly recommended external dependencies for archiving sites: `wget` (for plain HTML, static files, and WARC saving), `chromium` (for screenshots, PDFs, JS execution, and more), `youtube-dl` (for audio and video), `git` (for cloning git repos), and `nodejs` (for readability, mercury, and singlefile), and more.
+ArchiveBox is written in Python 3 so it requires `python3` and `pip3` are available on your system when not using Docker. The optional dependencies used for archiving sites include: `wget` (for plain HTML, static files, and WARC saving), `chromium` (for screenshots, PDFs, JS execution, and more), `youtube-dl` (for audio and video), `git` (for cloning git repos), and `nodejs` (for readability, mercury, and singlefile), and more.
@@ -368,6 +365,7 @@ ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exp
- [Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user/export), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)
```bash
+# archivebox add --help
echo 'http://example.com' | archivebox add
archivebox add 'https://example.com/some/page'
archivebox add < ~/Downloads/firefox_bookmarks_export.html
@@ -410,25 +408,21 @@ All of ArchiveBox's state (including the index, snapshot data, and config file)
It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) via environment variables or config file.
```bash
+# archivebox config --help
+archivebox config # see all currently configured options
archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m'
-archivebox config --help
```
The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard sqlite3 database (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `archive/` subfolder. Each snapshot subfolder includes a static JSON and HTML index describing its contents, and the snapshot extrator outputs are plain files within the folder (e.g. `media/example.mp4`, `git/somerepo.git`, `static/someimage.png`, etc.)
```bash
# to browse your index statically without running the archivebox server, run:
-archivebox list --html --with-headers > index.html
+archivebox list --html --with-headers > index.html # open index.html to view
archivebox list --json --with-headers > index.json
-# if running these commands with docker-compose, add -T:
-# docker-compose run -T archivebox list ...
-# then open the static index in a browser
-open index.html
-
-# or browse the snapshots via filesystem directly
-ls ./archive//
+# (if using docker-compose, add the -T flag when piping)
+docker-compose run -T archivebox list --csv > index.csv
```
@@ -458,13 +452,13 @@ archivebox config --set CHROME_BINARY=chromium # optional: switch to chromium t
#### Security Risks of Viewing Archived JS
-Be aware that malicious archived JS can also read the contents of other pages in your archive due to snapshot CSRF and XSS protections being imperfect. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page for more details.
+Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page for more details.
```bash
# visiting an archived page with malicious JS:
https://127.0.0.1:8000/archive/1602401954/example.com/index.html
-# example.com/index.js can now make a request to read everything:
+# example.com/index.js can now make a request to read everything from:
https://127.0.0.1:8000/index.html
https://127.0.0.1:8000/archive/*
# then example.com/index.js can send it off to some evil server
@@ -472,7 +466,7 @@ https://127.0.0.1:8000/archive/*
#### Saving Multiple Snapshots of a Single URL
-Support for saving multiple snapshots of each site over time will be [added soon](https://github.com/ArchiveBox/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs). For now ArchiveBox is designed to only archive each URL with each extractor type once. A workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:
+Support for saving multiple snapshots of each site over time will be [added eventually](https://github.com/ArchiveBox/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs). For now ArchiveBox is designed to only archive each URL with each extractor type once. A workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:
```bash
archivebox add 'https://example.com#2020-10-24'
@@ -486,7 +480,9 @@ Because ArchiveBox is designed to ingest a firehose of browser history and bookm
ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
-Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need.
+Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. Don't store large collections on older filesystems like EXT3/FAT as they may not be able to handle more than 50k directory entries in the `archive/` folder.
+
+Try to keep the `index.sqlite3` file on local drive (not a network mount), and ideally on an SSD for maximum performance, however the `archive/` folder can be on a network mount or spinning HDD.