archivebox/README.md

# Bookmark Archiver <img src="https://getpocket.com/favicon.ico" height="22px"/> <img src="https://pinboard.in/favicon.ico" height="22px"/> <img src="https://nicksweeting.com/images/bookmarks.png" height="22px"/> [![Twitter URL](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/thesquashSH)

**Browser Bookmarks (Chrome, Firefox, Safari, IE, Opera), Pocket, Pinboard, Shaarli, Delicious, Instapaper, Unmark.it**

(Your own personal Way-Back Machine) [DEMO: sweeting.me/pocket](https://home.sweeting.me/pocket)

Save an archived copy of all websites you star.
Outputs browsable html archives of each site, a PDF, a screenshot, and a link to a copy on archive.org, all indexed in a nice html file.

![](screenshot.png)

## Quickstart

**1. Get your bookmarks:**

Follow the links here to find instructions for each exporting bookmarks from each service.

 - [Pocket](https://getpocket.com/export)
 - [Pinboard](https://pinboard.in/export/)
 - [Instapaper](https://www.instapaper.com/user/export)
 - [Shaarli](http://sebsauvage.net/wiki/lib/exe/fetch.php?media=php:php_shaarli:shaarli_cap16_dragbutton.png)
 - [Unmark.it](http://help.unmark.it/import-export)
 - [Chrome Bookmarks](https://support.google.com/chrome/answer/96816?hl=en)
 - [Firefox Bookmarks](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer)
 - [Safari Bookmarks](http://i.imgur.com/AtcvUZA.png)
 - [Opera Bookmarks](http://help.opera.com/Windows/12.10/en/importexport.html)
 - [Internet Explorer Bookmarks](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows)

 (If any of these links are broken, please submit an issue and I'll fix it)

**2. Create your archive:**

```bash
git clone https://github.com/pirate/bookmark-archiver
cd bookmark-archiver/
./setup.sh
./archive.py ~/Downloads/bookmark_export.html   # replace this path with the path to your bookmarks export file
```

You can open `service/index.html` to view your archive.  (favicons will appear next to each title once it has finished downloading)

If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.

## Manual Setup

If you don't like `sudo` running random setup scripts off the internet (which you shouldn't), you can follow these manual setup instructions.

**1. Install dependencies:** `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5`  (google-chrome >= v59 also works well)

If you already have Google Chrome installed, or wish to use that instead of Chromium, follow the [Google Chrome Instructions](#google-chrome-instructions).
```bash
# On Mac:
brew cask install chromium  # If you already have Google Chrome/Chromium in /Applications/, skip this command
brew install wget python3

echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser  # see instructions for google-chrome below
chmod +x /usr/local/bin/chromium-browser
```

```bash
# On Ubuntu/Debian:
apt install chromium-browser python3 wget
```

```bash
# Check that everything worked:
chromium-browser --version && which wget && which python3 && which curl && echo "[√] All dependencies installed."
```

**2. Get your bookmark export file:**

Follow the instruction links above in the "Quickstart" section to download your bookmarks export file.

**3. Run the archive script:**

1. Clone this repo `git clone https://github.com/pirate/bookmark-archiver`
3. `cd bookmark-archiver/`
4. `./archive.py ~/Downloads/bookmarks_export.html`

You may optionally specify a third argument to `archive.py export.html [pocket|pinboard|bookmarks]` to enforce the use of a specific link parser.

If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.

## Details

`archive.py` is a script that takes a [Pocket-format](https://getpocket.com/export), [Pinboard-format](https://pinboard.in/export/), or [Netscape-format](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx) bookmark export file, and turns it into a browsable archive that you can store locally or host online.

The archiver produces a folder like `pocket/` containing an `index.html`, and archived copies of all the sites,
organized by starred timestamp.  It's Powered by the [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Chromium and good 'ol `wget`.
NEW: Also submits each link to save on archive.org!

For each sites it saves:

 - wget of site, e.g. `en.wikipedia.org/wiki/Example.html` with .html appended if not present
 - `sreenshot.png` 1440x900 screenshot of site using headless chrome
 - `output.pdf` Printed PDF of site using headless chrome
 - `archive.org.txt` A link to the saved site on archive.org

**Estimated Runtime:** 

I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down.  YMMV.  Users have also reported
running it with 50k+ bookmarks with success (though it will take more RAM while running).

## Configuration

You can tweak parameters via environment variables, or by editing `archive.py` directly:
```bash
env RESOLUTION=1440,900 FETCH_PDF=False ./archive.py ~/Downloads/bookmarks_export.html
```

 - Archive methods: `FETCH_WGET`, `FETCH_PDF`, `FETCH_SCREENSHOT`, `FETCH_FAVICON`, `SUBMIT_ARCHIVE_DOT_ORG` values: [`True`]/`False`
 - Screenshot: `RESOLUTION` values: [`1440,900`]/`1024,768`/`...`
 - Outputted Files: `ARCHIVE_PERMISSIONS` values: [`755`]/`644`/`...`
 - Path to Chrome: `CHROME_BINARY` values: [`chromium-browser`]/`/usr/local/bin/chromium-browser`/`...`
 - Path to wget: `WGET_BINARY` values: [`wget`]/`/usr/local/bin/wget`/`...`

 (See defaults & more at the top of `archive.py`)

You can also tweak the outputted html index in `index_template.html`.  It just uses python
format strings (not a proper templating engine like jinja2), which is why the CSS is double-bracketed `{{...}}`.

## Publishing Your Archive

The archive is suitable for serving on your personal server, you can upload the
archive to `/var/www/pocket` and allow people to access your saved copies of sites.


Just stick this in your nginx config to properly serve the wget-archived sites:

```nginx
location /pocket/ {
    alias       /var/www/pocket/;
    index       index.html;
    autoindex   on;
    try_files   $uri $uri/ $uri.html =404;
}
```

Make sure you're not running any content as CGI or PHP, you only want to serve static files!

Urls look like: `https://sweeting.me/archive/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem`

## Info & Motivation

This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!).
I got tired of sites I saved going offline or changing their URLS, so I started
archiving a copy of them locally now, similar to The Way-Back Machine provided
by [archive.org](https://archive.org).  Self hosting your own archive allows you to save
PDFs & Screenshots of dynamic sites in addition to static html, something archive.org doesn't do.

Now I can rest soundly knowing important articles and resources I like wont dissapear off the internet.

My published archive as an example: [sweeting.me/pocket](https://home.sweeting.me/pocket).

## Security WARNING & Content Disclaimer

Hosting other people's site content has security implications for your domain, make sure you understand
the dangers of hosting other people's CSS & JS files [on your domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy).  It's best to put this on a domain
of its own to slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery).

You may also want to blacklist your archive in your `/robots.txt` so that search engines dont index
the content on your domain.

Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons,
it's up to you to host responsibly and respond to takedown requests appropriately.

## Google Chrome Instructions:

I recommend Chromium instead of Google Chrome, since it's open source and doesn't send your data to Google.
Chromium may have some issues rendering some sites though, so you're welcome to try Google-chrome instead.
It's also easier to use Google Chrome if you already have it installed, rather than downloading Chromium all over.

```bash
# On Mac:
# If you already have Google Chrome in /Applications/, skip this brew command
brew cask install google-chrome
brew install wget python3

echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/google-chrome
chmod +x /usr/local/bin/google-chrome
```

```bash
# On Linux:
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
apt update; apt install google-chrome-beta python3 wget
```

2. Set the environment variable `CHROME_BINARY` to `google-chrome` before running:

```bash
env CHROME_BINARY=google-chrome ./archive.py ~/Downloads/bookmarks_export.html
```
If you're having any trouble trying to set up Google Chrome or Chromium, see the Troubleshooting section below.

## Troubleshooting

### Dependencies

**Python:**

On some Linux distributions the python3 package might not be recent enough.
If this is the case for you, resort to installing a recent enough version manually.
```bash
add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
```
If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.

**Chromium/Google Chrome:**

`archive.py` depends on being able to access a `chromium-browser`/`google-chrome` executable.  The executable used
defaults to `chromium-browser` but can be manually specified with the environment variable `CHROME_BINARY`:
```bash
env CHROME_BINARY=/usr/local/bin/chromium-browser ./archive.py ~/Downloads/bookmarks_export.html
```

1. Test to make sure you have Chrome on your `$PATH` with:
```bash
which chromium-browser || which google-chrome
```
If no executable is displayed, follow the setup instructions to install and link one of them.

2. If a path is displayed, the next step is to check that it's runnable:
```bash
chromium-browser --version || google-chrome --version
```
If no version is displayed, try the setup instructions again, or confirm that you have permission to access chrome.

3. If a version is displayed and it's `>=59`, make sure `archive.py` is running the right one:
```bash
env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive.py bookmarks_export.html   # replace the path with the one you got from step 1
```

**Wget & Curl:**

If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
See the "Manual Setup" instructions for more details.

### Archiving

**No links parsed from export file:**

Please open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of where you got the export, and
preferrably your export file attached (you can redact the links).  We'll fix the parser to support your format.

**Lots of skipped sites:**

If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links.
If you haven't already run it, make sure you have a working internet connection and that the parsed URLs look correct.
You can check the `archive.py` output or `index.html` to see what links it's downloading.

If you're still having issues, try deleting or moving the `service/archive` folder and running `archive.py` again.

**Lots of errors:**

Make sure you have all the dependencies installed and that you're able to visit the links from your browser normally.
Open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of the errors if you're still having problems.

**Lots of broken links from the index:**

Not all sites can be effectively archived with each method, that's why it's best to use a combination of `wget`, PDFs, and screenshots.
If it seems like more than 10-20% of sites in the archive are broken, open an [issue](https://github.com/pirate/bookmark-archiver/issues)
with some of the URLs that failed to be archived and I'll investigate.

### Hosting the Archive

If you're having issues trying to host the archive via nginx, make sure you already have nginx running with SSL.
If you don't, google around, there are plenty of tutorials to help get that set up.  Open an [issue](https://github.com/pirate/bookmark-archiver/issues)
if you have problem with a particular nginx config.

## TODO

 - body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)
 - auto-tagging based on important extracted words
 - audio & video archiving with `youtube-dl`
 - full-text indexing with elasticsearch
 - video closed-caption downloading for full-text indexing video content
 - automatic text summaries of article with summarization library
 - feature image extraction
 - http support (from my https-only domain)
 - try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader)

**Live Updating:** (coming soon... maybe...)

It's possible to pull links via the pocket API or public pocket RSS feeds instead of downloading an html export.
Once I write a script to do that, we can stick this in `cron` and have it auto-update on it's own.

For now you just have to download `ril_export.html` and run `archive.py` each time it updates. The script
will run fast subsequent times because it only downloads new links that haven't been archived already.

## Links

 - [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133)
 - [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/)
 - [Reddit r/datahoarder Discussion](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/)
 - https://wallabag.org + https://github.com/wallabag/wallabag
 - https://webrecorder.io/
 - https://github.com/ikreymer/webarchiveplayer#auto-load-warcs
 - [Shaarchiver](https://github.com/nodiscc/shaarchiver) very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
finish rename 2017-06-15 18:58:03 -04:00			`# Bookmark Archiver <img src="https://getpocket.com/favicon.ico" height="22px"/> <img src="https://pinboard.in/favicon.ico" height="22px"/> <img src="https://nicksweeting.com/images/bookmarks.png" height="22px"/> [![Twitter URL](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/thesquashSH)`

more instructions 2017-06-15 19:42:22 -04:00			`Browser Bookmarks (Chrome, Firefox, Safari, IE, Opera), Pocket, Pinboard, Shaarli, Delicious, Instapaper, Unmark.it`
readme 2017-05-05 05:10:50 -04:00
Update README.md 2017-05-29 19:53:08 -04:00			`(Your own personal Way-Back Machine) [DEMO: sweeting.me/pocket](https://home.sweeting.me/pocket)`
readme tweaks 2017-05-05 05:15:19 -04:00
make service argument optional 2017-06-15 19:31:33 -04:00			`Save an archived copy of all websites you star.`
			`Outputs browsable html archives of each site, a PDF, a screenshot, and a link to a copy on archive.org, all indexed in a nice html file.`
finish rename 2017-06-15 18:58:03 -04:00
readme 2017-05-05 05:10:50 -04:00			`![](screenshot.png)`

			`## Quickstart`

simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`1. Get your bookmarks:`

			`Follow the links here to find instructions for each exporting bookmarks from each service.`

			`- [Pocket](https://getpocket.com/export)`
			`- [Pinboard](https://pinboard.in/export/)`
			`- [Instapaper](https://www.instapaper.com/user/export)`
			`- [Shaarli](http://sebsauvage.net/wiki/lib/exe/fetch.php?media=php:php_shaarli:shaarli_cap16_dragbutton.png)`
			`- [Unmark.it](http://help.unmark.it/import-export)`
			`- [Chrome Bookmarks](https://support.google.com/chrome/answer/96816?hl=en)`
			`- [Firefox Bookmarks](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer)`
			`- [Safari Bookmarks](http://i.imgur.com/AtcvUZA.png)`
			`- [Opera Bookmarks](http://help.opera.com/Windows/12.10/en/importexport.html)`
			`- [Internet Explorer Bookmarks](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows)`

			`(If any of these links are broken, please submit an issue and I'll fix it)`

			`2. Create your archive:`

Update README.md 2017-05-29 19:53:08 -04:00			```bash
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`git clone https://github.com/pirate/bookmark-archiver`
			`cd bookmark-archiver/`
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00			`./setup.sh`
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`./archive.py ~/Downloads/bookmark_export.html # replace this path with the path to your bookmarks export file`
Update README.md 2017-05-29 19:53:08 -04:00			```
add bit about live updating 2017-05-05 06:56:12 -04:00
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			You can open `service/index.html` to view your archive. (favicons will appear next to each title once it has finished downloading)

Update README.md 2017-06-30 03:01:38 -04:00			`If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.`
add troubleshooting section 2017-06-30 02:50:57 -04:00
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`## Manual Setup`

Update README.md 2017-06-30 03:12:52 -04:00			If you don't like `sudo` running random setup scripts off the internet (which you shouldn't), you can follow these manual setup instructions.
re-arrange 2017-05-05 05:56:11 -04:00
Update README.md 2017-06-30 03:12:52 -04:00			1. Install dependencies: `chromium >= 59`,` wget >= 1.16`, `python3 >= 3.5` (google-chrome >= v59 also works well)
readme 2017-05-05 05:10:50 -04:00
Update README.md 2017-06-30 03:12:52 -04:00			`If you already have Google Chrome installed, or wish to use that instead of Chromium, follow the [Google Chrome Instructions](#google-chrome-instructions).`
readme 2017-05-05 05:10:50 -04:00			```bash
formatting 2017-06-30 02:35:48 -04:00			`# On Mac:`
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00			`brew cask install chromium # If you already have Google Chrome/Chromium in /Applications/, skip this command`
			`brew install wget python3`
Update README.md 2017-05-29 19:53:08 -04:00
add troubleshooting section 2017-06-30 02:50:57 -04:00			`echo -e '#!/bin/bash\n/Applications/Chromium.app/Contents/MacOS/Chromium "$@"' > /usr/local/bin/chromium-browser # see instructions for google-chrome below`
			`chmod +x /usr/local/bin/chromium-browser`
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00			```
Update README.md 2017-05-29 19:53:08 -04:00
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00			```bash
formatting 2017-06-30 02:35:48 -04:00			`# On Ubuntu/Debian:`
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00			`apt install chromium-browser python3 wget`
			```

			```bash
formatting 2017-06-30 02:35:48 -04:00			`# Check that everything worked:`
add troubleshooting section 2017-06-30 02:50:57 -04:00			`chromium-browser --version && which wget && which python3 && which curl && echo "[√] All dependencies installed."`
readme 2017-05-05 05:10:50 -04:00			```
Update README.md 2017-05-29 14:05:16 -04:00
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`2. Get your bookmark export file:`
readme 2017-05-05 05:10:50 -04:00
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`Follow the instruction links above in the "Quickstart" section to download your bookmarks export file.`

			`3. Run the archive script:`

			1. Clone this repo `git clone https://github.com/pirate/bookmark-archiver`
renamed to bookmark-archiver 2017-06-15 18:50:13 -04:00			3. `cd bookmark-archiver/`
make service argument optional 2017-06-15 19:31:33 -04:00			4. `./archive.py ~/Downloads/bookmarks_export.html`

			You may optionally specify a third argument to `archive.py export.html [pocket\|pinboard\|bookmarks]` to enforce the use of a specific link parser.
readme 2017-05-05 05:10:50 -04:00
Update README.md 2017-06-30 03:01:38 -04:00			`If you have any trouble, see the [Troubleshooting](#troubleshooting) section at the bottom.`
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`## Details`

Update README.md 2017-06-30 02:39:01 -04:00			`archive.py` is a script that takes a [Pocket-format](https://getpocket.com/export), [Pinboard-format](https://pinboard.in/export/), or [Netscape-format](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx) bookmark export file, and turns it into a browsable archive that you can store locally or host online.

simpler quickstart instructions 2017-06-30 01:57:20 -04:00			The archiver produces a folder like `pocket/` containing an `index.html`, and archived copies of all the sites,
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00			organized by starred timestamp. It's Powered by the [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Chromium and good 'ol `wget`.
			`NEW: Also submits each link to save on archive.org!`
Update README.md 2017-06-30 02:58:57 -04:00
change default from Google-Chrome to chromium 2017-06-30 02:34:23 -04:00			`For each sites it saves:`
readme tweaks 2017-05-05 05:15:19 -04:00
formatting 2017-05-05 05:21:35 -04:00			- wget of site, e.g. `en.wikipedia.org/wiki/Example.html` with .html appended if not present
			- `sreenshot.png` 1440x900 screenshot of site using headless chrome
			- `output.pdf` Printed PDF of site using headless chrome
Update README.md 2017-05-29 14:05:16 -04:00			- `archive.org.txt` A link to the saved site on archive.org
readme 2017-05-05 05:10:50 -04:00
Update README.md 2017-06-30 03:01:38 -04:00			`Estimated Runtime:`

			`I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.`
			`Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV. Users have also reported`
			`running it with 50k+ bookmarks with success (though it will take more RAM while running).`

			`## Configuration`
Update README.md 2017-06-30 02:58:57 -04:00
			You can tweak parameters via environment variables, or by editing `archive.py` directly:
			```bash
			`env RESOLUTION=1440,900 FETCH_PDF=False ./archive.py ~/Downloads/bookmarks_export.html`
			```

Update README.md 2017-06-30 03:01:38 -04:00			- Archive methods: `FETCH_WGET`, `FETCH_PDF`, `FETCH_SCREENSHOT`, `FETCH_FAVICON`, `SUBMIT_ARCHIVE_DOT_ORG` values: [`True`]/`False`
			- Screenshot: `RESOLUTION` values: [`1440,900`]/`1024,768`/`...`
			- Outputted Files: `ARCHIVE_PERMISSIONS` values: [`755`]/`644`/`...`
			- Path to Chrome: `CHROME_BINARY` values: [`chromium-browser`]/`/usr/local/bin/chromium-browser`/`...`
			- Path to wget: `WGET_BINARY` values: [`wget`]/`/usr/local/bin/wget`/`...`
Update README.md 2017-06-30 02:58:57 -04:00
			(See defaults & more at the top of `archive.py`)

css subleties 2017-05-05 05:54:31 -04:00			You can also tweak the outputted html index in `index_template.html`. It just uses python
			format strings (not a proper templating engine like jinja2), which is why the CSS is double-bracketed `{{...}}`.
details on runtime and storage 2017-05-05 05:52:07 -04:00
self-hosting instructions 2017-05-05 05:19:25 -04:00			`## Publishing Your Archive`

add pinboard info to readme 2017-05-05 20:42:12 -04:00			`The archive is suitable for serving on your personal server, you can upload the`
Update README.md 2017-05-29 20:06:50 -04:00			archive to `/var/www/pocket` and allow people to access your saved copies of sites.
readme 2017-05-05 05:10:50 -04:00

self-hosting instructions 2017-05-05 05:19:25 -04:00			`Just stick this in your nginx config to properly serve the wget-archived sites:`

			```nginx
Update README.md 2017-05-29 20:06:50 -04:00			`location /pocket/ {`
			`alias /var/www/pocket/;`
Update README.md 2017-05-05 07:11:43 -04:00			`index index.html;`
			`autoindex on;`
			`try_files $uri $uri/ $uri.html =404;`
self-hosting instructions 2017-05-05 05:19:25 -04:00			`}`
			```

			`Make sure you're not running any content as CGI or PHP, you only want to serve static files!`

Update README.md 2017-05-29 20:05:20 -04:00			Urls look like: `https://sweeting.me/archive/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem`
self-hosting instructions 2017-05-05 05:19:25 -04:00
Update README.md 2017-06-30 03:12:52 -04:00			`## Info & Motivation`
readme 2017-05-05 05:10:50 -04:00
add todos 2017-05-05 05:30:07 -04:00			`This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!).`
readme tweaks 2017-05-05 05:15:19 -04:00			`I got tired of sites I saved going offline or changing their URLS, so I started`
			`archiving a copy of them locally now, similar to The Way-Back Machine provided`
typo fix 2017-05-05 09:26:40 -04:00			`by [archive.org](https://archive.org). Self hosting your own archive allows you to save`
			`PDFs & Screenshots of dynamic sites in addition to static html, something archive.org doesn't do.`
readme 2017-05-05 05:10:50 -04:00
			`Now I can rest soundly knowing important articles and resources I like wont dissapear off the internet.`

add todos 2017-05-05 05:30:07 -04:00			`My published archive as an example: [sweeting.me/pocket](https://home.sweeting.me/pocket).`
readme tweaks 2017-05-05 05:15:19 -04:00
Update README.md 2017-05-29 19:53:08 -04:00			`## Security WARNING & Content Disclaimer`
readme 2017-05-05 05:10:50 -04:00
			`Hosting other people's site content has security implications for your domain, make sure you understand`
Update README.md 2017-05-05 07:11:43 -04:00			`the dangers of hosting other people's CSS & JS files [on your domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy). It's best to put this on a domain`
			`of its own to slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery).`
readme 2017-05-05 05:10:50 -04:00
Update README.md 2017-05-29 19:53:08 -04:00			You may also want to blacklist your archive in your `/robots.txt` so that search engines dont index
readme 2017-05-05 05:10:50 -04:00			`the content on your domain.`
add todos 2017-05-05 05:30:07 -04:00
Update README.md 2017-05-29 19:53:08 -04:00			`Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons,`
			`it's up to you to host responsibly and respond to takedown requests appropriately.`

Update README.md 2017-06-30 03:12:52 -04:00			`## Google Chrome Instructions:`
add troubleshooting section 2017-06-30 02:50:57 -04:00
			`I recommend Chromium instead of Google Chrome, since it's open source and doesn't send your data to Google.`
			`Chromium may have some issues rendering some sites though, so you're welcome to try Google-chrome instead.`
			`It's also easier to use Google Chrome if you already have it installed, rather than downloading Chromium all over.`

			```bash
			`# On Mac:`
			`# If you already have Google Chrome in /Applications/, skip this brew command`
			`brew cask install google-chrome`
			`brew install wget python3`

			`echo -e '#!/bin/bash\n/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome "$@"' > /usr/local/bin/google-chrome`
			`chmod +x /usr/local/bin/google-chrome`
			```

			```bash
			`# On Linux:`
			`wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub \| sudo apt-key add -`
			`sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'`
			`apt update; apt install google-chrome-beta python3 wget`
			```

			2. Set the environment variable `CHROME_BINARY` to `google-chrome` before running:

			```bash
			`env CHROME_BINARY=google-chrome ./archive.py ~/Downloads/bookmarks_export.html`
			```
Update README.md 2017-06-30 03:12:52 -04:00			`If you're having any trouble trying to set up Google Chrome or Chromium, see the Troubleshooting section below.`
add troubleshooting section 2017-06-30 02:50:57 -04:00
			`## Troubleshooting`

			`### Dependencies`

			`Python:`

			`On some Linux distributions the python3 package might not be recent enough.`
			`If this is the case for you, resort to installing a recent enough version manually.`
			```bash
			`add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6`
			```
			`If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.`

			`Chromium/Google Chrome:`

			`archive.py` depends on being able to access a `chromium-browser`/`google-chrome` executable. The executable used
			defaults to `chromium-browser` but can be manually specified with the environment variable `CHROME_BINARY`:
			```bash
			`env CHROME_BINARY=/usr/local/bin/chromium-browser ./archive.py ~/Downloads/bookmarks_export.html`
			```

			1. Test to make sure you have Chrome on your `$PATH` with:
			```bash
			`which chromium-browser \|\| which google-chrome`
			```
			`If no executable is displayed, follow the setup instructions to install and link one of them.`

			`2. If a path is displayed, the next step is to check that it's runnable:`
			```bash
			`chromium-browser --version \|\| google-chrome --version`
			```
			`If no version is displayed, try the setup instructions again, or confirm that you have permission to access chrome.`

			3. If a version is displayed and it's `>=59`, make sure `archive.py` is running the right one:
			```bash
			`env CHROME_BINARY=/path/from/step/1/chromium-browser ./archive.py bookmarks_export.html # replace the path with the one you got from step 1`
			```

			`Wget & Curl:`

			If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
			`See the "Manual Setup" instructions for more details.`

Update README.md 2017-06-30 03:12:52 -04:00			`### Archiving`

Update README.md 2017-06-30 03:52:21 -04:00			`No links parsed from export file:`

			`Please open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of where you got the export, and`
			`preferrably your export file attached (you can redact the links). We'll fix the parser to support your format.`

Update README.md 2017-06-30 03:12:52 -04:00			`Lots of skipped sites:`

			`If you ran the archiver once, it wont re-download sites subsequent times, it will only download new links.`
			`If you haven't already run it, make sure you have a working internet connection and that the parsed URLs look correct.`
			You can check the `archive.py` output or `index.html` to see what links it's downloading.

			If you're still having issues, try deleting or moving the `service/archive` folder and running `archive.py` again.

			`Lots of errors:`

			`Make sure you have all the dependencies installed and that you're able to visit the links from your browser normally.`
			`Open an [issue](https://github.com/pirate/bookmark-archiver/issues) with a description of the errors if you're still having problems.`

			`Lots of broken links from the index:`

			Not all sites can be effectively archived with each method, that's why it's best to use a combination of `wget`, PDFs, and screenshots.
			`If it seems like more than 10-20% of sites in the archive are broken, open an [issue](https://github.com/pirate/bookmark-archiver/issues)`
			`with some of the URLs that failed to be archived and I'll investigate.`

			`### Hosting the Archive`

			`If you're having issues trying to host the archive via nginx, make sure you already have nginx running with SSL.`
			`If you don't, google around, there are plenty of tutorials to help get that set up. Open an [issue](https://github.com/pirate/bookmark-archiver/issues)`
			`if you have problem with a particular nginx config.`

add todos 2017-05-05 05:30:07 -04:00			`## TODO`

			`- body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)`
			`- auto-tagging based on important extracted words`
			- audio & video archiving with `youtube-dl`
			`- full-text indexing with elasticsearch`
			`- video closed-caption downloading for full-text indexing video content`
			`- automatic text summaries of article with summarization library`
			`- feature image extraction`
			`- http support (from my https-only domain)`
Update README.md 2017-05-29 19:53:08 -04:00			`- try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader)`
add pinboard info to readme 2017-05-05 20:42:12 -04:00
simpler quickstart instructions 2017-06-30 01:57:20 -04:00			`Live Updating: (coming soon... maybe...)`

			`It's possible to pull links via the pocket API or public pocket RSS feeds instead of downloading an html export.`
			Once I write a script to do that, we can stick this in `cron` and have it auto-update on it's own.

			For now you just have to download `ril_export.html` and run `archive.py` each time it updates. The script
			`will run fast subsequent times because it only downloads new links that haven't been archived already.`

add pinboard info to readme 2017-05-05 20:42:12 -04:00			`## Links`

			`- [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133)`
Update README.md 2017-05-29 19:53:08 -04:00			`- [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/)`
			`- [Reddit r/datahoarder Discussion](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/)`
add pinboard info to readme 2017-05-05 20:42:12 -04:00			`- https://wallabag.org + https://github.com/wallabag/wallabag`
			`- https://webrecorder.io/`
			`- https://github.com/ikreymer/webarchiveplayer#auto-load-warcs`
add shaarchiver 2017-06-15 17:41:55 -04:00			`- [Shaarchiver](https://github.com/nodiscc/shaarchiver) very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index`