Update README.md

2017-05-29 18:53:08 -05:00 · 2017-05-29 18:53:08 -05:00 · b2312a088d
commit b2312a088d
parent 8e52c52fef
1 changed files with 36 additions and 22 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # Pocket/Pinboard/Browser Bookmark Website Archiver <img src="https://getpocket.com/favicon.ico" height="22px"/> <img src="https://pinboard.in/favicon.ico" height="22px"/> [![Twitter URL](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/thesquashSH)

-(Your own personal Way-Back Machine)
+(Your own personal Way-Back Machine) [DEMO: sweeting.me/pocket](https://home.sweeting.me/pocket)

 Save an archived copy of all websites you star using Pocket, Pinboard, or Browser bookmarks.  
 Outputs browsable html archives of each site, a PDF, a screenshot, and a link to a copy on archive.org, all indexed in a nice html file.  
@ -12,41 +12,35 @@ NEW: Also submits each link to save on archive.org!

 ## Quickstart

-`archive.py` is a script that takes a [Pocket](https://getpocket.com/export) export, and turns it into a browsable html archive that you can store locally or host online.
+```bash
+./archive.py pocket_export.html pocket      # See below for how to install dependencies
+```

-**Runtime:** I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
-Those numbers are from running it single-threaded on my i5 machine with 50mbps down.  YMMV.
+`archive.py` is a script that takes a [Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), or [Browser Bookmark](https://support.google.com/chrome/answer/96816?hl=en) html export file, and turns it into a browsable archive that you can store locally or host online.

-**Dependencies:** `google-chrome >= 59`,` wget >= 1.16`, `python3 >= 3.5`  ([chromium](https://www.chromium.org/getting-involved/download-chromium) >= v59 also works well, yay open source!)
+**1. Install dependencies:** `google-chrome >= 59`,` wget >= 1.16`, `python3 >= 3.5`  ([chromium](https://www.chromium.org/getting-involved/download-chromium) >= v59 also works well, yay open source!)

 ```bash
 # On Mac:
 brew install Caskroom/versions/google-chrome-canary wget python3
 echo -e '#!/bin/bash\n/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary "$@"' > /usr/local/bin/google-chrome
 chmod +x /usr/local/bin/google-chrome
+
 # On Linux:
 wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
 sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
 apt update; apt install google-chrome-beta python3 wget
+
 # Check:
 google-chrome --version && which wget && which python3 && echo "[√] All dependencies installed."
 ```
-On some Linux distributions the python3 package might not be recent enough.
-If this is the case for you, resort to installing a recent enough version manually.
-```bash
-add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
-```
-If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.

-To swtich from Google Chrome to chromium, change the `CHROME_BINARY` variable at the top of `archive.py`.
-If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
+**2. Run the archive script:**

-**Archiving:**
-
-1. Download your pocket export file `ril_export.html` from https://getpocket.com/export
-2. Download this repo `git clone https://github.com/pirate/pocket-archive-stream`
+1. Download your export file e.g. `ril_export.html` from https://getpocket.com/export
+2. Clone the repo `git clone https://github.com/pirate/pocket-archive-stream`
 3. `cd pocket-archive-stream/`
-4. `./archive.py ~/Downloads/ril_export.html [pinboard|pocket]`
+4. `./archive.py ~/Downloads/ril_export.html [pocket|pinboard|bookmarks]`

 It produces a folder `pocket/` containing an `index.html`, and archived copies of all the sites,
 organized by timestamp.  For each sites it saves:
@ -60,7 +54,22 @@ You can tweak parameters like screenshot size, file paths, timeouts, etc. in `ar
 You can also tweak the outputted html index in `index_template.html`.  It just uses python
 format strings (not a proper templating engine like jinja2), which is why the CSS is double-bracketed `{{...}}`.

-**Live Updating:** (coming soon)
+**Estimated Runtime:** I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
+Those numbers are from running it single-threaded on my i5 machine with 50mbps down.  YMMV.
+
+**Troubleshooting:**
+
+On some Linux distributions the python3 package might not be recent enough.
+If this is the case for you, resort to installing a recent enough version manually.
+```bash
+add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
+```
+If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.
+
+To switch from Google Chrome to chromium, change the `CHROME_BINARY` variable at the top of `archive.py`.
+If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
+
+**Live Updating:** (coming soon... maybe...)

 It's possible to pull links via the pocket API or public pocket RSS feeds instead of downloading an html export.
 Once I write a script to do that, we can stick this in `cron` and have it auto-update on it's own.
@ -101,15 +110,18 @@ Now I can rest soundly knowing important articles and resources I like wont diss

 My published archive as an example: [sweeting.me/pocket](https://home.sweeting.me/pocket).

-## Security WARNING
+## Security WARNING & Content Disclaimer

 Hosting other people's site content has security implications for your domain, make sure you understand
 the dangers of hosting other people's CSS & JS files [on your domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy).  It's best to put this on a domain
 of its own to slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery).

-It might also be prudent to blacklist your archive in your `robots.txt` so that search engines dont index
+You may also want to blacklist your archive in your `/robots.txt` so that search engines dont index
 the content on your domain.

+Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons,
+it's up to you to host responsibly and respond to takedown requests appropriately.
+
 ## TODO

 - body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)
@ -120,11 +132,13 @@ the content on your domain.
 - automatic text summaries of article with summarization library
 - feature image extraction
 - http support (from my https-only domain)
- - try getting dead links from archive.org (https://github.com/hartator/wayback-machine-downloader)
+ - try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader)

 ## Links

 - [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133)
+ - [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/)
+ - [Reddit r/datahoarder Discussion](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/)
 - https://wallabag.org + https://github.com/wallabag/wallabag
 - https://webrecorder.io/
 - https://github.com/ikreymer/webarchiveplayer#auto-load-warcs