ArchiveBox
The open-source self-hosted web archive.
▶️
Quickstart |
Demo |
Github |
Documentation |
Info & Motivation |
Community |
Roadmap
"Your own personal internet archive" (网站存档 / 爬虫)
💥 Attention: Big API changes are coming soon (including a proper config file format and pip install archivebox
)! Check out v0.4 and help us test it! 💥
Note: There are some important security design issues that need to be fixed before v0.4 can be pushed, all help is appreciated!
(This project is not abandoned, it's my primary side-project for the forseeable future, my day job is very busy right now.)
See the v0.4 release PR for more information.
**ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).**
You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox imports lists of URLs, renders the pages in a headless, autheticated, user-scriptable browser, and then archives the content in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the originals disappear off the internet. It automatically extracts assets and media from pages and saves them in easily-accessible folders, with out-of-the-box support for extracting git repositories, audio, video, subtitles, images, PDFs, and more.
#### How does it work?
```bash
echo 'http://example.com' | ./archive
```
After installing the dependencies, just pipe some new links into the `./archive` command to start your archive.
ArchiveBox is written in Python 3.5 and uses wget, Chrome headless, youtube-dl, pywb, and other common unix tools to save each page you add in multiple redundant formats. It doesn't require a constantly running server or backend, just open the generated `output/index.html` in a browser to view the archive. It can import and export links as JSON (among other formats), so it's easy to script or hook up to other APIs. If you run it on a schedule and import from browser history or bookmarks regularly, you can sleep soundly knowing that the slice of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).