Considering different strategies for archiving web pages
I'm working on a project that has an aspect of archiving web pages for posterity. This is to fight the problem of link rot.
There's already sites like archive.org and archive.today. But the problem with them is that they save the content as they see it, not as you see it.
Here are my requirements:
- Should be able to archive content through the server
- Should be able to archive content through a browser extension, as the user sees it
- Should be able to archive both static and dynamic pages
- Should be able to remove scripts to prevent random JS from getting stored in the archive
Let's look at a few different strategies to achieve this.
1. Using wget
wget is the simplest option of them all, and can download all the linked site data like images and stylesheets. But the problem is, it can't archive dynamic pages. It doesn't work like a browser, so it won't work for dynamic pages either.
2. Using a WARC recorder
WARC is a web archive format that's also used by archive.org for saving their archives. While there are tools to record WARCs on the server side, recording them on the client side is complicated. There was only one Chrome extension that I found called WARCreate and it didn't work.
3. Using a headless browser (server) + content script (client)