Considering different strategies for archiving web pages

Considering different strategies for archiving web pages

I'm working on a project that has an aspect of archiving web pages for posterity. This is to fight the problem of link rot.

There's already sites like archive.org and archive.today. But the problem with them is that they save the content as they see it, not as you see it.

Here are my requirements:

  • Should be able to archive content through the server
  • Should be able to archive content through a browser extension, as the user sees it
  • Should be able to archive both static and dynamic pages
  • Should be able to remove scripts to prevent random JS from getting stored in the archive

Let's look at a few different strategies to achieve this.

1. Using wget

wget is the simplest option of them all, and can download all the linked site data like images and stylesheets. But the problem is, it can't archive dynamic pages. It doesn't work like a browser, so it won't work for dynamic pages either.

2. Using a WARC recorder

WARC is a web archive format that's also used by archive.org for saving their archives. While there are tools to record WARCs on the server side, recording them on the client side is complicated. There was only one Chrome extension that I found called WARCreate and it didn't work.

3. Using a headless browser (server) + content script (client)

A headless browser is an entire browser engine that can run without a GUI. Puppeteer is the best option here, due to it being Chromium based, so it should render pages exactly like Chromium would. Once the page is rendered, we can capture the DOM, remove any JavaScript, and we will have a working version of any static page and most dynamic pages (Google Maps, for example, wouldn't work). We can do the same process on the client side using a content script in a browser extension.