How to detect if a page is an article

How to detect if a page is an article

If you've ever used a service like Pocket or Instapaper (Safari and Firefox also provide this, both powered by the now-defunct Readability parser code), you will have used a feature called "Readability" or "Reader View".

What is does is that it takes a page, detects what part of the page is the actual article content, and strips out everything else like menus, sidebars, headers, footers, ads, etc., so you're left with just the article which you can then read in a clean and peaceful layout.

But how do you detect if a page has an article in the first place?

The easiest way, of course, is to check if the page is making use of the HTML5 article tag. But things are rarely that easy.

Another way to do this detection is to check the that the meta[property='og:type'] tag has the content="article", like so:

<meta property="og:type" content="article">

This is basically an OpenGraph tag.

A lot of Reader View parsers instead rely on the meta[property='og:image'] tag existing, and classify pages that have this tag as an article, but I find this to be rather weird, because

  1. Non-article pages can have og:image thumbnails
  2. Article pages do not absolutely need to have that thumbnail

The third way is to make use of the Mozilla Readability library, which provides an isProbablyReaderable function.

The Mozilla Readability library is not published on npm, so to install it you need to pin it to a specific commit from GitHub

npm install --save "mozilla/readability#878545f64d4cc389bbc4031a130ccbaf425c7211"

Also the isProbablyReaderable function requires a DOM object to be passed to it, since it's designed to be used in a browser environment and not a server environment. We can use JSDom to generate a DOM object from some raw HTML text.

npm install --save jsdom

This is how you use the isProbablyReaderable function.

const JSDOM = require("jsdom").JSDOM;
const isProbablyReaderable = require("readability/Readability-readerable").isProbablyReaderable;
const html = "<html><head><title>Hello World</title></head></html>";
const pageUrl = "http://www.example.com";
// we need to pass the page URL to resolve relative links properly
let docDOM = new JSDOM(html, {
    url: pageUrl
});

isProbablyReaderable(docDOM.window.document).then(function (res) {
    // res will be true / false
});

It's important to note that all these techniques are heuristics, and are not guaranteed to be 100% accurate. Right now there is no technique to do this detection that is 100% accurate, because of the wide variability in HTML used by different pages and sites on the internet.