Many modern browsers nowadays have a sort of “reading mode”, which, when activated, will attempt to automatically extract content from a page and present it to the user in a more reader-friendly environment. In Firefox, this is called “Reader View”, and can be found as a little book icon in the URL bar or under
View > Enter Reader View.
I personally find this to be particularly useful when browsing on my mobile phone, as it allows me to scale the text size to my heart’s content without fiddling with zoom and side-scrolling. There’s even a “low-light reading” mode, which shows white text on a black background.
This “Reader View” is not available on every page, however. On some pages, it might not make any sense to have the option, because there is little content, because the content is very layout-dependent, or simply because the algorithm in charge of extracting the content from the page can’t tell where the main content of the page is.
So of course, being a web developer, I want to know how the content-grabbing algorithm works, in order to make sure my websites take full advantage of this Reader View. For instance I have found it to be a nice trade-off between developing a whole mobile-dedicated version of the site and ignoring mobile devices completely, especially when revamping an old website, and remaking it from scratch really isn’t an option.
Firefox’s Reader View
So to get an idea of how these algorithms work and how to optimize my websites for them, I had a look at how Firefox does it. I read some of the source code to
Readability.js, which is a stand-alone version of the library used by Firefox’s Reader View.
The algorithm is centered around paragraph tags. First, it tries to identify parts of the page which are definitely not content – like forms and so on – and removes them. Then it loops through the paragraph nodes on the page and assigns a score based on (quote) “how content-y they look”. In other words, it gives them points for things like number of commas, length of content, or class names which are likely to indicate main content. Incidentally, a paragraph with fewer than 25 characters is immediately discarded.
Scores then “bubble up” the DOM tree – that is, each paragraph will add it’s score to it’s parent nodes – a direct parent gets the full score added to its total, a grandparent only half, a great-grandparent a third and so on. This allows the algorithm to identify not only paragraphs with a lot of good content, but also higher-level elements which are likely to be the main content section.
I haven’t looked in to other browsers yet, but my guess is that if it works well for Firefox, it’ll work well for most others too. After all they are all trying to obtain the same goal: extracting the content of a page and serving it to the reader in a nicer format.
So how do I improve my website?
In order for these Reader View algorithms to work for your website, you want them to correctly identify the main content-heavy sections of your page. This means you want the more content-heavy nodes on your page to get high scores in their algorithm. To be fair, they do a pretty good job of it – so you’re more trying not to get in their way than anything else.
So here are some rules of thumb to improve the quality of the page in the eyes of these algorithms:
- Use paragraph tags in your content! Many people tend to overlook them in favor of
<br />tags. While it may look similar, many content-related algorithms (not only Reader View ones) rely heavily on them.
- Use HTML5 semantic elements in your markup, like
<aside>. These are very useful to computers reading your page (not just Reader View) to distinguish different sections of your content.
- Wrap your main content in one container, like an
<div>element. This will receive score points from all the paragraph tags inside it, and be identified as the main content section.
- Keep your DOM tree shallow in content-dense areas. If you have a lot of elements breaking your content up, you’re only making life harder for the algorithm: there won’t be a single element that stands out as being parent of a lot of content-heavy paragraphs, but many separate ones with low scores.