"Precomposing" a SPA may become the Holy Grail to SEO

Nicolai Kamenzky

MEAN stack enthusiast

January 09, 2015

The state of the art of making SPAs crawlable is to render HTML snapshots on the server and serve them to crawlers. Being able to render those snapshots, however, adds development or service costs which are only tolerated because SEO is important and there is no better solution, yet.

As I found out, this changed when the Googlebot begun executing JavaScript: Any SPA (easily turned into a "precomposed SPA") can be served in its original form to both humans and crawlers – removing the need of HTML snapshots altogether.

Unfortunately, this approach currently works only with Google. The Bingbot – which is used by Bing and Yahoo – is not as advanced. Hopefully, it will also execute JavaScript in the future and then we will have the Holy Grail to SEO in our hands.

tl;dr: What is a precomposed SPA? On first load a regular SPA makes a few AJAX calls to get the data / templates needed to render the requested page. Those AJAX calls, however, must be avoided to allow the Googlebot to crawl the page. I suggest doing this by "precomposing" the SPA; Based on the route of the requested page the needed data / templates are inlined into the index.html. This way the SPA skips fetching data / templates and renders the page immediately – making the initial page load even faster at the same time.

State of the art

To make a Single Page Application (SPA) accessible to crawlers the best solution at this time is to serve HTML snapshots to them. A HTML snapshot is a pure HTML representation of the page that the SPA would render in the browser. There are a few approaches to render the snapshots on the server:

Render the page in a headless browser: The server can spin up a headless browser like PhantomJS and run the original SPA inside it to render the page that the crawler requested. Once the rendering is completed the produced HTML page is served to the crawler. On one hand this approach has the benefit that the SPA itself doesn't need extra functionality for producing HTML snapshots. On the other hand, though, the infrastructure has to be built for that. And that adds development and testing costs to your project.
Use a cloud service: Quite a few cloud services evolved around the first approach reducing the implementation effort down to one line of code which forwards crawler requests to their infrastructure. If your project budget allows you to buy their service this is certainly the easiest solution.
Have an isomorphic code base: If JavaScript is used on the server as well (e.g. node.js) you may decide to develop your application logic in an isomorphic way. Then the SPA can be executed on the server even without a headless browser. Although this design decision wouldn't be made if it was just for SEO if the code base is isomorphic this approach is simpler than the first one and cheaper than the second one.

The Googlebot begun executing JavaScript

After Google officially announced that their bots execute JavaScript hopes went up that SPAs are now SEO-friendly out-of-the-box. To test that I wrote a small SPA based on Angular.js. It shows a list of JavaScript projects based on data retrieved via a REST API. With Google's new Fetch as Google feature the rendering actually looked as expected:

SPA fetched as Google

However, the SPA's content did not show up in Google's search results:

Still no search results

Actually, the Googlebot only executes JavaScript with certain restrictions which Google did not publish in detail. So I decided to do some reverse engineering. To keep a long story of "guess, upload, ask Google to index, wait, check, repeat" short I found out that the Googlebot executes JavaScript if either:

A script is loaded via a <script> tag and executed right after being loaded (e.g. via an IIFE),
A function is bound to and executed on the 'DOMContentLoaded' event, or
A function is bound to and executed on the window.onload event.

The Googlebot does NOT execute JavaScript if either:

The code is executed after an AJAX call returns or
The code is executed after a timeout.

Please be aware that this is a rough picture derived from numerous tests and this behavior might be subject to change. Nonetheless, with that knowledge it was easy to alter the Angular.js based SPA mentioned above to become SEO-friendly:

Now with search results

What I did is turning the SPA into what I call "precomposed SPA".

Turning a SPA into a "precomposed SPA"

Let us have a look at the initial page load of the original non-SEO-friendly SPA:

Page load timeline with AJAX calls

The index.html contains several script tags to load the Angular.js sources. App.js is the last JavaScript file loaded through a script tag. It initializes the Angular.js app and with that the page rendering is kicked off. It then retrieves the template "list.html" and the JSON data "projects" through AJAX calls for rendering and populating the page.

As the Googlebot doesn't wait for "list.html" and "projects" to return those AJAX calls need to be removed. By inlining the template and the data into the index.html the SPA can skip those AJAX calls and render the page immediately:

Page load timeline without AJAX calls

Notice that the page loaded twice as fast in this case. You may sell a precomposed SPA for its initial page load performance and get the SEO-friendliness for free – if you are willing to argue this way.

This is how it works in detail

A crawler or a human requests a page.
The server analyzes the url of the requested page, which user is logged in etc. and
- deduces which templates will be needed to render the page
- as well as what data will be needed to render the page.
The server inlines into the index.html
- the templates by loading them from disk and usually adding them as script tags and
- the data by making an internal request e.g. to the database and by adding a script that sets a global variable.
The server serves the composed index.html to the crawler or human.
Finally, the SPA includes some if-then-else logic to check if the needed templates and data are already available (inlined) and if so skips the AJAX calls.

The index.html of the example above is this one:

<!doctype html>
<html lang="en" ng-app="project">
    <head>
        <title>SEO Test</title>
        <script src="/bower_components/angular/angular.min.js"></script>
        <!-- Some more scripts... -->
        <script src="/scripts/app.js"></script>
        <base href="/">

        <!-- This is the inlined data. -->

        <script>
            window.initialData = [{"name":"Fire Up!","site":"https://github.com/analog-nico/fire-up","description":"Fire Up! is a dependency injection container designed specifically for node.js with a powerful but sleek API.","_id":"buxU63zl1diDgm04"},{"name":"Assume.js","site":"https://github.com/analog-nico/assumejs","description":"Assume your node.js production server won't fail. And get notified if you were wrong.","_id":"qPvRiYoj5VeUbasM"},{"name":"serve-spa","site":"https://github.com/analog-nico/serve-spa","description":"Express middleware to serve single page applications in a performant and SEO friendly way","_id":"s8txhcZQatd8o2bb"}];
        </script>

    </head>
    <body>

        <!-- This is the inlined template. -->

        <script type="text/ng-template" id="partials/list.html">
<input type="text" ng-model="search" class="form-control" placeholder="Search">

<table class="table table-striped">
    /* Some more template markup... */
</table>
        </script>

        <div class="container">
            <h1>Open Source Projects</h1>
            <div ng-view></div>
        </div>
    </body>
</html>

I uploaded the full sources to GitHub.

Why I think this may become the Holy Grail to SEO

I understand that crawlers are bound to very limited resources considering the vast amount of pages they crawl. And I also understand that it is difficult to do the heavy lifting of executing JavaScript. But I think Google found a sweet spot of executing JavaScript not in full but to a certain extent and thus was able to pioneer this. Looking into the future I believe that other crawlers will follow and with that the need to serve HTML snapshots is removed.

In my opinion just implementing a precomposed SPA has the following benefits:

The additional implementation effort to develop a SPA as a precomposed SPA is minimal. I actually cannot think of a more minimal approach considering the extent to which the Googlebot executes JavaScript. (To turn an existing SPA into a precomposed one, though, might involve some more diligence – depending on its current design.)
The implementation effort is also justified by the faster page loading achieved at the same time. I would argue to invest the money into the performance improvement and get SEO for free.
The implementation makes no distinction of who requested the page. The same page is served to a crawler and a human. Thus nearly no particular testing effort is generated. (For a 100% test coverage each page needs to be loaded and checked if the AJAX calls are skipped. IMHO this effort is neglect-able and can be automated.)
Just to be clear, this approach is independent of the server-side language.
The required code changes can be baked into the libraries we use in our stack. E.g. I wrote a middleware "serve-spa" for Express on node.js to handle the server-side logic. Synth is an example for a full-stack framework that does this out-of-the-box – reducing all efforts listed above to none.

Are we there yet?

In my opinion Google, Bing, and Yahoo must be able to crawl a precomposed SPA before I would argue that this approach is ready for production. (Of course you may already use it for its performance gain but probably not for SEO.) Unfortunately, the Bingbot does not execute JavaScript and Yahoo is powered by Bing. Anyhow, since Bing is competing with Google I believe it will not take too long until the Bingbot catches up.

To even get a 100% solution we have to consider many other crawlers from search engines targeting specific languages like Baidu and from sites that compile link previews like embed.ly. And that, I believe, will take forever if we – the web community – don't give it a push.

I think the best point to start pushing is to specify a contract between crawlers that support JavaScript and websites that depend on JavaScript. Starting with my research for this blog post I made a draft. Please join me in detailing out the "JavaScript-enabled crawling" specification and add your voice by starring the project on GitHub. I appreciate it!