Behind the screens at The Guardian

Serving 1.4 billion page views a month with 40 to 50 developers based out of London.

1 Dec 2019

19 min read

The Guardian is not a tech startup, it is a 198 year old British news publisher. Like all other others, it has had to evolve from a traditional print newspaper into a high-traffic monetised digital property in order to survive the world’s rapid shift to digital media consumption.

In an age where the digital news industry is under increasing pressure to monetise a market that’s grown to expect free information, Guardian Media & News (which also includes the Observer) reported an £800,000 operating profit for their 2018-19 fiscal year. Three years earlier they had recorded an annual loss of £57m. This turnaround has also been achieved without using a paywall, which provides additional cost relief from lower bandwidth usage, and now 56% of its revenue comes from digital channels.

Today, the Guardian serves approximately 1.4 billion page views a month with a team of between 40 to 50 software developers, all based out of London. In this post, Richard Beddington, an outgoing software developer provides some insight into the work and culture of the Guardian’s software development team.

Can you talk about some of the publishing tools that are developed at the Guardian and what technology has been chosen to build them?

Most of our backend services are written in Scala, using the Play framework. We have a set stack at the Guardian but you don’t have to adhere to it. The only thing we’re tightly coupled to is AWS. We’re a big Scala shop, we moved from Java about six or seven years ago. Most of our stuff is built with Play and runs on EC2 instances behind load balancers.

Content API

We have an internal service which is at the core of everything, called CAPI. This is our Content API that anyone can query to access content published by the Guardian. It’s based off Elasticsearch. We have a live stack which feeds the live site and a preview stack that helps drive the tooling as it displays unpublished articles.

Most of our content comes from CAPI with the exception of the data that describes the front page layouts. Those are managed by a different service called Facia Tool that sits between the tools team and dotcom.

Web Rendering

The dotcom team are mainly focussed on ingesting data from CAPI and rendering that into the final content that readers see. A lot of our frontend stuff has been written in the Twirl templates that come with Play but we’re in the process of moving this over to React and a general TypeScript setup. We’re currently working on a project called Dotcom Rendering, which is a Node service that you can throw some data at and it will render an article page, button or other components for you so that other clients can have that pull it through. I think we’re serving about 20% of our traffic off of that now and that’s a React server-rendered tier. We don’t hydrate any of our React at the moment, it’s completely static. Because it’s an article page and so we don’t have too much dynamic content there.

We wrote a blog post about how we came to the decision to move from a SASS oriented approach to server-side rendering with CSS-in-JS using React and Emotion. We did this to take hold of our CSS, because that’s what was slowing us down on dotcom. Over time our CSS became increasingly difficult to touch and it doesn’t help when you’ve got developer turnover every couple of years. So we’re working with this new approach now.

Then we use Fastly for caching, our hit rate is about 95% percent. Dotcom Rendering is aiming to take about 100 requests a second to render the cache misses. This stack is pretty solid, it’s been going for a while with a lot of people working on that.

Composer

Composer is basically our CMS that is used by journalists and other staff to write content. It was built with Angular a while ago and we’ve recently rebuilt the rich-text editor part of this with ProseMirror and that is essentially what feeds CAPI. Composer sits on its own EC2 servers where you save stuff to a database and that then puts out a Kinesis stream to CAPI, which ingests and transforms that into its own data model and that comes out in a different stream. So basically a whole bunch of streams are talking to each other.

I’d estimate 70 - 80% of our content gets written in Composer. We’ve also integrated it with our print systems so that you can send articles out to Adobe InCopy. So stuff that’s been written in our web application ends up making it into the print edition of our paper. Often the Sports desk does their writing in InCopy first and will then update it in Composer afterwards. Sports results come later than everything else and they’re the last thing that drives the paper.

You published a blog post talking about why you’re building a new rich-text editor, based on ProseMirror, to replace your previous text editor, Scribe. Could you talk about how that transition has gone?

Like any piece of legacy software we just started encountering limitations and difficulties. There are a few people like Sébastien Cevey and Oliver Ash who developed Scribe about five years ago and these guys are like the stalwarts of editorial tools, they’ve done a lot of good work on Scribe and also Grid, our image management tool. I remember speaking to Oliver and him saying he never expected Scribe to last as long as it did. It was just in an era where there weren’t text editors five or six years ago that did the job we needed it to do.

Ultimately, the problem just became having HTML as a model. There are just so many different ways of representing the same visual content in HTML and being able to track that and add annotations couldn’t necessarily be part of that model and all this sort of stuff was just proving quite difficult to manage at a fundamental level using HTML.

I think there was a case for maybe updating Scribe but it would have needed a whole new layer added to it, which was not rendering to HTML but was rendering to a model that would then be rendered to HTML, and this is what had been done in ProseMirror. I only arrived a couple of months after we decided to use ProseMirror but it seemed like a very sensible decision.

I learnt so much from rolling out ProseMirror to our editorial team. We built it as this great new text editor thing, and to us it really felt like that. It solved a lot of small annoying edge case bugs that no one was annoyed with anymore. And so we explained this to the editorial staff. We were like, “this is going to be your new text editor it’s going to be great”, but the look and feel of the thing hadn’t changed, it was the engine underneath that was changing.

And actually, when we rolled it out, because we didn’t know the full spec of the text editor at the time we were missing features, because there are just so many features in our text editor. So what happened was that we started annoying people, even though we rolled it out behind a switch and asked people to opt-in. We still ended up annoying people because they were like “What’s changed? Nothing’s changed”. It was a bit painful but we got there in the end and now it’s 100% ProseMirror and everything is golden and rosy and you get all the benefits of a more robust foundation that you can build on.

The next phase will be to make the rich-text editor portable so that we can lift it out of our Composer Angular app. Currently, we’ve got ProseMirror at the core, then we’ll have a wrapper around that which will be the Guardian Text Editor. At the moment it just sits in Composer but we want to use it in other places that deal with rich-text content, where currently we have plain textarea fields or other inputs where people have to put in all sorts of weird tags and markup. So the goal is to have a ProseMirror-based standalone editor for our rich-text requirements and then build extensions on that to allow people to inject other special content types.

Media Atom Maker

We also have Media Atom Maker, which allows us to make video atoms. Atoms are little things that you can inject into a piece of content, as a reference to something. So, say you have a reference to a video that you’ve put in twenty different articles. If you were to change that video, all of those articles would reflect that new update. We have media atoms, explainer atoms and so on.

So quite often you’ll see something like “What is Brexit?” and that will be in thirty or forty articles. If somebody makes an important edit or change to that, it’s reflected in all of the Brexit articles where people have referenced that piece of information. So we have this whole atom rendering tier which manages how these are rendered on mobile, web and wherever else they might need to be.

But getting back to Media Atom Maker, that’s just one of these things that creates a media atom which can then be put into Composer.

We use YouTube as our hosting provider for video. So we upload a video there, and YouTube doesn’t allow you to change a video once uploaded. So when you upload an updated video you can just change the link in the atom and that would reflect in all places where it’s referenced. We’re also able to input a description for that atom and we have captions for that, and potentially at some point we might want to bring our ProseMirror implementation to Media Atom Maker so that you can add certain amounts of rich-text in a way that feels similar to Composer.

Image management

Grid, as mentioned, is our image management tool, which the BBC are also about to take on. It’s basically a tool for ingesting from Getty Images, Associated Press and The Guardian images internally so that you can search and query images from different sources.

https://www.youtube.com/watch?v=ZoddCAH9EPE

At the moment we’re busy decoupling it from our own services, like AWS IAM, so that the BBC can use it to inject their own config. I think they're using it on their Sports desk at the moment and have a couple of developers maintaining it, so we’re getting pull requests back from the BBC which is epic. We’re speaking to a few other people about Grid too, it’s a really well architected thing that’s built on its own set of about 11 or 12 microservices, like an image resizer and so on.

So we have sub-editors and writers who use Composer, we have Image desk people using Grid, we have video editors using Media Atom Maker, and so if the frontend for Composer goes down, it doesn’t stop people on the image desk and the video desk from doing their work. So this way we get the benefit of targeting a specific desk’s requirements and not having to maintain everything in a single application.

We can still bring Grid into these different applications using iframes so that you can just drag an image into Composer from there. We use iframes a lot for throwing content from these separate services into different contexts, but most of the time people are working in their own app UIs for Composer, Grid, Media Atom Maker and so on, and then we pull it together behind the scenes and it all goes into CAPI.

We have about 30 different tools that the editorial team use. It’s a really nice setup. A lot of that work was done four or five years ago.

Deployments

We’ve got our own internal deployment system called Riff-Raff, which is what deploys all of our stuff. Basically, it has a model of assets and archives and different pieces like that, so we can choose the type of service we want to deploy to. Quite often you’ll have three instances behind a load balancer for any of our services. Riff-Raff will spin up another three, that the load balancer isn’t pointing to, it will ping those health check endpoints to make sure they’re running on the right instances and as soon as it gets an OK from those, it points the load balancer at them and shuts down the other servers. And that’s basically our continuous deployment process.

Production monitoring

We have this other tool called Prodmon, basically it’s a tool that uses Selenium to make sure that it can use Composer to publish a piece of content, check that content is on CAPI and check that you can visit it on dotcom. It does this every minute and has a webhook into our Prodmon Hangouts channel, so if anything goes wrong with it we get an error notification in there. We aim to know quicker than our users when something has gone wrong.

So while we do run tests as part of our deployment process, we run Prodmon constantly because we don’t know if there’s potentially some increased load, a problem with Elastic Search or one of our caches is missing and things like that. Prodmon now uses our Google Auth service and also will check that the emergency login service works if Google Auth is down.

We’ve got a big ELK stack for most of our logs where you can filter by app or anything else. Any session that’s happened or any content that goes through these service pipelines gets its own session ID, so that you can track a session through all of these different things. So you can search by session ID to see what’s happened and where things have failed. And this is all part of the Riff Raff tooling. It’s no Facebook or Google, but it helps to get stuff seen very quickly.

It’s a really resilient process, we have production outages and things like that. But, most of our production outages for dotcom are protected by Fastly, so most of the time it’s just serving from the cache. If you accidentally purge the production cache during an outage, which has happened before, then you’re in a world of pain. But we don’t often serve 500s, most of the time we’ll just serve stale content.

All of this is written and maintained internally with a team of about 40 - 50 engineers who are all mostly sat on the same floor of our offices in London. I think collaboration is naturally really good at the Guardian as a side-effect of the proximity that people work in. The rest of the Digital team consists of PMs, UX and so on which makes up the total team size of around 100 people.

The editorial tooling and the cross team collaboration is one thing that leaves a bit to be desired, only for the fact that we don’t really have enough resources to embed. But stuff still gets done. At the moment our Content API which drives everything, has only two people working on it, and they’re called the Journalism team because they’re also responsible for a few other bits and pieces. Given that CAPI is the heart of what we do, it’s a bit like, how are there only two people managing this?!

Apps

As for apps, they don’t actually get served off CAPI directly, we have MAPI (the mobile API) that basically is an API to serve the needs of the mobile applications. For example, when loading the Guardian home page in the app, MAPI will be in charge of over fetching such that, even if you go offline, all of the text content of those articles is available.

Additionally the mobile teams manage the mobile notifications, something that has been overhauled and massively improved to make sure that we can get notifications to our users before the other news agencies

Postgres Migration

The Guardian also published a post about migrating from MongoDB to PostgreSQL. How have you found that’s gone?

Yes, so this was the editorial tools team that did this. The primary reason we moved from Mongo to Postgres was cost. We were paying for OpsManager which was the way we were administering our Mongo database and at the end of the day we just wanted to cut that down and bring it back in-house to something a bit more Amazon flavoured and so we decided to move over to Postgres.

Mongo is a document store and essentially difficult to query in certain ways but generally fine. Postgres has got a really good JSON field, and we want to evolve from that in future so that we can improve our querying. Initially we just moved most of our stuff to a JSON column in a Postgres database.

So what does the article database table look like now? Is it mostly just one JSON field?

A good deal of the content that you will actually see landing on an article page by the time it goes through editorial tools, to the content APIm and all the way through to dotcom is just one JSON field. We are breaking certain pieces out into fields, that’s been potentially happening over the last few months, I’m not in the editorial team any more, but again the hard bit there was the whole model for moving that across. But the main motivator for moving was cost.

We had this really funny issue during that migration. We had a proxy that sat between all requests, that would push all new content to Mongo and Postgres and it turned out that there was a memory leak in that proxy. So there was a load balancer with three instances of this proxy, and every seven or eight hours the proxy would go down because the memory leak would crash these instances, which would then cycle and bring themselves back up again, but you’d have about 10 minutes where they went down and nobody could really save anything which was a bit of a nightmare to be honest.

But, and this is probably my most pragmatic moment at the Guardian, because there were three instances I just restarted one instance after two hours and another instance after about five hours and so they’d all go down at separate times. Before, the memory leak was so consistent that they were going down at the same time. So we spent about two or three days trying to fix this issue and trying to work out where the memory leak was, but we couldn’t profile it to work it out. It was something to do with Akka streams that we were using.

And so I thought, let’s just nudge these things so that they’re all out of sync with each other and for the last two months we used that and the service never went down again because the instances were all just cycling every seven hours or so. It was literally the most pragmatic solve I’ve ever done and it took me about 5 minutes in the console to just restart an instance and then a few hours later just restart another instance and then we never looked at it again. And so that lasted for another two months while we were working on the migration.

A lot of the work in the migration was validating that all of the content was there, so we had a lot of testing and stuff which was diff’ing the two versions, we had Kibana dashboards telling us how many things were hitting and the data sizes and the things were having and I think it was pretty seamless when they made the switch. So now we’re fully on Postgres and I still don’t think as much has been done with it as we’d want to but it’s just a resource thing.

Obviously, when you import from a new article or an old article you have to try and work out how to merge these two pieces of content as best as possible. There’s all sorts of tooling that goes into making sure you’re not overwriting the wrong things. There are so many nuances to think about. It’s a thousand different things that just exist in there and we’ve got to make every single workflow as best as possible. There are contingencies for everything.

Culture

The culture at the Guardian is really just a flat hierarchy of developers. A lead dev can be working with an associate dev and an associate dev will often be driving it. I’ve never been somewhere where experienced engineers will bring their experience to the fore when it’s required but generally let others have the space to decide how to design a solution.

Every single idea is entertained and considered, and somehow meetings are kept succinct. If you’re in a meeting and you realise you aren’t really contributing or adding value then you can just politely excuse yourself to go find somewhere where you can.

There is also zero blame culture. Every time a mistake is made, the team doesn’t ask “Why did you do that?” it’s more a case of “What in our system has allowed you to do that?”. For example, if a junior publishes some broken code that 500’s on dotcom, the reaction to that is “How was this even possible that a dev could do that? Don’t worry, we’ll revert and we’ll put some stuff in place to make sure that that can’t happen again”.

Coding in the open

There’s an important distinction we make between open source and coding in the open. We practice the latter, in the sense that open source has this idea of good documentation and the ability to easily set this up locally. Whereas for us, the code for dotcom is completely on Github you can browse all the way through it. But you probably couldn’t set it up too easily because of its coupling to AWS, without spending a fair bit of money to get those similar things set up, you couldn’t actually run it and do anything with it. So we're a key component of coding in the open, I’d say 80% - 90% of our repos are open source and you can go and look at those.

And I think that fits with our ethos. We want people to see what we’re up to, even if it’s one or two people. Obviously, it’s a bit more than that but we’ve got nothing to hide, besides secret keys. One of our developers Roberto Tiley owns an open source project called BFG which is very good at going through your git repo and deleting any secrets you might not want to be in there.

Coding in the open is very good in that sense because it causes a lot of good practices. Then you also you get nice little things, like for the Daily Edition app that we’ve just launched. We have crosswords in there like we do on the website. We’re using React Native for the Daily Edition app. So to render our crosswords, we’re actually using an open source npm module that is a React version of the Guardian crosswords that someone outside of the Guardian did themselves.

Somebody saw that the Guardian crossword tool was not in React, wanted to do it and so one day we just bumbled across this and saw that someone’s taken our crossword JavaScript code, bunched it up into React and published it as a module. And so we have just downloaded that module, created some forks and eventually submitted some pull requests and now we’re using someone’s library that they’ve created. So that’s saved us a lot of time.

And so you get those wins sometimes, we’ve had a few other pull requests for bugs in the crosswords, people like to go on crosswords, we’ve fixed a few bugs on crosswords. It’s not the reason we do it, but just one day you see someone submitting a PR with a fix for something, and you’re just like “Thanks!, we really appreciate that! I forgot that people can see this repo”.

So by default everything goes out there to be seen, bar a lead dev or product manager coming in and saying “Ah by the way, this probably shouldn’t be public, it’s a bit sensitive”. So it’s not open source in a ‘supported with documentation and setup’ sense but at least it gives people eyes on what we’re up to.

We’ve worked with lots of all the UK publishers that we like. Normally, it’s the FT or The Economist. The Times get in there too.

Self-development

We don’t have much by resource so we don’t have week long camps, we have day or two long Hack Days, and occasionally innovation weeks which are more company focussed than a Hack Day. Every Friday we have Tech Time which is like an hour where somebody presents something they’ve done or something interesting they’ve seen, just anything about tech. Or we have little ignite talks which are like five minute lightning talks, so there’s a keen developer mentality in there.

We have learning groups, we’re learning about Haskell every Tuesday at 2pm. Again, it’s not quite as full on as you might get at Facebook or something like that. We’ll take a small room for like an hour where like ten interested people go down there but those things are available to you. Everybody’s really excited about them, from the Head of Engineering all the way down, they’re just like “Let’s just keep this going. We can’t provide this but if you want support this yourselves just get going”. Which is very much like a ground up thing, it’s not like put down on anyone and that’s the Guardian as a whole. Just a big flat organisation of people trying to do the best they can with a real sense of mission.

The reason people go to the Guardian is because that purpose is there and you get amazing developers there because, you’re working for the Guardian. It’s no Facebook or Google but, you’re working for something that’s trying to ask questions of the Facebooks and Googles and so you get long term people that do a lot of stuff for the Scala community that do a lot of stuff for wide swathes of different bits and pieces. They stay there for a long time and their domain knowledge is amazing and you can just kind of speak to them about anything.

If you come up with an idea you can throw it into somewhere, like the yellow reader revenue banner at the bottom of our pages, this is the canonical example of people doing that. That was a digital first thing. People were quite against that internally, but the digital team decided to pick that up and run with it and try to see whether we could ask for money.

We run Objectives & Key Results (OKRs) internally every three months, so we had a three month window to set up this yellow banner and see whether it would work and set an A/B testing framework up around it. So that was set up and it just started making money immediately and we realised quite quickly that we weren’t going to have to put up a paywall, because people internally were thinking about doing that.

You can throw your opinions and features in and they might get filtered out because say we don’t have time to work or we don’t think it’s business critical. But for another idea we might be like “yeah, we’re fully embracing that”.

And it’s not always as it should be, some of these ideas are golden and we still don’t have the resources to do them and at some point because we’re humans, you can’t just keep working and working on stuff outside of hours. Even though it might be a good project necessarily, sometimes we just don’t have the resources. It’s still a balance, we still have things to work out but the culture there is just really positive overall.

Breaking even

What were the implications on the team leading up to break even?

The implications before break even were not great but understandable. In the six months leading up to it, a few people were diverted to help try and get to break even. Which internally, felt like a difficult thing to accept because we saw the valuable projects we were up to in the mid-term and long-term. To see us slightly divert away from that was a bit frustrating. But when we hit break even and you see all the positivity that brings in and the way we can present that. I’m not a marketing person but I know there’s value in that and everybody felt it.

END.

As the Guardian approaches its 200 year anniversary in 2021, it has set a goal to double its paid supporters from 1 million to 2 million in 2022. If you’re interested in any of their projects or want to get involved, you can find more information at their GitHub page or job openings.

Join the Able Developer Network

If you liked this post you might be interested in the Able developer network, a new place for developers to blog and find jobs.