Mixcloud is an audio streaming platform for uploading and streaming radio shows, DJ mixes and podcasts. Since being founded in 2008 they’ve grown to host over 15 million shows by the likes of Wired, Harvard Business Review, TED Talks, Barack Obama, Carl Cox, Richie Hawtin, deadmau5, NTS Radio and Worldwide FM.
In 2017 they signed a direct licencing agreement with Warner Music and then signed additional agreements with Universal Music Group, Sony and Merlin in 2018 while raising a further $11.5m to continue their expansion globally. Mixcloud currently handles approximately 350 million GraphQL queries a day.
In a Behind the Screens interview, Mat Clayton (Co-founder at Mixcloud) gives insight into the tech that powers their platform, how it’s cheaper for them not to host with cloud providers, their love for GraphQL, why Relay is underrated and the media player libraries they use.
Stack & Infrastructure
Could you talk a bit about Mixcloud’s stack and infrastructure?
We are a Django application. 99% of it is synchronous Python and Django with MariaDB, a fork of MySQL, on the backend. Then we run Memcached for caching, Elasticsearch for search and we use RabbitMQ for queue management and background tasks. We also use Redis for certain high-speed counters we can’t afford to lose in the long term.
Back in the day, we were a Django app with MySQL. We fundamentally are the same now but we’ve just moved search into its own service and have supplemented things with caching using both Redis and Memcached for different types of caching. We obviously run a lot more copies of the web layer now than we did previously but it’s fundamentally the same code. We just have thousands of processes now instead of 10, it’s scaled up really nicely.
Do you use any cloud services?
There are always exceptions but no, the vast majority of the web app is run off co-located servers out of London. Two reasons for that: mainly cost, it's substantially cheaper but also performance is substantially better than the cloud.
We host from London with points-of-presence (POPs) spread out across the globe, we fundamentally just outsource that to Cloudflare. So when a user connects to the service they do the SLL termination with a Cloudflare POP and then piggyback onto a backhaul into our data centre here in London.
How do you achieve that cost optimisation because the common perception is that cloud services are more cost-effective?
It is a common perception. Storage is now comparable to what you can do yourself, so in terms of storage costs, we pay about the same to store data on any of the cloud providers compared to what we can do ourselves.
The main difference comes with bandwidth. As far as I can perceive, the bandwidth costs that cloud providers charge for egress (outgoing bandwidth) are substantially higher than what you could do yourself if you go and buy those peering arrangements. I think they’ve got a massive margin there, which once you hit a certain scale they’ll probably negotiate with you on. We just learned to survive without it. That’s my take, particularly from the perspective of our use case with audio streaming.
It’s very similar to video. The bandwidth costs for video streaming are substantially higher per user than most services, so when there was a big margin to be taken by the cloud service providers it actually ends up becoming a substantial cost for a business like ours which would ultimately not make the business model work. So I think that’s where the difference is, not in the storage but in the bandwidth. Particularly the egress.
Any unique challenges you face being a streaming service?
Storage is always a pain point for us. Streaming essentially has two problems: storing data and streaming data. So that’s an area we’ve spent a lot of time on over the years. We’ve gone through multiple systems. That is also the area that I think is one of Mixcloud’s biggest assets, is that we know how to do that stuff cheaply and reliably at scale.
Aside from keeping archives and backups for disaster recovery hosted in the cloud, we don’t stream from S3, Glacier or any of the cloud platforms for that matter because that would bankrupt us very quickly.
We handle hundreds of gigabytes of bandwidth per second and have petabytes of storage, which is kind of like trying to build S3 yourself. We sort of went for it out of necessity because to survive we had no other option. We couldn’t afford to host with one of the main providers, it would have bankrupted us and I would say that now we have about a 10x difference in cost at current price points in terms of what it would cost us to host in the cloud.
How have you handled performance bottlenecks during your growth?
Database tuning is always a thing, that never goes away. There are always bad queries that you have to optimise as the scale of the database grows. You find that queries which worked before don’t work so well anymore because they scale exponentially rather than linearly. So there’s always work to be done there.
There’s always new stuff that surprises us where we thought we had optimal setups. We could have one table that is a thousand rows, really small, and then we start pushing over some of the site traffic onto it and then we're like “ah crap, we’ve got 50 million rows there now” or “a billion rows” and we have to go through that journey again and again. So we take the same journey each time and keep applying it to new areas as we grow those product lines.
For example, Select is one of the new ones we’ve been working on, and that is still in its infancy with thousands of users on it, not millions. But that will go through that growth curve of pain on the database.
The nice thing though is the engineers now all know how to deal with it so they’re addressing those problems in advance. Not optimising prematurely, they just know how to design stuff that works at scale from day one rather than having to evolve it over time. Some of those design decisions from day one are great and you can kind of roll them out and there’s no real extra cost doing it the right way vs. the wrong way. It’s just that level of experience which guides you one way or the other. But some approaches can take ten times the engineering effort to build for scale and those are the ones where the team will opt for the simple approach first unless we know we’re going big very quickly in which case they will invest in scale. It doesn’t always work that way but generally, they’ve got a pretty good sense of it now.
We also do a lot of work on profiling for measurement and optimisation. We open-sourced one of our projects, django-speedbar, which hooks into Django and gives you timing and profiling information for things like Memcached, Redis and MySQL when rendering pages or views. We have another version of that in-house because we’ve moved onto GraphQL and we’ve done some coupling there which probably wouldn’t make sense to open-source because it’s so closely tied into a lot of our other tech. But the open-source one is still pretty good. The version we use is like 95% the same, we just have a patch on top of the open-source one for our needs.
Could you talk about why you decided to move to GraphQL and how that’s gone?
I think the world is slowly catching up with GraphQL. I think it took the world a little bit by surprise because you get people who are very dogmatic in terms of REST, and GraphQL went so against that, that it actually took people a long time to adapt. We’ve been using GraphQL in production for over 3 years now. 100% of desktop web, native mobile and our widget use it. The only platform we don’t have running on GraphQL now is mobile web and we will migrate that this year, almost certainly.
As for what motivated the move, it’s really about two things. We use React on the frontend and Relay to connect the data layers between React and GraphQL. Relay is another open-source project from Facebook that very few people seem to talk about. Because we do this it allows us to break apart problems a bit better, so we can decouple from them. When you look at a web page, you see entities forming. You see things like user objects which have images, descriptions and biographies, it’s essentially a graph. Then users perform actions like, in our case, favourites and follows on other nouns.
When you start breaking down your problem like that you find that actually REST is not a great way of representing it. Specifically, for us the use case that really falls down is mobile. If you think about the typical REST design you’d have
user/2 and then you’d have
/notifications. So if you wanted to render a list of notifications, you’d get your
/notifications, which is one REST endpoint, you’d look at that and then within that endpoint you’d be like okay, I need to fetch data for user 2, 4, and 7 and I need the data for show A, B and C. So I need to hit all of those endpoints now and get all that data back, then only have I got all the data that I need to render my final notifications list.
So you’ve got this process where you end up doing lots of round-trips, if you’re being pure in terms of how you’d do REST, and that’s not performant. So what people end up doing is they end up batching that all together and having a consolidated notifications endpoint which gives you all the data you need for notifications in one go.
Six weeks later you release a new version of mobile, and the nasty thing with mobile is that you’ve got lots of versions in production simultaneously. Whereas, with desktop when you do an upgrade, everyone who hits refresh gets the latest version. With mobile that’s just not the case because you ship through the App Store. So what you have to do is either add new fields to that
/notifications endpoint, and over six months it slowly ends up with a lot of wasted computation generating fields which are no longer used and then you then have to come up with some deprecation plan. Or what tends to happen is that you have
/notifications/v3/, and then you end up in a support nightmare trying to figure out which one to use for current versions of the app and what your deprecation cycle is for later versions of the app to avoid over-fetching or under-fetching.
Over-fetching is when you get too much data back from the server and you just don’t need it, so it’s wasted computation and wasted bandwidth. Under-fetching is more critical, you ask the server for data and it doesn’t know how to respond to it anymore. For example, this happens with older versions where we’ve turned off endpoints, that would be the typical case you’d really want to avoid.
The nice thing with GraphQL is because the client declares what it needs we can have many different versions of the client in production and they’re all declaring their data requirements and the server satisfies those requirements on a per client basis, so we don’t end up with any over-fetching or under-fetching. We end up with this really streamlined query language backwards and forwards between the two. So for us, that was the killer use case, really.
And then there are various other things like using it for typing. The server declares its API infrastructure, what it can provide, via a schema file. That schema file has got types in it and then from there, we can actually do full test coverage from the type system on the API all the way through the apps and all the way through the website as well.
So there are a couple of key points, which are kind of hard to explain, but when you take a step back and look at what you want to achieve holistically. It’s almost an act of genius how they’ve pulled this together. Honestly, when we started using it, we were like “this kind of fits a couple of things we want” but then 3 years in we’re like the “there’s no way we’d ever go back to the old system, no way”. Every engineer that joins our team comes in and says “what the hell did you do here? Why aren’t you using REST?” I literally see it every single time. Give them a week and a half, maybe two weeks, and every single one says “I’m never using REST again” having used GraphQL at production scale. It’s phenomenal to watch given how dominant REST was.
Can you talk about your use of Relay?
If we look at a React component, that’s rendering a display name with a city and a country for a user, we can bind those things together and say that this component now requires GraphQL to provide the display name, city and country from the user object. So if we do that across all of our components and then put a component around a group of them, it can automatically generate the GraphQL to execute from that. So it binds that API data layer to our visual UI.
It’s kind of incredible once you see it working but it takes three or four steps to be able to visualise what it does. This is one of the other things we’ve seen with developers who’ve come in and used it, they swear never to go back to writing queries again. Our engineers never have to write API calls. We just say what data we need in our components and our data layers are responsible for making sure that it’s there, synched, lockstepped and handles offline situations, as is necessary with mobile. Relay is the library which provides all of that.
The learning curve is really steep but once you get over that it’s incredibly efficient from a developer productivity point of view. However, expect to spend a week trying to boot the damn thing up. It’s one of these things where for a small hack project it might be overkill but I would definitely do it for a small hack project because we’ve now used it and understand it. It can be quite a learning curve to get it off the ground, but once you’ve got it going the benefits are substantial.
The middle ground is a client called Apollo, which gets a lot of PR. I’d say technically, on our evaluations, Apollo doesn’t look as strong as Relay but it is much more user-friendly for developers. If you want to get going quicker, it’s probably an easier way of doing it, but that doesn’t necessarily mean you’re going to get something that is built to scale to thousands of developers and hundreds of millions of users, it’s taken some shortcuts along the way. But it will probably get you a lot more efficiency than doing it all yourself so it might be a good middle ground. When we got started Apollo wasn’t an option so we were kind of forced into Relay or nothing and so we got used to it pretty quickly. So at this point, it’s better the devil we know.
Relay is definitely a solution built by a giant organisation like Facebook as opposed to a third party library where they’ve not really operated on the same scale. You can tell that the Relay team are in close proximity to the GraphQL and React teams working on integrating the two together. It’s the library which powers Facebook Marketplace for example so it’s used by fairly big teams but the open-source community seems to have not really taken to it.
What about the challenges of building media players?
We used to build our own players. We’ve got four or five platforms we need to support, the main ones being native iOS and Android, mobile web and desktop web. Mobile web and desktop web are very similar, right now we have two different players because we’re rebuilding mobile web so mobile web is very much on our old technology stack and desktop web is on our new technology stack.
For iOS, we take the built-in players from Apple and wrap them in an API which is to our liking. We don’t do much work on those, I kind of wish their player was more substantial, it’s got the fundamental building blocks but not much more beyond that.
The final one is ExoPlayer, that powers Android. We used to use MediaPlayer which is the low-level media player library provided by Android. We now use ExoPlayer which again is an open-source project by Google and that one I’m 90% certain powers lots of industry services on Android at this point. I’m fairly certain YouTube is on there, I’ve seen BBC, Sky and I think I’ve even seen Spotify mention it at this point. It’s a phenomenal bit of software open-sourced by the Google team on Android for audio and video playback. Full support for pretty much everything.
That along with Shaka player I’m almost certain are done by very similar teams sat in the same areas of Google. They’re very well done, incredible software, but don’t get much love in terms of the public eye is not really on them. They are open-source and are on Github and both of them are projects like React, where you go in there and have a look at them and you can tell that these are very well written, well-tested, established libraries, written by incredible teams. So we’re just moving onto them. They’ve done a better job than we could ever do.
Could you talk about your approach to testing and your thoughts on TDD?
I don’t think that anyone who says they’re full-on TDD is actually full-on TDD. The reality is that often I find teams who claim to be are idealistic and probably not shipping the best quantities of software.
We use Github for code hosting, we have CI toolchains backed on to that and we use CodeShip for desktop and CircleCI for mobile on different Mixcloud products. Just for legacy reasons because CodeShip came out before CircleCI, we haven’t moved.
Then on top of that, we do full integration tests, so things like Selenium on desktop. We’re still building out our integration testing capabilities on mobile. We do have that capability as of this year, we invested heavily in January but I would say our number of tests there are still quite minimal. Whereas desktop is pretty extensively covered. Not just on deploy but on every single commit across the entire codebase. We believe very heavily in investing in that kind of toolchain because we think that it’s something which keeps on paying dividends so it takes a bit of investment to get it off the ground but then it just keeps on paying back.
Snapshot testing is another area of testing we do a lot of. Because we use Relay and React we have these components which take properties, those properties define what the layout would be and what it should look like and do. We invest pretty heavily in testing the different constructs, like putting in different sets of properties and seeing what the expected outcome is. We do lots of those which essentially just takes snapshots or freezes of what the app output should be and then if somebody does make a change to the codebase we can see whether those changes are percolating through to the areas we would expect or whether they’ve had unexpected consequences somewhere else. It’s very good for checking that the UI doesn’t radically shift based on data changes.
Lastly, could you talk a bit about the product engineering team at Mixcloud?
So the product team is 24 people right now. 2 support people, 2 product managers, myself as director, 2 mobile engineers, one on data, one on infrastructure, one on record label reporting a manager across that team. We have a design team of three who work across all platforms.
Then we have three Feature Lanes, which are pairs of people who work on features. They’re given a feature for a month at a time or two sprints, and they will focus on those exclusively. Those features go across all platforms simultaneously so they’d be working on iOS, Android and desktop at once. So they’re full-stack teams.
Then we have another team called Fast Lane, which is a team of three people, that is dedicated to rapid changes to the product. Anything that needs to be turned around in 24 hours, bug fixes, random requests or very small feature fixes, stuff like that.
This post is part of a new series of Behind the Screens technical interviews with teams who build web products. You can subscribe to updates via RSS or join the Able developer community to receive updates in your email digest.