Lessons learned from migrating to Python 3

18 Apr 2018

11 min read

As the Python 2 End of Life date approaches, many developers will need to start looking at migrating their builds to Python 3. I sat down with
Dan Palmer to discuss Thread's recent migration of their Django e-commerce website to Python 3.

Thread is an online personal styling service. When you sign up, their website asks you a few questions about your preferred styles, sizes, budget, owned items and so on which their personal stylists use to provide surprisingly well-tailored recommendations. Headquartered in London with around 50 staff, Thread’s Django powered e-commerce site receives approximately 850k visits a month.

Rhett: What does Thread’s architecture look like?

Dan: Thread has been strongly powered by Django from day one. The founding team have a lot of experience with Django and so the project was architected to scale as it increased in complexity. Our recommendations service is built with Flask using scikit-learn, there’s a lot of data science used at Thread, then we’ve also got some smaller services that power state machines and quite a few monitoring services. Mostly though, it’s a Django monolith. Monolith gets a bad wrap as a term but I think Django lends itself very nicely to dividing things up into apps and creating this nice hierarchy of logic and models.

So today, Thread is about 370 Django apps with about the same number of models, roughly one model per app. We’ve got about 150 asynchronous jobs that are either on demand or scheduled. These are powered by our library called Django Lightweight Queue, which is mostly developed in-house. It can be thought of as a very simplified version of Celery, basically, Redis backed queues with a very simple API that lets us call a function with arguments that then gets put into a queue and picked up by workers. Some of those queues have multiple processes on them as well, like our inventory scraping, for example. We scrape a lot of data from our suppliers and so there are about 10 to 20 workers for that. Some of our recommendation stuff has 150 workers on it, other things are just 1 worker which caused a bit of an issue during the Python 3 migration.

Rhett: Why did you decide to migrate to Python 3?

Dan: I’ve been at Thread for four years now and it was fairly clear after the first year that Python 2 was going to disappear eventually. I think for a long time there was this limbo where people weren’t convinced that Python 3 was the future and the community was considering running both versions long term, in a way I guess that’s kind of what’s been happening to date. I remember the first PyCon UK I went to in 2015 and that year a lot of people were talking about migrating their bigger codebases to Python 3. The year after that not many people were talking about it, a lot of people seemed to have just done it.

By 2017, we had very few dependencies that were compatible with Python 2 only. These were either very small or had Python 3 forks that we could move to. The Python 2 End of Life date had been confirmed (as 1st of January 2020) and Django had committed to going Python 3 only too. We also run our systems on Debian and they had announced an end date for Python 2 support which was a year earlier. This was because of their release cycle, they didn’t want to have a period during which they were supporting Python 2 while it was no longer officially supported by the Python team.

So I guess we could have left the migration until next year but we put a lot of effort into making sure that we have nice code, that we are writing code in good long-term maintainable ways and ultimately that’s what Python 3 really does for Python. It makes it easier to write good code and so there were places where we were feeling the pain of not having that, particularly with things like type systems, which had to be managed with mypy, and a lot of the Unicode support which we’ve never had to deal with full on because we’re only operating in the UK currently. However, every now and then something would crop up and we’d get a product that had a Unicode symbol in it that we hadn’t catered for somewhere or we’d ingest a CSV file from some supplier and it wouldn’t work.

We have something called Tech 10% at Thread which is like Google’s 20% time, so Wednesday afternoons is when developers can work on whatever they think is important. During that time I took on the task of starting to move us to Python 3.

Rhett: How did you plan and prepare to migrate to Python 3?

Dan: Funnily enough, one of the things we didn’t do as well as we could have, was plan properly. As the preparation was only happening in the Tech 10% time we didn’t have the whole team working on it and it didn’t get the thorough planning that any other large piece of tech work at Thread would get, that’s something that I think we’ve learnt for the future.

Having said that, we did have a plan. We started by getting our dependencies up to date so that they would work on Python 3. After that, we started to adapt our coding practices so that they would create forward-compatible code and so we used linters quite a lot for that. We’ve got quite a few Flake8 linters and extra plugins for Flake8 which we use to keep things like trailing commas in check and stuff like that but we found a bunch of plugins that we could configure in ways to allow us to write more forward-compatible code.

For example, there is a plugin for Flake8 that allows you to ban certain imports. We went through the list of things that the six library has under six.moves, so for where things have moved in a standard library you can use six.moves as a compatible way between Python 2 and Python 3. We banned all of the Python 2 imports, moved everything over to six and that meant that we weren’t importing from the wrong place, so we could do that quite incrementally. A lot of those things started to come in incrementally and each week we’d do a small PR that would just change a couple of things. I think that was probably several months of work but it was only a few hours here and there and over several months we started to get to the point where the code looked a lot more like Python 3 code.

The stage after that was getting our tests to run. Our tests didn’t even run on Python 3 to begin with, not to mention pass. So when we had all the linters in place and we had fixed all of those issues the tests were a lot closer to running, then there were just a few little bugs that we could sort out to get them running. I think the first time they ran they were mostly alright, we probably had 80% of our tests passing and so that kicked off three our four weeks of using Tech 10% time to get all of the tests passing.

Most of the time it was a lot of CSV handling and stuff like that. We have to deal with several third-party warehouses and shipping companies who all have FTP servers with CSV files on them, they’ve never heard of an HTTP API so that was fun. Once we fixed up the tests we found places where we needed more tests. We would find a particular bug in say one CSV system and then we’d learn from that and go and write tests for the same thing in all of our other ones just to make sure that we weren’t having the same issues all over the code base.

Once the tests were passing we added another CI job to Jenkins that would run all of the tests on Python 3 and make sure that we could build the package. Then the idea was to essentially treat the Python 3 build as a first-class citizen.

We had originally expected it to be just a couple of weeks of the Python 3 build being stable before we’d start to ship it but we knew that the changeover was going to be a big risk and so we wanted to plan ahead for it. We wanted to make sure that we did it at a time where we’d have enough developers in the office to handle any issues and also at a time that wasn’t a critical sales period. So we ended up putting it off for about three and a half months and keeping Python 2 and Python 3 compatibility during that time.

That was error-prone and a bit of a pain. I’d been running Python 3 in my dev environment for several months and about a month or so after we had Python 3 compatibility all of the development team were using Python 3 in their dev environments, which then resulted in flakey builds on Python 2 as we’d write something that worked for Python 3 without realising it and then push it live and the build would fail. So that caused a bit of friction, it started to cost us time and so we put a bit more pressure on getting Python 3 shipped and eventually set a date on a Tuesday in November.

I say we didn’t want the changeover to happen in a high sales period but we actually ended up doing it the week of Black Friday, which was another one of the things we learnt from this. We communicated about it really well within the tech team but we didn’t communicate well with everyone else. I think we were a bit more confident about it than perhaps we should have been.

The rest of the company didn’t know there was a high risk of it going wrong and so we came in one Tuesday morning and tried to push it live. It didn’t work and we ended up realising by about 8 am that it wasn’t going to work that day and so we rolled back. Then we came in early again the next day and tried again. A few things went wrong but less so and we ended up pushing through and just fixing issues as they came up and we’ve been on Python 3 since.

Rhett: Did you have any teething issues when migrating over the Python 3?

Dan: The main one that caused us to roll back on the Tuesday was basically an issue between our sessions and our caching system. It meant that all of our users would have been logged out and would have to log back in again. We have our sessions stored in signed cookies, and we store some of our user data in Memcached as pickled Python classes. When we moved to Python 3 there were some incompatibilities, down to Unicode issues again, that resulted in a difference between bytes in strings and it meant that we weren’t able to validate the cookies that people had against the data we had stored in Memcached.

The Thread experience is mostly about browsing outfits and learning about how to dress, so it’s a much better user experience for us to have a relatively long expiry time on our cookies. We know that people dip in and out of browsing, so in terms of user experience, logging everyone out wasn’t something we thought was reasonable, and that’s why we decided not to go ahead on Tuesday and spent the day writing some forward compatibility into our caching and signing of cookies so that when we launched it the next day we wouldn’t have that issue.

Part of the rollout plan was that we’d get the web servers running first. We took everything down, put the web servers back up as well as the really critical queues like our checkout processing queue, for example. Then the idea was that throughout the day we would bring up the other queues one at a time and watch the logs for errors. Our warehouse opens at 8 am and all the software they use is our Django app, so it had to be working by then which it was and so that was a success, but as we started to bring queues up we found there were still a fair number of things that weren’t working exactly how we’d hoped they would and we ended up finding out the next day we had caused a lot of items to go out of stock, post order. Post order out of stock is when someone has bought something and then we email them to say that we actually don’t have it. It usually only happens in very rare circumstances and we have to refund the customer, it’s not a good experience at all. It’s a metric we track very closely and we try to minimise it. During this time it spiked and it turned out that some of the ways we had our queues running in a serial process caused an issue where we were running multiple instances of the same queue. They were all churning through things very quickly causing issues, throwing errors and marking lots of stuff as out of stock. Our Ops team probably spent about two or three days cleaning up the aftermath of that. That was probably the biggest issue we had.

We run meetings called 5 Whys when something goes wrong on a big scale to find out the root cause of the issue. We ended up running a 5 Whys for this issue and we ultimately concluded that the root cause was that as a tech team we didn’t communicate all of the risks and everything that might be affected well to the rest of the team. If we had done that then maybe they would have been able to spot issues sooner, maybe if we’d mentioned the risks they would have said "don’t do this on Black Friday" or they would have seen some of the systems it was going to touch and said "you might want to write some tests for this particular area because that is a very key thing for us".

So it wasn't a trouble-free launch but by the end of the week the only issues we were having were minor things like reports that only get generated once a week and we could very easily fix that and rerun the report. We haven’t really had any problems since.

Rhett: What would you recommend to people who are preparing to undertake the same migration to Python 3?

Dan: I would say plan out the steps for how you are going to get your codebase compatible with Python 3. Invest in tooling that’s going to help you know that it’s compatible. Write tests, if you haven’t got tests on certain areas of your code base and you know that they deal with data files that might be in different formats or things that Python 3 has notably changed. We thought we had pretty good test coverage but there were certain bits where we found we were lacking. Invest in tools like linters and set up parallel builds of your software that runs on Python 3.

Plan for how you're going to get to Python 3 compatible in a way that doesn’t interrupt the rest of your dev team. That was something that I think we did get right, by the time I went to developers and said “Hey, do you want to use Python 3 on your machine?” everything mostly just worked and that meant that the rest of the team thought it was a good thing and bought into it.

I think the other thing would be to plan the launch, particularly if you’ve got a web service, one where it really matters that you’re up and running and that customers can still transact. So plan step by step very carefully what you’re going to do and at any point within that how you can realise that it’s not working and how you’re going to roll back from that to a known working state.

Communicate outside of your engineering team, make sure that people know what all the risks are, what might stop working and what they need to be watching. Make sure that everyone knows how to identify that something isn’t working because sometimes it’s difficult to know, particularly if you’ve got a large system.

If you're interested in learning more about Thread then check out their Thread Engineering publication on Medium. They're also hiring!

How Mixcloud serves 15 million users a month