Performance and Unsplash Availability Update

We've been battling some major issues with the performance and reliability of the Unsplash API over the past month, which has resulted in slower responses, increased 500s, and at one point the need to take the API completely offline for thirty minutes — something we've never done in the 2 years that the Unsplash API has been running.

These problems have effected unsplash.com, our official applications like Unsplash Instant, and thousands of 3rd party applications.

Needless to say, there is no one more disappointed (and stressed) than our backend team. They take the API's uptime and performance extremely seriously, and they've been working tirelessly the past month to identify and overhaul problem areas in the API.

While we can't promise that these issues are completely resolved, I wanted to share what has happened so far and what we're doing about it.


At the core of it, we've had three challenges come to a head at the same time: new search, stats, and significantly increased traffic.

At the start of the summer, I gave the backend team a challenge to overhaul our search system to reduce the number of empty searches while increasing engagement and accuracy of the system. This became the overhauled search system, which we started rolling out mid-summer. While the new search itself is as performant as the previous version in theory, the number of results being returned heavily increased the stress on the overall system, specifically our Postgres store.

At the same time, our stats system began to show problems as a result of the increased stress. Stats are very heavy to compute and are constantly changing making our caches roll-over fairly often.

Bruno, one of our backend engineers, identified the problem by digging deeper into our caching strategy on the stats themselves. By calculating the stats in the request queue, we were crashing the system when repeated heavy calculations were being calculated on the fly for the same uncached endpoint. Essentially, we'd have multiple cache misses spike at once, and until the cache was written, all of the cache misses would be trying to make the same calculation. This would result in all requests queueing, clogging the system and affecting multiple other endpoints.

687474703a2f2f73686172652e656c63756572766f2e6e65742f53637265656e2d53686f742d323031372d30392d31342d31322d31352d35322e706e67.png

To fix this, we needed to move the stats calculations out of the request queue and only allow cache reads. The simple solution would be to calculate all stats using some sort of repeated CRON job, but the overall size of the stats store would dwarf our caches and result in heavy inefficiencies, as only a portion of the stats ever need to be read.

The solution we landed on was to create a new cache key and use the request queue to kick off a worker every time the key expired, but always reading from the cache for the stats. In practice, it looked something like:

# app/models/stats/photos.rb
def views
Rails.cache.fetch("photo/worker/#{@photo.id}/views", expires_in: 20.minutes) do
RefreshPhotoStatsJob.perform_later(@photo.id, :views)
end

from_cache(:views)
end

When we first introduced this, it resulted in some race conditions and heavy cache turnover, prompting dreaded zeroes to appear sporadically on stats.

The fix for this was to add a second layer of cold cache as a fallback:

# app/models/stats/photos.rb
def views
Rails.cache.fetch("photo/worker/#{@photo.id}/views", expires_in: 20.minutes) do
RefreshPhotoStatsJob.perform_later(@photo.id, :views)
end

from_hot_cache(:views) || from_cold_cache(:views)
end

This made sure that at worst a cache miss would result in a slightly out-of-date value being read.

Now that the stats computation had been moved out of the request queue we were feeling a lot better about the overall performance of the site.

Then came the traffic.

Multiple big API partnerships and a 20% week-over-week growth in visitors significantly increased the load on the system.

Normally we'd just throw money at the problem and increase the number of dynos and the size of the databases, but we had a new problem: we were running out of Redis connections.

Redis maxes out at 1000 connections per store, and while we use 6 different Redis (Redii?), our connections for our caching Redis were maxing out, resulting in connection errors.

We'd solved the same problem earlier in the year for Postgres by pooling connections, using the excellent PgBouncer tool. However, the pooling we'd been using for Redis, wouldn't cut it.

We ended up deciding to use the dyno number to split the Redis connections to two different caching Redis, cutting the number of connections per Redis in half. Updating the caching workers to write to both caches meant that both caches would be kept warm and result in less misses.

With our Redis caches no longer blocking our ability to scale our resources, we're able to at least increase our servers to match the load.

Fundamentally though, we think we need to change how we load data to reduce the stress on Postgres. Most of our endpoints rely on either hitting the cache or fetching almost all of the data again from Postgres. Given that our data changes at very different frequencies, this results in terrible inefficiencies in our caches and their longevity.

While it’s not going to be straightforward to make a big change to how all of the API endpoints work, we've learned a lot about batching and efficient data loading from our next generation GraphQL API. Our hope is to combine serialization with batch loading in a way that moves most of the stress from Postgres to our caches.


The last thing we want is for the Unsplash API to become seen as unreliable. Recovering from that perception is almost impossible.

We're incredibly frustrated when the site goes down and 500s appear. They affect thousands of Unsplash members and thousands of API applications with their own user bases.

Fixing these problems are our top priority on the backend right now. We have a plan that we feel confident in executing on, and we've made significant progress over the past month towards fixing bottlenecks in the system, but we still most likely have a lot of work left.

I wish I had a straightforward timeline where I could say these issues will all be over in the next week, but it's looking like it's not quite that simple. Know though that we are working extremely hard on fixing these issues and returning the site and API to being stable.

Luke Chesser