Blog posts > Chasing down a session-swapping bug on Vistaserv.net

Chasing down a session-swapping bug on Vistaserv.net

By paul and caitlin, 09-JAN-2022

TL;DR

Recently, users reported to us that they were occasionally being logged in as some other user (!). A few days after our first attempted fix, it turned out to still be happening. We've now addressed the issue properly, and it was (unsurprisingly) due to a misconfiguration on our part. In the interests of transparency, we decided to share how we found and fixed the issue.

Background

On the 19th of December, 2021, two of our users reached out to us, saying that they had been simultaneously editing their homepages and were randomly getting logged in to each other's accounts. Thankfully they knew each other and so no harm was done, but of course, this represented a pretty serious security concern.

Our first thought was that the issue was somehow related to sessions leaking between requests. Vistaserv is a Ruby on Rails app, using Devise for user management and Puma as the HTTP server. For those who are unfamiliar, Rails keeps track of which users are logged in and who they are logged in as via session cookies. These session cookies are signed and encrypted little blobs of text that the server passes back to the client's browser to store, and which the browser automatically sends along with subsequent requests to the web server. The following diagram illustrates the flow.

Let's take a high-level look at how Rails session cookies work. The client on the left wants to request a page from the web server, running Rails, on the right.
  1. Initially, the (unauthenticated) user logs in with their username and password. No cookies are sent to the web server.
  2. The server checks that the username and password are correct. The server retrieves the numerical user ID for that user for future reference.
  3. The server creates a cookie, containing among other things the user ID. Before being sent out with the response, the cookie is signed and encrypted, so that it cannot be forged by a malicious user. The user's browser stores the cookie.
  4. When the user wants to visit some other page, the cookie will be sent as part of the request to the server.
  5. Because the cookie is signed and encrypted, the server can trust the user ID contained in the cookie. The server proceeds as if the request comes from the authenticated user and responds with whatever the user requested.
  6. The response (e.g., a private message for the user) is sent.

Example of a session cookie's contents:

{"session_id":"f00b00",
 "warden.user.user.key":[[620],"$SOMEHASH$Zxe"],
 "_csrf_token":"SOMEHASHq4="}

These cookies contain the user's ID (e.g., 620), and although they are opaque to the user (because of encryption) and not modifiable (because their signature is checked by the server each time), if you get hold of another user's cookie and present it to the server, the server will trust that you are that user, and so you will have gained access to their account. There are a few ways this might happen, for instance if two simultaneous requests are being handled by the server, and for whatever reason it gets mixed up about which session belongs to which connection, it might incorrectly send back a cookie for session A to user B. Or, if the cookies themselves are somehow getting mixed up user A might first be using session A, but after such a mixup find themselves in session B.

Since there were a few (admittedly vague) reports of thread safety issues around Rails and Devise, we came to the conclusion that we should at least temporarily stop using Puma (which is multi-threaded & multi-worker) as our web server, and switch to Unicorn (single-threaded), to eliminate the possibility of thread safety issues causing sessions to somehow leak between unrelated requests. For example, if memory was being used in a thread-unsafe way while two threads were processing requests for two different users, perhaps sessions might leak between the two – that was our thinking, anyway.

We made the switch to Unicorn, and didn't hear of any more issues from our users, so we crossed our fingers that the issue was fixed. Since we hadn't experienced the issue ourselves, it was difficult to test. We were left feeling somewhat uneasy.

Unfortunately, on January 5th, another user posted in the chat room asking why they had been logged in to someone else's account! This meant that the issue still existed, so we decided that we had to ensure we were able to reproduce the issue ourselves this time, to be able to validate that it really had been solved by whatever measure we took, before we declared it fixed. Of course, in hindsight, we should have made sure the first time around, but at the time the issue seemed sufficiently outlandish and infrequent that we gave it less weight than we should have. After all, Devise is considered quite battle-tested by now.

Our experiments

We needed to find a way to reliably reproduce the issue, so that we could test whether we had fixed it. It seemed as if it would be easiest to spin up a local development version of the Vistaserv app, create two users, "test1" and "test2", and then attempt to trigger the bug. The bug would in theory cause one of the users to get the session of the other user. We decided to initially simply make simultaneous requests as both test1 and test2 in the hopes of seeing the sessions get muddled.

We assumed the bug would be easier to trigger in a multithreaded Rails server (since we still had a lingering belief that this might be the root cause), so we rolled our local environment back to the version with Puma and tried to come up with a test bench.

First, we would need valid session cookies for both our test users. There's probably a smarter way to do this, but we simply used a web browser to log in as the test1 user, then used the browser's Developer Tools to get the value of the Rails session cookie. Repeat for test2.

The excellent curl utility supports cookies, but the format it expects them to be in is rather awkward. Again not very elegant, but the easiest way for us to set the Rails session cookie correctly, was to use curl to make an anonymous request and have it save the session cookie, and then manually replace it with the authenticated cookie we had retrieved using our graphical browser.

$ curl --silent --output /dev/null http://lvh.me:3000 --cookie-jar -
Which tells us the format curl expects:
# Netscape HTTP Cookie File
# https://curl.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

#HttpOnly_.lvh.me	TRUE	/	FALSE	0	_vistaserv_session_1	Pp1r%2FYa4boh1Q[...etc...]

Now, we can just replace the value after _vistaserv_session_1 with an authenticated cookie for test1 and test2 users, save the two text-based cookie jar files, and we should be off to the races.

We wrote a little script to use those stored cookies, and make requests to our local development server, with a simple test to check whether the server's response included our expected username or not. We hoped that somehow hammering the local server with requests for test1 and test2 in parallel would expose the issue.

Here's the script we ended up using:

However, running this script in parallel with

  $ while true; do ./hammer.sh test1; done

in one terminal and

  $ while true; do ./hammer.sh test2; done

in another, didn't yield any issues. We tried a few approaches (with/without "remember me" cookie, using Unicorn instead of Puma, …), but unfortunately it appeared that our local development environment wasn't exhibiting the issue!

Next steps

Then it struck us that perhaps it was only happening in the production version of the site, caused by something outside of our application code. We were considering whether it'd be worth spinning up the original version which had been running when the issue was first reported, which would be a bit of a hassle, when we realised we could just run this very same test against the current version in production. We created test1 and test2 on Vistaserv.net, grabbed cookies and put them into our script, and… the problem immediately surfaced as soon as we ran the two hammer-scripts in parallel! The production configuration was looking mighty suspect at this point!

Let's take a moment to review the high-level route a request to Vistaserv takes, from the user's browser to the web app.

Basically, requests to www.vistaserv.net from the outside internet all will hit our CloudFront distribution. AWS CloudFront is a CDN (content delivery network) service, but for the purposes of our discussion, we can pretend that it's just a caching layer. Depending on whether CloudFront has seen a matching request recently, the request will either be served from the cache, or will be routed onward to the Vistaserv web app. In reality there is another hop before the web app, but that's not important for now.

This had all been set up quite a while ago, and honestly I have to admit that since Vistaserv was just a tiny hobby project, not enough thought went into it – out of habit I run Cloudflare (another CDN service with a very similar name) in front of most of my services (since it's very cheap, it's good at caching static assets, and it's a nice anti-bot stopgap too) but in this case I recall we needed a bit more flexibility on incoming request routing than free-tier Cloudflare gave us (in particular, we are happy with flexible SSL/non-SSL for the members.vistaserv.net subdomain, to allow antique browsers to view homepages, but we deal with that differently on the www/root domain, because we don't want logins happening in the clear and potentially compromising our users' accounts). In any case, the crucial bit is that Vistaserv has AWS's CloudFront CDN in front of its origins. When I looked closer, I discovered that I had set (probably due to a copy-paste from another configuration) the CloudFront distribution's CachePolicyId to CachingOptimized. This sounds fine, but as the documentation points out:

> The CachingOptimized policy is designed to optimize cache efficiency by minimizing the values that CloudFront includes in the cache key. CloudFront doesn't include any query strings or cookies in the cache key […] (emphasis mine)

And there was the answer! For a dynamic web app with logins, this is of course a super bad idea! Let's dig into an example to clarify what had been happening, and what exactly this "cache key" means.

What had been happening?

Let's consider an example. A user (the computer on the left of the image) is authenticated (with a fictional session ID of session_1, which tells the web server the user they're authenticated as), and wants to visit /page. What happens next is:
  1. The user's browser looks up www.vistaserv.net and gets the IP address of our CloudFront distribution. Their browser sends a GET request for the URL /page, and automatically sends any cookies it has which match vistaserv.net. In this case, it's just the one session cookie, session_1.
  2. CloudFront receives the web request, and will check whether the request's cache key is present in the cache. As we mentioned before, because of our CachingOptimized configuration, CloudFront will actually ignore any cookies, headers, etc., and will only consider the request path, that is /page, as part of the cache lookup key. From here, two options exist.
  3. If CloudFront's cache does not already contain a hit for /page, all is well – the request will be forwarded to the Vistaserv web app ("the origin", in official parlance) and the response will be sent back to the user. Depending on the cache headers the web app sets, the response may also be saved to CloudFront's cache for the next time a similar request is received.
  4. If, however, the cache does contain a hit for /page, CloudFront simply replies to the client with that response, and the Vistaserv web server never sees the request. The problem is, though, that this response might contain another user's cookie, in the image that's illustrated as session_2!
  5. This is bad news! The user, who initially made a request with a cookie for session_1, receives a response intended for session_2, some other user! This cookie is valid, so subsequent requests by the client will seem to be authenticated by the other user. The session swap has happened!

Although the situation sketched above seems really dire, in reality this was not happening very frequently. In practice, Rails does set sensible caching headers on its responses – for example, if you visit any of the dynamic www.vistaserv.net pages:

$ curl --include https://www.vistaserv.net | grep --ignore-case cache
Cache-Control: max-age=0, private, must-revalidate

This Cache-Control header is saying to any caches that might be between the web app and the client, in effect, "you may only cache this response for a maximum of zero seconds." Basically, never cache this. Why were we hitting the issue then?

The CloudFront documentation also states that the CachingOptimized policy has a minimum TTL (time-to-live) of 1 second. The TTL is how long a cache will hold onto a response before going back to the origin for an update. This setting likely overrides the max-age=0 specified by the app server. So, if two requests hit the web server within 1 second of one another, the later request will receive the response (including session cookie!) destined for the earlier request. Because the session cookie contains the user ID, once the user has someone else's session cookie, they are effectively logged in and authorised as them! Not good at all.

Repairing the mess

To address the issue, we just had to modify the cache policy – for the time being, we set it to disable caching completely (that is, minimum TTL == maximum TTL == default TTL == 0s). We'll probably bring back static asset caching soon enough, but we will need to take a bit more care this time around. After modifying the cache policy, we invalidated the entire CloudFront cache, just to make sure all requests would get sent through to the origin (and especially to ensure that no cookies would still be lurking in cached responses visitors might receive).

Finally, since there were potentially users out there with someone else's (valid) session cookie sent to them in error, we also rotated Rails' secret_key_base, effectively making all current sessions and their cookies invalid, and forcing everyone to log in again. This seemed like a small price to pay to ensure things would be good going forward.

Summary

Caching is hard, but to be fair, this mistake is on us – it was a silly mistake to make. The reason we didn't notice the issue ourselves though, is because from our point of view everything worked fine. The site doesn't receive that much traffic, and so you'd have to be fairly unlucky for two authenticated users to hit the same page (not just any page – the exact same URL, e.g., /sitebuilder/edit) within 1s of one another. Note that since CloudFront has a geographically distributed caching system, the two requests would additionally have to go via the same CloudFront regional edge cache (of which there are currently 13), meaning that geographically distant users (for example, a user in Brazil visiting at the same time as someone from Romania) hitting the same URL at the same time would also not see this issue.

In any case, our users can now rest easy knowing that in principle, no mysterious session swapping should happen from now on. A very big thank you to the users who alerted us to the issue!

Further reading

Here are a few more links we found interesting: