Designing a server-side application for secure storage of access tokens and other secrets

One of the projects I’m currently working on is an augmented inbox service. The primary goal of the service is to allow user to use email in a more efficient manner and spend less time in their inbox.

It helps user to achieve that by overlaying an inbox with all kind of important contextual information about the sender or recipient. This overlay consists of different insights, metrics and suggestions which are derived from the historical usage data and real-time information obtained from the social media profiles. You can think of with as Rapporitve with contextual data.

Early prototype of the overlay served as a Chrome Extension.

Historical data is obtained by analyzing user’s inbox. Service works by connecting to the GMail’s IMAP servers using SASL XOAUTH2 mechanism1.

To authenticate with the GMail IMAP servers, the service uses an access token which is obtained from the Google authorization servers using a user-specific refresh token and the OAuth 2.0 refresh token flow.

Since the service needs to periodically fetch user’s emails, this means it needs to securely and safely persist the refresh token so it can be re-used later on.

In this blog post I’m going to describe how I have approached and designed the server-side application architecture to provide a secure storage of refresh tokens.

Keep in mind that you can use a similar approach to securely store other user’s secrets (different service keys, access tokens, credentials, ssh keys and so on).

Background & Motivation

It doesn’t really matter what kind of application you are working on, you should always treat user’s privacy and security as a top priority. This is especially important if you are handling sensitive, private or secret data (like refresh tokens in this case).

This means you should dedicate sufficient time and resources into designing, developing and reviewing your application and making sure it’s secure. Sadly a lot of people and organizations don’t recognize that (or they are simply ignorant). Because of that, incidents like a recent Adobe breach (Adobe encrypted passwords using 3DES in ECB mode, seriously!) and Buffer hack are a lot more catastrophic than they would be if those companies would store credentials properly and in a secure manner.

Application Design & Security Principles

Here are some of the main security principles I have adhered to while designing and working on the application:

  • Keep it simple.
  • Don’t roll your own crypto, use well known, researched and tested principles, algorithms, methods and libraries.
  • Use layered approach to security.
  • Design the services to adhere to the principle of least privilege.
  • To reduce the attack surface design simple and small services.
  • Isolate different services and components.

Quick Note About Isolation

As noted above, I have used isolation and layered approach to security. In this case, the service isolation consists of the following layers:

  • Virtualization (Xen)
  • Isolated private networks
  • Software firewall

Architecture Overview

This section contains a high-level application architecture overview and a short description of the important services and their roles.

High level architecture overview.

Note: Dashed line indicates a TLS connection and a red container indicates an isolated private network.

Cassandra

Cassandra cluster is used for storing metrics, insights and other information and metrics about the email account.

PostgreSQL

PostgreSQL is used for storing account meta data. At the moment, this includes user <-> email account mappings and time zone information for each email account.

Web Application

Web Application is a simple Django application which is, at the moment, only responsible for two things:

  1. Performing an initial OAuth 2 token exchange and retrieving the refresh token from the Google authorization servers. This happens when the user first registers and connects their Gmail account.
  2. Logging the user in. This happens on subsequent requests after the user has already connected their account.

API Service

Tornado service which exposes a public API for retrieving metrics and insights from the metrics database (Cassandra).

Workers

This service consists of Celery worker processes which run different jobs:

  1. Retrieval job - This job fetches email messages from the IMAP servers, parses them and stores email meta data2 in the database.
  2. Aggregation jobs - Those jobs aggregate previously retrieved metrics for the following periods: daily, weekly and monthly.
  3. Processing Jobs - Those jobs process previously aggregated data and infer all kinds of insights from it.

To be able to authenticate and fetch email messages from the IMAP servers, this service needs to have access to the access token for the email account in question.

OAuth2 web application flow. Source: https://developers.google.com/accounts/docs/OAuth2WebServer

The service obtains access token by hitting and asking the token storage get service for it (more on that bellow).

As such, this service is also the only one which has access to the token storage get service and access tokens.

Token Storage Service

Token storage service actually consists of two separate services. First one is responsible solely for securely storing refresh tokens (“token storage set service”) and the second one (“token storage get service”) is responsible for retrieving refresh token from the database, decrypting it using the private key and using the decrypted refresh token to obtain access token from the Google authorization servers.

To reduce the attack surface area, both services are designed to be small and simple. Both of them are simple Tornado services which expose an HTTP API with a single method to the consumers.

On top of that, those services run in an isolated private network and only small set of services (two to be exact) have access to it. Web application has access to the set service and workers have access to the get service.

Authentication to those services is handled using certificates. Certificate based authentication is not ideal because it adds a lot of overhead and basically requires you to manage and run your own certificate authority, but that’s a complex topic for a different post. For now it suffices to say that we have a simple process in place which works fine for a small number of certificates.

Token Storage SET Service

This service exposes a method for storing an encrypted refresh token in a local token database. Refresh tokens are encrypted using public-key / asymmetric cryptography (more on that bellow).

Refresh token is encrypted using a public key on the web server which is responsible for performing an initial OAuth 2.0 exchange and retrieving the refresh token from Google authorization servers.

Token storage SET service work-flow.

Token Storage GET Service

This service exposes a single method for retrieving an access token for an email account. The service retrieves access token for an email account by first retrieving encrypted refresh token from a local token database, decrypting it using a private key and then using this decrypted refresh token to obtain a temporary access token from the Google authorization servers.

As you can see above, this service needs to have access to the private key to be able to decrypt the refresh token. As such, this is the only service which has access to the private key and ability to decrypt the refresh token.

Token storage GET service work-flow.

Public Key Cryptography & Keyczar

As noted above, public-key cryptography is used to protect and securely store the refresh tokens. More, specifically RSA algorithm with a 4096 bit key is used.

There are multiple ways to do public-key cryptography in Python (big chunk of the server side application is written in Python), some of the more popular choices include:

Because of my previous experience and other benefits which are mentioned later on, I have decided to go with KeyCzar.

KeyCzar is an open source cryptographic toolkit developed by Google with C++, Java and Python bindings available. One of the main goals of KeyCzar is to make it easier for developers to use cryptography safely. Unlike other existing libraries mentioned above, it exposes a higher-level API with more sane default values which makes it harder for developer to use it in a wrong or a potentially harmful way.

On top of that, it also includes a command-line tool which allows users to manage (create, rotate, revoke) key files.

Key Storage and Management

Storage and management of the cryptographic keys is out of scope of this blog post, but it’s worth nothing that it’s also an important topic. All of the effort you have put into designing and making your application secure doesn’t matter if you don’t securely store cryptographic keys which are used to protect your secrets.

If you are using Amazon EC2 and are hosted on Amazon cloud, you should have a look at CloudHSM. On the other hand, if you are self-hosted, you should have a look at YubiHSM, a secure and cost-effective alternative to other usually more expensive hardware based security modules.

In the future, Barbican might also prove itself as a viable, lower security software based alternative.

Conclusion

This time I have mostly focused on the high-level server-side application architecture, but in the future posts I plan to go into more details about the following topics:

  • how we handle key management
  • how we handle isolation via isolated private networks
  • how we handle client side security in the chrome extension

Note: If you think I have done something wrong or something can be further improved, don’t hesitate to contact me.

authenticate using username + OAuth 2.0 access token instead of using a more traditional approach of using a username + password.

store the following meta data items and header values (if available): Message UID, Subject, From, To, Date, X-Received, Received-SPF, DKIM-Signature, Authentication-Results, Message-ID, In-Reply-To. On top of that, this “raw” data is only stored until it’s aggregated (for the average case this means less than 24 hours) and at most one week.

  1. In a nut shell, SASL XOAUTH2 mechanism allows clients to 

  2. In the first version, we don’t touch or store message body. We just