tomaz.me - Tag: python

Making StackStorm Fast

2021-07-04T00:00:00+02:00

Making StackStorm Fast

In this post I will describe changes to the StackStorm database abstraction layer which landed in StackStorm v3.5.0. Those changes will substantially speed up action executions and workflow runs for most users.

Based on the benchmarks and load testing we have performed, most actions which return large results and workflows which pass large datasets around should see speed ups in the range of up to 5-15x.

If you want to learn more about the details you can do that below. Alternatively if you only care about the numbers, you can go directly to the Numbers, numbers, numbers section.

Background and History

Today StackStorm is used for solving a very diverse set of problems – from IT and infrastructure provisioning to complex CI/CD pipeline, automated remediation, various data processing pipelines and more.

Solving a lot of those problems requires passing large datasets around – this usually involves passing around large dictionary objects to the actions (which can be in the range of many MBs) and then inside the workflow, filtering down the result object and passing it to other tasks in the workflow.

This works fine when working with small objects, but it starts to break when larger datasets are passed around (dictionaries over 500 KB).

In fact, passing large results around has been StackStorm’s achilles heel for many years now (see some of the existing issues - #3718, #4798, #625). Things will still work, but executions and workflows which handle large datasets will get progressively slower and waste progressively more CPU cycles and no one likes slow software and wasting CPU cycles (looking at you bitcoin).

One of the more popular workarounds usually involves storage those larger results / datasets in a 3d party system (such as a database) and then querying this system and retrieving data inside the action.

There have been many attempts to improve that in the past (see #4837, #4838, #4846) and we did make some smaller incremental improvements over the years, but most of them were in the range of a couple of 10% of an improvement maximum.

After an almost year long break from StackStorm due to the busy work and life situation, I used StackStorm again to scratch my own itch. I noticed the age old “large results” problem hasn’t been solved yet so I decided to take a look at the issue again and try to make more progress on the PR I originally started more than a year ago (https://github.com/StackStorm/st2/pull/4846).

It took many late nights, but I was finally able to make good progress on it. This should bring substantial speed ups and improvements to all StackStorm users.

Why the problem exists today

Before we look into the implemented solution, I want to briefly explain why StackStorm today is slow and inefficient when working with large datasets.

Primary reason why StackStorm is slow when working with large datasets is because we utilize EscapedDictField() and EscapedDynamicField() mongoengine field types for storing execution results and workflow state.

Those field types seemed like good candidates when we started almost 7 years ago (and they do work relatively OK for smaller results and other metadata like fields), but over the years after people started to push more data through it, it turned out they are very slow and inefficient for storing and retrieving large datasets.

The slowness boils down to two main reasons:

Field keys need to be escaped. Since . and $ are special characters in MongoDB used for querying, they need to be escaped recursively in all the keys of a dictionary which is to be stored in the database. This can get slow with large and deeply nested dictionaries.
mongoengine ORM library we use to interact with MongoDB is known to be be very slow compared to using pymongo directly when working with large documents (see #1230 and https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage). This is mostly due to the complex and slow conversion of types mongoengine performs when storing and retrieving documents.

Those fields are also bad candidates for what we are using them for. Data we are storing (results) is a more or less opaque binary blob to the database, but we are storing it in a very rich field type which supports querying on field keys and values. We don’t rely on any of that functionality and as you know, nothing comes for free – querying on dictionary field values requires more complex data structures internally in MongoDB and in some cases also indexes. That’s wasteful and unnecessary in our case.

Solving the Problem

Over the years there have been many discussions on how to improve that. A lot of users said we should switch away from MongoDB.

To begin with, I need to start and say I’m not a big fan of MongoDB, but the actual database layer itself is not the problem here.

If switching to a different database technology was justified (aka the bottleneck was the database itself and nor our code or libraries we depend on), then I may say go for it, but the reality is that even then, such a rewrite is not even close to being realistic.

We do have abstractions / ORM in place for working with the database layer, but as anyone who was worked in a software project which has grown organically over time knows, those abstractions get broken, misused or worked around over time (for good or bad reasons, that’s it’s not even important for this discussion).

Reality is that moving to a different database technology would likely require many man months hours of work and we simply don’t have that. The change would also be much more risky, very disruptive and likely result in many regressions and bugs – I have participated in multiple major rewrites in the past and no matter how many tests you have, how good are the coding practices, the team, etc. there will always be bugs and regressions. Nothing beats miles on the code and with a rewrite you are removing all those miles and battle tested / hardened code with new code which doesn’t have any of that.

Luckily after a bunch of research and prototyping I was able to come up with a relatively simple solution which is much less invasive, fully backward compatible and brings some serious improvements all across the board.

Implemented Approach

Now that we know that using DictField and DynamicField is slow and expensive, the challenge is to find a different field type which offers much better performance.

After prototyping and benchmarking various approaches, I was able to find that using binary data field type is the most efficient solution for our problem – when using that field type, we can avoid all the escaping and most importantly, very slow type conversions inside mongoengine.

This also works very well for us, since execution results, workflow results, etc. are just an opaque blob to the database layer (we don’t perform any direct queries on the result values or similar).

That’s all good, but in reality in StackStorm results are JSON dictionaries which can contain all the simple types (dicts, lists, numbers, strings, booleans - and as I recently learned, apparently even sets even though that’s not a official JSON type, but mongoengine and some JSON libraries just “silently” serialize it to a list). This means we still need to serialize data in some fashion which can be deserialized fast and efficiently on retrieval from the database.

Based on micro benchmark results, I decided to settle down on JSON, specifically orjson library which offers very good performance on large datasets. So with the new field type changes, execution result and various other fields are now serialized as JSON string and stored in a database as a binary blob (well, we did add some sugar coat on top of JSON, just to make it a bit more future proof and allow us to change the format in future, if needed and also implement things such as per field compression, etc.).

Technically using some kind of binary format (think Protobuf, msgpack, flatbuffers, etc.) may be even faster, but those formats are primarily meant for structured data (think all the fields and types are known up front) and that’s not the case with our result and other fields – they can contain arbitrary JSON dictionaries. While you can design a Protobuf structure which would support our schemaless format, that would add a lot of overhead and very likely in the end be slower than using JSON + orjson.

So even though the change sounds and looks really simple (remember – simple code and designs are always better!) in reality it took a lot of time to get everything to work and tests to pass (there were a lot of edge cases, code breaking abstractions, etc.), but luckily all of that is behind us now.

This new field type is now used for various models (execution, live action, workflow, task execution, trigger instance, etc.).

Most improvements should be seen in the action runner and workflow engine service layer, but secondary improvements should also be seen in st2api (when retrieving and listing execution results, etc.) and rules engine (when evaluating rules against trigger instances with large payloads).

Numbers, numbers, numbers

Now that we know how the new changes and field type works, let’s look at the most important thing – actual numbers.

Micro-benchmarks

I believe all decisions like that should be made and backed up with data so I started with some micro benchmarks for my proposed changes.

Those micro benchmarks measure how long it takes to insert and read a document with a single large field from MongoDB comparing old and the new field type.

We also have micro benchmarks which cover more scenarios (think small values, document with a lot of fields, document with single large field, etc.), but those are not referenced here.

1. Database writes

This screenshot shows that the new field type (json dict field) is ~10x faster over EscapedDynamicField and ~15x over EscapedDictField when saving 4 MB field value in the database.

2. Database reads

This screenshot shows that the new field is about ~7x faster over EscapedDynamicField and ~40x over EscapedDictField..

P.S. You should only look at the relative change and not absolute numbers. Those benchmarks ran on a relatively powerful server. On a smaller VMs you may see different absolute numbers, but the relative change should be about the same.

Those micro benchmarks also run daily as part of our CI to prevent regressions and similar and you can view the complete results here.

End to end load tests

Micro benchmarks always serve as a good starting point, but in the end we care about the complete picture.

Things never run in isolation, so we need to put all the pieces together and measure how it performs in real-life scenarios.

To measure this, I utilized some synthetic and some more real-life like actions and workflows.

1. Python runner action

Here we have a simple Python runner action which reads a 4 MB JSON file from disk and returns it as an execution result.

Old field type

New field type

With the old field type it takes 12 seconds and with the new one it takes 1.

For the actual duration, please refer to the “log” field. Previous versions of StackStorm contained a bug and didn’t accurately measure / reprt action run time – the end_timestamp – start_timestamp only measures how long it took for action execution to complete, but it didn’t include actual time it took to persist execution result in the database (and with large results actual persistence could easily take many 10s of seconds) – and execution is not actually completed until data is persisted in the database.

2. Orquesta Workflow

In this test I utilized an orquesta workflow which runs Python runner action which returns ~650 KB of data and this data is then passed to other tasks in the workflow.

Old field type

New field type

Here we see that with the old field type it takes 95 seconds and with the new one it takes 10 seconds.

With workflows we see even larger improvements. The reason for that is that actual workflow related models utilize multiple fields of this type and also perform many more database operations (read and writes) compared to simple non-workflow actions.

You don’t need to take my word for it. You can download StackStorm v3.5.0 and test the changes with your workloads.

Some of the early adopters have already tested those changes before StackStorm v3.5.0 was released with their workloads and so far the feedback has been very positive - speed up in the range of 5-15x.

Other Improvements

In addition to the database layer improvements which are the start of the v3.5.0 release, I also made various performance improvements in other parts of the system:

Various API and CLI operations have been sped up by switching to orjson for serializarion and deserialization and various other optimizations.
Pack registration has been improved by reducing the number of redundant queries and similar.
Various code which utilizes yaml.safe_load has been speed up by switching to C versions of those functions.
ISO8601 / RFC3339 date time strings parsing has been speed up by switching to udatetime library
Service start up time has been sped by utilizing stevedore library more efficiently.
WebUI has been substantially sped up - we won’t retrieve and display very large results by default anymore. In the past, WebUI would simply freeze the browser window / tab when viewing the history tab. Do keep in mind that righ now only the execution part has been optimized and in some other scenarios WebUI will still try to load syntax highlighting very large datasets which will result in browser freezing.

Conclusion

I’m personally very excited about those changes and hope you are as well.

They help address one of StackStorm’s long known pain points. And we are not just talking about 10% here and there, but up to 10-15x improvements for executions and workflows which work with larger datasets (> 500 KB).

That 10-15x speed up doesn’t just mean executions and workflows will complete faster, but also much lower CPU utilization and less wasted CPU cycles (as described above, due to the various conversions, storing large fields in the database and to a lesser extent also reading them, was previously a very CPU intensive task).

So in a sense, you can view of those changes as getting additional resources / servers for free – previously you might have needed to add new pods / servers running StackStorm services, but with those changes you should able to get much better throughput (executions / seconds) with the existing resources (you may even be able to scale down!). Hey, who doesn’t like free servers :)

This means many large StackStorm users will be able to save many hundreds and thousands of $ per month in infrastructure costs. If this change will benefit you and your can afford it, check Donate page on how you can help the project.

Thanks

I would like to thank everyone who has contributed to the performance improvements in any way.

Thank to everyone who has helped to review that massive PR with over 100 commits (Winson, Drew, Jacob, Amanda), @guzzijones and others who have tested the changes while they were still in development and more.

This also includes many of our long term uses such as Nick Maludy, @jdmeyer3 and others who have reported this issue a long time ago and worked around the limitations when working with larger datasets in various different ways.

Special thanks also to v3.5.0 release managers Amanda and Marcel.

Consuming AWS EventBridge Events inside StackStorm

2019-07-13T00:00:00+02:00

Consuming AWS EventBridge Events inside StackStorm

Amazon Web Services (AWS) recently launched a new product called Amazon EventBridge.

EventBridge has a lot of similarities to StackStorm, a popular open-source cross-domain event-driven infrastructure automation platform. In some ways, you could think of it as a very light weight and limited version of StackStorm as a service (SaaS).

In this blog post I will should you how you can extend StackStorm functionality by consuming thousands of different events which are available through Amazon EventsBridge.

Why?

First of all you might ask why you would want to do that.

StackStorm Exchange already offers many different packs which allows users to integrate with various popular projects and services (including AWS). In fact, StackStorm Exchange integration integration packs expose over 1500 different actions.

StackStorm Exchange aka Pack Marketplace.

Even though StackStorm Exchange offers integration with many different products and services, those integrations are still limited, especially on the incoming events / triggers side.

Since event-driven automation is all about the events which can trigger various actions and business logic, the more events you have access to, the better.

Run a workflow which runs Ansible provision, creates a CloudFlare DNS record, adds new server to Nagios, adds server to the loadbalancer when a new EC2 instance is started? Check.

Honk your Tesla Model S horn when your satellite passes and establishes a contact with AWS Ground Station? Check.

Having access to many thousands of different events exposed through EventBridge opens up almost unlimited automation possibilities.

For a list of some of the events supported by EventsBridge, please refer to their documentation.

Consuming EventBridge Events Inside StackStorm

There are many possible ways to integrate StackStorm and EventBridge and consume EventBridge events inside StackStorm. Some more complex than others.

In this post, I will describe an approach which utilizes AWS Lambda function.

I decided to go with AWS Lambda approach because it’s simple and straightforward. It looks like this:

AWS / partner event -> AWS EventBridge -> AWS Lambda Function -> StackStorm Webhooks API

Event is generated by AWS service or a partner SaaS product
EventBridge rule matches an event and triggers AWS Lambda Function (rule target)
AWS Lambda Function sends an event to StackStorm using StackStorm Webhooks API endpoint

1. Create StackStorm Rule Which Exposes a New Webhook

First we need to create a StackStorm rule which exposes a new eventbridge webhook. This webhook will be available through https://<example.com>/api/v1/webhooks/eventbridge URL.

wget https://gist.githubusercontent.com/Kami/204a8f676c0d1de39dc841b699054a68/raw/b3d63fd7749137da76fa35ca1c34b47fd574458d/write_eventbridge_data_to_file.yaml
st2 rule create write_eventbridge_data_to_file.yaml

name: "write_eventbridge_data_to_file"
pack: "default"
description: "Test rule which writes AWS EventBridge event data to file."
enabled: true

trigger:
  type: "core.st2.webhook"
  parameters:
    url: "eventbridge"

criteria:
  trigger.body.detail.eventSource:
    pattern: "ec2.amazonaws.com"
    type: "equals"
  trigger.body.detail.eventName:
    pattern: "RunInstances"
    type: "equals"

action:
  ref: "core.local"
  parameters:

    cmd: "echo \"{{trigger.body}}\" >> ~/st2.webhook.out"

You can have as many rules as you want with the same webhook URL parameter. This means you can utilize the same webhook endpoint to match as many different events and trigger as many different actions / workflows as you want.

In the criteria field we filter on events which correspond to new EC2 instance launches (eventName matches RunInstances and eventSource matches ec2.amazonaws.com). StackStorm rule criteria comparison operators are quite expressive so you can also get more creative than that.

As this is just an example, we simply write a body of the matched event to a file on disk (/home/stanley/st2.webhook.out). In a real life scenario, you would likely utilize Orquesta workflow which runs your complex or less complex business logic.

This could involve steps and actions such as:

Add new instance to the load-balancer
Add new instance to your monitoring system
Notify Slack channel new instance has been started
Configure your firewall for the new instance
Run Ansible provision on it
etc.

2. Configure and Deploy AWS Lambda Function

Once your rule is configured, you need to configure and deploy AWS Lambda function.

You can find code for the Lambda Python function I wrote here - https://github.com/Kami/aws-lambda-event-to-stackstorm.

I decided to use Lambda Python environment, but the actual handler is very simple so I could easily use JavaScript and Node.js environment instead.

git clone https://github.com/Kami/aws-lambda-event-to-stackstorm.git
cd aws-lambda-event-to-stackstorm

# Install python-lambda package which takes care of creating and deploying
# Lambda bundle for your
pip install python-lambda

# Edit config.yaml file and make sure all the required environment variables
# are set - things such as StackStorm Webhook URL, API key, etc.
# vim config.yaml

# Deploy your Lambda function
# For that command to work, you need to have awscli package installed and
# configured on your system (pip install --upgrade --user awscli ; aws configure)
lambda deploy

# You can also test it locally by using the provided event.json sample event
lambda invoke

You can confirm that the function has been deployed by going to the AWS console or by running AWS CLI command:

aws lambda list-function
aws lambda get-function --function-name send_event_to_stackstorm

And you can verify that it’s running by tailing the function logs:

LAMBDA_FUNCTION_NAME="send_event_to_stackstorm"
LOG_STREAM_NAME=`aws logs describe-log-streams --log-group-name "/aws/lambda/${LAMBDA_FUNCTION_NAME}" --query logStreams[*].logStreamName | jq '.[0]' | xargs`
aws logs get-log-events --log-group-name "/aws/lambda/${LAMBDA_FUNCTION_NAME}" --log-stream-name "${LOG_STREAM_NAME}"

2. Create AWS EventBridge Rule Which Runs Your Lambda Function

Now we need to create AWS EventBridge rule which will match the events and trigger AWS Lambda function.

AWS EventBridge Rule Configuration

As you can see in the screenshot above, I simply configured the rule to send every event to Lambda function.

This may be OK for testing, but for production usage, you should narrow this down to the actual events you are interested in. If you don’t, you might get surprised by your AWS Lambda bill - even on small AWS accounts, there are tons of events being being constantly generated by various services and account actions.

3. Monitor your StackStorm Instance For New AWS EventBridge Events

As soon as you configure and enable the rule, new AWS EventBridge events (trigger instances) should start flowing into your StackStorm deployment.

You can monitor for new instances using st2 trace list and st2 trigger-instance list commands.

AWS EventBridge event matched StackStorm rule criteria and triggered an action execution.

And as soon as a new EC2 instance is launched, your action which was defined in the StackStorm rule above will be executed.

Conclusion

This post showed how easy it is to consume AWS EventBridge events inside StackStorm and tie those two services together.

Gaining access to many thousand of different AWS and AWS partner events inside StackStorm opens up many new possibilities and allows you to apply cross-domain automation to many new situations.

Migrating from Zerigo to Rackspace Cloud DNS using Libcloud

2014-01-18T00:00:00+01:00

Migrating from Zerigo to Rackspace Cloud DNS using Libcloud

In this blog post I’m going to describe how to migrate from Zerigo DNS to Rackspace Cloud DNS using a ~80 lines long Python script which utilizes Libcloud.

Background and Motivation

In September of the last year, I wrote how to export a Libcloud zone to the BIND zone format and use the BIND zone file to migrate between DNS providers.

At that time, my motivation for migrating away from Zerigo was mostly fueled by a very unreliable service which was a consequence of DDoS attacks and less than ideal service architecture.

I have a paid Zerigo plan, so back then, I only migrated the most important domains to a different provider. Not long after I have done this, Zerigo announced that they have partnered with Akamai and that going forward, they will outsource running of the DNS infrastructure to Akami and as such, the service should be way more stable and reliable.

I thought great, I won’t need to migrate rest of the domains away, but an unplesant surprise came earlier this month, when Zerigo announced pricing changes (see 1, 2, 3 & 4).

Previously, I have paid 19$ years per year, but with a new plan which matches my current one, I would need to pay 25$ per month. That’s with an existing customer loyalty discount. New customers will need to pay 38$ per month (what a great deal, instead of paying 24 times more, now I need to pay just 15 times more!). Yes, you have read this correctly, that’s more than one order of magniture per year more than I used to pay before.

I honestly don’t mind paying for a great software and services and I wouldn’t mind paying a little more if the service improved, but that kind or price increase is simply too much. That is especially true, because all of the ~15 domains that I still have at Zerigo are used to host non-profit and community websites and paying 25$ per month is simply too much.

Why Rackspace Cloud DNS?

Disclamer: I used to work at Rackspace, but I don’t work there anymore and I’m not affiliated with them in any way.

Before I dive further, lets have a look at why you might want to use Rackspace Cloud DNS.

The main reason for me to migrate to Rackspace is that they have a decent API, they are supported in Libcloud and best of all, the service is totally free for the existing cloud servers customers. On top of that, the service is supposed to use Anycast.

All of that made it a good fit for hosting my non-profit domains there.

I also need to add that I haven’t used the service a lot before, so I can’t really talk much about the service relablitity at this point. Only time and monitoring will tell how reliable the service really is.

Migrating from Zerigo DNS to Rackspace Cloud DNS using Libcloud

Instead of using Libcloud’s export to BIND zone file functionality, this script works by talking directly to both of the provider APIs.

The reason for that is that this approach is more robust and makes performing partial migrations and synchronizations easier. On top of that it also works with other providers which don’t support importing a BIND zone file.

It’s also important to note that the script relies on some Libcloud fixes which are currently only available in trunk. As such, you should use pip to install latest version from Git inside a virtual environment:

pip install -e git+https://github.com/apache/libcloud.git@trunk#egg=libcloud

After you have done this, you can use the script bellow to migrate all of your zones from Zerigo to Rackspace:

import hashlib

from libcloud.dns.types import Provider, RecordType
from libcloud.dns.providers import get_driver

ZERIGO_USERNAME = ''
ZERIGO_API_KEY = ''

RACKSPACE_USERNAME = ''
RACKSPACE_API_KEY = ''

CONTACT_EMAIL = ''  # Rackspace requires a valid email for every domain

ZONE_TTL = 30 * 60  # Default zone TTL (in seconds) which should be used
MIN_TTL = 300  # Minim TTL supported by the target provider
IGNORED_RECORD_TYPES = [RecordType.NS, RecordType.PTR]

source_cls = get_driver(Provider.ZERIGO)(ZERIGO_USERNAME, ZERIGO_API_KEY)
destination_cls = get_driver(Provider.RACKSPACE)(RACKSPACE_USERNAME,
                                                 RACKSPACE_API_KEY)


def get_record_hash(record):
    """
    Return a hash for the provided record. This is used to determine if the
    record already exists.
    """
    record_hash = hashlib.md5('%s-%s-%s' % (record.name, record.type,
                                            record.data)).hexdigest()
    return record_hash

source_zones = source_cls.list_zones()
destination_zones = destination_cls.list_zones()

destination_domains = [zone.domain for zone in destination_zones]

# 1. Create zones
for zone in source_zones:
    if zone.domain in destination_domains:
        print('Zone "%s" already exists, skipping...' % (zone.domain))
        continue

    extra = {'email': CONTACT_EMAIL}

    print('Creating zone: %s' % (zone.domain))
    destination_cls.create_zone(domain=zone.domain, ttl=ZONE_TTL,
                                extra=extra)

destination_zones = destination_cls.list_zones()

supported_record_type = destination_cls.list_record_types()

# 2. Create records
for source_zone in source_zones:
    destination_zone = [zone for zone in destination_zones
                        if zone.domain == source_zone.domain][0]

    source_records = source_zone.list_records()
    destination_records = destination_zone.list_records()

    for source_record in source_records:
        # Rackspace doesn't have a special SPF record type
        if source_record.type == RecordType.SPF:
            source_record.type = RecordType.TXT

        record_hash = get_record_hash(source_record)
        destination_record_hashes = [get_record_hash(record) for record
                                     in destination_records]

        if source_record.name:
            fqdn = '%s.%s' % (source_record.name, source_zone.domain)
        else:
            fqdn = source_zone.domain

        if record_hash in destination_record_hashes:
            print('Record "%s" already exists, skipping...' % (fqdn))
            continue

        if source_record.type in IGNORED_RECORD_TYPES:
            print(('Encountered ignored record type (type=%s,name=%s) '
                  'skipping...') % (source_record.type, fqdn))
            continue

        if type not in supported_record_type:
            print(('Encountered unsupported record type (type=%s,name=%s)'
                  ', skipping...') % (source_record.type, fqdn))
            continue

        extra = {}

        ttl = source_record.extra.get('ttl', None)
        priority = source_record.extra.get('priority', None)

        if ttl:
            if ttl < MIN_TTL:
                ttl = MIN_TTL
            extra['ttl'] = ttl

        if priority:
            extra['priority'] = priority

        name = source_record.name
        type = source_record.type
        data = source_record.data

        print('Creating a record: %s' % (fqdn))
        destination_zone.create_record(name=name, type=type, data=data,
                                       extra=extra)

Before proceeeding it’s worth knowing that there are some differences between the providers and some limitations you should be aware of:

Zerigo supports more record types. If you use more advanced record types which are not supported by Rackspace, then Rackspace might not be a good fit for you.
Rackspace only allows you to create PTR records for resources (cloud servers & load balancers) which are hosted in their data centers.
Rackspace doesn’t support SPF record type. This is not a big deal since this record type has been deprecated anyway and TXT can be used instead. This script transparently handled remapping of SPF to TXT for you.
Minimum supported TTL by Zerigo is 180 seconds and the minimum supported TTL by Rackspace is 300 seconds. If during the migration the script encounteres a TTL smaller than 300 seconds, it simply uses the smallest possible TTL which is 300 seconds.

To use it, simply plug in your API credentials and run it:

python migrate_dns_providers.py

Zerigo control panel.

If the script for some reason fails half-way through (bad connectivity, API issues, etc.), it’s safe to run it again since all the operations are idempotent.

Rackspace Cloud DNS control panel after the migration.

After you have run the script, you should check if everything looks OK and if it does, you can go ahead and change the DNS records for your domains to point to the Rackspace Cloud DNS servers (dns1.stabletransit.com & dns2.stabletransit.com).

Designing a server-side application for secure storage of access tokens and other secrets

2013-12-27T00:00:00+01:00

Designing a server-side application for secure storage of access tokens and other secrets

One of the projects I’m currently working on is an augmented inbox service. The primary goal of the service is to allow user to use email in a more efficient manner and spend less time in their inbox.

It helps user to achieve that by overlaying an inbox with all kind of important contextual information about the sender or recipient. This overlay consists of different insights, metrics and suggestions which are derived from the historical usage data and real-time information obtained from the social media profiles. You can think of with as Rapporitve with contextual data.

Early prototype of the overlay served as a Chrome Extension.

Historical data is obtained by analyzing user’s inbox. Service works by connecting to the GMail’s IMAP servers using SASL XOAUTH2 mechanism¹.

To authenticate with the GMail IMAP servers, the service uses an access token which is obtained from the Google authorization servers using a user-specific refresh token and the OAuth 2.0 refresh token flow.

Since the service needs to periodically fetch user’s emails, this means it needs to securely and safely persist the refresh token so it can be re-used later on.

In this blog post I’m going to describe how I have approached and designed the server-side application architecture to provide a secure storage of refresh tokens.

Keep in mind that you can use a similar approach to securely store other user’s secrets (different service keys, access tokens, credentials, ssh keys and so on).

Background & Motivation

It doesn’t really matter what kind of application you are working on, you should always treat user’s privacy and security as a top priority. This is especially important if you are handling sensitive, private or secret data (like refresh tokens in this case).

This means you should dedicate sufficient time and resources into designing, developing and reviewing your application and making sure it’s secure. Sadly a lot of people and organizations don’t recognize that (or they are simply ignorant). Because of that, incidents like a recent Adobe breach (Adobe encrypted passwords using 3DES in ECB mode, seriously!) and Buffer hack are a lot more catastrophic than they would be if those companies would store credentials properly and in a secure manner.

Application Design & Security Principles

Here are some of the main security principles I have adhered to while designing and working on the application:

Keep it simple.
Don’t roll your own crypto, use well known, researched and tested principles, algorithms, methods and libraries.
Use layered approach to security.
Design the services to adhere to the principle of least privilege.
To reduce the attack surface design simple and small services.
Isolate different services and components.

Quick Note About Isolation

As noted above, I have used isolation and layered approach to security. In this case, the service isolation consists of the following layers:

Virtualization (Xen)
Isolated private networks
Software firewall

Architecture Overview

This section contains a high-level application architecture overview and a short description of the important services and their roles.

High level architecture overview.

Note: Dashed line indicates a TLS connection and a red container indicates an isolated private network.

Cassandra

Cassandra cluster is used for storing metrics, insights and other information and metrics about the email account.

PostgreSQL

PostgreSQL is used for storing account meta data. At the moment, this includes user <-> email account mappings and time zone information for each email account.

Web Application

Web Application is a simple Django application which is, at the moment, only responsible for two things:

Performing an initial OAuth 2 token exchange and retrieving the refresh token from the Google authorization servers. This happens when the user first registers and connects their Gmail account.
Logging the user in. This happens on subsequent requests after the user has already connected their account.

API Service

Tornado service which exposes a public API for retrieving metrics and insights from the metrics database (Cassandra).

Workers

This service consists of Celery worker processes which run different jobs:

Retrieval job - This job fetches email messages from the IMAP servers, parses them and stores email meta data² in the database.
Aggregation jobs - Those jobs aggregate previously retrieved metrics for the following periods: daily, weekly and monthly.
Processing Jobs - Those jobs process previously aggregated data and infer all kinds of insights from it.

To be able to authenticate and fetch email messages from the IMAP servers, this service needs to have access to the access token for the email account in question.

OAuth2 web application flow. Source: https://developers.google.com/accounts/docs/OAuth2WebServer

The service obtains access token by hitting and asking the token storage get service for it (more on that bellow).

As such, this service is also the only one which has access to the token storage get service and access tokens.

Token Storage Service

Token storage service actually consists of two separate services. First one is responsible solely for securely storing refresh tokens (“token storage set service”) and the second one (“token storage get service”) is responsible for retrieving refresh token from the database, decrypting it using the private key and using the decrypted refresh token to obtain access token from the Google authorization servers.

To reduce the attack surface area, both services are designed to be small and simple. Both of them are simple Tornado services which expose an HTTP API with a single method to the consumers.

On top of that, those services run in an isolated private network and only small set of services (two to be exact) have access to it. Web application has access to the set service and workers have access to the get service.

Authentication to those services is handled using certificates. Certificate based authentication is not ideal because it adds a lot of overhead and basically requires you to manage and run your own certificate authority, but that’s a complex topic for a different post. For now it suffices to say that we have a simple process in place which works fine for a small number of certificates.

Token Storage SET Service

This service exposes a method for storing an encrypted refresh token in a local token database. Refresh tokens are encrypted using public-key / asymmetric cryptography (more on that bellow).

Refresh token is encrypted using a public key on the web server which is responsible for performing an initial OAuth 2.0 exchange and retrieving the refresh token from Google authorization servers.

Token storage SET service work-flow.

Token Storage GET Service

This service exposes a single method for retrieving an access token for an email account. The service retrieves access token for an email account by first retrieving encrypted refresh token from a local token database, decrypting it using a private key and then using this decrypted refresh token to obtain a temporary access token from the Google authorization servers.

As you can see above, this service needs to have access to the private key to be able to decrypt the refresh token. As such, this is the only service which has access to the private key and ability to decrypt the refresh token.

Token storage GET service work-flow.

Public Key Cryptography & Keyczar

As noted above, public-key cryptography is used to protect and securely store the refresh tokens. More, specifically RSA algorithm with a 4096 bit key is used.

There are multiple ways to do public-key cryptography in Python (big chunk of the server side application is written in Python), some of the more popular choices include:

PyCrypto
M2Crypto
KeyCzar
And it looks like in the near future we will have another option available - cryptography

Because of my previous experience and other benefits which are mentioned later on, I have decided to go with KeyCzar.

KeyCzar is an open source cryptographic toolkit developed by Google with C++, Java and Python bindings available. One of the main goals of KeyCzar is to make it easier for developers to use cryptography safely. Unlike other existing libraries mentioned above, it exposes a higher-level API with more sane default values which makes it harder for developer to use it in a wrong or a potentially harmful way.

On top of that, it also includes a command-line tool which allows users to manage (create, rotate, revoke) key files.

Key Storage and Management

Storage and management of the cryptographic keys is out of scope of this blog post, but it’s worth nothing that it’s also an important topic. All of the effort you have put into designing and making your application secure doesn’t matter if you don’t securely store cryptographic keys which are used to protect your secrets.

If you are using Amazon EC2 and are hosted on Amazon cloud, you should have a look at CloudHSM. On the other hand, if you are self-hosted, you should have a look at YubiHSM, a secure and cost-effective alternative to other usually more expensive hardware based security modules.

In the future, Barbican might also prove itself as a viable, lower security software based alternative.

Conclusion

This time I have mostly focused on the high-level server-side application architecture, but in the future posts I plan to go into more details about the following topics:

how we handle key management
how we handle isolation via isolated private networks
how we handle client side security in the chrome extension

Note: If you think I have done something wrong or something can be further improved, don’t hesitate to contact me.

authenticate using username + OAuth 2.0 access token instead of using a more traditional approach of using a username + password.

store the following meta data items and header values (if available): Message UID, Subject, From, To, Date, X-Received, Received-SPF, DKIM-Signature, Authentication-Results, Message-ID, In-Reply-To. On top of that, this “raw” data is only stored until it’s aggregated (for the average case this means less than 24 hours) and at most one week.

In a nut shell, SASL XOAUTH2 mechanism allows clients to ↩
In the first version, we don’t touch or store message body. We just ↩

Libcloud update - Key pair management methods are now part of the base API

2013-12-11T00:00:00+01:00

Libcloud update - Key pair management methods are now part of the base API

Yesterday, I have merged a Libcloud pull request which promotes SSH key pair management methods to be part of the base Libcloud compute API.

In this post I’m going to talk a bit about the project history and evolution and show how to utilize this new functionality.

History and Background

Libcloud was originally developed in 2009 at Cloudkick to solve a problem of talking to multiple different cloud provider APIs.

Later that year, the project joined Apache Incubator and in May of 2011, the project graduated from the incubator to a top level project.

An example of how Libcloud did not came to be.

First couple of versions were simple and only exposed a small API (~6 methods) for managing cloud / virtual servers.

A list of methods supported in the first few versions of Libcloud. Source: Apache Libcloud @ Open Source Bridge presentation.

Down the road, the cloud evolution pace and competition increased and providers started adding more and more features and new services. On top of that, demand from our users also grew, so it made sense for us to start thinking about increasing the project scope and add support for other cloud services.

As such, version 0.5.0 was born in 2011. This version represented a very important milestone for the project. It was a first release which moved away from only supporting the compute API and added support for managing cloud load balancers and object storage.

Not long afterwards, version 0.6.0 which added support for a brand new DNS API was released.

Since then, we haven’t added support for any other new APIs, but have spent a lot of time improving the existing functionality, adding new features, adding support for new providers and improving the library all around.

If you are curious about what we have been working on lately, you should have a look at the Upgrade Notes and Changelog for Libcloud 0.14.0, a new stable version which should be released some time in the next couple of weeks.

SSH key pair management methods promotion

Functionality for managing key pairs was already available in some drivers as part of the extension methods and arguments for quite some time now. In Libcloud, extension methods expose provider specific functionality and usually differ from one provider to another.

Not long ago, we have spent some time unifying those arguments and method, but there were still some minor differences between different providers which made up for not so pleasant experience for out users.

Because of that, I have decided to again evaluate how much sense it makes for us to promote those methods to be part of the base Libcloud compute API. It’s important to keep in mind that Libcloud acts as a lowest common denominator which means that only functionality which is exposed by majority of providers supported in Libcloud can be part of the base API.

It turned out that most of the providers we support offer key pair management functionality which means those methods are indeed a good candidate to be part of the base API. Because of that, I have I decided to write up a proposal.

Documentation for the new functionality.

After some feedback and tweaks to the proposed interface, I have implemented the proposed changes and updated the existing code. To ease the migration and make it less painful for users which rely on the existing extension methods, I have decided to deprecate those methods and leave them in place until the next major release.

Working with the new API

The example bellow demonstrates how to use new ssh key pair management methods which are now part of the base compute API.

import os
from pprint import pprint

from libcloud.compute.types import Provider
from libcloud.compute.providers import get_driver

cls = get_driver(Provider.EXOSCALE)
driver = cls('api key', 'api secret key')

# Create a new key pair. Most providers will return generated private key in
# the response which can be accessed at key_pair.private_key
key_pair = driver.create_key_pair(name='test-key-pair-1')
pprint(key_pair)

# Import an existing public key from a file. If you have public key as a
# string, you can use import_key_pair_from_string method instead.
key_file_path = os.path.expanduser('~/.ssh/id_rsa_test.pub')
key_pair = driver.import_key_pair_from_file(name='test-key-pair-2',
                                            key_file_path=key_file_path)
pprint(key_pair)

# Retrieve information about previously created key pair
key_pair = driver.get_key_pair(name='test-key-pair-1')
pprint(key_pair)

# Delete a key pair we have previously created
status = driver.delete_key_pair(key_pair=key_pair)
pprint(status)

As you can see, I have used Exoscale provider in my example, but it should work exactly the same with other providers which support this functionality. Currently those providers are Amazon EC2, OpenStack (and other OpenStack based providers such as Rackspace) and CloudStack (and other CloudStack based providers such as Exoscale and Ikoula).

For a full list of providers which support this functionality, please refer to the supported providers / methods page.

Conclusion

I hope the addition of the SSH key pair management methods to the base compute API will make it even easier for our users to work with multiple providers and pave the surface for the promotion of other methods which will make Libcloud more suitable for other more complex / advanced use cases.

Libcloud and the road to 1.0 release

2013-10-28T00:00:00+01:00

Libcloud and the road to 1.0 release

Back in September of 2011, I was a guest on FLOSS Weekly where I was interviewed about Libcloud.

If you are not familiar with FLOSS Weekly, it’s a weekly podcast (hence the name) about free and libre open source software. I have been listening to to it for a long time (it’s a great way to spend time while you run errands / shop for groceries / bike to work) and one of my favorite things is that it covers a very wide range of guests and topics. Guests range from new comers to open source word to people with 15+ years of experience and a long open source contribution history. Some goes for projects. They range from small hobby projects you might never heard about to popular projects with very large communities and ecosystems such as Arduino and OpenStack Swift.

The road to 1.0

Anyway, lets get back on topic.

One of the questions was how stable Libcloud is and when users can expect a 1.0 release. At that time, some of the APIs such as DNS and storage were added just recently, but compute API has been stable for quite a while and used in production in multiple places.

My answer was something along the lines that for the main part, Libcloud already is production ready and 0.1 versioning scheme is just an artifact left over from the past and 1.0 version should hopefully be released some time next year.

It has been more than 2 years since then and we have made numerous releases during that time, but a version 1.0 still hasn’t been released yet.

You might ask why. The closest reason for that is that a documentation was lacking and we just simply hadn’t made the switch yet.

New documentation which is available at https://libcloud.readthedocs.org.

Documentation situation has been improving lately, so a while back I thought it’s finally time to start working on a 1.0 release.

1.0, does it even matter?

Some of you might ask why switch to 1.0 and not just continue with 0.1 series?

There are multiple reasons for that:

1.0 release indicates production readiness to a lot of people. I’m personally more of a rolling release guy and believe that the whole “production ready” concept is often misleading and the switch is usually made purely out of marketing and political reasons and not technical ones. In any case, a lot of users still associate 1.0 with production readiness so it makes sense for us to switch and indicate them that Libcloud is safe to be used in production.
Move to 1.0 will finally allow us to use semantic versioning. As noted above, current versioning scheme is mostly and artifact from the past. Using semantic versioning will make it easier for our users to understand what is going on and know what to expect with each release. If you want to know other reasons, see this mailing list thread.

An example of semantic versioning scheme. Source: http://www.aosabook.org/en/eclipse.html

Some of you might also say that this switch seems kinds arbitrary. That is true, but as noted above, base APIs have been stable for a long time and at this point, we should just do it.

On top of that, Linux kernel did a similar thing with transition from 2.6 to 3.0 and if Linux kernel can do it, we can do it as well :P

Road to 1.0

I would say that at this point the road to 1.0 is pretty shallow and my goal is to get the release out some time in the next couple of months.

We are currently working on a 0.14.0 release. This releases includes some pretty big changes and improvements and will most likely be a last release with backward-incompatible changes before the 1.0 one.

One of the more important changes this release bring is improved support for providers with multiple regions. For more, see Upgrade notes and CHANGES file.

And as far as backward-incompatible changes go, we have doing our best to avoid them as much as possible, even before the 1.0 release. Sadly that is not always possible and we did manage to have some backward incompatible changes in the past, but all of those changes very small and non-invasive. We also didn’t press users to update their code as soon as possible and we supported old (deprecated) way of doing things for a long time to make a transition easier.

Another thing which I would also like to see done before 1.0 release is an improved and more user-friendly website. It is something which has been on my todo for a long time and I have even started to work on it in the past, but I was always distracted by other things and never made much progress.

Sneak peek of a new website. Keep in mind that this is a very early draft which is likely to change in the near future.

In any case, Jerry recently started working on an improved design based on Bootstrap 3. Once the new design is ready, it should be a fairly easy and smooth ride from then on.

How can I help?

As always, any kind of help and contributions are welcome and appreciated. One of the things we need the most help with at the moment is documentation and testing of the 0.14.0 release once it becomes available.

For information on how to contribute, see this page.

So, when?

As noted above, two main pre-requisites for the 1.0 release are a new website and a 0.14 release. Both of those things should be available in the near future.

We obviously do a lot of testing and have a fairly comprehensive test suite, but nothing is perfect and sadly, some bugs almost always manage to get overlooked.

I imagine this will also be the case with 0.14.0 release and even more so since it includes a large number of changes and improvements.

Because of that, I want to give 0.14 enough time in the wild before preparing a 1.0 release. This means you can expect 1.0 release around 6 - 12 weeks after 0.14 becomes available.

Migrating from epydoc to Sphinx style docstrings using sed and some command line fu

2013-09-28T00:00:00+02:00

Migrating from epydoc to Sphinx style docstrings using sed and some command line fu

This post describes how to migrate Python API documentation which uses epydoc style docstings to Sphinx format using sed and some command line fu.

Motivation

After a gentle nudge by Alex Gaynor, we have recently finally started to work on a task which was long overdue - improving documentation for the Libcloud project.

Improving and updating documentation has been on my todo for a long time, but I was always too busy and / or had an excuse to work on code or some other non-documentation related part of the project.

I know there is no good excuse or apology for that, but I don’t want digress too much from the original title of this post, so I plan to go into more details in a separate blog post. For now it suffices to say that we have already made quite a lot of progress and as always, your contributions are very much appreciated and welcome.

New documentation already looks way better than the old one.

This task included writing new documentation and moving existing regular and API documentation to Sphinx.

Existing documentation was stored in subversion (using Apache CMS) in Markdown format. The move to Sphinx and reStructuredText was performed manually. The reason for that is that the existing documentation was pretty poor and lacking and the move didn’t just involve changing the format, but it also involved rewriting the text and filling the gaps.

Existing API documentation and docstrings used epytext markup. Unlike regular documentation, API documentation didn’t need rewriting and we just wanted to migrate to a Sphinx style docstring format so we could use autodoc extension.

Migrating from epydoc to Sphinx style docstring format

There are multiple ways to approach this task:

Write a Sphinx extension which converts epytext tags to Sphinx format on the fly
Update all the epytext tags in the code

I decided to go with #2 and automate it using some command line fu. The reason for that is, that on the fly translation slows things down and moving forward, you end up with two style of docstrings in your code (epytext for old and Sphinx for new code).

The only downside of the second approach that it touches a lot of code and in case you have a lot of open pull requests, this could result in a bunch of merge conflicts down the road, so keep that in mind.

The script which I used for the migration can be found bellow:

#!/usr/bin/env bash
#
# Script for migrating from epydoc to Sphinx style docstrings.
#
# WARNING: THIS SCRIPT MODIFIES FILES IN PLACE. BE SURE TO BACKUP THEM BEFORE
# RUNNING IT.

DIRECTORY=$1

SED=`which gsed gnused sed`

for value in $SED
do
    SED=${value}
    break
done

if [ ! $DIRECTORY ]; then
    echo "Usage: ./migrate_docstrings.sh <directory with your code>"
    exit 1
fi

OLD_VALUES[0]='@type'
OLD_VALUES[1]='@keyword'
OLD_VALUES[2]='@param'
OLD_VALUES[3]='@return'
OLD_VALUES[4]='@rtype'
OLD_VALUES[5]='L{\([^}]\+\)}'
OLD_VALUES[6]='C{\(int\|float\|str\|list\|tuple\|dict\|bool\|None\|generator\|object\)}'
OLD_VALUES[7]='@\(ivar\|cvar\|var\)'

NEW_VALUES[0]=':type'
NEW_VALUES[1]=':keyword'
NEW_VALUES[2]=':param'
NEW_VALUES[3]=':return'
NEW_VALUES[4]=':rtype'
NEW_VALUES[5]=':class:`\1`'
NEW_VALUES[6]='``\1``'
NEW_VALUES[7]=':\1'

for (( i = 0 ; i < ${#OLD_VALUES[@]} ; i++ ))
do
    old_value=${OLD_VALUES[$i]}
    new_value=${NEW_VALUES[$i]}

    cmd="find ${DIRECTORY} -name '*.py' -type f -print0 | xargs -0 ${SED} -i -e 's/${old_value}/${new_value}/g'"

    echo "Migrating: ${old_value} -> ${new_value}"
    eval "$cmd"
done

(script is also available as gist at https://gist.github.com/Kami/6734885)

As you can see, the script is very simple and has some limitations (noted bellow), but it worked very well for us. As usually, 80-20 rule also applies in this case.

Limitations of this script:

Script does a very simple search and replace and has no knowledge or context of the surrounding code and text. This means that if you have some code which looks like epytext docstrings, this script might unwillingly replace it.
I only added support for tags we use. As such, the script doesn’t support all the epytext tags. This shouldn’t be a big deal though. It’s fairly easy to change it and add support for all of the tags. You can find a list of all the available tags on this page.

135 days of commits and 50+ open source contributions later

2013-09-26T00:00:00+02:00

135 days of commits and 50+ open source contributions later

It has been more than 2 months since I left Rackspace to work on my own startup. Those two months have been very busy and many different things have happened.

After a lot of development, brainstorming and working with our potential customers, David and I have decided to stop working on a project we originally started to work on after I left Rackspace (Wadodo) and focus our effort on two other projects.

I plan to go into more details why we have decided to do that in a future blog post, but it suffices to say that we have decided to focus our efforts on projects in other fields where we have more experience and connections (personal training, big data and distributed systems).

Wadodo

First of those projects has already been launched. It’s an online marketplace for personal trainers and athletes called CoachSpree. Second project is well underway and should be launched in the near future.

To make those products a reality, I have spent a lot of time on coding and non-coding (customer development and acquisition, …) tasks. In this post I’m going to ignore other topics for a moment and focus solely on coding tasks, more specifically on my open-source contributions.

Why, you might ask?

There are multiple reasons:

It’s nice to look back and see things that have been accomplished.
To encourage more people to contribute to open source projects.
To show people there is always time to give back and contribute to open source projects, even while working crazy hours on a startup.
To push myself to contribute even more in the future.

Open source contributions are good, yo!

During this period I have made more than 50 contributions to more than 30 different open source projects. A big chunk of changes I have contributed were smaller bug fixes and feature additions, but there were also larger contributions and feature additions.

Github activity graph (also includes commits to private repositories)

It’s also important to keep in mind that the sheer contribution size and number lines of code is not always a good indicator of how much time was actually spent working on the issue.

Some of the smaller bug fixes I’ve contributed took substantially more time than other larger contributions. The reason for that is that some of the smaller contributions were fixes for some really nasty edge cases. And as it usually goes, debugging nasty edge conditions can be very time consuming and many times it’s really hard to write a test case which reproduces the issue.

Some projects I have contributed to are listed bellow. This list excludes projects where I’m a primary author.

As you can see, there is a lot of Python, but there are also contributions in other languages such as JavaScript (Node.js), Ruby and Bash.

Conclusion

My goal is to continue and hopefully exceed this pace of open source contributions and giving back to the community in the future.

I also hope this post will inspire other developers to contribute more. I would especially like to see more smaller web agencies here in Slovenia to contribute more.

I know a lot of such companies who rely on many open source projects and technologies, but they run forks instead of contributing changes back upstream.

Lets ignore moral perspective and not giving back for a moment and focus solely on the cost of forking. From a quick glance and a short-term thinking, forking might seem like a time saving thing to do. It is true that it might save you some time in the short term, but in most cases it’s going to result in a lot of additional work and maintenance headaches in the future.

So keep that in mind next time you fork a project.

10 secrets to sustainable open source communities; great presentation about open source communities

2013-09-16T00:00:00+02:00

10 secrets to sustainable open source communities; great presentation about open source communities

A while back I have encountered a great presentation about open source communities titled 10 secrets to sustainable open source communities which has been delivered by Elizabeth Leddy at OSCON 2013.

The presentation primarily talks about author’s experience with the Plone project and community, but the lessons and observations in it can be applied to pretty much any open source project out there.

One of the slides from the presentation

Why I find this presentation so good you might ask? Here are some highlights from the presentation:

Open source is more than just contributing code
In many cases open source projects outlast relationships and jobs
A community can as as your extended family
People move on (aka life happens) so you should plan accordingly
Measuring community success is important
Diversity is important
Communication is important
Transparency is important
In-person communication and meetups are important
Soft skills are important
Project governance is important

For more, I encourage you to go check out the presentation.

Exporting Libcloud DNS zone to BIND zone file format and migrating between DNS providers

2013-09-07T00:00:00+02:00

Exporting Libcloud DNS zone to BIND zone file format and migrating between DNS providers

Because of the reliability issues and pretty much non-existent and non-responsive customer service (if you don’t believe me, check Twitter which is full of complaints) I migrated some of my domains away from Zerigo to a different DNS provider. To do that, I wrote a simple Python script which allows you to export Libcloud DNS zone to a BIND zone file format.

Motivation & History

I have been Zerigo user for almost 4 years now. One of the primary reasons why I have migrated most of my domains to Zerigo back then was that they were one of the first DNS providers which offered a simple management REST API.

At the beginning things were working flawlessly, but after the acquisition by 8x8 things have started degrading. Things have became especially bad in the last year or so. During that time Zerigo has been a target of multiple DDoS attacks which took down majority or all of their DNS servers (1, 2). Those attacks caused a major disruptions for a lot of Zerigo DNS users.

Anycast provides better resilience against DDoS attacks

Recently Zerigo posted an announcement where they said that they have implemented multiple measures to improve their DNS servers reliability. Sadly, as displayed a couple of weeks ago when they were a target of another DDoS attack, those measurements didn’t help. The problem is that most of the measurements they have implemented are just small patches which don’t address a root cause. To improve the reliability of their service, the first step they would need to take is to move all of their DNS servers to Anycast. Anycast has long been used by many commercial DNS providers to increase performance and availability.

Exporting Libcloud DNS zone to BIND zone file format

Yesterday I have decided to migrate more of my domains to a different provider. To expedite the migration I wrote a simple Python script which can take a Libcloud DNS zone and create a BIND zone file for it.

Advantage of this approach over writing a script which uses Libcloud to directly re-create all the records under a different provider is that it’s more efficient and as an output you get a file which you can use with any DNS software or provider which supports BIND zone file format.

Keep in mind that a similar thing can be achieved by using a dig tool (dig +nocmd example.com any +multiline +noall +answer). The problem with dig approach is that unless zone transfers are enabled for your IP address, you won’t receive all the records back.

Aforementioned script can be found on Github.

Usage

To use this script and create a BIND zone file, follow the steps bellow:

Install the Python package

$ pip install -e git+https://github.com/Kami/python-libcloud-dns-to-bind-zone@master#egg=libcloud_to_bind

The script is so simple I haven’t decided to publish it to PyPi yet. I plan to start a discussion on the Libcloud mailing list and if more people find it useful and are OK with that, I will include this functionality in the core.

Take a look at example.py and modify it to suit your needs. For example:

from libcloud.dns.types import Provider
from libcloud.dns.providers import get_driver

from libcloud_to_bind import libcloud_zone_to_bind_zone_file


DOMAIN_TO_EXPORT = 'example.com'

Zerigo = get_driver(Provider.ZERIGO)
driver = Zerigo('email', 'api key')

zones = driver.list_zones()
zone = [z for z in zones if z.domain == DOMAIN_TO_EXPORT][0]

result = libcloud_zone_to_bind_zone_file(zone=zone)
print(result)

Keep in mind that you can replace Zerigo with any other provider supported by Libcloud.

Run the code

$ pypy example.py

(yes, Libcloud works just fine under PyPy)

Here is an example output for one of my domains:

; Generated by Libcloud v0.13.0 on 2013-09-07 00:09:09
$ORIGIN tomaz.me.
$TTL 900

tomaz.me.	900	IN	TXT	"v=spf1 include:_spf.google.com ~all"
tomaz.me.	900	IN	A	207.97.227.245
tomaz.me.	900	IN	MX	30	aspmx5.googlemail.com.
testsrv.tomaz.me.	900	IN	SRV	10	10 333 google.com.
mail._domainkey.atlantis.tomaz.me.	900	IN	TXT	"v=DKIM1; k=rsa; t=y; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQC6HeU4PBI+JuEWAe03Bzye1Gs+U2vXhbSloSNbXr9JDWMygyQtCjxN7brHahqFambBtmdQ5VmbukM+HFlKUoaNz7Q97KaKRQg8mDvSmLJkHmAw5PzZJXfzrfkoLmXhN6K4XnwLWJ0BFWPyEPdpwCX8v9v3kB0INJU4hNjwdy/+6wIDAQAB"
www.tomaz.me.	900	IN	CNAME	kami.github.com.
tomaz.me.	900	IN	MX	30	aspmx3.googlemail.com.
google._domainkey.tomaz.me.	900	IN	TXT	"v=DKIM1; k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDNCHa8VeffMv+X/fRkPgHC9MN2Eh5vQqMkWy4e/YnFbWgF1JilL1Yn9nN54A5WV7lZpCTIvuOC2CrQrIcaBpfr+8SjYsjGO91dz8cwgqZkl7mAjKs7nz8U0PsstuI9i4V3LsHC4NVGOirAgnKA4HXVhxGRuyE94+tuNJ6XDLJoNQIDAQAB"
tomaz.me.	900	IN	MX	20	alt2.aspmx.l.google.com.
atlantis.tomaz.me.	900	IN	TXT	"v=spf1 ip4:178.63.79.14 ip4:178.63.79.48 ip4:178.63.79.49 ip4:178.63.79.50 ip6:2a01:4f8:121:3121::2"
atlantis.tomaz.me.	900	IN	PTR	atlantis.tomaz.me.
tomaz.me.	900	IN	TXT	google-site-verification=Rgex8ShgIRWUlb9j0Ivw5uHllb0p9skEdJqkSMqvX_o
test5.tomaz.me.	900	IN	AAAA	2620:0:1cfe:face:b00c::3
tomaz.me.	900	IN	SPF	"v=spf1 include:_spf.google.com ~all"
tomaz.me.	900	IN	MX	20	alt1.aspmx.l.google.com.
atlantis.tomaz.me.	900	IN	A	178.63.79.14
test5.tomaz.me.	900	IN	A	127.0.0.1
ponies.tomaz.me.	900	IN	A	86.58.76.208
tomaz.me.	900	IN	MX	30	aspmx2.googlemail.com.
atlantis.tomaz.me.	900	IN	AAAA	2a01:4f8:121:3121::2
secure.tomaz.me.	900	IN	A	86.58.76.208
tomaz.me.	900	IN	MX	10	aspmx.l.google.com.
tomaz.me.	900	IN	MX	30	aspmx4.googlemail.com.