Gotchas when deploying (Go) apps on Google App Engine

Me and my other half write our weekly shopping list on a notepad. It's not a spectacularly inefficient system but it has its downsides: we can't both work on it at the same time, we often forget it when we go shopping, and we often think of things to add to it while we're away from home (and the wretched pad). In any case, it gave my inner geek a great excuse to digitise the process and also to write something else in Go, which I really like but have yet to make anything non-trivial with. As I've been wanting to give Google App Engine (GAE) a go, again this was the perfect excuse.

As it turns out, I'm glad that my first time deploying to GAE was with a small, simple app, because if this had been part of a larger project I would have given up and used a different service. Most of these 'gotchas' were my own fault, partly because I failed to appreciate the great lengths that GAE goes to make an app scale, even if it's just intended for two users (not that it has any way of knowing what my intentions are!). This involves making simple behavioral choices for the developer, assuming that he will understand them.

For my part, I was just looking for a way to quickly push a very simple app, so it was not the best fit, especially since I'm used to the likes of Rackspace and EC2 where the developer has (almost) complete control. That said, now that I understand these quirks and why they're here, I would certainly consider using GAE in the future--just not for tiny personal projects. In hope that others don't waste their time too, here's a list of the issues I faced when deploying to GAE and how I resolved them.
It's likely that these issues will affect any GAE deployment (not just Golang apps), but I only tested this language.

Note: This post does not aim to be a tutorial in GAE app creation or deployment. If you want the basics, head over to Google's tutorial.

1. Don't use mutable global variables for anything

(Updated following some comments on Reddit)

In general, you wouldn't use mutable global variables in a server application anyway. If the server gets rebooted, you'll lose whatever is stored there. But when testing a new cloud platform (GAE), you want to know what you can and can't do. If I deploy an application on EC2, I can use global state to store things if I feel like it, and everything will behave perfectly well until the VM gets rebooted. In my experience they don't get rebooted that often, so if the variable contains something that did not have strong persistence requirements to begin with, you may not notice it being reset or it may not annoy you very much.

For example, in this shopping list app I wanted to create a map of weekdays to booleans to indicate whether or not to cross that weekday off the list (if the list was complete for that day), and thought instead of using the Datastore I'd use a global array of booleans - big mistake.

In GAE, it seems that VMs get rebooted much more often, and your global state is obviously not shared between VMs, so there's no point in using it to store anything. Memcached and the Datastore are the only ways of persisting information. As has been pointed out to me, the development server goes to great lengths to replicate and even exaggerate this behavior so that you aren't surprised by it in production.
The bottom line is: don't use mutable globals, especially in GAE!

2. Mime-types must be specified for all static handlers

This puzzled me for a while because it results in cryptic error messages from appcfg.py when you try and deploy:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal not in range(128)

The fix is simple:

All static file handlers in your app.yaml file must include a mime-type attribute, despite what it says here.

GAE is supposed (by its own admission) to know the Mime-Type of some common extensions, like CSS, but if you don't include them, you sometimes get the above exception. If you do, then you're probably missing some Mime-Types in app.yaml that the app engine can't figure out on its own. It would be nice if it could say this in a human-readable way, but there are a number of posts on SO (see eg. here) relating to this so it shouldn't be too difficult to figure out.
This is a particular pain point for web-font folders, since you'll have to specify a different handler for each font file type just so they can specify separate Mime-types for them all (or at least this was true on my system).

3. Static files don't update when you appcfg.py update

The command python appcfg.py update [app path] is supposed to update the deployed version of your app with the changes that you have made to the local version. Unfortunately, if you expect it to update all your static files seamlessly, you're in for a world of pain. The process went something like this:
1. Deploy version 1 of your app
2. Go to [your-app-id].appspot.com and confirm that everything works as planned.
3. Actually, you notice something is not quite right in the styling and you end up making a small adjustment to the CSS
4. Re-deploy your app with a version bump.
So, you expect that if you go back to [your-app-id].appspot.com, you'll see the new styling, right? Nope. Not even if you clear your browser cache. GAE is very good at caching static files - so good that when you push a new version, it will happily ignore it.

The fix

According to what I could find online, some "cache-busting" semantics may work for you (see also this post). This involves changing all references to static files in browser-facing code (HTML/JS/CSS) to include an identifier that is unique to your version number, preventing GAE from serving up old versions.
Example: Instead of linking to js/app.js in your index.html, link to js/app.js#[version number].
Sounds good, but for some reason it didn't work for me and I still don't know why. Fortunately, GAE also uses a subdomain identifier to version your app, so if you want to access, say, version [xx] of your app, you can go to [xx].latest.[your-app-id].appspot.com. This worked for me, although it is not a pleasant solution by any means since I'll have to update the bookmark on my iPhone every time I make any small change to the app.
If I can find out why the cache busting technique failed for me I'll update this post.

A parallel 'for' loop in Python using decorators

A colleague who was new to Python recently ran into some performance problems with a data processing script he'd written, which we had good reason to think was being held up by slow-running database queries. The script was something like this:

 for id in entities:
      res = run_expensive_query(id) #very slow
      process_result(res) #relatively fast

Other than tweaking the query, the obvious thing to try was to parallelize the for loop, but, since my colleague was new to Python (and programming in general), pointing him to the concurrent package was only going to confuse him. So I thought I'd write a decorator to do the job.

The 'Static' version

This decorator works statically (you'll see what I mean in a moment). It's designed to work with scripts like the one above, where entities is computed once before the for loop runs rather than dynamically changing. The upside is the beginner-friendly syntactic sugar @distribute_over(entities). The way it works is to offload the processing to a ThreadPoolExecutor using executor.map and append the items of the resulting generator to a list result:

from concurrent.futures import ThreadPoolExecutor

 def distribute_over(shards, threads=5):
     """Parallel for loop over an iterable 'shards', which is passed as the first argument of fn"""
     def wrapper(fn):
         def func_wrapper():
             result = []
             with ThreadPoolExecutor(max_workers=threads) as executor:
                 gen = executor.map(fn, shards)
                 for res in gen:
                     result.append(res)
             return result
         return func_wrapper
     return wrapper

Here's an example of usage:

 import time

 @distribute_over(range(10))
 def get_squares(i):
      """Return the ith square and sleeps for a second"""
      time.sleep(1) #Do something time consuming
      return i**2

 print(get_squares())

As you can see, the shards must be determined before the function is even defined, which might be a problem in some cases. But it was perfect for my colleague's use case, since he could just replace his for loop by a function definition and decorate it with @distribute_over(entities), which is just about as readable as it gets.

The 'Dynamic' version

This version is not quite as pretty in terms of syntax but it's more useful because you don't have to know shards ahead of time - you can just pass it in dynamically. Here is what it looks like:

 def distribute(fn, threads=5):
      """Parallel for loop over an iterable shards, which is passed as the first argument of fn"""
      def func_wrapper(shards):
           result = []
           with ThreadPoolExecutor(max_workers=threads) as executor:
                gen = executor.map(fn, shards)
                for res in gen:
                     result.append(res)
           return result
      return func_wrapper

And the example (with cubes this time):

 @distribute
 def get_cubes(i):
      """Return the ith cube and sleeps for a second"""
      time.sleep(1)
      return i**3

 print(get_cubes(range(10)))

Although it's more useful, it's not quite as obvious what's going on in the last line - and this is especially true if get_cubes was called something like analyze_entity, which suggests the argument should be a single entity rather than a list! So aptly naming the method to reflect that it's going to be called on an iterable is important to avoid confusion.

A 'Generator' version

As an afterthought, you could replace the result.append(res) line by yield res to do away with the result variable and end up with a generator instead, which you could call like

 for cube in get_cubes(range(10)):
      print(cube)

Not sure if that's any better (or if it even works, I haven't tried it), but it might satisfy some different use cases.

The code

The code is available from Github in this gist.

Importing CSV data in Postgres

Just a quick tip: If you need to import CSV data into a Postgres table, use COPY. If the table doesn't already exist, you'll have to create it first using CREATE. Your COPY statement will then populate the table. If the table already exists and is non-empty, COPY will append data to the table assuming it doesn't violate any constraints (e.g. uniqueness).

In its simplest form, here's what it looks like

COPY MyTable FROM '/path/to/my/file' DELIMITER ',' CSV;

Headers

If the first row of your file contains a header, you can tell Postgres thusly:

COPY MyTable FROM '/path/to/my/file' DELIMITER ',' CSV HEADER;

Explicitly specifying columns

If your table has an auto-incrementing column (e.g. a row id), you may need to tell Postgres which columns are contained in your file or it may complain that it can't find this column. The syntax is as follows:

COPY MyTable ("Column1", "Column2", ... "ColumnN") FROM '/path/to/my/file' DELIMITER ',' CSV;

Importing multiple files

You could copy-paste COPY statements, but there's a neater way to do it using a FOREACH loop (requires PostgreSQL 9.1). The idea is what follows. I haven't actually tested this so it may require some modification!

DO
$BODY$
DECLARE
    files varchar[] := array['/path/1', '/path/2'];
BEGIN
    FOREACH f SLICE 1 IN ARRAY files
    LOOP
        EXECUTE '''COPY MyTable FROM ''' || f || '''DELIMITER '','' CSV''';
    END LOOP
END
$BODY$ language plpgsql

Hello world

Test. This text has emphasis. This text is strong. And because this post is called 'Hello world', here's such a thing:

package main

import "fmt"

func main() {
    fmt.Printf("Hello world.\n")
}