Surviving Django (if you care about databases)

2020-07-25

Django is currently the most commonly used full-stack web framework for Python. It has been around for a good 15 years, emerging as a winner from a period in which Python was already mature, but its web development tools were comparatively much more immature and fragmented.

Django allows the definition of objects in your program as "models" using the Model base class. They behave largely like normal Python classes, with added support to save into, and retrieve from, the relational database backing your application. If you don't need such database support, you didn't need a full-stack web framework in first place.

Django tries to be independent from the database you choose. It sounds like a good idea, but only on paper. After working several years with Django systems, both written from scratch or inherited and maintained, I feel the "blessed" way of working with databases and Django leads you to using your database in a sub-optimal way, and unnecessarily complicates the development and maintenance cycle of your project.

I think the divergence between wishful thinking and reality starts from a fundamental misunderstanding between you and Django, which is not written in the contract you haven't signed with them anyway:

"independence from the relational database" is a feature needed by Django as a framework, not by the program you are writing.

Django needs it because a web framework not tied to a single database vendor is more valuable than one tied to a specific one - and that's fair enough. But you don't: your web program, most likely than not, will not have to switch from one database to another. So, You Ain't Gonna Need It™. Portability at all costs leads to at least two problems:

1) You will not able to use all the features offered by your relational database.

2) Every change to your models, or to your database schema, will be more complicated than it should be.

You Ain't Gonna Need It

How many times have you worked on a project and, after 1-2 years of development, you have changed to a different database?

I can tell you how many times it happened to me, I counted them: exactly never.

Replacing your database vendor is a major, traumatic occurrence, almost as much as rewriting your program in a different language. If you replace your database, more likely than not, it is because you are interested in the features of the new databas. You need to use them so using the common functionalities between the old and the new one will not solve any of your problems.

Did you maybe start your project with SQLite and now your project grew enough to need a bigger database? If so then you are still at the phase in which your project is a toy: you haven't done anything yet that requires thinking in terms of concurrency. Even if you have to rewrite a few things, it's not going to be a lot.

Do you have a large MySQL project and now you have to migrate to PostgreSQL? That's not gonna happen: you have probably tweaked MySQL, have expertise in MySQL. Maybe PostgreSQL might be a better database in some aspects, but not so much that you want to migrate all your data and start from scratch without frobbing, twiddling, tweaking the database configuration. Did you say you have High Availability and Disaster Recovery configured? That's to be converted too of course.

In the above paragraph, replace the database vendor with all the permutations of MySQL, PostgreSQL, MS SQL, Oracle. That's not gonna happen. Except maybe if an Oracle salesman gets hold of someone in your project with a modicum of decision making and talks them into buying some sort of expensive license, but that's not a technical problem, it's a political one, and it's up to you to decide if you are comfortable with it.

Have you got PostgreSQL in production, but you want to test with SQLite because it's easier to set up? If so, your tests are just a tick-box exercise: you are not testing anything remotely plausible and resembling your live system.

Choosing a database happens in the first days of your projects, it will not happen when the project is mature. You may as well use all the features available with your database, not only the ones common enough that Django created a Python wrapper for it.

Use all the features

Scanning the schema of a Django program I've written and maintained for a few years I see:

  • Schemas (as in the "directory of the tables", not all the other meanings)
  • Custom domains
  • Collations
  • Triggers
  • Permissions
  • Partial indexes
  • Constraint exclusions
  • Views
  • Stored procedures
  • Partitioned tables

If these features were used, it's because they allowed to implement certain features needed for the program in a simpler way than what possible in the language (if even would have been possible there). Audit for instance: Django doesn't have an audit feature except for the changes made in the admin. Even if you added some form of manual auditing to each save() method, it will not capture changes made outside Django. It wouldn't be very secure either: Django uses a single user to access the database so if someone manages to hijack that user, they would be able to change the data in the database and alter the audit tables to hide their traces.

In PostgreSQL you can:

  • Create an "audit" user: it will have different permissions than the user of the Django application.
  • Create an "audit" schema: revoking write permission to all the objects it will contain from the Django user.
  • Create a function to append a record to an audit table owned by the "audit" user, but callable by the Django user.
  • Add a trigger to the tables to audit.

This setup requires Postgres-specific knowledge, which is fair for a feature that has to watch over the database data whichever the origin of the change is. But being PostgreSQL extensible as it is, you can use an extension to automate the creation and maintenance of the audit triggers and functions.

So where do I put the schema?

Into an SQL file!

With its tiny .sql extension!

And with comments! Explaining why a certain index or constraint exist!

With constraints named meaningfully, available for manipulation, not auth_group_permissions_group_id_b120cbf9_fk_auth_group_id.

Sounds civilised to me.

The nice thing of doing this is that, if you do things carefully enough, Django will not notice anything at all.

Take the above audit examples: Django is not meant to interact with it: everything will just happen under its nose. You can use views instead of tables for read-only models, you can use domains instead of more basic data types for your fields: Django won't see your triggers triggering, your constraints constraining, your permissions permitting - except when things go wrong, which will result in an error 500 and a Python traceback. This is at least better than bad data in the database. Also it won't see your procedures proceeding, your domains dominating...you get the idea.

Using psql to import the schema into an empty database means you can modularise the code and use \i to import "submodules". A typical pattern for me is to have a database.sql to create global objects (users, extensions, schemas), set the basic permissions and import the details into target schemas.

\i users.sql
revoke create on schema public from public;     -- safety

--- create a schema for the django app tables
create schema myapp;
grant usage on schema myapp to myapp, backup;

--- set the default permissions for all the object that will be created there
alter default privileges in schema myapp
    grant select, insert, update, delete on tables to myapp;
alter default privileges in schema myapp
    grant select on tables to view, backup;
alter default privileges in schema myapp grant all on sequences to myapp;
alter default privileges in schema myapp grant select on sequences to backup;

--- import the tables into the schema
set search_path to myapp, public;
\i django.sql   -- django objects - users, groups, permissions tables
\i myapp.sql    -- your app models
reset search_path;

The database.sql doesn't create the database itself so you can create an empty one anywhere it's needed, then you can use psql -f database.sql "postgres://connection/url" to populate it. The myapp.sql file shouldn't contain any reference to the schema myapp where the objects are created so the schama can be easily changed. Postgres doesn't have a statement like CREATE USER ... IF NOT EXIST: in users.sql you can simulate it with:

do $$
begin
    perform 1 from pg_user where usename = 'myapp';
    if not found then
        create user myapp;
    end if;
end
$$ language plpgsql;

Migrations

Django has an amazingly complex system to perform model migrations. It is complex amongst other reasons because:

  • It actually migrates Python models not database schemas. Even if you change a field's help text, it generates a migration.

    class Migration(migrations.Migration):
    
        operations = [
            migrations.AlterField(
                model_name='foo',
                name='bar',
                field=models.BooleanField(help_text='I only changed this'),
            ),
    

    That's not useful at all for the database, but Django will create it for you and if you remove it, it will add it back. Similarly, changing a choices list, a display label, results in migrations with no database operation, only a Python operation, and practically no SQL purpose.

  • It allows access to the state of the model at times intermediate between migrations, using get_model(appname, modelname) and with some Python code in the returned model. But if that code also happens to use any code inside your application, importing models with a normal Python import, things will crash because of a mismatch between model definitions and schema in the database. But they won't crash immediately: only later when you will apply some unrelated migration. And not when that migration is needed: only after it has already been applied and it doesn't have anything more to do in its lifetime: it is implemented as a model that Django will keep on importing over and over. In a project I'm currently working on, I had to add this function into our codebase:

    def can_run(*models):
        try:
            with transaction.atomic():
                for model in models:
                    model.objects.all().first()
        except Exception:
            return False
        else:
            return True
    

    And use it as a whack-a-mole to avoid already applied migrations from exploding, adding early bailouts like:

    def data_migration(apps, schema_editor):
        if not can_run(Model1, Model2, Model3):
            return
    
        # ...the stuff I meant to do
    
    class Migration(migrations.Migration):
        operations = [
            migrations.RunPython(data_migration),
        ]
    
  • It detects changes to your models and infers how to change the schema. This is pretty much impossible to do in an automated way, except in the simplest cases. So the feature is incomplete, and when it thinks it knows what it's doing, often it doesn't and it's hard to tell it to stop.

    In a case I experienced, my model Foo needed some extra pepper to its attribute bar. So we thought to fetch the data from the database into a hidden _bar attribute and then have a Python property to expose it:

     class Foo(models.Model):
    
    -    bar = models.CharField(
    +    _bar = models.CharField(
             max_length=255,
    +        db_column='bar',
         )
    +
    +    @property
    +    def bar(self):
    +        return something_more()
    

    No change needed to the database, but Django insisted on creating migration consisting pretty much of:

    ALTER TABLE foo DROP COLUMN bar;
    ALTER TABLE foo ADD COLUMN bar;
    

    So goodbye to your data! But who needs a bar anyway. Although I think I could've used a bar, and a few drinks, if that migration had hit production.

Much of the complexity of this system is designed to give you a specific feature: migrations abstracted from your database. In my opinion, if there is something less likely to happen than writing a complex app to run on interchangeable databases, it's the need to repeat the same history of migrations. The cost you pay for this YAGNI feature is a cumbersome and fragile system trying to have a life of its own whilst fighting yours. It is a direct application of Jeremy Miller's observation that any extensibility point that’s never used isn’t just wasted effort, it’s likely to also get in your way as well.

I may be a bit of a control freak, but I think that there is too much magic here and too few hooks to intervene to correct it, all playing too close with my data to feel comfortable. This comes with the advantage of being able to replay your migrations in the future on a different database vendor, but I struggle to find a valid use case for it not involving parallel universes and time machines.

So what do I use instead of migrations?

You can use SQL files! Yes, same extensions as before. It takes some discipline, but it rewards with great control of the behaviour of the database and the safety of the data. The work involved consists in:

  • Knowing your database and its data definition and manipulation language: to know how to change the schema and the data associated.
  • If you have the database schema in source control, when you change the schema, you will usually want to associate a similar "schema patch". An ex-colleague of mine wrote a script to "prime" a migration with the right name and a comment containing the diff of the schema, to give an idea of what to do.
  • You can have a database table recording the schema patches already applied to the database and a script to look for unapplied patches: run them, and record them into the table, to run on deployment.

I have a patch_db.py script that I have used, with small variations, in several projects. It is meant to be executed as a database superuser so that the application user can be given only limited privileges: enough to read and change the data it needs. No need to be a superuser or to own the tables to be able to change their schema. A few features of this simple script include:

  • If the database had never been patched, it creates a schema_patch table and registers the patches available, as already applied.
  • If the table exists, it compares the content of the patches directory with the table content and applies and registers the missing patches in alphabetical order. You can use the date as prefix to make sure they are applied in the correct order.
  • It keeps an advisory lock for the entire run to make sure that there will not be two patching processes running concurrently by mistake.
  • The patching process is identifiable in PostgreSQL via its appname so it can be monitored and dealt with, in case it behaves unexpectedly.
  • Patches can be applied one by one, with interactive user confirmation, or unattended for automatic deployment.
  • You can associate executable scripts to a patch, to run before or after it, to implement procedures difficult to implement in SQL: you can have my_patch.pre.sh and my_patch.post.py to run respectively before and after my_patch.sql is applied, receiving a PATCH_DSN environment variable to know where to connect to.

The script is independent from Django; if used in a Django project it might be reimplemented as a management command, but I have never felt compelled to do so.

Once this script is integrated with your deployment procedures, there are these few examples of patches that can be produced, scanning the patches directory of projects I have worked with:

  • Add a field to a table, populate it with data obtained querying other database tables.
  • Move a field to another table, split a table into two joined tables, change a relation from one-to-many to many-to-many without losing the current relations.
  • Partition previously unpartitioned data by creating partitions, moving data into them, and emptying the base table.
  • Add a new currency to the currencies table, change the name of that Balkan nation to "Republic of North Macedonia", remove that island from the list of European Union nations, or any other change to some "configuration table".
  • Fix data after a bug producing wrong data was discovered.
  • Add a new Django group and set the group permissions or add new objects' permissions to existing groups.
  • Create a normal column for a piece of data that was previously stored in an unstructured way into a JSON field, remove the data from the JSON and move it to the new field.

In practice, Django migrations create automatically only the simplest of operations, like adding a new (empty) column or dropping a column. Renaming a column is already an operation fraught with perils, and manipulating data using Python models while the models themselves are being altered, is a task of Shrödingeranean complexity.

Is it difficult to write patches?

Most of the time writing an SQL patch is simpler than writing a patch in the Django migration system. The latter is something that, in order to achieve successfully, requires knowledge not so much of the database, but rather of the migration system itself.

Looking at a few examples, taken directly from Django's own writing migrations page:

Adding a non-nullable field in Django migrations involves creating the normal migration, i.e. the one creating a field=models.UUIDField(default=uuid.uuid4, unique=True), creating two empty ones (it's suggested to use makemigrations myapp --empty because there's a certain amount of boilerplate involved), and pretty much have a sequence of:

  • Create a field with the default, but nullable
  • Fill it with values (using a RunPython step)
  • Make it non-nullable

There is also a bit of a race condition between object creation whilst the migration has been applied.

The same steps, in SQL, take much less boilerplate: in one patch you can:

BEGIN;
ALTER TABLE myapp_mymodel ADD uuid uuid;
UPDATE myapp_mymodel SET uuid = uuid_generate_v4();
ALTER TABLE myapp_mymodel ALTER uuid SET NOT NULL;
COMMIT;

Or variations of the above (if you want the default generated by the database, a non-locking operation on a large table, a unique constraint...)


Writing non-atomic migrations can either all be dealt with in one SQL script using a PL/pgSQL DO statement or with an additional script to replace the second step in the previous example (untested):

# migration add_uuid_nullable.post.py
cur = conn.cursor()
while 1:
    cur.execute("""
        UPDATE myapp_mymodel SET uuid = uuid_generate_v4()
        FROM (
            SELECT id FROM myapp_mymodel WHERE uuid IS NULL
            ORDER BY id LIMIT 1000
        ) x USING (id)
        RETURNING id
        """)
    if not cur.fetchall():
        break

Migrating data between third-party apps gets in the way with the whole get_model() machinery and you have to fight its error.

It's much more simple if you just leave Python out:

ALTER TABLE oldapp_oldmodel RENAME TO newapp_newmodel;

Or whatever INSERT and UPDATE it takes to get data into new tables.


Changing a ManyToManyField to use a through model is another example where Django considers your data expendable and to jump through sophisticated hoops if you romantically want to hold onto them.

The whole example is 64 lines of code that make sense only if you know the internal model of the schema changing machinery and how to separate schema changes and Django models state operations (which, in my opinion, had no reason to be merged together in first place and the second has no reason to exist).

Dropping the complications, everything you need is here:

ALTER TABLE core_book_authors RENAME TO core_authorbook;
ALTER TABLE core_authorbook ADD is_primary bool default false;

The first statement, by the way, is verbatim in a RunSQL example. I think this example wouldn't necessarily have been portable so I really struggle to see the point of all this.

Commands you don't want to use

Once you have introduced your migration procedures, and motivated your team into using them, these are a few operations you can use to replace the official ways of involving Django migrations:

  • manage.py migrate: the first time (e.g. new dev setup, unit tests) you can run psql -f database.sql. If you do it in a test suite run, you will have tested your schema too. If an initial schema is already in place (deployment in staging and production, distributing changes to other developers) you can run something similar to the proposed patch_db.py and have tested the patches too.

  • manage.py makemigrations: run git diff schema/*.sql and work out what you have to do, like all the other times you wanted to use the Django command that didn't do everything you needed.

  • manage.py sqlmigrate: replaced by cat.

  • manage.py showmigrations: you can use patch_db.py --dry-run.

  • manage.py squashmigrations: ...I am in awe of the complexity of squashmigraitons. It reduces your many migrations to an undetermined number of fewer migrations, conditioned by the time you have been creative and have used SQL or Python in them. You can mark operations as "elidable", if you want to have them dropped, but be careful of CircularDependencyError, although you can use --no-optimize...

    You can replace squashmigrations with rm.

    Once the migrations have been applied to all your testing and production systems, and all your developers' databases, just remove them. Or leave them where they are: who cares. Even if they stay there, even thousands of them won't do any harm to your program's start-up time: they are not automatically imported modules. If you want to keep them and not drown in patches, you can divide them into month or year directories, and change one line of the patch_db script to find them.

Tips to write a migration

Writing a complex migration consists of trying the operations to perform in a testing database, which is often an iterative procedure of trial and error, where mistakes are made, requiring to go back to the state before migrating, and don't forget to add that semicolon!

A way to write a migration patch of the type described here consists roughly of the below:

  • Create a file in the schema/patches/ directory called YYYY-MM-DD_some_meaninfgul_description.sql. If you happen to write more than one patch in the same day, and their order matters, you can add a ~02 after the date to the name of the second patch and so on, as ~ sorts after _.
  • Add a BEGIN; statement to the top of the file, followed by the statements you want to test.
  • Connect to your test database with a psql shell and use \i /path/to/your/patch.sql to try and apply your patch.
  • If applying the patch fails, run a ROLLBACK in psql, fix the error and try \i again.
  • If applying succeeds, you can explore the database to verify that the changes applied have produced the right results. You may want to run a SAVEPOINT x command now. In case you mistype a command while verifying the results, you can run ROLLBACK TO SAVEPOINT x and revert the transaction to the state just after the patch was applied.
  • Your patch applied ok, but you don't like the result? ROLLBACK, fix the patch then try again.
  • If you are happy with what you see, you can ROLLBACK this psql transaction, add a COMMIT; to the end of the patch script and run patch_db to apply the patch just created for good. If, by mistake or intention, you committed the psql changes, you can run patch_db interactively and "skip forever" the patch when asked for confirmation.

Another random bit of advice

The motivation for writing this article comes from knowledge sharing with the team I'm currently collaborating with, who are using Django the canonical way. I am convinced they can better use the tools they have. Let's see if they can be persuaded to drop Django migrations...

As I'm writing this entire brain dump, there's another trick that should be filed under "know what your tool is doing". Sometimes Django pages become slow because, adding something apparently harmless, such as accessing a model attribute, results in unexpected queries running behind the scenes. If you do it in a loop, it results in a "ripple load", meaning you'll run 50 queries, each one fetching one record, while you could have run a single query fetching 50 records, or just adding an extra JOIN to another query.

How do you identify this problem? While you are developing using the development server, you can log all the operations performed to the database adding the following logger configuration to your settings.py:

LOGGING = {
    # ...
    'loggers': {
        # ...
        'django.db': {
            'handlers': ['console'],
            'level': 'DEBUG',
        },
    },
}

You will get used to the amount of SQL blobs popping up in the dev server log at each request. If you accidentally introduce a ripple load, it'll be very easy to spot: a fast train of short statements will blaze into the log console. Either it happens in Python code or in a template or as a result of asynchronous ajax calls (not included in request-oriented tools, such as the Django Debug Toolbar). It will be unmissable. Looking at these statements, it's easy to understand what model caused it and the right select_related() to add.

You're welcome.

"Do you hate Django?"

No, I don't. I still like it, although with Python moving towards asyncio and static typing, and with the established web architecture of JSON APIs consumed by JavaScript front-ends, the appeal for other frameworks such as FastAPI is increasing. As opinionated about databases I am, I quite like the Django ORM: very simple to use, well integrated with the Python objects it produces and it allows you to write monsters of nested queries with great ease. Peeking at the logs of the SQL statements produced for you by the django.db logger fosters some admiration.

What I really don't like is when tools get in the way of someone's work. Sometimes the tools are complex because the problem they try to solve is complex also. But you are free to ask yourself this question: do I really have that problem? You might get to the conclusion that you might not and you can make choices against conventions to simplify your life.

In the case of Django, the right economy for me is:

  • Get to know the ORM features inside out and take full advantage of it to query and manipulate data.
  • Avoid using the Models to produce the database schema, use the native database language and tools instead.
  • Stay well clear of migrations.