Developing database-backed applications, I've often had the need to extract a meaningful selection of data out of a production database. Maybe in order to set up a development environment, or a testing environment, or to try out a complex migration on smaller dataset.
However a pg_dump of the production db is not necessarily a good idea:
- the production data may be YUGE;
- the production data may contain sensitive information;
- some of the data in the production db might be bulky and irrelevant, for instance audit tables.
Still, googling around, you don't get much to help: people still advise to use pg_dump with a plethora of --table, but this results in the need of saving a script to keep the unwieldy command line, and still that does not make you able to modify in any way the data dumped from the table, for instance to omit the password column. You might use COPY TO for that, but that's an entirely different beast from pg_dump and the resulting file is only data: not even the columns names. It will bear no context about how it should be restored (and now that you dropped the password column the data won't align anymore and you won't be able to just run a COPY FROM...)
A few months ago I decided to scratch this itch properly, and I started writing a tool that, after some improvements, was deemed to be good enough to be stamped as 0.1. After some discussion about the meaning of dump and what a good synonym could have been (with references ranging from relation breakups to more scatological ones) it was called pg_seldump.
The idea of the tool is to define a YAML file containing a description of the rule to apply to different database objects to dump. For instance:
# A YAML file is similar in content to a JSON file, but easier for human # to edit: more compact, less noisy, and can be commented. db_objects: # dump all the tables in the schema 'myapp' - schema: myapp action: dump # but not the one starting with 'audit_' - schema: myapp names: "^audit_.*" action: skip # don't dump customers passwords, and replace their telephone no. with a dummy - schema: myapp name: customer no_columns: - password replace: telephone: "+44 7853 123456"
This description would produce a data-only dump as a text file containing the data and enough statements to restore them by piping the entire file into psql.
The program isn't finished yet, but it is already quite interesting: the pg_seldump command takes in input one or more configuration files like the one in the example and introspects the database looking for the objects to dump. The objects are matched against all the rules and, inspired to CSS, it picks the one with the highest selectivity. It also understand what fields use what sequences, and if the field is dumped then the sequence state will be saved as well.
There is one large feature I'm about to develop: automatically navigating foreign keys in order to collect all the data necessary to create a consistent sample of the data. I started yesterday working at that feature, but there is something else I wanted to do before...
What's in a configuration file?
So far, the program had a parser for the configuration file to create "rule" objects out of what found in the YAML. While it works, it's quite verbose, it stops at the first error, and it's hard to understand from it what is really allowed in the file. I find working with unstructured data really impractical: every single property is optional, every typo (schmea) should be explicitly checked.
In order to check the configuration entirely before throwing it into the rest of the program, and in order to be able to document precisely what is a valid configuration file, I've taken a look at JSON Schema. Now, that's deceiving: validating YAML with JSON Schema? That turns out to be no problem, because the validation doesn't happen at syntactical level, but it is performed on the objects after parsing. Limiting the YAML to the same data types JSON can handle it will work no problem.
JSON Schemas are quite simple to define. My first thought was noooo, another huge thing to learn just for a simple thing... but actually it turned out to be not difficult at all: it didn't take too long to put together a schema for pg_seldump configuration file (in YAML of course). Having that it is possible to perform automatic validation of the document against the schema (for instance using the jsonschema Python package). Exciting times...
However the results are not entirely satisfactory. For instance, this is a configuration file containing 4 errors:
1 # badconfig.yaml 2 db_objects: 3 - name: 10 4 action: download 5 kid: table 6 adjust_score: twelve
It's easy to use the validator to find all the errors at once:
>>> import jsonschema >>> import yaml >>> schema = yaml.load(open("seldump/schema/config.yaml")) >>> validator = jsonschema.Draft7Validator(schema) >>> config = yaml.load(open("badconfig.yaml")) >>> print('\n'.join(e.message for e in validator.iter_errors(config))) Additional properties are not allowed ('kid' was unexpected) 10 is not of type 'string' 'twelve' is not of type 'integer' 'download' is not one of ['dump', 'skip', 'error', 'ref']
However these error messages, as informative as they can be, have lost the context of where they originated. Sure 10 is not a string, but good luck finding the right "10" into a large file. The problem comes from that same characteristic that allowed us to validate a YAML file against a JSON schema expressed into a YAML file: everything is based on Python plain objects (scalars, dicts, lists) and the information of what line in what file that value came from is long lost. For the same reason, the errors ordering is undetermined.
In order to overcome the problem, the idea is to attach file and line information, available at parsing time, to the parsed objects. In a previous approximation of the solution I had only the dictionaries subclassed to attach them their position. In case of error, the name of the file and the first line of the dictionary would have been printed: better than nothing I guess, but not perfect (in the above example any error would have been reported at line 3).
With the introduction of the JSON schema validator I got back on the customized YAML parser. First I used more customization, subclassing lists and wrapping scalars to memorize their positions and the position of their content. The scalars themselves don't remember the attribute they are attached to: the 10 above knows it came from line 3 of a file, but not that it was the tentative name of an object. So the idea was to further instrument the containers returned by the YAML parser, and memorize item by item what line the values are parsed from. When a schema validation error is found, jsonschema errors have an useful path attribute describing in what object the error was found (for instance ['db_objects', 0, 'name'] for the "10" error). So, if the last element is a string and the penultimate is a dictionary, there is enough information to associate it back to the original file and position.
The result I think is very user friendly:
$ pg_seldump --test badconfig.yaml ERROR at badconfig.yaml:3: name: 10 is not of type 'string' ERROR at badconfig.yaml:4: action: 'download' is not one of ['dump', 'skip', 'error'] ERROR at badconfig.yaml:5: Additional properties are not allowed ('kid' was unexpected) ERROR at badconfig.yaml:6: adjust_score: 'twelve' is not of type 'integer'
It wasn't possible to express everything in the JSON schema: demanding at most one between name and names for in rule objects seems quite difficult and... I don't think a regular expression exists to verify that a certain string is a valid regular expression, does it? So certain finer checks are still performed a step after schema validation, but still before handing control to the rest of the program. However, emitting all the errors in one go, together with the precise location where they occurred, makes the program an ally rather than an opponent. This is a program I would like to use.
Enough of YAML and configuration! Now I'm looking forward to dive into the problem of navigating the graph of the tables across foreign keys and to generate dynamically the nested queries to extract a consistent sample of data, watching for the pitfalls of circular foreign keys and self-referencing tables...