Wednesday, May 20, 2009

The MySQL Sandcastle

Developer sandboxes are crucial for quality software creation: a place to work in a locally destructive way, trying new avenues to problem solving at minimal risk. There are several different ways to put a sandbox together, but all essentially built around the concept of a scaled down copy of the target production environment.

In the case of internet applications this means an interface server (usually web), processing environment, and persistence layer (disk or database storage). Most simplistically this is just another subdirectory or virtual host directed to the developers' local copy of the operating code, where they get to make their changes without affecting or being affected by the work of others (shared processor and disk constraints aside, and out of scope for this discussion). The persistence layer, most commonly a database, has some additional constraints though: getting enough good sample data in place to be a good representation of real-world activity, both in terms of permutations and raw number of records, can be cost prohibitive either in documentation and setup (assembling the samples) or simply in the size of the data set.

Working with smaller data sets tends to help with the validation of structure, and doing it on scaled down hardware is often used as a way of simulating (poorly) the probable performance of proportionately larger data sets in a higher-class processing environment (and typically under greater load). There are some things that just can't be worked out on a reduced scale, however, especially related to index and query tuning.

A solution we've recently implemented is the MySQL Sandcastle - a more elaborate construct on top of the typical sandbox concepts that has some significant benefits for web app development.

We started with a rich staging environment: a copy of the production database, minimally sanitized (personally identifying and/or financially sensitive data obscured) and trimmed down only as appropriate (retaining 90 days worth of operational data from history tables) for the hosting environment. The sanitation makes it more like a bridge between test and literal stage, but gives us exactly what we need to work with regardless.

On top of that staging environment we're taking advantage of the MySQL Merge storage engine, which essentially marries multiple table definitions under a single abstracted table interface. This is similar to a unioned view, except that the structural checking is done during creation, the contents are fully malleable to all standard CRUD operations, and there's no need to spool data to temporary tables (which most databases do in some form or other for these kinds of operations) during complex queries. A default insert mode also tells the table which of the component tables to add records to (exclusively).

Per the illustration then, we have the Primary Staging Repository (PSR), plus 2 additional databases per sandbox (STAGE_[user] and STAGE_[user]_M) which round out the environment. The STAGE_[user] database replicates the structure of the PSR but starts out completely empty. STAGE_[user]_M shadows this structure, but swaps out the engine definition from MyISAM to Merge in order to combine user stage and PSR. In order to keep PSR clean for all developers to continue their work, each user is granted an un-privileged account with full access to their own sandbox databases and read-only to the large store (Merge table definitions must be created with full CRUD permissions to the underlying tables as well, so these are created by a privileged user prior to turning the reins over to the per-user accounts), then accesses the environment only through the ..._M instance.

Obviously this restricts activity: any attempt to update or remove records which exist only in PSR will likely produce errors, and at the very least will be ineffectual and quite probably anomalous. The most effective usage pattern is for unit tests and general developer activity to still recreate data they intend to directly modify, leaving the large store as a good general sample set for read activities (which accounts for ~60-80% of all development activity anyway). The benefits are pretty big: developers get cheap access to large swaths of regularly refreshed data without having to continually repopulate or propagate in their own environments, can work destructively without interference, and even test structural changes (simply excluding PSR from the Merge and redefining the table) to the database before finally recombining efforts in Stage (which does work destructively against the PSR) for integration testing before promoting to production.

There are some other disadvantages as well: it's possible for local insertions to the underlying tables to create keys in identical ranges which will appear to the client as duplicate primary keys (in violation of the very clear and appropriate designation of "Primary Key") in the merged data set. Setting new exclusive AUTO_INCREMENT ranges per table doesn't help yet either, due to a bug in the engine that treats it like a combinatory (multi-column) key definition, using MAX( PK ) + 1 instead of the defined range of the target table. Merge tables are also restricted to MyISAM table types, excluding the oft popular and appropriate InnoDB (and other) options. Treading carefully easily avoids these, but it's good to be aware of them.

A more complete workaround would be updateable views, using a combination of exclusive UNION (rather than implicit UNION ALL) selects and a series of triggers which not only manage insert activity, but replicate into the developer shadow sandbox the intended target record prior to modification in the case of insert. Establishing this kind of more extensive sandcastle pattern would be far more capable, should be scriptable without great effort, but still carry a small number of its own gotchas (most notably that un-altered trigger definitions on the underlying target tables with any dependence on external tables would be operating locally rather than through the same abstracted view, and insert triggers may fire at unexpected times). For now the limited Merge version is sufficient for our needs however, so we haven't gone so far as to even kick the tires on this possible approach.

I have a few more changes to make to the sandbox recreation script before it's ready for wider consumption, but it's not an especially arduous process for someone enterprising to reproduce on their own with little effort.

Enjoy!

No comments: