Sam Boyer

Tag1 Quo: Engineering Versions at Scale

After a long hiatus, we're back! When we left off last fall, we were looking at the mechanics of version comparison. In this post, we'll get into more practical matters: the approach we actually took to building out Tag1 Quo's version management system.

When we started working on Quo, we knew that we were going to lean heavily on versions for pretty much all aspects of the system’s functionality. Many of the individual requests arriving at Quo’s APIs would necessarily have to trigger many, many version comparison checks, often not even as a final goal, but as a step towards some broader goal. Here are some of the questions we knew Quo would need to answer:

  • Given a version of a particular extension, are there any newer versions?
  • How many newer versions are there?
  • Are any of those newer versions security updates?
  • Given a site with a set of extensions, do any of those extensions have security updates?
  • Given a new [security] release for a particular extension, what sites that we monitor need to receive a notification? 
  • This suggested two things: we needed design for a system where individual version comparison checks would be highly efficient, and also correct. This created some interesting requirements, necessitating that we deviate from typical approaches to Drupal system design.

Efficiency

One can imagine a “typical” Drupal implementation of this version information: perhaps an entity, with individual fields for each of the six version components. Entities needing version information - like those representing an upstream module release, or an instance of a module on a client site - would keep a reference to a version entity. Comparing such entities by version, then, would be a question of loading up the version sub-entity and comparing the values in PHP. That would entail:

  1. Load up one version object from the database
    1. Call out to an entity controller
    2. Trigger field handlers and several database queries (barring caches)
    3. Construct the appropriate typed PHP object
    4. (Fire hooks all along the way)
  2. Repeat for the other version object (if not already loaded)
  3. Perform the (up to) six-step numerical comparison between the two objects’ numerical field components

That’s not cheap. And if we need to answer a question like “how many newer versions are there?”, then we have to perform this set of operations once for each known version in an extension’s universe. That number is fairly small - even the extensions with the largest number of releases only have on the order of hundreds. But if we have to do a check for a new release of an extension against all the sites Quo is tracking, then that scales linearly in the number of sites with the extension enabled. If the extension is, say, Views, that’s basically every client site we have.

This is the point in a more typical Drupal build where we might entertain layering in caching: maybe we precompute each little version universe. We keep a marker the instances of extensions on each client site, indicating whether or not they’re insecure. Or maybe a listing of all the versions within a universe that are insecure, or the most recent secure version in each line of development.

Each of these things might be useful in computing the answers to Quo’s big questions. They might allow us to avoid repeating the arduous steps described above, at least some of the time. And they might be sufficient to the task of scaling Quo to handling tens or hundreds of thousands of client sites. 

But scaling isn’t the only thing that matters.

Correctness

Quo is intended to be a service our clients can rely on. Like, really rely on. That kind of reliability entails designing with the goal of minimizing failure modes. In engineering practice, that tends to be one of the better ways of ensuring our systems are correct - that they behave as expected.

Some failures we can’t really do anything about: if a network partition occurs between Quo’s servers and a client’s machine, then reports about what’s installed can’t reach Quo. But this is a classic networked and distributed systems problem; we basically have to accept this as a failure mode, and focus on the ones we can do something about.

Assuming the client site’s messages can reach Quo’s servers, it’s crucial that Quo’s service be correct. In particular, Quo needs to avoid false negatives - telling users that their sites are secure, when they’re actually not.

All manner of bugs could result in such misinformation reaching users. If those bugs are errors in reporting or display logic, they’re relatively lower risk, as they cause no lasting harm. Bugs in how data is processed, however, are far more insidious, as they can result in inconsistent database state. Such inconsistent states require careful auditing and analysis to locate. Even if found, it can be nigh-impossible to determine what caused the inconsistency in the first place.

If we were to rely on the “typical” Drupal approach for representing versions, then wrap caching around it, it would expose Quo to the possibility of data inconsistencies through race conditions.

Outside of core developers working on low-level subsystems, most Drupal folks don’t think about race conditions. They don’t have to! PHP is single-threaded, which eliminates the possibility of data races in day-to-day programming. And Drupal’s core itself does much with database transactions to guard against races when operating on standard Drupal objects, such as entities. Most of all, though, the overwhelming majority of Drupal sites just don’t see enough resource contention for problems to arise.

Caching introduces the possibility of race conditions because it shifts responsibility for providing canonical representations of data outside of a single, well-defined “segment” that Drupal will keep consistent. Let’s say, for example, that we compute and cache the list of versions that are known to be insecure, per extension. If a message were to arrive from a client site with information about that extension at the same moment that a background job happens to ingest new release information about that extension, what might happen? The former is likely interested in reading from the cache that the latter might have to rewrite; there’s all manner of ways that either of the processes might get partial or stale information, and end up making a bad choice and potentially persisting it to cache, or even to the database.

Lest we forget: cache invalidation IS one of the hard problems in computer science.

It’s not that these problems aren’t solvable. Drupal has a locking system, out of which we can cobble protections against data races...more or less. But locking systems are notoriously hard to reason about, and our case has multiple points of ingress for both reads and writes, making the problem far worse. We’d have to create an absurdly complex and detailed plan to be assured that our version data cannot become inconsistent. And, of course, the more complex the plan, the more likely it is we’d miss something - maybe an inconsistency vector, or a deadlock that could tie up PHP worker threads, effectively DOSing Quo’s servers.

So, we don’t want to rely on locks for correctness and data consistency. If locks are out, then so is caching as a strategy for performance and scaling. And if caching is out, then with it goes the conventional Drupal approach of entity references and clicked-together fields.

We need something more.

Pushing the work down: Versions as fields

We know that a conventional approach won’t be good enough for Quo’s versions. OK then -  what’s an unconventional approach look like? Let’s step back and review what we know:

  • As described in this series’ preceding posts, we can express versions as a six-tuple of positive integers.
  • Comparisons of these integers will yield a correct relative ordering of versions, and the entities to which they’re attached.
  • The universe of known versions for a given extension is the union of the official releases and the LTS releases.
  • The universe of encountered versions for a given extension is whatever versions the client sites report that they have, if any at all.
  • PHP is slow, but C is fast!
  • That’s right - we’re going to push the work down into the database.

Because we can encode versions as these numerical coordinates, it’s easy to create a table that expresses all six dimensions, then compose SQL queries to perform almost all of the version-comparing business logic. Pushing the work down into highly optimized database C code drops the cost of each version comparison operation from the 1-10ms range towards hundreds of nanoseconds. That’s an improvement of 4-5 orders of magnitude - enough that we can stop worrying about caching entirely, and just do our version checks on the fly.

By computing on the fly from the canonical version data, rather than some precomputed cache,, we also gain data consistency guarantees, thereby reducing our failure modes. For any megaquery that we run to answer a question about versions, ACID database semantics guarantee that it will operate on a single, consistent database snapshot of the whole version universe. And, because we’re storing each bit of version information discretely, there’s little concern about incoming writes causing even temporary inconsistencies - either a version is fully in the known or found version universes, or it’s fully not.

Now, with all these custom querying requirements, this might seem like a good case for a custom table. But we still need to integrate all of this with the entity system: there are the entities for official releases, LTS releases, and for site-specific instances of an extension.

Instead of custom table, we opted for versions as a custom field type, then attached them to each of these entity types. Defining the custom field provided just the right controls within Drupal’s APIs to express what we needed; ultimately we were able to distill down the essence of the version comparison logic into reusable query components, like this one:

$query
 ->condition(db_or()
   ->condition('field_version_minor', $minor, '>')
   ->condition(db_and()
     ->condition('field_version_minor', $minor, '=')
     ->condition(db_or()
       ->condition('field_version_prerelease_type', $prtype, '>')
       ->condition(db_and()
         ->condition('field_version_prerelease_type', $prtype, '=')
         ->condition(db_or()
           ->condition('field_version_prerelease_num', $prnum, '>')
           ->condition(db_and()
             ->condition('field_version_prerelease_num', $prnum, '=')
             ->condition('field_version_lts_patch_num', $ltsnum, '>')
           ))))));

That query fragment is a complete logical expression of the "newer than" ordering relationship, in the context of our six-part version coordinate system.


...Which is almost enough. But Drupalistas may note (dubiously) that that query contains hardcoded field names - clearly, this query fragment only works when operating on "field_version". This was the final aspect of our design.

As noted in the above list, there are two version sets in each extension - the set of known release versions, and the set of found versions. The former are comprised of both official and LTS releases. But for these to behave as a unified whole, it's not enough to use the version field. They have to use the same instance of the version field. Doing so colocates their data in the same table, "field_data_field_version," which the above query caters to specifically.

For most Drupal sites, it's not the best idea to have the correct behavior of your site depend on a field instance being shared by exactly the right two entities. In fact, field instance reuse for any nontrivial field is generally more trouble than it's worth. But that doesn't mean it's never a good idea, as Quo's case demonstrates. The best approach is to intimately understand the requirements of the problem you're working on, and deviate from general best practices only when there's overwhelming reason to do so. 

Drupal Security Monitoring by the Experts
Tag1 has you covered with our Drupal security monitoring solution, Tag1 Quo.