For each of the CMS problem domains mentioned in the introduction, I’m going to start out by outlining the core requirements that are most relevant to NoSQL.
Typically these requirements will focus on the “repository” underlying the CMS – how content is represented, how it is structured, how it is stored and retrieved, how the repository is scaled to large traffic volumes and so on. This is not to diminish the importance of other requirements (such as the editorial UI/UX), however NoSQL has a lot less direct relevance to those facets of a CMS.
Let’s kick off the series with my favourite CMS use case – the Content Production System (CPS).
Requirements for a CPS
A CPS has a number of requirements that are relevant to NoSQL solutions. These include:
- Richly structured content types
- Unstructured binary objects
- Relationships / references / associations
- The ability to evolve content models over time (what I call “schema evolution”)
- Branch / merge (in the Source Code Management (SCM) sense of the term)
- Snapshot based versioning
- ACID transactions
- Scalability to large content sets
- Geographic distribution
Let’s discuss each of these in more detail:
Richly Structured Content Types
In my experiences, types in a WCM content model are generally more complex than those in other content management use cases (e.g.. Document Management), with complex nested data structures within types being the norm rather than the exception.
While the specifics vary widely depending on the precise information architecture of the web site, some typical examples might be†:
- News Article:
- a number of singleton fields such as “title”, “summary”, “author”, “date”, “body” etc.
- an unbounded set of related image files, each of which has a “thumbnail” and a “high fidelity” rendition. These images may have further fields associated with them (provenance information, for example).
- a number of singleton fields such as “SKU”, “title”, “description”, etc.
- a variety of images such as “thumbnail”, “high fidelity”, “left view”, “right view”, etc.
- nested data structures such as a regional price list – a set of (country code, currency, price) tuples
- a set of ingredients, each of which has:
- singleton fields such as “name”, “quantity”, “optional / required flag”, “substitutes” etc.
- a nested data structure containing nutritional information
- an ordered set of preparation instructions
- a set of ingredients, each of which has:
† All of these are based on actual WCM content models I have seen used in live web sites.
Unstructured Binary Objects
A no-brainer really – any CMS (WCM or otherwise) that is unable to efficiently store binary objects (regardless of MIME type) isn’t worthy of the moniker. Enough said.
Relationships aka References aka Associations
WCM content models are inherently interlinked, after all that’s what the “hyper” in “hypertext” refers to! Continuing our examples above: a Product may contain references to complementary Products and other content types such as Technical Specifications, White Papers etc.; a News Article might refer to related News Articles, and so on.
In fact often the most highly interlinked part of a WCM content model are the content types representing the navigational model of the site. Regardless of whether the site uses a traditional single-root hierarchy, a multi-hierarchy (“faceted”) navigation scheme, a tag cloud or some wacky newfangled model dreamt up by a genius information architect, the content type(s) representing the navigational data structures are always highly interlinked with the non-navigational content types (the Products, News Articles, Recipes, etc.) and are often interlinked with themselves. This latter case is particularly true of hierarchically based navigational schemes, which continue to be the dominant navigational paradigm used in content-rich web sites.
- it’s difficult to inform an author that they’re about to break links on the site by moving or deleting a content item that is the target of a reference
- it’s difficult to determine what needs to be deployed in order to ensure that all dependencies are met (i.e. so links won’t be broken, post-deployment)
- the graph of links provides useful information to authors about the dependencies within their content set, possible navigation paths through the site etc.
- coupled with usage analytics data, visualisations of the link graph can be a powerful tool for authors in revising, distilling and generally maintaining the relevance of the content they’re delivering
A general guiding principle that I have followed throughout my technical career has been to avoid (as far as possible) anything that requires what I refer to as “crystal ball gazing” – making decisions now that require prediction of the future and that may be difficult to correct when that prediction turns out to be incorrect (as inevitably happens).
Content models are a classic example of this – in the decade or so that I’ve been working in content management professionally, I don’t recall a single instance where the content model was defined perfectly first time, up front, prior to use.
Unfortunately some of today’s CPSes make it extremely difficult to change the definition of a content type once that type has instances in existence – requiring (for example) a full dump / reload of the entire content set for even the most trivial of changes to the model.
This is the crux of the “schema evolution” requirement – any CMS worth a damn must provide the ability for the content model to evolve over time, regardless of whether content that uses that model exists or not.
Branch / Merge
This is the ability for an author (or set of authors) to spin off from the main “branch” of editorial activity, work independently for some period of time and then merge their changes back into the main “branch”.
This is an optional (though common) requirement – some CPSes don’t provide this capability and some editorial teams don’t require it either.
That said, any web site that has a lifecycle that involves both frequent incremental revisions and infrequent major revisions that are prepared in parallel will benefit from this kind of functionality. Anyone who’s ever managed multiple concurrent software releases will grasp the issue (and its solution) immediately.
Snapshot Based Versioning
By “snapshot based versioning” I mean a versioning system that captures the full state of the content set at a given point in time, and can resurrect that state at any point in the future, regardless of what operations are executed by authors in the meantime (including deletes, renames and moves of assets).
Anyone who suffered through RCS / CVS in the good old days and is now using a sane SCM (Subversion or Mercurial, for example) will know exactly what I’m referring to here!
Surprisingly, some CPSes continue to use RCS style per-asset versioning, which means they are unable to resurrect deleted assets – a serious problem if your web site happens to fall under one of the regulations (e.g. HIPAA, SEC, FTC, etc.) that require that the complete state of a site be “resurrectable” for quite significant periods of time (often 7 years).
Basically this boils down to the guarantee that modifications to the content set can be durably persisted to the CPS, either succeed or fail in their entirety and can be read back out in the case of success. To many this will seem a no-brainer, but when we move on to our review of NoSQL solutions we’ll find that some of them don’t necessarily provide this guarantee.
Note: while advantageous in some situations, I consider externally defined transactional boundaries (i.e. the ability to “batch up” numerous otherwise unrelated content modifications into an arbitrary ACID transaction) to be a “nice to have” requirement, rather than a hard requirement. Again we’ll see the impact of this when we review NoSQL technologies.
Scalability to Large Data Sets
Interestingly, scalability in the presence of large amounts of traffic is the area where NoSQL technologies garner the most attention, yet it is one of the least important requirements for a CPS. This is because even large (several hundred person) editorial teams are unable to generate the kind of traffic load that even a moderately successful web site can receive.
However what does matter is that the CPS can scale in the presence of large amounts of data – typical content-heavy web sites these days contain tens to hundreds of thousands of discrete content items, many of which will contain several media assets (images, video, fire applets, etc.) that themselves may be heavyweight (MB to GB in size).
Basically this requirement is for those organisations that have geographically distributed editorial teams who wish to ensure good performance of the editorial tool, no matter where the editors are physically located.
Although in essence a “nice to have” requirement, I threw it in here because I’m hearing it increasingly often and some NoSQL solutions cater to it quite nicely.
Next up I’ll give a quick overview of some (but not all!) of the more relevant NoSQL technologies currently on the market, and we’ll compare them against the requirements we’ve defined here to see to what degree they are relevant to the CPS use case.