NoSQL and CMS – Comparing NoSQL with CPS Requirements

Now that we have both our CPS requirements mapped out, and a reasonable understanding of the NoSQL Universe, we can finally get to the fun stuff: mapping requirements to technologies.

For this kind of exercise I’m quite partial to ye olde comparison matrix – a table with the alternatives listed horizontally (in our case that’s the different classes of NoSQL technology) and with comparison criteria listed vertically (in our case our CPS requirements). You then play a “fill in the blanks” game for all of the cells in the table.

While simple and effective, comparison matrices have a tendency to be rather information-dense – for that reason I’ve elected to use boolean yes / no values in each cell, rather than more granular scoring systems (such as numerical scales). If we were doing a more thorough comparison (say, if we were actually implementing a CPS on top of a NoSQL technology) I would want to use a variety of more granular scoring systems.

Let’s dive in:

Key / Value BigTable-like Document Graph Relational (for comparison)
1 Richly structured content types 1 1
2 Unstructured binary objects 2 3
3 Relationships / references / associations
4 Schema evolution4 5
5 Branch / merge
6 Snapshot-based versioning
7 ACID transactions6 7
8 Scalability to large content sets 8
9 Geographic distribution 9
REQUIREMENTS MET 4 4 5 5 4

One of the nice things about comparison matrices is that they immediately highlight the differentiating criteria for any given comparison, allowing one to focus the discussion on those areas where compromises will be necessary.

Looking at the matrix above, the eye is immediately drawn to rows 1, 2 and 3, which is where the variability between the different classes of NoSQL is found. In fact what this immediately tells us is that there’s no “perfect” class of NoSQL for our CPS requirements – none of them satisfy all three of the following requirements:

  • Richly structured content types
  • Unstructured binary objects
  • Relationships / references / associations

What’s worse are rows 5 and 6, and (to a lesser extent) row 9. These rows indicate that there are entire requirements that no extant NoSQL technology satisfies.

In looking at these 3 requirements in isolation, there is an obvious class of systems that do satisfy them – Source Code Management (SCM) systems, especially Distributed SCMs (Mercurial, Bazaar, git and the like). It is important to stress that DSCMs don’t meet many of the other CPS requirements however (for one thing the “data model” they expose is a filesystem, which is completely unsuited for CPS content modeling), so they’re by no means an appropriate CPS repository in and of themselves.

The conclusion I’ve drawn is that the ideal CPS repository would be a three-way love child between 2 classes of NoSQL solution (document and graph databases) and DSCM. Specifically:

  • Document database style data model (satisfies requirements #1, #2, #4 and #7) plus Graph database style relationships (satisfies requirement #3)
  • SCM style versioning and branch/merge (satisfies requirements #5 and #6)
  • DSCM or CouchDB style multi-master replication and conflict detection (satisfies requirement #9)

I think of this hybrid as a “Document-Relational10 Version Control System” (DRVCS for the acronym junkies).

All three kinds of system meet requirement #8 when used in isolation, and a cursory analysis doesn’t reveal any obvious reasons why the DRVCS couldn’t retain this characteristic (although the devil is often in the details when it comes to questions of scale).

The bigger questions here are whether anyone is considering implementing such a repository, and if not whether it’s possible to emulate the missing requirements on top of any of the extant NoSQL technologies.

To the first point, I’m not aware of any efforts aimed at implementing a DRVCS, although I’m by no means familiar with all of the repository development going on out there. I’m also sure I’ll receive comments that such-and-such a technology already meets all of these requirements. I’m skeptical that such a beast actually exists, but would love to be proven wrong – what would convince me is the addition of the technology to the comparison matrix above, with a brief description in each of the new column’s cells describing how the technology satisfies each requirement.

I’ve also spent a bit of time thinking about whether it would be possible to build out the missing features on top of an extant NoSQL technology, and at this stage my gut tells me that the branch / merge and snapshot based versioning requirements would be difficult (and probably impractical) to build on top of any of these technologies. These features are technically complex enough that they deserve first class support in the repository, rather than being implemented as an “emulation” layer on top of something else.

Next Up

Next up I’ll be reviewing the repository-level requirements for Presentation Management CMSes, and we’ll repeat this comparison process to see how NoSQL technologies stack up against that use case. Fingers crossed for a more positive outcome!



1 Standard practice in graph and relational databases is to decompose complex objects into separate vertexes / rows with connecting edges / foreign keys. I don’t particularly like this approach since it confuses true inter-object relationships (“links” between otherwise independent content items) with the internally rich data structures of those content items. With a relational databases you can fake it by using 1-M foreign key constraints with CASCADE DELETE, although I see that as a weak substitute for “real” richly structured content types. Both relational and graph databases also impose “reconstitution costs” (joins) when an object is retrieved in its entirety – document databases don’t have this issue (this issue is nicely described in The Future of Databases).
2 While it is possible to store large binary objects in most graph databases, some (notably Neo4J) don’t recommend doing so.
3 While it is possible to store large binary objects in most relational databases (via LOB data types), this is generally frowned upon as relational databases typically aren’t very efficient at handling them and there are no standard ways to perform certain types of I/O within a single binary object (e.g. pulling out particular byte ranges, random I/O, etc.). In fact I was very close to putting a cross in this cell in the matrix – there are very good reasons why most CMSes that use a relational database choose to store binaries outside in a “real” filesystem.
4 Virtually all of the NoSQL classes mentioned here support Schema Evolution but not directly – instead they are “schemaless”, which means they don’t impose or require that the data model is declared to the repository up front. This is in fact an ideal way to support schema evolution, by allowing higher level CMS-specific logic to decide what a “schema” is, and how schema evolution will be supported, without the underlying persistence mechanism getting in the way. Compare this to the equivalent operation in a relational database (while recalling the good old days of multi-day “ALTER TABLE” marathons!) and you’ll quickly appreciate how liberating a schemaless repository can be.
5 Typical BigTable-like systems aren’t fully schemaless, in that some form of up front “data modeling” is required, and once instances of that schema exist the options for modifying those models are limited.
6 Most NoSQL solutions only support ACIDity in predefined ways – they don’t typically allow external logic to programmatically demarcate arbitrary transactional boundaries. For example document databases typically only support ACIDity for a single document at a time, but not for batches of documents (note that the graph database Neo4J is a notable exception to this general trend). In my opinion this is perfectly adequate for most CMS uses cases, including the CPS use case.
7 MongoDB’s default behaviour is not what is traditionally thought of as ACID – in particular consistency and durability are relaxed in order to improve performance. That said MongoDB provides mechanisms for the implementer to increase both consistency and durability (at the cost of performance) – these facilities are described in this great guide.
8 Many in the NoSQL movement would tout this as the primary limitation of relational databases, however recall that for the Content Production use case, the scalability to large traffic volumes is not a requirement (an editorial team cannot generate anything approaching an internet scale traffic load), and the data volumes are not necessarily beyond what a relational database can handle either (a typical web site has 10s to 100s of thousands of discrete content items, and when stored in a relational data model that would typically equate to one or two orders of magnitude more rows). Some might argue I’m being somewhat lenient on relational databases in this analysis, but there are reasons they’ve enjoyed such widespread adoption for so long, beyond the momentum / familiarity arguments.
9 CouchDB stands out in this regard, with exceptional support for geographical distribution. It achieves this by offering a replication mechanism that can be used to create multi-master topologies with automatic conflict detection (so that concurrent updates in different places are detected, and can be resolved by higher level logic – in our case that would be done by the hypothetical CPS logic that’s sitting on top of CouchDB). CouchDB’s replication functionality is, in my opinion, a perfect fit for this requirement, and we’ll be hearing more about it in my upcoming post on NoSQL in the context of Document Management CMSes.
10 “Relational” isn’t the most accurate term to describe the “relationship / reference / association” part of this hypothetical system, but I haven’t come up with a better alternative.

Share via del.icio.us Share via Digg Share via Facebook Bookmark in Google Share via MySpace Share via Reddit Share via StumbleUpon Favourite in Technorati Share via Twitter

Advertisements

NoSQL and CMS – Requirements for Content Production Systems

For each of the CMS problem domains mentioned in the introduction, I’m going to start out by outlining the core requirements that are most relevant to NoSQL.

Typically these requirements will focus on the “repository” underlying the CMS – how content is represented, how it is structured, how it is stored and retrieved, how the repository is scaled to large traffic volumes and so on. This is not to diminish the importance of other requirements (such as the editorial UI/UX), however NoSQL has a lot less direct relevance to those facets of a CMS.

Let’s kick off the series with my favourite CMS use case – the Content Production System (CPS).

Requirements for a CPS

A CPS has a number of requirements that are relevant to NoSQL solutions. These include:

  1. Richly structured content types
  2. Unstructured binary objects
  3. Relationships / references / associations
  4. The ability to evolve content models over time (what I call “schema evolution”)
  5. Branch / merge (in the Source Code Management (SCM) sense of the term)
  6. Snapshot based versioning
  7. ACID transactions
  8. Scalability to large content sets
  9. Geographic distribution

Let’s discuss each of these in more detail:

Richly Structured Content Types

In my experiences, types in a WCM content model are generally more complex than those in other content management use cases (e.g.. Document Management), with complex nested data structures within types being the norm rather than the exception.

While the specifics vary widely depending on the precise information architecture of the web site, some typical examples might be:

  • News Article:
    • a number of singleton fields such as “title”, “summary”, “author”, “date”, “body” etc.
    • an unbounded set of related image files, each of which has a “thumbnail” and a “high fidelity” rendition. These images may have further fields associated with them (provenance information, for example).
  • Product:
    • a number of singleton fields such as “SKU”, “title”, “description”, etc.
    • a variety of images such as “thumbnail”, “high fidelity”, “left view”, “right view”, etc.
    • nested data structures such as a regional price list – a set of (country code, currency, price) tuples
  • Recipe:
    • a set of ingredients, each of which has:
      • singleton fields such as “name”, “quantity”, “optional / required flag”, “substitutes” etc.
      • a nested data structure containing nutritional information
    • an ordered set of preparation instructions

All of these are based on actual WCM content models I have seen used in live web sites.

Unstructured Binary Objects

A no-brainer really – any CMS (WCM or otherwise) that is unable to efficiently store binary objects (regardless of MIME type) isn’t worthy of the moniker. Enough said.

Relationships aka References aka Associations

WCM content models are inherently interlinked, after all that’s what the “hyper” in “hypertext” refers to! Continuing our examples above: a Product may contain references to complementary Products and other content types such as Technical Specifications, White Papers etc.; a News Article might refer to related News Articles, and so on.

In fact often the most highly interlinked part of a WCM content model are the content types representing the navigational model of the site. Regardless of whether the site uses a traditional single-root hierarchy, a multi-hierarchy (“faceted”) navigation scheme, a tag cloud or some wacky newfangled model dreamt up by a genius information architect, the content type(s) representing the navigational data structures are always highly interlinked with the non-navigational content types (the Products, News Articles, Recipes, etc.) and are often interlinked with themselves. This latter case is particularly true of hierarchically based navigational schemes, which continue to be the dominant navigational paradigm used in content-rich web sites.

While it is possible to “manage” links via the humble hyperlink (and in fact this is the de-facto approach in several CPSes), this is less than ideal for various reasons:

  • it’s difficult to inform an author that they’re about to break links on the site by moving or deleting a content item that is the target of a reference
  • it’s difficult to determine what needs to be deployed in order to ensure that all dependencies are met (i.e. so links won’t be broken, post-deployment)
  • the graph of links provides useful information to authors about the dependencies within their content set, possible navigation paths through the site etc.
  • coupled with usage analytics data, visualisations of the link graph can be a powerful tool for authors in revising, distilling and generally maintaining the relevance of the content they’re delivering

Schema Evolution

A general guiding principle that I have followed throughout my technical career has been to avoid (as far as possible) anything that requires what I refer to as “crystal ball gazing” – making decisions now that require prediction of the future and that may be difficult to correct when that prediction turns out to be incorrect (as inevitably happens).

Content models are a classic example of this – in the decade or so that I’ve been working in content management professionally, I don’t recall a single instance where the content model was defined perfectly first time, up front, prior to use.

Unfortunately some of today’s CPSes make it extremely difficult to change the definition of a content type once that type has instances in existence – requiring (for example) a full dump / reload of the entire content set for even the most trivial of changes to the model.

This is the crux of the “schema evolution” requirement – any CMS worth a damn must provide the ability for the content model to evolve over time, regardless of whether content that uses that model exists or not.

Branch / Merge

This is the ability for an author (or set of authors) to spin off from the main “branch” of editorial activity, work independently for some period of time and then merge their changes back into the main “branch”.

This is an optional (though common) requirement – some CPSes don’t provide this capability and some editorial teams don’t require it either.

That said, any web site that has a lifecycle that involves both frequent incremental revisions and infrequent major revisions that are prepared in parallel will benefit from this kind of functionality. Anyone who’s ever managed multiple concurrent software releases will grasp the issue (and its solution) immediately.

Snapshot Based Versioning

By “snapshot based versioning” I mean a versioning system that captures the full state of the content set at a given point in time, and can resurrect that state at any point in the future, regardless of what operations are executed by authors in the meantime (including deletes, renames and moves of assets).

Anyone who suffered through RCS / CVS in the good old days and is now using a sane SCM (Subversion or Mercurial, for example) will know exactly what I’m referring to here!

Surprisingly, some CPSes continue to use RCS style per-asset versioning, which means they are unable to resurrect deleted assets – a serious problem if your web site happens to fall under one of the regulations (e.g. HIPAA, SEC, FTC, etc.) that require that the complete state of a site be “resurrectable” for quite significant periods of time (often 7 years).

ACID Transactions

Basically this boils down to the guarantee that modifications to the content set can be durably persisted to the CPS, either succeed or fail in their entirety and can be read back out in the case of success. To many this will seem a no-brainer, but when we move on to our review of NoSQL solutions we’ll find that some of them don’t necessarily provide this guarantee.

Note: while advantageous in some situations, I consider externally defined transactional boundaries (i.e. the ability to “batch up” numerous otherwise unrelated content modifications into an arbitrary ACID transaction) to be a “nice to have” requirement, rather than a hard requirement. Again we’ll see the impact of this when we review NoSQL technologies.

Scalability to Large Data Sets

Interestingly, scalability in the presence of large amounts of traffic is the area where NoSQL technologies garner the most attention, yet it is one of the least important requirements for a CPS. This is because even large (several hundred person) editorial teams are unable to generate the kind of traffic load that even a moderately successful web site can receive.

However what does matter is that the CPS can scale in the presence of large amounts of data – typical content-heavy web sites these days contain tens to hundreds of thousands of discrete content items, many of which will contain several media assets (images, video, fire applets, etc.) that themselves may be heavyweight (MB to GB in size).

Geographic Distribution

Basically this requirement is for those organisations that have geographically distributed editorial teams who wish to ensure good performance of the editorial tool, no matter where the editors are physically located.

Although in essence a “nice to have” requirement, I threw it in here because I’m hearing it increasingly often and some NoSQL solutions cater to it quite nicely.

Next Up…

Next up I’ll give a quick overview of some (but not all!) of the more relevant NoSQL technologies currently on the market, and we’ll compare them against the requirements we’ve defined here to see to what degree they are relevant to the CPS use case.

Share via del.icio.us Share via Digg Share via Facebook Bookmark in Google Share via MySpace Share via Reddit Share via StumbleUpon Favourite in Technorati Share via Twitter

Published in: on 2010-06-25 at 12:56 pm  Comments (7)  
Tags: , , ,

Taxonomania!

While deconstruction is serious business for the curmudgeon, I occasionally like to take a break from the rigours of sowing chaos and discord by presenting some more constructive observations.

In this post I’d like to capture the mental picture I have of how Content Management fits together, neatly putting all of the pieces of the CM puzzle (DM, WCM, RM, AA, etc.) in their rightful place. As a bonus we will also learn how and why various products (including our good friend WordPress) fit into the Content Management menagerie.

I consider AA to be part of CM, as beer consumption appears to be an increasingly important part of the Content Professional’s technical proficiency.

A Hierarchy of CM Problem Domains

In my previous post I introduced the “Reversi Rule” and noted that for CM we came up with the rather broad definition of “the management of content”. To me this generality is a large part of the appeal of the term (particularly when compared to ECM, which is just downright confusing) – it generously includes the diverse array of human endeavours that could conceivably be classified as “Content Management”, it doesn’t say anything about what those specific problem domains look like (beyond requiring that they involve the “management of content”, for some reasonable definition of “management” and “content”) and it doesn’t exclude any of the broad range of actors who face these problems (including, but not limited to, enterprises).

So what specific value, then, does such a broad definition for Content Management provide us?

Perhaps I’m betraying my technologist background, but to me Content Management clearly forms the root of a hierarchy of increasingly specialised problem domains – in graphical format, this hierarchy might start to look something like this:

cm_hierarchy.png

Note: this diagram does not attempt to capture all possible CM problem domains, although doing so would be an illuminating exercise.

This diagram clearly illustrates a couple of important points:

  1. A vast array of activities can be referred to as “Content Management”.
  2. Many of these use cases have unique and highly specialised requirements, particularly as we get closer to the tips of the tree.
  3. Some of the management activities we think of as being common across the hierarchy actually have quite different semantics depending on the specific problem domain (versioning requirements are very different between Docroot Revision Control and Records Management, for example).
  4. File / folder-centric definitions of content are only part of the content management picture.

A graphical treatment also helps to highlight part of the reason why we’re all having so much trouble agreeing on what “Content Management” really is – we all tend to operate down at different tips of the tree, yet throw around our specific problem domain as The One True Form of Content ManagementTM!

I think this gets to the root of Pie’s earlier loss of composure, yet he is arguably guilty of the same sin, albeit while standing on a different soap box.

What About the Technology?

Typically software products are a trailing indicator of business problems, so it’s no surprise to find that there are systems for almost all of the use cases identified on the diagram. In fact adding the word “System” or “Software” to most of the labels on the diagram will result in an extant product classification. There are a few exceptions (“Docroot Revision Control System” and “Structured Content Production System”, for example), however there are products on the market today that are admirably described by these two terms.

The Bonus Round

Going back to our (by now somewhat fatigued) example of WordPress, it clearly falls into the node labeled “Blogs”, and by adding “System” to the label we get “Blog System”. Sounds fair – I doubt anyone would dispute that WordPress is indeed a Blog System.

Now by looking at the diagram we can see that a Blog System is a specialised form of Presentation Management System, which itself is a specialised form of Web Content Management System, which is finally a specialised form of Content Management System. I can hear some incredulous voices: “are you asserting that WordPress is all of these things?”. Absolutely!

Let’s pick some more examples, to see if we can break this model:

  • Alfresco RM – clearly a Records Management System therefore also a Document Management System, therefore also a Content Management System.
  • Virage MediaBin – this is an easy one: the web site explicitly touts it as Digital Asset Management, so only one step and we arrive at Content Management System. NEXT!
  • Ektron eWebEditPro (here’s a potentially contentious one!) – again the web site tells us it’s HTML Editing Software, therefore a Web Content Management System and a Content Management System.

Interesting eh? All these vastly different systems (we’ve just picked 4 that are completely different from one another), yet all of them provide specialised facilities for the management of content central to various different problem domains. They’re all Content Management Systems!

To paraphrase Drew Carey, next time you’re at a social event without companionship or sustenance, I’d encourage you to play “pin the CMS tail on the product donkey” (allowing yourself the ability to extend the hierarchy above with categories that I left out) – I think you’ll mostly find it a trivial exercise.

In Conclusion

At this point you might still be asking yourself what all this means and whether there is any real value in such a broad definition for Content Management.

My answer to that would be that an inclusive definition such as this one comes closest to the true meanings of the words “Content” and “Management”, without requiring us to open the can of worms that would be involved in trying to define these two words in detail (which is impossible anyway, since their precise definitions depend on the specific problem domain).

More importantly, by not requiring us to come to some global agreement about what “content” and “management” mean, this definition can help us move beyond the historical divides within the profession (notably the divide between the Web Content Management and Document Management camps), by giving us common terminology that is compatible with how these terms are used today by all camps, while also being sufficiently well defined that everyone knows what’s implied (and just as importantly, not implied) when someone make an assertion such as “Microsoft Word is a Content Management System”.

Published in: on 2010-05-07 at 5:58 pm  Comments (3)  
Tags: , , , , , , , , , , , ,