NoSQL and CMS – Comparing NoSQL with CPS Requirements

Now that we have both our CPS requirements mapped out, and a reasonable understanding of the NoSQL Universe, we can finally get to the fun stuff: mapping requirements to technologies.

For this kind of exercise I’m quite partial to ye olde comparison matrix – a table with the alternatives listed horizontally (in our case that’s the different classes of NoSQL technology) and with comparison criteria listed vertically (in our case our CPS requirements). You then play a “fill in the blanks” game for all of the cells in the table.

While simple and effective, comparison matrices have a tendency to be rather information-dense – for that reason I’ve elected to use boolean yes / no values in each cell, rather than more granular scoring systems (such as numerical scales). If we were doing a more thorough comparison (say, if we were actually implementing a CPS on top of a NoSQL technology) I would want to use a variety of more granular scoring systems.

Let’s dive in:

Key / Value BigTable-like Document Graph Relational (for comparison)
1 Richly structured content types 1 1
2 Unstructured binary objects 2 3
3 Relationships / references / associations
4 Schema evolution4 5
5 Branch / merge
6 Snapshot-based versioning
7 ACID transactions6 7
8 Scalability to large content sets 8
9 Geographic distribution 9
REQUIREMENTS MET 4 4 5 5 4

One of the nice things about comparison matrices is that they immediately highlight the differentiating criteria for any given comparison, allowing one to focus the discussion on those areas where compromises will be necessary.

Looking at the matrix above, the eye is immediately drawn to rows 1, 2 and 3, which is where the variability between the different classes of NoSQL is found. In fact what this immediately tells us is that there’s no “perfect” class of NoSQL for our CPS requirements – none of them satisfy all three of the following requirements:

  • Richly structured content types
  • Unstructured binary objects
  • Relationships / references / associations

What’s worse are rows 5 and 6, and (to a lesser extent) row 9. These rows indicate that there are entire requirements that no extant NoSQL technology satisfies.

In looking at these 3 requirements in isolation, there is an obvious class of systems that do satisfy them – Source Code Management (SCM) systems, especially Distributed SCMs (Mercurial, Bazaar, git and the like). It is important to stress that DSCMs don’t meet many of the other CPS requirements however (for one thing the “data model” they expose is a filesystem, which is completely unsuited for CPS content modeling), so they’re by no means an appropriate CPS repository in and of themselves.

The conclusion I’ve drawn is that the ideal CPS repository would be a three-way love child between 2 classes of NoSQL solution (document and graph databases) and DSCM. Specifically:

  • Document database style data model (satisfies requirements #1, #2, #4 and #7) plus Graph database style relationships (satisfies requirement #3)
  • SCM style versioning and branch/merge (satisfies requirements #5 and #6)
  • DSCM or CouchDB style multi-master replication and conflict detection (satisfies requirement #9)

I think of this hybrid as a “Document-Relational10 Version Control System” (DRVCS for the acronym junkies).

All three kinds of system meet requirement #8 when used in isolation, and a cursory analysis doesn’t reveal any obvious reasons why the DRVCS couldn’t retain this characteristic (although the devil is often in the details when it comes to questions of scale).

The bigger questions here are whether anyone is considering implementing such a repository, and if not whether it’s possible to emulate the missing requirements on top of any of the extant NoSQL technologies.

To the first point, I’m not aware of any efforts aimed at implementing a DRVCS, although I’m by no means familiar with all of the repository development going on out there. I’m also sure I’ll receive comments that such-and-such a technology already meets all of these requirements. I’m skeptical that such a beast actually exists, but would love to be proven wrong – what would convince me is the addition of the technology to the comparison matrix above, with a brief description in each of the new column’s cells describing how the technology satisfies each requirement.

I’ve also spent a bit of time thinking about whether it would be possible to build out the missing features on top of an extant NoSQL technology, and at this stage my gut tells me that the branch / merge and snapshot based versioning requirements would be difficult (and probably impractical) to build on top of any of these technologies. These features are technically complex enough that they deserve first class support in the repository, rather than being implemented as an “emulation” layer on top of something else.

Next Up

Next up I’ll be reviewing the repository-level requirements for Presentation Management CMSes, and we’ll repeat this comparison process to see how NoSQL technologies stack up against that use case. Fingers crossed for a more positive outcome!



1 Standard practice in graph and relational databases is to decompose complex objects into separate vertexes / rows with connecting edges / foreign keys. I don’t particularly like this approach since it confuses true inter-object relationships (“links” between otherwise independent content items) with the internally rich data structures of those content items. With a relational databases you can fake it by using 1-M foreign key constraints with CASCADE DELETE, although I see that as a weak substitute for “real” richly structured content types. Both relational and graph databases also impose “reconstitution costs” (joins) when an object is retrieved in its entirety – document databases don’t have this issue (this issue is nicely described in The Future of Databases).
2 While it is possible to store large binary objects in most graph databases, some (notably Neo4J) don’t recommend doing so.
3 While it is possible to store large binary objects in most relational databases (via LOB data types), this is generally frowned upon as relational databases typically aren’t very efficient at handling them and there are no standard ways to perform certain types of I/O within a single binary object (e.g. pulling out particular byte ranges, random I/O, etc.). In fact I was very close to putting a cross in this cell in the matrix – there are very good reasons why most CMSes that use a relational database choose to store binaries outside in a “real” filesystem.
4 Virtually all of the NoSQL classes mentioned here support Schema Evolution but not directly – instead they are “schemaless”, which means they don’t impose or require that the data model is declared to the repository up front. This is in fact an ideal way to support schema evolution, by allowing higher level CMS-specific logic to decide what a “schema” is, and how schema evolution will be supported, without the underlying persistence mechanism getting in the way. Compare this to the equivalent operation in a relational database (while recalling the good old days of multi-day “ALTER TABLE” marathons!) and you’ll quickly appreciate how liberating a schemaless repository can be.
5 Typical BigTable-like systems aren’t fully schemaless, in that some form of up front “data modeling” is required, and once instances of that schema exist the options for modifying those models are limited.
6 Most NoSQL solutions only support ACIDity in predefined ways – they don’t typically allow external logic to programmatically demarcate arbitrary transactional boundaries. For example document databases typically only support ACIDity for a single document at a time, but not for batches of documents (note that the graph database Neo4J is a notable exception to this general trend). In my opinion this is perfectly adequate for most CMS uses cases, including the CPS use case.
7 MongoDB’s default behaviour is not what is traditionally thought of as ACID – in particular consistency and durability are relaxed in order to improve performance. That said MongoDB provides mechanisms for the implementer to increase both consistency and durability (at the cost of performance) – these facilities are described in this great guide.
8 Many in the NoSQL movement would tout this as the primary limitation of relational databases, however recall that for the Content Production use case, the scalability to large traffic volumes is not a requirement (an editorial team cannot generate anything approaching an internet scale traffic load), and the data volumes are not necessarily beyond what a relational database can handle either (a typical web site has 10s to 100s of thousands of discrete content items, and when stored in a relational data model that would typically equate to one or two orders of magnitude more rows). Some might argue I’m being somewhat lenient on relational databases in this analysis, but there are reasons they’ve enjoyed such widespread adoption for so long, beyond the momentum / familiarity arguments.
9 CouchDB stands out in this regard, with exceptional support for geographical distribution. It achieves this by offering a replication mechanism that can be used to create multi-master topologies with automatic conflict detection (so that concurrent updates in different places are detected, and can be resolved by higher level logic – in our case that would be done by the hypothetical CPS logic that’s sitting on top of CouchDB). CouchDB’s replication functionality is, in my opinion, a perfect fit for this requirement, and we’ll be hearing more about it in my upcoming post on NoSQL in the context of Document Management CMSes.
10 “Relational” isn’t the most accurate term to describe the “relationship / reference / association” part of this hypothetical system, but I haven’t come up with a better alternative.

Share via del.icio.us Share via Digg Share via Facebook Bookmark in Google Share via MySpace Share via Reddit Share via StumbleUpon Favourite in Technorati Share via Twitter

NoSQL and CMS – A Brief Overview of the “nosql” Universe

Before we compare our list of CPS requirements against the universe of NoSQL, it helps to have an understanding of what NoSQL actually is. As it turns out, NoSQL doesn’t refer to a single technology but rather a grab bag of otherwise unrelated technologies that are only loosely “related” by virtue of not being based on relational (specifically SQL) technologies.

And you thought the NoSQL moniker was simply a clever marketing ploy to incite rage amongst relational aficionados! :-)

While Wikipedia lists no less than 7 major types of NoSQL solution, the “Big 4″ that are discussed most frequently are:

  1. Key / Value
  2. BigTable-like
  3. Document
  4. Graph

Let’s look at each of them in turn.

Key / Value

These solutions are typically little more than sophisticated distributed hash tables, often adding direct support for some combination of persistence (durability of data across restarts), replication (physical duplication of data, typically across multiple servers) and sharding (partitioning of data into discrete subsets, each of which is stored on separate sets of servers). They often restrict what data can be used for keys, while allowing values to be any kind of data that can be serialised. Typically querying can only be done by key.

Lord Voldemort looking a tad peeved Some examples:

BigTable-like

Not just big, this table is humungous! These solutions are based on Google’s BigTable system (used internally by Google, as well as being the primary data storage facility available to Google App Engine developers). They can be thought of as being one step above a Key / Value store in that the values are not simply unstructured binary blobs that are opaque to the store, but are instead structured data elements that can be used for additional non-key based queries.

In some respects these solutions are quite similar to a relational database, minus explicit foreign keys, and with support for different “rows” in the same “table” having different “columns” (this is somewhat of an over-simplification, but from a data/content modeling perspective is reasonably accurate).

Some examples:

Document

The original and still the best way to store documents Document databases store “flat” collections of structured documents – in this case “document” does not (as the name might suggest) mean binary documents (Word, PDF etc.), but rather structured data objects with potentially rich internal structures.

The current generation of document databases have, for the most part, standardised on JSON as the underlying data format for documents, however I consider XML databases to fall under this umbrella as well (albeit many of those have additional facilities for slicing and dicing XML in various weird and unnatural ways that are illegal in some states).

Interestingly, query facilities vary widely between the extant document databases, with some offering query facilities on par with relational databases, while others don’t provide anything resembling a traditional query capability.

Some examples:

Graph

Circle graph Graph databases are unquestionably the oldest NoSQL solution, with at least one example predating the relational model by several years!

In these databases, data is stored as a series of discrete objects (“vertexes”) connected by zero or more relationships (“edges”) to one another. Typically the objects are simple hash table data structures and cannot have rich internal data structures (in contrast to a document in a Document database).

Some examples:

Summary

Borrowing a nice diagram from the Neo4J folks, one way of comparing the different classes of NoSQL solution is as follows:

Note: when looking at this graph I mentally replace the “Complexity” label with “Sophistication of data modeling” – the diagram is equally accurate with that substitution and to my mind that’s a more interesting picture (not to mention more relevant to the discussion of CMS and NoSQL).

In the next post we’ll instead look at how these classes of NoSQL solution compare to the CPS requirements we previously identified.

Addendum

There are any number of good NoSQL primers available on the interwebitubes, and I’d encourage you to read them if you’re new to the topic. I particularly like:

  1. Slides 11 through 17 of “A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)” by Neo4J’s Emil Eifrem.
  2. NoSQL – the Shift to a Non-Relational World
  3. NoSQL – Death to Relational Databases(?)
  4. Ricky Ho’s NOSQL Patterns (if you’re after a little more “meat”)

Before I get flamed to a burnt crisp by the CouchDB fanbois, yes I’m quite familiar with map/reduce “materialised” views – I simply don’t consider that to be a “real” query mechanism. This feature also runs afoul of my “avoid crystal ball gazing at all costs” principle, but that’s a topic for another day.

Share via del.icio.us Share via Digg Share via Facebook Bookmark in Google Share via MySpace Share via Reddit Share via StumbleUpon Favourite in Technorati Share via Twitter

NoSQL and CMS – Requirements for Content Production Systems

For each of the CMS problem domains mentioned in the introduction, I’m going to start out by outlining the core requirements that are most relevant to NoSQL.

Typically these requirements will focus on the “repository” underlying the CMS – how content is represented, how it is structured, how it is stored and retrieved, how the repository is scaled to large traffic volumes and so on. This is not to diminish the importance of other requirements (such as the editorial UI/UX), however NoSQL has a lot less direct relevance to those facets of a CMS.

Let’s kick off the series with my favourite CMS use case – the Content Production System (CPS).

Requirements for a CPS

A CPS has a number of requirements that are relevant to NoSQL solutions. These include:

  1. Richly structured content types
  2. Unstructured binary objects
  3. Relationships / references / associations
  4. The ability to evolve content models over time (what I call “schema evolution”)
  5. Branch / merge (in the Source Code Management (SCM) sense of the term)
  6. Snapshot based versioning
  7. ACID transactions
  8. Scalability to large content sets
  9. Geographic distribution

Let’s discuss each of these in more detail:

Richly Structured Content Types

In my experiences, types in a WCM content model are generally more complex than those in other content management use cases (e.g.. Document Management), with complex nested data structures within types being the norm rather than the exception.

While the specifics vary widely depending on the precise information architecture of the web site, some typical examples might be:

  • News Article:
    • a number of singleton fields such as “title”, “summary”, “author”, “date”, “body” etc.
    • an unbounded set of related image files, each of which has a “thumbnail” and a “high fidelity” rendition. These images may have further fields associated with them (provenance information, for example).
  • Product:
    • a number of singleton fields such as “SKU”, “title”, “description”, etc.
    • a variety of images such as “thumbnail”, “high fidelity”, “left view”, “right view”, etc.
    • nested data structures such as a regional price list – a set of (country code, currency, price) tuples
  • Recipe:
    • a set of ingredients, each of which has:
      • singleton fields such as “name”, “quantity”, “optional / required flag”, “substitutes” etc.
      • a nested data structure containing nutritional information
    • an ordered set of preparation instructions

All of these are based on actual WCM content models I have seen used in live web sites.

Unstructured Binary Objects

A no-brainer really – any CMS (WCM or otherwise) that is unable to efficiently store binary objects (regardless of MIME type) isn’t worthy of the moniker. Enough said.

Relationships aka References aka Associations

WCM content models are inherently interlinked, after all that’s what the “hyper” in “hypertext” refers to! Continuing our examples above: a Product may contain references to complementary Products and other content types such as Technical Specifications, White Papers etc.; a News Article might refer to related News Articles, and so on.

In fact often the most highly interlinked part of a WCM content model are the content types representing the navigational model of the site. Regardless of whether the site uses a traditional single-root hierarchy, a multi-hierarchy (“faceted”) navigation scheme, a tag cloud or some wacky newfangled model dreamt up by a genius information architect, the content type(s) representing the navigational data structures are always highly interlinked with the non-navigational content types (the Products, News Articles, Recipes, etc.) and are often interlinked with themselves. This latter case is particularly true of hierarchically based navigational schemes, which continue to be the dominant navigational paradigm used in content-rich web sites.

While it is possible to “manage” links via the humble hyperlink (and in fact this is the de-facto approach in several CPSes), this is less than ideal for various reasons:

  • it’s difficult to inform an author that they’re about to break links on the site by moving or deleting a content item that is the target of a reference
  • it’s difficult to determine what needs to be deployed in order to ensure that all dependencies are met (i.e. so links won’t be broken, post-deployment)
  • the graph of links provides useful information to authors about the dependencies within their content set, possible navigation paths through the site etc.
  • coupled with usage analytics data, visualisations of the link graph can be a powerful tool for authors in revising, distilling and generally maintaining the relevance of the content they’re delivering

Schema Evolution

A general guiding principle that I have followed throughout my technical career has been to avoid (as far as possible) anything that requires what I refer to as “crystal ball gazing” – making decisions now that require prediction of the future and that may be difficult to correct when that prediction turns out to be incorrect (as inevitably happens).

Content models are a classic example of this – in the decade or so that I’ve been working in content management professionally, I don’t recall a single instance where the content model was defined perfectly first time, up front, prior to use.

Unfortunately some of today’s CPSes make it extremely difficult to change the definition of a content type once that type has instances in existence – requiring (for example) a full dump / reload of the entire content set for even the most trivial of changes to the model.

This is the crux of the “schema evolution” requirement – any CMS worth a damn must provide the ability for the content model to evolve over time, regardless of whether content that uses that model exists or not.

Branch / Merge

This is the ability for an author (or set of authors) to spin off from the main “branch” of editorial activity, work independently for some period of time and then merge their changes back into the main “branch”.

This is an optional (though common) requirement – some CPSes don’t provide this capability and some editorial teams don’t require it either.

That said, any web site that has a lifecycle that involves both frequent incremental revisions and infrequent major revisions that are prepared in parallel will benefit from this kind of functionality. Anyone who’s ever managed multiple concurrent software releases will grasp the issue (and its solution) immediately.

Snapshot Based Versioning

By “snapshot based versioning” I mean a versioning system that captures the full state of the content set at a given point in time, and can resurrect that state at any point in the future, regardless of what operations are executed by authors in the meantime (including deletes, renames and moves of assets).

Anyone who suffered through RCS / CVS in the good old days and is now using a sane SCM (Subversion or Mercurial, for example) will know exactly what I’m referring to here!

Surprisingly, some CPSes continue to use RCS style per-asset versioning, which means they are unable to resurrect deleted assets – a serious problem if your web site happens to fall under one of the regulations (e.g. HIPAA, SEC, FTC, etc.) that require that the complete state of a site be “resurrectable” for quite significant periods of time (often 7 years).

ACID Transactions

Basically this boils down to the guarantee that modifications to the content set can be durably persisted to the CPS, either succeed or fail in their entirety and can be read back out in the case of success. To many this will seem a no-brainer, but when we move on to our review of NoSQL solutions we’ll find that some of them don’t necessarily provide this guarantee.

Note: while advantageous in some situations, I consider externally defined transactional boundaries (i.e. the ability to “batch up” numerous otherwise unrelated content modifications into an arbitrary ACID transaction) to be a “nice to have” requirement, rather than a hard requirement. Again we’ll see the impact of this when we review NoSQL technologies.

Scalability to Large Data Sets

Interestingly, scalability in the presence of large amounts of traffic is the area where NoSQL technologies garner the most attention, yet it is one of the least important requirements for a CPS. This is because even large (several hundred person) editorial teams are unable to generate the kind of traffic load that even a moderately successful web site can receive.

However what does matter is that the CPS can scale in the presence of large amounts of data – typical content-heavy web sites these days contain tens to hundreds of thousands of discrete content items, many of which will contain several media assets (images, video, fire applets, etc.) that themselves may be heavyweight (MB to GB in size).

Geographic Distribution

Basically this requirement is for those organisations that have geographically distributed editorial teams who wish to ensure good performance of the editorial tool, no matter where the editors are physically located.

Although in essence a “nice to have” requirement, I threw it in here because I’m hearing it increasingly often and some NoSQL solutions cater to it quite nicely.

Next Up…

Next up I’ll give a quick overview of some (but not all!) of the more relevant NoSQL technologies currently on the market, and we’ll compare them against the requirements we’ve defined here to see to what degree they are relevant to the CPS use case.

Share via del.icio.us Share via Digg Share via Facebook Bookmark in Google Share via MySpace Share via Reddit Share via StumbleUpon Favourite in Technorati Share via Twitter

Published in: on 2010-06-25 at 12:56 pm  Comments (7)  
Tags: , , ,

NoSQL and CMS – a Match made in Heaven?

As anyone who’s visited planet Earth in the last year or so knows, the NoSQL (“Not Only SQL”) movement is rapidly gaining both momentum and mind share, despite a number of prominent detractors. Rather than entering into a lengthy debate on the general pros and cons of NoSQL technologies, I’d like to reflect on the possible applications of these technologies to the specific problems of content management, a use case that (to my mind) it seems particularly well suited to. I briefly scraped the surface of this topic in a prior post.

As discussed previously what is meant when we refer to a “CMS” varies quite significantly depending on the use case, and my initial focus will be on the impact of NoSQL on CMS’s that target the Web Content Management (WCM) use case, followed with a post on the impact of NoSQL on the Document Management (DM) use case.

Even within the seemingly narrow confines of WCM, we’re discussing (at least!) two different problem domains (Content Production Systems and Presentation Management Systems) that have markedly different requirements (and are arguably unrelated), and I’ll discuss the impact of NoSQL on each of these areas in turn.

Let’s begin!

Share via del.icio.us Share via Digg Share via Facebook Bookmark in Google Share via MySpace Share via Reddit Share via StumbleUpon Favourite in Technorati Share via Twitter

Published in: on 2010-06-24 at 10:19 pm  Comments (5)  
Tags: ,
Follow

Get every new post delivered to your Inbox.