View all articles

CDP’s - The Hard Parts

By Ferdinand Steenkamp

Customer Data Platforms have been a hot topic for almost a decade now. The term itself was first coined in 2013 as “packaged software that builds a unified, persistent customer database accessible to other systems”. Quite a mouth full. Despite the many challenges and potential downsides of implementing or integrating CDP solutions, the industry has seen large scale adoption - an indication of the magnitude of opportunity costs at play. Even minimal Customer Data Platforms yield comparatively large ROIs due to the rich customer insights they provide - a field which, historically, was more speculative than data-driven. In this article we discuss the hard parts of implementing or integrating a CDP into your existing platform.

Introduction to CDPs #

To repeat the earlier definition, a CDP is packaged software that builds a unified, persistent customer database accessible to other systems. In layman’s terms: it is a system that ingests customer related data from many sources, with varying levels of quality and structure, and makes it useful to the organisation as a whole. A system qualifies as a CDP if it implements, at a minimum, the following set of features (as defined by the CDP Institute):

  • Accept All Sources
  • Retain All Detail
  • Persistent Data
  • Unified Profiles
  • Manage PII
  • External Access
  • Segment Extracts

We won’t be diving into details about required or optional features, there are plenty of sources covering them in detail. I recommend looking at the RFP Guide for a comprehensive list of features with detailed descriptions of each.

A Customer Data Platform, in its simplest form, looks like this.

An infographic displaying a CDP broken up into three distinct sections, as described in the preceding text

CDPs vary in scope and complexity, but they all share the same basic attributes:

  • Pulls data from many sources.
  • Cleans, transforms, and loads the data into a persistent store.
  • Crucially, during this ingestion step, it must use data-deduplication methods to create unified customer profiles.
  • Then it exposes a platform (something that can be built on) of value driving features.

Every attribute described above poses its own set of unique challenges ranging from technical to behavioural in nature. For the remainder of this article we discuss these issues in more depth.

Complexity in Data Ingestion #

Data Quality Challenges #

Anyone that has worked with data in a large organisation knows that it can be chaotic. Instead of fighting this, data experts have gone in the opposite direction, embracing messy and unstructured data to assist and enrich structured data.

An infographic showing data coming from multiple sources, with varying degrees of structure, being transformed into useful data.

In a perfect world, data would be cleaned at the source. This is, unfortunately, not practical in enterprise data management. Theoretically one could enforce checks and filters to prevent ingestion of bad data into your CDP. But, in reality, this would leave your CDP bare compared to what it could be since you would be cutting out valuable context.

Forcing responsibility for data quality down-stream has proven to be an ineffective strategy in large organisations. Defined structure requires up-front context, which we don’t have. Furthermore, a large portion of valuable data comes from the frontline where there is less incentive and skill to keep data quality high. We simply cannot expect to work exclusively with well defined, clean data. In contrast, if we embrace low-quality and incomplete data, we can squeeze more value out of the platform. This, however, puts a lot more pressure on the CDP to take ownership of data processing and transformation. Low quality incoming data does not excuse low quality outgoing data. Your CDP is going to be a platform for the organisation to build on; the platform must be stable.

Managing Diverse Data Sources #

Since data is being pulled from all over the organisation, you should expect regular disruptions due to faults or changes. Relational databases will have their table schemas adjusted. CSV dump files pulled over FTP will suddenly stop being updated. When your platform depends on many sources, maintaining integrity becomes a full-time job. There is no workaround for this problem; instead I would suggest focussing energy on monitoring and alerting. A platform that fails loudly can be fixed, a platform that fails silently can deliver stale or bad data for weeks or months, driving decision makers in the wrong direction.

Merging data from disparate sources can also be a complicated task. This requires a lot of context; you might even find that adding a business analyst to the team could dramatically impact project success. Engineers lack the skills required to discover meaningful industry insights or correlations.

Change Data Capture #

CDC is one of the hardest data engineering tasks my team faces day to day. If done incorrectly, it can lead to massive networking costs and out-of-sync data issues. Databases normally have streaming replication features like Write Ahead Logs (WAL) that allow you to replicate and capture changes in real time. However, a large portion of useful data comes from less sophisticated sources, such as CSV dumps, where manual diffing and polling could be a requirement.

A graphic showing data being streamed from a relational database into a graph database, requiring a transformation step

Even if you are streaming data from one database to another, unless they are both of the same model1, there will be a transformation step. This can sometimes be done by combining event streaming platforms (Apache Kafka) or message queues (RabbitMQ) with plugins, but could require manual intervention through scripting. If this is the case, your team’s engineering capability might be the bottleneck. Either way, I have yet to run into a CDC stream which didn’t leave me sweating, there are just too many “what if’s”. When real-time is a requirement, the task becomes dramatically more complicated. And even without real-time, CDC might be the only viable way to keep costs at a realistic level.

Compliance and Privacy Concerns #

Interestingly enough, when discussing customer data with industry experts, compliance and privacy management is one of the leading concerns. Customer Data Platforms can greatly simplify the view you have of individual customers, but, it can also be a dangerous tool if not managed correctly. When all your customer data is collected and cleaned up in one place, the last thing you want to do is give access to that information to everyone in your organisation. There are a couple of ways to ensure that this doesn’t happen:

  • implement strict access control measures such as RBAC, ABAC or Fine-Grained Access Control.
  • data masking, this allows data to remain useful without risking privacy breeches.

All in all, the compliance wins of CDPs far outweigh the risks. Without CDPs, many organisations have no single source of truth for customer information. This means that they are at risk of being non-compliant. A major win in privacy over recent years has been the rise of CDPs with cookie-consent-management features. Before GDPR (or POPI) laws kicked in, many retailers had no way of knowing who their customers were, let alone request individual consent. It seems almost counter-intuitive that one of the greatest marketing tools of the last century has become a champion for consumer privacy protection - but that is exactly the case.

Segmentation and Personalisation Difficulties #

To achieve effective segmentation and personalisation you need accurate identity resolution. This is the cornerstone of a CDP. At Rockup, we believe that graph databases are a key competitive advantage when it comes to making sense of connected business processes residing in disparate data sources. By making a graph database the foundation of your CDP, you unlock powerful data science techniques that allow for accurate and resource efficient projections. In other words, your models work and they work fast. Querying a graph across densely connected data is an order of magnitude faster than querying across tables, meaning that your dependent applications can query data on the fly, as they need it - truly real time. Most CDPs rely on columnar stores, which are great for large scale aggregations, but they fall flat when query depth comes into play.

Infographic displaying the difference between segmentation and personalisation

Segmentation and personalisation requires an understanding of the data that connects your data - relationships. This is the reason we advocate for the use of graph databases when building CDPs - at the very least they should be used in conjunction with columnar stores. The more relationships we have access to when building projections, the more accurate we can predict user behaviours or desires. Unlike other data paradigms which suffer under the weight of query depth, graphs thrive. Segmentation and personalisation, to be frank, should not be as difficult as people would make it out to be. It is simply a matter of using the right tool for the job.

Cost, Scalability and Performance #

As discussed above, CDPs that don’t use graph databases as the underlying data store are severely handicapped with their ability to query segments or rank identities in real time. This will drive costs through the roof as you will need to run batch jobs to counter the delayed results of running large scale joins2. Even if you throw processing power ($$) at the problem, the platform will always be playing catchup in terms of ranking quality and segmentation query speed.

If you are choosing to buy a pre-packaged vendor solution, ensure that you validate their underlying technologies. If an in-depth technology evaluation is technically out of scope for your team, be sure to carefully evaluate costs and capabilities. Many of these platforms have surprising hidden costs that become apparent after the integration is complete, at which point turning back is not an option. Otherwise, if you plan on building an in-house CDP, add a graph data expert to the project . They could greatly improve your analytical capabilities whilst decreasing query costs.

A big reason for the exorbitant pricing of CDPs comes from proprietary database technologies owned by large cloud vendors (such as GCP BigTable or AWS DynamoDB). If efficiency and cost are of importance to your team, we recommend looking at open source alternatives and picking the right model for your use case. For graph databases we recommend Neo4j and for wide column stores our current recommendation is ScyllaDB (though Apache HBase would likely find more traction in a corporate environment).

We have one recommendation that could save you a lot of money and reduce data processing times when implementing a CDP - a software engineer. Many of the services used to implement CDPs are over-powered (feature-wise), over-priced and slow. There are many scenarios where simple scripting can replace multiple expensive services. Data processing is best done as a software job. The transformation of data is, after all, the single purpose of all software.

Find an engineer who understands performance optimisation and knows how to profile their code. Software that does ETL on large datasets can be complicated to write, but if done correctly, it will be well worth your while. You can achieve order-of-magnitude improvements in data processing speed by understanding hardware-level optimisations. Speed improvement means:

  • You use less processing power, which makes the job cheaper.
  • You can processes more data, since its faster.
  • You can run batch jobs more frequently, keeping data fresh.

If you plan on keeping your data real-time, we suggest offloading as much processing as you can to self-written software. The nice thing about this approach is that you don’t have to do this from the start. Build your system, run it, find the bottle-necks, and the gradually remove them.

Organisational Concerns #

Vendor Selection #

If you have decided to purchase a CDP solution from a vendor, I don’t have much advice other than: do your homework. There are plenty of sources such as CDP Institute or They have template RFPs and plenty of educational content. You can find vendor comparisons that have in-depth descriptions of all industry standard CDP vendors along with feature lists. Be careful with your selection, you are about to make a large, risky investment.

Trading Complexity for Complexity #

I recently wrote about the tradeoffs between building and buying solutions to Complex Problems. In that article, I argue that buying isn’t always as attractive as people might believe. CDPs are complicated systems. The chances are good that your requirement calls for a small subset of features available on CDPs in the market. Your integration requirements could also be too involved for a general purpose platform. Taking on the complexity of an off-the-shelf CDP without the use case for it can cost you a lot of time and money with little return on investment - if any.

If you are getting started with your data journey, then you might be better off cleaning your data and doing segmentation straight at the source. In a world where everything has become complicated, layered, and abstracted; simplicity is a competitive advantage. It keeps your organisation lean and ready to pivot on a whim. You could achieve 80% of the benefit of CDPs for 20% of the effort 3. Don’t trade the complexity of building simple solutions in-house for the complexity of buying and integrating complex solutions without good reason.

Adoption Hurdles #

We all know that new tools can be greeted less than enthusiastically in an organisation. Executives will have cost concerns, managers will have training concerns, and staff will have learning grievances. The three main hurdles you will face are

  • Overcoming operational silos
  • Integrating with existing services
  • Inspiring change

Operational silos can be tough to break down, but if you want rich data for your CDP, you will need buy in from all over the company. Many managers will push back on sharing their data, this is to be expected. The best way forward is to get buy-in from a small, targeted group. Build something simple and useful. When others see the results, they will follow.

The same will happen with teams that are hesitant to integrate. Over time they will see how their own products can benefit from richer customer data. Allow your success stories to do the marketing on your behalf. Customer data can be used by everyone within an organisation - that is why it must be democratised.

Migrating a workforce over to new tooling can be tough - unless the tooling improves their day to day experience. Listen to your team, if they are complaining about the move, they are likely not benefitting much. I have witnessed plenty of scenarios where ‘someone at the top’ makes a decision to change tooling to a platform that does exactly the same thing. A lot of training for zero benefit. This is an effective way to annoy large groups of people. Make sure that your new system is an actual improvement over the existing one.

Measuring ROI and Success #

If you can’t measure the value, there is no value. This sounds harsh, but it is a core tenet of data-driven decision making. The problem is that no-one is sure how to measure the success of CDPs4. Key performance indicators used to measure Customer Data Platform ROI are:

  • revenue
  • cost savings
  • user engagement
  • user satisfaction
  • predictive model accuracy
  • data quality over time

There is, however, no agreed upon approach to measuring. This means it is difficult to compare solutions across industries as a whole. What we can say, however, is that every business is unique. The best person to decide what to measure is a person with deep context. That is why, again, we lean on business analysts to decide on KPIs that must be tracked throughout the process. Measuring ROI for CDPs can be bothersome, but it remains a crucial part of solving complex enterprise problems.

Conclusion: Navigating CDP Challenges #

Implementing or integrating Customer Data Platforms can present a range of challenges and pitfalls. However, the opportunity costs involved cannot be ignored. My advice, as always, is to start small and iterate fast. You should be especially cognisant of this if you are integrating into an off-the-shelf solution - they can offer a wealth of complicated features that yield little returns. Focus on the simple gains and build a solid foundation. You will always get outsized returns from the simplest solutions.

CDPs are a long-term investment, take your time to get it right.

  1. ’Database model’ refers to the structure or format of the data. Examples of common models are: relational, graph, columnar, document. 

  2. ‘Join’ here refers to a SQL JOIN, which is a SQL operation used to combine rows from two or more tables, based on a related column between them. This can be a particularly slow operation when performed on large databases. 

  3. Pareto Principle, 80-20 rule. 80% of the result for 20% of the effort. 

  4. Confusion around measuring CDP ROI 

Subscribe to Rockup’s Insights

We are committed to keeping your data safe.


Too much email?