Critical Thinking: With No Further Study

Jay Quincy Allen
10 min readJun 16, 2022

Data Modeling, Data Warehouse and the “Modern Data Stack”

I. Data Modeling

Effective Data Base Design by William H. Inmon
Figure 1. “Effective Data Base Design” by William H. Inmon

I’ve used the following quote from this book countless times over the past 30 years or so in my journey as a software and database developer.

Preface from “Effective Data Base Design” by William H. Inmon
Figure 2. Preface from “Effective Data Base Design” by William H. Inmon

“It is a sad fact of life that when a person learns the mechanics of data base design, he or she often feels equipped to design an application with no further study. Perhaps the architects of today’s data base management systems have done a disservice by making it too easy to construct data bases. The freedom the beginner has in constructing data bases usually results in a poor design, which in turn triggers all sorts of other problems.” — from “Effective Data Base Design” by William H. Inmon, 1981

“This paragraph is just as prescient today as it was in 1981.” I used to say that sentence in my professional life too, since I first read this around 1993. I had just finished reading “Building the Data Warehouse” and I wanted to go back and see if Mr. Inmon had written anything earlier. In 1993, I didn’t find the earlier work was as applicable as it might have been in 1981, but I did find value in it nonetheless, just as I do today. During the 90s and 00s, I used this in the context it was originally intended. The DBMS platforms made it all too easy to create a table, then create columns for that table. That’s all we need to know how to do, right? With no further study.

No. There’s a lot more to designing databases than just creating tables and columns that reflect our “human” view of data.

Around 2015, I had my second exposure to Hadoop and the mentality of the data scientists, architects, managers, and engineers regarding data management of this “new era.” It revealed something a bit surprising to me. The age of “big data” was making this problem worse. In this case, they were writing data to tables in Microsoft® SQL Server that were exactly like the Hive tables. Sometimes hundreds of columns wide and tens of millions of rows. While I don’t claim to be an SQL Server expert, I have used it quite a bit over the decades and I would have thought that I would have known the maximum number of columns allowed in a table. SQL Server maximums were once common interview questions. I did not. It’s 1,024.

This problem was still present and it was getting worse. Not only was there ignorance of proper database modeling, but now a community of data folk were saying they didn’t need it. Here are a few quotes I’ve read or heard related to this issue:

“Data modeling is a thing of the past.(1)”
“The RDBMS is obsolete.(2)”
“We can load and read this data with much greater performance using newer technology(3) so we don’t need to be concerned with the shaping of the data anymore.(4)”
“I think all data engineering issues in building an analytical platform can be solved without data modeling, using just the capabilities of the modern data stack.(5)”

Sound familiar? There are five points made in those four sentences. The answers to those points in order are:
1. “No, it’s not.”
2. “No, it’s really not.”
3. “Yes, you can, within the same environment. So what?”
4. “Yes, you do.”
5. “No, they definitely cannot.”

Over the past decade, I’ve seen product managers and even SCRUM masters dictate how a table should be designed in both OLTP and decision support projects. I thought it was just pure ignorance at first, but later accepted this as a real trend. Yet, there’s still a lot of ignorance there. Aside from the folks who are supposed to have the responsibility of the “what” trying to dictate database design to those with the responsibility of the “how” without proper understanding of what they’re asking, it’s also a common theme with data engineers, scientists, analysts and managers. Sorry, I’m sure I left out several data titles there. I’ll have to address the ridiculous metastasizing of data titles another time.

There’s a persistent attitude that makes it seem like they all had forgotten Codd’s independence rules (8–11 of Codd’s 12 Rules). Or did they ever know about them at all? Probably not. Of course, these rules were for a relational system that supposedly no vendor has every truly built. Plus, who cares when I can read and write data lightening fast? Right? No, this demonstrates hubris and a lack of willingness to educate themselves on anything outside of their immediate sphere of influence, which all to often is shaped by vendors and not the leaders and senior practitioners of data management. The worst part about that now is that a large swath of the leaders and senior practitioners of data management are also the perpetuators of misinformation. In many organizations, the blind are now leading the blind.

The tide may be turning. Organizations with sizeable investments in a “modern data stack” and little to no business value to show for it have started asking questions. Even organizations that lead the hype trains have felt the disturbance in the force. Why are these projects failing? It’s not just because of a lack of data modeling, but that’s certainly a part of it. There are countless articles on big data projects failing, but here’s a more recent one from Madhurjya Chowdhury. Notice the very first item on the list in that article? Integration. Today, more data folk than ever are ignorant of how to do this. It’s hard. So their response is often, “What possible benefit could come from all those tables in a data model?” or “Why would I go through so much trouble just to analyze and divide up this big file when I could use it as it is?” Sound familiar?

Do you still think new tech without further study is the answer? Adding “modern” to “data stack” is a relatively new thing. Data stacks have always been “modern.” Those technologies, like everything else, evolve. It seems redundant to say “modern.” Is that just an insecurity? It’s more likely a nod to thinking that MPP and column store are destined to completely supplant the RDBMS and any other tech that wasn’t invented after 2000. Also important to keep in mind: These “new” technologies aren’t even new.

Column store — TAXIR — 1969
MPP — Goodyear MPP — 1983 (This is only one of several MPP architectures introduced in the 1980s)

I don’t mention NoSQL since that *is* a relatively new term (Strozzi 1998, Oskarsson and Evans 2009), although the tech is not.

It’s easy to understand why vendors perpetuate misinformation. Their goal is to sell products. Anything seems to be fair in advertising these days. So what if we use the term data warehouse to mean something other than Data Warehouse? Well, it’s a problem, and not a new one. The age of the “modern data stack” has just made it much, much worse.

So why would practitioners accept and perpetuate misinformation? I can remember many times over the past 35 years when software engineers would be gangbusters for a new tech, regardless of its actual worth to an organization that might pay money for it, invest in education on it, or perpetuate a mythology about it. This usually didn’t last very long. This time, with “big data” and “the modern data stack,” it’s lasted for over decade.

Figure 3. This Photo by Unknown Author is licensed under CC BY-SA-NC

II. Data Warehouse

If the data practitioners out there would take a little time to do some reading, critical thinking, and attempt to tear down as many of their own cognitive biases as possible, they would find that Data Warehouse has never been sold as a technology product because it’s not one. No matter how many times you hear Snowflake data warehouse or Redshift or any other number of vendors that use the term imprecisely. I use the adjective “imprecisely” instead of “inappropriately” because it would only take one or two additional terms to make it more precise. “Platform” could be added for precision. “Data mart platform” should be added to many of them. Not “Data Warehouse.” It’s easy to spot those who have taken the term and used it to describe a product rather than an architecture, as it was originally intended (“Building the Data Warehouse” by W. H. Inmon, January 1992, Wiley & Sons, Incorporated). Notice in the diagram of Data Warehouse architecture in Figure 4 that no mention is made of a technology. In 1992, it was the RDBMS alone that would be used for the Data Warehouse. The Data Warehouse was defined as being:
1. subject-oriented
2. nonvolatile
3. integrated
4. time-variant

Figure 4. Top-Down or “Inmon” Data Warehouse, 1992

The popularity of Kimball’s “The Data Warehousing Toolkit,” 1996 brought about the era of “Inmon versus Kimball.” While Kimball’s work was largely about dimensional modeling , it also promoted an architecture. This is generally referred to as “Bottom-up.” (see Figure 5) In short, this architecture left out the integration, time-variant and the non-volatile characteristics of Data Warehouse architecture. As for the time-variant part, you will find arguments suggesting that dimensional modeling’s slowly changing dimensions accommodate this criteria for Data Warehouse. Sorry, but a collection of dimensional models (a data mart) does not make a Data Warehouse and slowly changing dimensions only get you so far with managing the temporality of data. Read a detailed explanation of the problems associated with managing temporality through SCDs in “Temporal Dimensional Modeling,” February 2019 by Lars Rönnbäck and Olle Regardt. For an architecture, bottom-up really only supports the subject-oriented characteristic of Data Warehouse. Therefore, not Data Warehouse.

Figure 5. Bottom-up Data Warehouse associated with Ralph Kimball

While the architecture was not Data Warehouse, Ralph Kimball’s promotion of dimensional modeling did bring about an evolution of Data Warehouse through polyglot persistence. This was the first use of the term “hybrid” in association with Data Warehouse. “Hybrid” has taken on a few different definitions since then. Today, it seems to mostly be used to describe a combination of on-premise and cloud to host the data warehousing landscape.

Hybrid refers to using the integration layer as the single source for the data mart(s). So, instead of “just the Data Warehouse” or “just the data marts,” both are present in the Data Warehousing landscape. The integration layer, sometimes called the “Operational Data Store” or just “the Data Warehouse” is the normalized model. Using this as its sole source, a data mart layer is added, which is dimensionally modeled. The reason for using polyglot persistence is so that the complete set of data (the Operational Data Store) is optimized for data integrity and writes, while the copies of this data in the data mart(s), the source for the visualizations, is designed for optimizing reads. Now the source of our visualizations is optimized for reads both in technology and in the model.

Figure 5. Hybrid of Top-Down and Bottom-Up

Data Vault and Anchor Modeling are examples of the evolution of the Data Warehouse. Anchor Modeling is a modeling technique and not a Data Warehouse architecture per se, but it does represent an evolution of the architecture.

III. Conclusion

  • When you hear anything negative about Data Warehouse, it’s usually a straw man argument made by demonizing the architecture for things which are not failures of the architecture. It distracts you from being critical of the idea that it’s something that needed supplanting in the first place.
  • Modern tech makes data modeling obsolete? How does faster data processing make up for a lack of modeling? One is about performance and the other is about avoiding update/delete anomalies and data integrity. How do complex data pipelines make up for lacking a holistic view of the information being managed? Not data, but information.

New technologies are great. New patterns emerge, which means new opportunities for reimagining how we accomplish more strategic goals.

-Subject-oriented
-Time variant
-Non-volatile
-Integrated

This was the answer. The question? How do I best support analytical capabilities in my organization? In most cases, the setting which produced this question hasn’t changed in 30 years. In this era of IoT, FinTech, social media, and greater marketing and advertising opportunities for organizations to sell their products and services, new technology plays a major role in extending our data processing capabilities. Even in these cases that are truly “big data” cases, did these organizations suddenly lose their original analytics requirements? Anyone who has worked in data management for a significant time knows the lessons learned in just throwing new technology at a problem without thinking it through. Is there a plan? A methodology? A true architecture? Not all situations will require every cloud tech you can throw at it. In fact, you may be making things worse. The “stack” of tech that’s chosen is useless without sound, repeatable architecture. In fact, most organizations can meet that original requirement with surprisingly few platforms and programming languages. It’s worth noting that many “big data” requirements are simply not situations that couldn’t be handled by traditional means.

Keep it simple. What is the outcome your organization needs when it comes to data? How can you most efficiently provide this outcome such that it has the lowest possible cost of ownership, is easily extensible so as to accommodate changes over time, and meets the original criteria put forth to meet an organization’s analytics needs? Subject-oriented, non-volatile, time variant, and integrated.

--

--

My career in decision support has spanned many industries. Learning what is common across organizations and industries has been an invaluable resource.