Getting the Facts Right: A Guide to Modelling Robust Fact Tables
Fact Tables are the foundation of a data model. They contain events and transactions that measure business activity. Without facts, there is no data. Due to their centrality, it is important to model fact tables correctly; otherwise, you might end up with fundamental flaws in your data model that will affect both your upstream (ingestion and integration) and downstream (analysis and insights) activities.
In this piece, I will focus on the practical aspects of fact table modeling that, in my experience, are indispensable. Hopefully, this will serve as a good starting point for your modeling endeavors, or simply help you filter the plethora of information and concepts available on this topic.
To pin down abstract concepts, I will be utilizing the following data model throughout the article.
This is a sample data model for trading data. The core fact is trades. The main measurement columns are price and quantity. We also have four dimensions: Exchange (the exchange the trade occurred on), Date, Instrument (the financial instrument that was traded), and Counterparty (the party on the opposite side of the trade). Let’s delve into it.
Define the Grain and Stick to it
The ‘grain’ is simply the basic unit of measurement. It is what is being measured. What is the grain in our sample data model? A single trade. The grain should be the ‘lowest, most atomic….physical process that collects data’ (Kimball, 2003). This is important because data can be aggregated upwards from the grain, but not disaggregated below the grain. This means that if I have data relating to individual trades, I can aggregate them at the daily, weekly, and monthly levels, but I can’t go backward. Therefore, the grain should be at the most granular level practically possible.
Once a grain is selected, it should be adhered to. This simply means that all columns in the fact table should relate to a single unit of measurement. For instance, it is inappropriate to have a cumulative quantity or average price column in our Trades Fact table. This would violate the grain of the table, which can lead to double counting and mixed granularity issues in aggregation and filtering operations at the Analysis/BI stage.
Resolve Many to Many Relationships in Fact Tables
If we refer to our sample Data Model, we can see that all dimensions have a one-to-many relationship with the fact table. This is usually the case. However, what is the cardinality between dimensions? Let’s take an example and try to work it out. Between Exchanges and Instruments, can a single instrument be traded on multiple exchanges? Yes. Can a single exchange list multiple instruments? Yes. The cardinality between our dimensions is, therefore, many to many.
With this in mind, let’s take a step back and think about what role our fact table is playing between dimensions. If I have to answer questions like, what instruments got traded on a particular day? Or, what instruments trade on a particular exchange? How am I able to answer such questions from my data model? In each of these cases, I would have to traverse the fact table to establish a direct relationship between dimensions.
The fact table effectively acts as a bridge between the dimensions. This is what is meant by ‘resolving’ many to many relationships.
The next question we should be asking is whether a single fact table is sufficient for our analytical needs. A single fact table would only be able to answer questions between dimensions w.r.t to the grain. Any direct questions between dimensions would not be answered sufficiently; this is because we’re only able to establish relationships between dimensions that occur in the fact table. For instance, we can answer the question: what instruments were traded on X exchange? but, we cannot answer what instruments are listed on X exchange? This is because the latter question also includes instruments that were not traded.
To answer direct questions between dimensions, we need to utilize another fact table for resolving many-to-many relationships directly between dimensions. Such a table takes various names — bridge table, junction table, factless fact table, mapping table— but they all refer to the same notion of directly resolving many-to-many relationships between dimensions. Let’s incorporate this into our model.
Now, with the Instrument_Exchange fact table, we’re able to answer questions relating to instruments and exchanges independent of trades. It is worth noting that multiple such tables may be required depending on business requirements.
Consider Surrogate Keys
It's useful to think about surrogate keys as ‘synthetic’ keys. These are keys that are deliberately engineered and added to the data model for functional reasons or necessity. The popular narrative is that since fact tables have a natural (composite key), surrogate keys are a waste of memory. I would like to challenge this notion. First, this is simply not true in some cases, in our data model, for instance, a composite key comprising all foreign keys would not uniquely identify a trade. In this case, it is mandatory to have a surrogate key that uniquely identifies each record. This will be required in inserting identical facts that have the same foreign key values to avoid data duplication.
Perhaps the biggest case for having a surrogate key stems from the need to perform data fixes, backfills, or ‘upserts’ in the fact table. While the frequency of such an operation might seem limited, in my experience, it happens more often than most people like to admit. A fact table that lacks a genuine primary key, will not be able to do this effectively, and will almost certainly result in data duplication, resulting in erroneous analytical results down the line. Having a surrogate key ensures that each record is uniquely identified, resulting in accurate updates. It also results in markedly better query performance for upserts and joins since the length of the unique key is 1.
It's useful to think of surrogate keys as future-proofing a fact table. With virtually no downside, you’re hedging yourself against future data changes.