Lookup tables in a data warehouse
Chip Hartney (The Datamology Company)
First written: April 14, 2023
Last updated: April 19, 2023
Relational databases depend on key-based relationships between tables. The primary key (PK) in one table can be used as a foreign key (FK) in another table. In that manner, records in the two tables are understood to be related to each other. And, with the power of SQL, a user has the ability to merge those related records together via a JOIN operation. The FK in one table is a reference to the PK of the other table. In that sense, any table with a primary key can act as a reference table for another.
But the term “reference table” is typically used in a more specific sense. It is usually intended to describe a table that contains a set of domain values that, in turn, are used to describe something else. E.g., the codes used to describe a type of meeting (in-person, virtual) or the outcome of an activity (win, lose) or a status (active, dormant, defunct), etc.
The concept frequently extends beyond those simple lookups to reference data that acts as a constraint on the valid values in the referring tables. For example, we expect a US-based address to be in the predefined list of states (Alabama, Alaska, etc) just as a meeting type must be one of the predefined values (in-person or virtual, for example). From a DW POV, such lists define the domain to which attribution is constrained.
In this article, I propose a definition of lookup tables and describe my approach to managing them in a data warehouse.
With respect to an enterprise data model, source systems almost certainly have their own reference tables. However, in designing a model that is independent of the source systems being utilized, we also design the domains independent of those source systems. Then we map the data coming from the source systems to these reference domains. I.e., a source system may declare a meeting to be of type “2” but, in our enterprise data model, we use “V” to indicate a virtual meeting. And, through a reference table of meeting types, we translate that “V” (a code) to “Virtual”.
Bottom line: We think of lookups as the reference tables that are declared to constrain our data model as opposed to information coming from any of our source systems. If the source system defines the concept (and associated values), then the concept should be modeled as a standard concept (entity or object). If, however, our source systems do not define the concept, we establish it manually or populate it from external (expert/definitive) sources.
A lookup, therefore, is a reference table that is managed by the dev team as opposed to reference tables that are derived from source systems. The latter should be created in the standard enterprise model using ETLs that populate them based on collected data from the source systems.
Lookups, however, are created and managed directly by the dev team. By managed, we mean that the dev team controls the content of the table. If there is an ETL at all, it is used carefully and, probably, sparingly. And it likely wouldn't be based on collected data. The point is that such a lookup would be used to define and control what gets written to other parts of the enterprise model. I.e., and here is the critical point, their content must be syncrhonized with the ETL logic (the actual code) used to populate that enterprise model! And only the dev team writes that code. So only they can determine what records need to be in the lookup.
Consider the case where we decide that the Client table requires an indication of client proximity ... near or far. Our first approach is likely to be one in which we simply add a Proximity column to the Client table and have the ETL populate it with "Near" or "Far" literals. Note that the ETL populates it and uses logic (written by the dev team) to determine what value should be written for any given record. This is all fine and good. SOP.
But we may then take it to the next level (one of storage efficiency), choosing to record a code instead of a value. I.e., "N" instead of "Near" and "F" instead of "Far". And we'd likely rename the Client column to ProximityCode. The ETL still uses the same logic, but writes "N" or "F" instead of the longer literals. And, of course, we would document (in the metadata) the meanings of the ProximityCode values ("N" and "F"). Again, all fine and good.
The final level is one in which we provide the consumers a way to translate the codes without having to write their own CASE statement (based on the metadata). So we add a ProximityCode lookup table with two records (N/Near and F/Far). Now a consumer can join to the lookup table to get the longer names with a simple JOIN. Note that the lookup table was created by the dev team (who also INSERTed the two records). There is no dependence on any source system! There is no ETL processing source data that was first collected from a source system. It is simply a translation mechanism provided by the dev team to improve usability of the enterprise model. This is what we mean by a lookup table.
If we ever decide that we will start using a third proximity (such as "In-between"), a few things must happen:
· We add the I/In-between record to the lookup table.
· We change the ETL logic to choose the applicable code based on new criteria.
· We update the metadata.
Note (crucially) that the dev team did all of the above. Again, no dependence at all on any source system nor any ETL based on collected data from a source system. This effort merely reflects our choice that we'd prefer to use 3 proximities going forward instead of two.
Because the ETL only affects records that it writes going forward, the dev team is faced with another decision: Should we rebuild history? I.e., should we reprocess all of the existing records (written with the old ETL logic) so that those records also use the new In-between proximity (where appropriate). That is entirely a separate and orthogonal question. It does not change anything about our establishment of the 3rd category. It's a judgment call that is made on a case-by-case basis. Note that, if you choose to build a really robust DW, you are theoretically able to rebuild history by simply DELETEing all records from the Client table and running the ETL again. This works if the ETL robustly rebuilds missing/incorrect records and no source data required by the ETL has been removed from the system.
Finally, let's consider the case where you decide that you wish to use different translated values. Perhaps "Close", "Moderate", and "Distant" instead of "Near", "In-between", and "Far". All you need to do is UPDATE the values in the lookup table (and the metadata). Every report instantly changes. As for reproducing or auditing reports produced in the past, you are covered if the DW tracks system time. To reproduce something in the past, you use a query with a lookback clause such as (for SQL Server):
FOR SYSTEM_TIME AS OF <timestamp>
That clause will cause the query to use the versions of the Client and Lookup tables in existence at the specified datetime. If that datetime is before the ETL change was made, the Client table will have used only the "N" and "F" codes and the Lookup table will have used the "Near" and "Far" values.