|
Effective periods in a data warehouse
Chip Hartney (The Datamology Company)
First written: October 13, 2022
Last updated: January 19, 2024
Contents
To communicate history in a data warehouse, with business time or system time, the data warehouse must present data records with pertinent temporal periods. Such temporal periods are thought of as “effective” periods. Each record has a point at which it became effective (i.e., started) and a point (perhaps not yet known) when it ceased to be effective (i.e., ended).
In this article, I describe effective periods, their applicability in practice, and their implementation within a data store.
For each data asset, you must determine the pertinent granularity of such temporal periods. That granularity could be any increment of time including:
· Century
· Year
· Quarter
· Month
· Week
· Day
· Hour
· Minute
· Second
· Millisecond
· Microsecond
The most common granularities in a data warehouse are day and fractions of a second (aka “timestamp”), hence the most common temporal periods are:
· Day: From a starting day to an ending day … in which case dates are used to define the end points and those end points are known as the start date and end date.
· Fractions of a second: From a starting point in time to an ending point in time … in which case timestamps are used to define the end points and those end points are known as the start time and end time.
In general, the method of describing the effective period depends on the type of subject being described:
· For immutable subjects, the data is understood to be constant throughout time. There is no effective period for such data.
· For mutable subjects, the subject changes state across time. Each state is recorded separately with its own effective period (start and end points in time). So, we use both “Effective date” and “Expiration date” for such data.
· For append subjects, the data never changes. Each record stands alone after it is recorded. PERIODs are rarely applicable to this kind of data. More commonly, these are point-in-time events for which only an “event” time is recorded.
In practice, the term “effective date” is often used (a bit misleadingly) to indicate the more general concept of “effective period”. Furthermore, the term is frequently used to describe effective periods that are not date-based. I.e., though the data may have a granularity of month, many practitioners will still use the term “effective date”. Sometimes this is because the granularity has not been decided yet and “date” is the most common. Therefore, take care to understand the context around the usage of that term.
The Allen relationships are a set of thirteen positional relationships between two time periods, as first defined in James F. Allen's 1983 article "Maintaining Knowledge about Temporal Intervals". Every possible positional relationship between two time periods is represented by one and only one member of this set:
Modern data stores support a PERIOD data type to ensure proper management of periods (including effective periods). They also provide more robust temporal capabilities (such as OVERLAP, ABUT, and INCLUDE functions).
I strongly recommend using a data store with such temporal capabilities.
Note that temporal DBMSs, especially those that provide a PERIOD data type, necessarily interpret effective periods in an inclusive/exclusive manner meaning that the “effective” point in time is included in the period and the “expiration” point in time is excluded.
If a PERIOD data type is not available for use, periods must be recorded with a pair of “begin” and “end” columns. In the case of date granularity, those columns would be:
· Effective Date
· Expiration Date
All processes, especially ETLs, must maintain the two columns in a manner that maintains the integrity of the data both within the record (effective date cannot exceed expiration date) and between records for the same entity (their effective periods cannot overlap).
To record effective periods with separate columns, the start and end points of the effective period can be interpreted in one of four ways.
Consider a period bounded by:
· Start date = August 1, 2022
· End date = August 4, 2022
The four possible interpretations are:
· Inclusive start and inclusive end … meaning the above period includes August 1, 2, 3, and 4.
· Inclusive start and exclusive end … meaning the above period includes August 1, 2, and 3.
· Exclusive start and inclusive end … meaning the above period includes August 2, 3, and 4.
· Exclusive start and exclusive end … meaning the above period includes August 2 and 3.
It is critical that periods be recorded and interpreted consistently to ensure that analytical data is accurately utilized. A consumer should never be faced with having to read the metadata for a particular data asset to understand how to interpret its effective period. Choose one approach and use it consistently.
Given the following information:
· The value was Red from the start of August 1 through the end of August 15.
· The value was Blue from the start of August 16 through the end of August 31.
In an inclusive/inclusive approach, the data would be stored as follows:
Key |
Start Date |
End Date |
Value |
1 |
2022-08-01 |
2022-08-15 |
Red |
1 |
2022-08-16 |
2022-08-31 |
Blue |
Generally, this approach is utilized for legacy reasons (before the advent of temporal data stores).
But it is useful if you want your consumers to be able to leverage the BETWEEN predicate in the SQL language (which uses inclusive/inclusive matching logic).
But inclusive/inclusive approaches are ill-advised for granularities that are subservient to the system clock. Many orgs utilized inclusive/inclusive approaches for fractional seconds in the past only to be tripped up when the system introduced support for a more precise timestamp (as happens, for example, when an application starts supporting microseconds). Using milliseconds, such periods would have been recorded as:
· 2022-08-16 12:00:00.000 – 2022-08-16 12:00:00.999
· 2022-08-16 12:00:01.000 – 2022-08-16 12:00:01.999
Once the system supported microseconds, the periods would have been upgraded as follows and a query for 2022-08-16 12:00:00.999001 would return no records at all!
· 2022-08-16 12:00:00.000000 – 2022-08-16 12:00:00.999000
· 2022-08-16 12:00:01.000000 – 2022-08-16 12:00:01.999000
Given the following information:
· The value was Red from the start of August 1 through the end of August 15.
· The value was Blue from the start of August 16 through the end of August 31.
In an inclusive/exclusive approach, the data would be stored as follows:
Key |
Start Date |
End Date |
Value |
1 |
2022-08-01 |
2022-08-16 |
Red |
1 |
2022-08-16 |
2022-09-01 |
Blue |
In this approach, the end point of one period exactly matches the start point of the following period. Always. That makes maintenance of the data much simpler and leaves you in a position to upgrade to temporal capabilities if/when they become available.
Not only is maintenance of temporal data made easier with the inclusive/exclusive approach, but it is easier to chain adjacent records together (in a query) by simply matching the end date of one period with the start data of another.
With this approach, SQL users cannot use the BETWEEN operator with implicit comparisons:
“Column Name” BETWEEN “Start Date” AND “End Date”
Instead, they must use a compound predicate with explicit comparisons:
“Column Name” >= “Start Date” AND “Column Name” < “End Date”
Given the following information:
· The value was Red from the start of August 1 through the end of August 15.
· The value was Blue from the start of August 16 through the end of August 31.
In an exclusive/inclusive approach, the data would be stored as follows:
Key |
Start Date |
End Date |
Value |
1 |
2022-07-31 |
2022-08-15 |
Red |
1 |
2022-08-15 |
2022-08-31 |
Blue |
Frankly, nobody (people or software) uses this approach because the start date values do not align with our intuition and common nomenclature.
Given the following information:
· The value was Red from the start of August 1 through the end of August 15.
· The value was Blue from the start of August 16 through the end of August 31.
In an exclusive/exclusive approach, the data would be stored as follows:
Key |
Start Date |
End Date |
Value |
1 |
2022-07-31 |
2022-08-16 |
Red |
1 |
2022-08-15 |
2022-09-01 |
Blue |
Frankly, nobody (people or software) uses this approach because the start date values do not align with our intuition and common nomenclature.
I believe that system time and business time should be managed in the same manner. If you utilize a temporal DBMS’s capability of managing system time automatically, that will be inclusive/exclusive. You should, therefore, use inclusive/exclusive for business time, too.
If you want to be able to later implement a temporal DBMS’s capability of managing periods (of any type) automatically without impacting your user community, you should use inclusive/exclusive now.