|
Managing clock time in a data
warehouse
Chip Hartney (The Datamology Company)
First written: January 25, 2023
Last updated: January 15, 2024
Contents
Daylight
saving time ambiguity
Time is maddeningly difficult to deal with. Always has been and, I expect, always will be. When it comes to clock time, it’s even more so.
In this article, I describe the complexities of clock time and comment on the implications of using various database column types to store such values. The perspective here is one of utilizing Microsoft SQL Server for a data warehousing solution in an Azure environment. But the concepts apply to most all RDBMSs and solutions.
People around the world, and therefore systems around the world, use different local clocks. A local time of 12:00 (noon) in Denver is not coincidental with the local time of 12:00 (noon) in London, for example. Similarly, while 10am is nominally 1 hour “before” 11am, 10am in Denver on June 1, 2022, is 6 hours later than 11am in London on the same day.
Two standard clocks, to which all others refer, are:
· GMT (Greenwich Mean Time) is technically based on the Earth’s rotation as measured at the prime meridian (which passes through Greenwich, England). In practice, it is established based on atomic clocks.
· UTC (Universal Time Coordinated) was established in 1972 and is based on the average of a number of atomic clocks.
Each is based on a continuous flow of time without adjustments (other than to occasionally synchronize them with their underlying atomic clocks). Because they use the same calendar and start of day, GMT and UTC are functionally and practically equivalent.
Nobody (I think) uses one of these standard clocks in real life. Instead, the local government decides the clock to be used and orients that local clock to a standard clock (so that everyone understands the local clock).
A time zone (TZ) is a geographic area in which clocks are nominally the same. Because we use a 24-hour clock and prefer increments of 1 hour, the world is roughly divided into 24 TZs, each offset 1 hour from the prior one. However, due to international borders and the international date line, there are roughly 37 TZs (last I checked).
If the world were a simple place, everyone in a given time zone would use the same clock.
But, of course, that’s not the case. Local governments within each TZ can, and usually do, decide to implement their own clock.
Arizona and Colorado, for example, both lie in the Mountain Time Zone of North America. But they follow different clocks!
Many governments around the world have also adopted Ben Franklin’s suggestion that adjusting their local clock during the summer is a good idea (though saving candles is probably no longer the reason). Such policies go by a variety of names (“daylight saving time” in the US, “British summer time” in the UK) and usually adjust by an hour (though some locations adjust by 30 or 45 minutes).
In the US, most residents of the Mountain Time Zone (as of January 2023, at least) observe daylight saving time (DST) which is established by the US govt as follows:
· Begins on the 2nd Sunday of March at 2:00am (when clocks “spring forward” one hour).
· Ends on the 1st Sunday of November at 2:00am (when clocks “fall back” one hour).
Colorado passed a law in October 2022 which will change Colorado clocks permanently to MDT. But that law will only take effect if federal law is also changed. The US Senate has passed such a law. The US House has (as of Jan 2023) not yet acted on it, but may soon.
If Colorado changes, one motivation for handling time zones explicitly may be reduced.
Governments around the world decide the following in establishing a local clock (often confusingly called a “time zone”):
· The local clock’s offset (in minutes) from the standard UTC clock.
· Whether/when/how any DST adjustments are made to the clock during the year.
There are many, many different clocks utilized around the world.
“US Mountain Time” is one such clock. It is the one that Colorado utilizes. In that clock (as of Jan 2023):
· The standard offset from UTC is 7 hours (earlier).
· DST is observed as described in the prior section.
Note: Legal changes are afoot (as of January 2023) in which Colorado may change its clock to:
· The standard offset from UTC is 6 hours (earlier).
· No DST adjustments are made during the year.
These clocks are defined in the following Windows Registry hive (by key) for use by any application (such as SQL Server):
KEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Time Zones.
Among them are:
· "Mountain Standard Time" which Colorado uses and includes DST adjustments (despite its name).
· "US Mountain Standard Time" which Arizona uses and includes no DST adjustments.
Given that this registry may not be up to date with the latest government decisions around the world, it is possible that the software on a Windows machine does not accurately reflect reality on the ground.
In SQL Server, use CURRENT_TIMEZONE to determine the time zone of the Managed Instance. The returned value is based on the time zone of the instance itself (assigned during instance creation), not the time zone of the underlying operating system.
Azure SQL Managed Instance offers a choice of time zones to meet the needs of existing applications that store date and time values and call date and time functions with an implicit context of a specific time zone. T-SQL functions like GETDATE() or CLR code observe the time zone set on the instance level. SQL Server Agent jobs also follow schedules according to the time zone of the instance.
Azure SQL Database does not support time zone settings; it always follows UTC.
For more info:
Azure SQL Managed Instance time zones - Azure SQL Managed Instance | Microsoft Learn
RDBMS offer (usually) the following data types to contain date and/or time values:
· TIME (time of day without associated date)
· DATE (date without associated time)
· DATETIME (date and time without anchoring to UTC)
· TIMESTAMP (date and time with anchoring to UTC)
In SQL Server, the datetime options are:
· SMALLDATETIME (4 bytes):
o 1900-01-01 through 2079-06-06
o no fractional seconds
o not anchored to UTC
· DATETIME (8 bytes):
o 1753-01-01 through 9999-12-31
o fractional second accuracy to .003
o not anchored to UTC
· DATETIME2(prec) (6-8 bytes):
o 0001-01-01 through 9999-12-31
o fractional second accuracy to .0000001
o not anchored to UTC
· DATETIMEOFFSET(prec) (8-10 bytes):
o 0001-01-01 through 9999-12-31
o fractional second accuracy to .0000001
o anchored to UTC via offset (in minutes)
Of the above, only DATETIMEOFFSET is able to store both the local time and the offset to UTC, meaning that it is the only data type in which we can fully handle times from different clocks including:
· Retain the local time of each value.
· Retain the offset to UTC.
· Sort the values correctly.
· Compare the values correctly.
Note that, in using DATETIMEOFFSET, the local time is converted (if you will) to its critical parts:
· The local time, itself.
· The offset to UTC at that moment.
In such, we faithfully record the pertinent moment in time exactly. We do not, however, store the “clock”. I.e., we cannot know which clock the time was based on. As an example, both of the following would be recorded identically in a DATETIMEOFFSET column as “2022-12-25T10:00:00.0000000 -07:00”:
· 10am on December 25, 2022, in Denver.
· 10am on December 25, 2022, in Phoenix.
That is location information which might be stored in other columns (such as the location of the ATM at which the transaction took place).
For comparison, the above values would be stored as “2022-12-25T10:00:00.0000000” in a DATETIME2 column. No UTC offset info is stored. Returning to our Denver/London scenario:
· 10:00am in Denver would be stored as “2022-06-01T10:00:00”.
· 11:00am in London would be stored as “2022-06-01T11:00:00”.
Because there is no offset stored, we would mistakenly conclude that that former preceded the latter when, in reality, the former was 6 hours after the latter.
In an RDBMS such as SQL Server, anchored datetime values can be stored in a DATETIMEOFFSET column:
· 10:00am in Denver (US Mountain Time Zone, DST) would be “2022-06-01T10:00:00 -0600” (6:00 behind UTC).
· 11:00am in London (West Europe Time Zone, BST) would be “2022-06-01T11:00:00 +0100” (1:00 ahead of UTC).
And it can be seen that 10am in Denver is 6 hours later than 11am in London on that day. I.e., we can now meaningfully sort datetime values.
If one chooses not to use an anchored datetime column, one must manipulate each time value to a standard time before storing it in the unanchored datetime column (like DATETIME2).
For example, one could choose to adjust all datetimes to UTC before storing the value in the DB:
· 10:00am in Denver would be adjusted by the -6-hour offset to 4:00pm UTC before storage.
· 11:00am in London would be adjusted by the +1-hour offset to 10:00am UTC before storage.
Again, we can determine the sequence (London before Denver) and the difference (6 hours).
But we would have no way to recover the local times from the stored times (which are not anchored).
If the org is not spread across time zones, one could choose to adjust all datetimes to the local time in the pertinent time zone (say US Mountain):
· 10:00am in Denver would not be adjusted (it is already in the correct time zone) before storage.
· 11:00am in London would be adjusted by the 7-hour difference between London and Denver to 04:00am before storage.
Again, we can determine sequence and difference. And the stored times have more meaning to the org because they match its local time.
But, again, we would have no way to recover the local times from the stored values.
Unless it happened to be the case that every time ever provided to the DW used the same local value. In such a case, there would never be any adjustments and every local value could safely be assumed to apply to the time zone of the org.
Therefore …
If every datetime value ever supplied to the DW is guaranteed to be based on one and only one clock and that clock is pertinent to every consumer of the data, an unanchored DATETIME column can be used to store those supplied datetime values. Doing so would have the benefit of saving a couple bytes of storage (not having to store the anchoring information … such as an offset).
If there will never be a need to know the local time that was the basis of the stored (converted) value, an unanchored DATETIME column can again be used to store those supplied datetime values (with the same benefit of saving some storage). However, if there is any possibility of being provided datetime values based on any other clock, then supplied datetime values must first be converted to the expected local time before being stored in that unanchored DATETIME column.
If it is possible to receive datetime values anchored to differing clocks and there may be a need to know the local time, then an anchored datetime column (such as DATETIMEOFFSET) must be used. In order to preserve the ability to determine the original local time, values must not be converted before storing them in the DW. The datetime portion of the DATETIMEOFFSET column will retain the local time.
Because a DW is supplied data (including datetime values) from other systems and because those systems may not indicate the clock on which those values are based, there can be ambiguity in one of those supplied values. Specifically, the hour between 1:00am and 2:00am occurs twice on the “fall back” day:
· 1:00am-2:00am MDT (US Mountain Time with DST) occurs first
· 1:00am-2:00am MST (US Mountain Time w/o DST) occurs next
Consider the time of 1:30am. It occurred twice. If the source system supplied an indication of the pertinent clock (MST or MDT), all is well. If not, there is ambiguity (which, being a DW, we abhor).
The opposite problem occurs when clocks spring forward:
· 1:00am-2:00am MST (US Mountain Time w/o DST) occurs first
· 2:00am-3:00am never occurs!
· 3:00am-4:00am MDT (US Mountain Time with DST) occurs next
There is no ambiguity … unless the source system provides a time in the non-existent gap (such as 2:30am)!
A policy must be implemented to handle these ambiguities so that we know what value to actually store in the DW. Note, we may choose to reject the invalid time in the spring forward gap outright.
SQL Server (as described for its AT TIME ZONE function) uses the following policy (which seems as good as any):
· In the fall-back scenario, the ambiguous time is presumed to be DST.
· In the spring-forward scenario, the ambiguous time is adjusted forward and then presumed to be DST.
A data warehouse is not typically concerned with hour-by-hour precision in reporting. I.e., we may well be interested in how many services were provided on a given day but may not be interested in their specific order on that day.