It is more difficult to conduct updates and deletes on a fact table, than to simply insert new rows. If the source system handles the updates and deletes, then this is the preferable option since it is already in place. Unfortunately, however, this is not always the case. How is it possible to handle unaudited updates and deletes?
If the changed rows are flagged with a ‘Modified’ date column, then this means we can more easily identify the rows that have been changed. If this isn’t the case, how is it possible to go about identifying which rows have been changed? There are a few options:
Truncate and re-load the fact table at every load – this isn’t a good suggestion because some fact tables may be very frequently updated. This would not provide a good service to users, who rely on the information from the fact table.
Ask your friendly DBA to look at helping you e.g. triggers, replication, or adding a ‘Modified’ date to the source transaction data – this is a better idea. It is better to push the issue ‘upstream’ so that the data is loaded as simply into the new data warehouse as possible. This also allows the opportunity for re-use, which is always good. If you’d like to know more about using replication to find unaudited changes in the fact table, please see Duncan Sutcliffe’s excellent post here.
It’s better to distinguish between ‘soft’ deletes and ‘hard’ deletes.
• ‘Soft’ deletes means that the row is not actually deleted, but simply flagged as deleted.
• ‘Hard’ deletes – it’s gone!
Soft deletes are obviously better; an additional column can be added as a flag, and the flag is set to denote whether the record is deleted or not.
The Kimball methodology argues that updates and deletes should be handled as transactions. However, that isn’t possible in every case, and it’s probably better to apply this strategy for incremental data such as sales. Let’s take a non-incremental example, where a company has phone record data. The phone record may hold information initially such as the start time of the call. However, any ‘on hold’ time or ‘end time’ associated with the call cannot be recorded until the call ends. Thus, the update needs to be done ‘in situ’ or there will be lots of additional rows with empty columns, all for one call.
Although the structure can be difficult to decide, Kimball usually has some good words of wisdom; the trick is knowing when to apply it to your particular circumstance and user requirements, and I do recommend reading Kimball as often as possible!