Wow. I know it has been a while since my last blog, but I have been mulling the topic on this one for a while. To set things up: I work for a certain wireless carrier who is in the process of being 'acquired' by another carrier. My job is as a business analyst for an internal BI team and we take data from a number of both internal and external data sources and 'munge' them together for a huge number of reports -- something like 400 and counting!
So the challenge, as you can imagine, is that we have been asked to be prepared to update / split out a number of these reports to support the 'Divestco' / 'Remainco' scenario; some reports would be for the parts of the business sold off -- others would be for the part of the company being acquired. We KNOW there will be changes in the upstream data sources to support the split -- we just don't know what those splits are (yet). So in preparation as a team we are documenting the various 'hops' that the data takes from the primary source through the final report(s).
Documentation. You know, the LAST thing that anyone does for a project. Or updates when changes occurs to fix a bug or change some logic. The goal, in the end, is to have a way to know which report(s) will be affected by any specific change to a given table. Which finally gets us to the topic -- how to best create / manage this type of documentation?
Currently we have started a data lineage spreadsheet -- and yes, this thing is getting WIDE! -- that looks something like this:
REPORT --> DELIVERY TYPE --> LEVEL 1 SOURCE(S) --> LEVEL 2 SOURCES(S) --> etc., etc...
So for example, if the report is an SSRS report, you might see something like this:
REPORTx --> SSRS --> DATASET --> DS1 --> SPROC --> sp_sprocname --> table 1
REPORTx --> SSRS --> DATASET --> DS1 --> SPROC --> sp_sprocname --> table 2
REPORTx --> SSRS --> DATASET --> DS1 --> SPROC --> sp_sprocname --> table 3
So Report X has one data set, which is driven by a stored procedure, which is sourced from 3 tables. Pretty straight
forward. However, there are some limitations:
forward. However, there are some limitations:
- Can you guess how many times a given sproc/view/table shows up? For a database guy can you imagine how painful it is to see the repeats :-)
- While there are repeats, those repeats could actually occur at any level -- so for one report you could see a given table at level 1, 5 or 13!
So what are other options? Well, the SQL devs on the team are looking at/considering some type of database. UI for entering data, and some type of relational back end to allow for re-use. MIGHT work -- but I wonder if we would really get all of the twists and turns.
There used to be really cool programs out there to help document SOME of these areas -- anyone remember SQL Scribe Documentation Builder by Geckoware Australia? GREAT package! However, with any 'automated' system like this you are limited to documenting each individual item (table/view/sproc) and you don't really get the connections.
Whether or not this merger, or some other merger, happens in the near future the fact that this documentation will
be very helpful for understanding the impact of upstream data changes. I am interested in what other people have found for a solution -- or are you like me and doing it the old fashioned way -- one cell at at time. Whether your company is being acquired -- or you have a contractor who may be gone tomorrow -- documentation of how your reports are put together seems to be a lost art.
be very helpful for understanding the impact of upstream data changes. I am interested in what other people have found for a solution -- or are you like me and doing it the old fashioned way -- one cell at at time. Whether your company is being acquired -- or you have a contractor who may be gone tomorrow -- documentation of how your reports are put together seems to be a lost art.