Sunday, November 27, 2011

Mergers, Documentation and... Data Lineage?

Wow. I know it has been a while since my last blog, but I have been mulling the topic on this one for a while.  To set things up:  I work for a certain wireless carrier who is in the process of being 'acquired' by another carrier.  My job is as a business analyst for an internal BI team and we take data from a number of both internal and external data sources and 'munge' them together for a huge number of reports -- something like 400 and counting!

So the challenge, as you can imagine, is that we have been asked to be prepared to update / split out a number of these reports to support the 'Divestco' / 'Remainco' scenario;  some reports would be for the parts of the business sold off -- others would be for the part of the company being acquired.  We KNOW there will be changes in the upstream data sources to support the split -- we just don't know what those splits are (yet).  So in preparation as a team we are documenting the various 'hops' that the data takes from the primary source through the final report(s).

Documentation.  You know, the LAST thing that anyone does for a project.  Or updates when changes occurs to fix a bug or change some logic.  The goal, in the end, is to have a way to know which report(s) will be affected by any specific change to a given table.  Which finally gets us to the topic -- how to best create / manage this type of documentation?

Currently we have started a data lineage spreadsheet -- and yes, this thing is getting WIDE! -- that looks something like this:

REPORT --> DELIVERY TYPE --> LEVEL 1 SOURCE(S) --> LEVEL 2 SOURCES(S) --> etc., etc...

So for example, if the report is an SSRS report, you might see something like this:

REPORTx --> SSRS --> DATASET --> DS1 -->  SPROC --> sp_sprocname --> table 1
REPORTx --> SSRS --> DATASET --> DS1 -->  SPROC --> sp_sprocname --> table 2
REPORTx --> SSRS --> DATASET --> DS1 -->  SPROC --> sp_sprocname --> table 3

So Report X has one data set, which is driven by a stored procedure, which is sourced from 3 tables.  Pretty straight
forward.  However, there are some limitations:
  • Can you guess how many times a given sproc/view/table shows up?  For a database guy can you imagine how painful it is to see the repeats :-)
  • While there are repeats, those repeats could actually occur at any level -- so for one report you could see a given table at level 1, 5 or 13!
For some of the reports we even had some Visio diagrams to show all the connections visually.  Makes for a nice picture, but it's a LOT of work to pull those together.

So what are other options?  Well, the SQL devs on the team are looking at/considering some type of database.  UI for entering data, and some type of relational back end to allow for re-use.  MIGHT work -- but I wonder if we would really get all of the twists and turns.  

There used to be really cool programs out there to help document SOME of these areas -- anyone remember SQL Scribe Documentation Builder by Geckoware Australia?  GREAT package!  However, with any 'automated' system like this you are limited to documenting each individual item (table/view/sproc) and you don't really get the connections.

Whether or not this merger, or some other merger, happens in the near future the fact that this documentation will
be very helpful for understanding the impact of upstream data changes.  I am interested in what other people have found for a solution -- or are you like me and doing it the old fashioned way -- one cell at at time.  Whether your company is being acquired -- or you have a contractor who may be gone tomorrow -- documentation of how your reports are put together seems to be a lost art.


Wednesday, May 12, 2010

Your Data Don't Match....

Inspriration to write has finally come again from one of Jim's fabulous rock-n-roll themed posts: The Data Rock Stars:  The Rolling Forecasts.  If you are on Twitter, and care about data, you MUST follow these fabulous guys and gals...

First we had Rush -- and my response with Styx -- and now Jim channels the Rolling Stones.  OK... I like the 'classic rock' formula as much as the next guy, but I wouldn't be true to myself if I didn't pay homage to the boys in Poison -- and yes, before you say it, I know that Loggins and Messina did it first.  And yes, that is a great version as well.  But today I was driving in the beautiful Pacific Northwest, sun streaming down, and had this 1989 cover cranked up -- and yet I still couldn't get Jim's blog out of my head...  (I know -- wierd...)

SO:  I give you "Your Data Don't Match..." in the style of Poison's "Your Mama Don't Dance..."

Hope you like it!



Your data don’t match and management sez “Heads will roll!”
Your data don’t match and your weekend is screwed, AGAIN!
But when the weekend comes around and it’s time to buy a round
Where do you go – you got to doc it!

The BRD’s are bad and the analysts don’t comprehend…
But if you don’t deliver something soon your career could soon be at an end
You know there’s no excuse, you have to document, you gotta bend
And diagram again…

And it’s all because


Everyone gather round the table let me get your sign-off with this pen…
You see I need to make sure we all agree on the data lineage from A to Z.
We just have to get the VP to release the funds for the vendors that number over ten…
We’re just about to move thinkin’ “Berry, this is a breeze”
There’s an email coming in and a guy says “Three C.Rs., that’s fair!”

Oowee – “you’re making me scream!”
Said the poor P.M.


Play me a little C.C.


Your data don’t match, no
It just won’t match, no
Your data don’t match and your weekend is screwed again…
Ah. Yeah.

Songwriters: Kenny Loggins and Jim Messina
Released in 1989 by Poison.

Sunday, February 21, 2010

Data as the Grand Illusion...

I have been lucky in my professional life to have met some really fascinating and knowledgeable people who have guided my data quality journey over the past few years. One of those persons is Jill Dyche (@JillDyche) with Baseline Consulting. After working with them as a customer at Microsoft, I then had the opportunity to sit down with her and her partner at Baseline Evan Levy during some fabulous off hours discussions during the Fall 2008 session of TDWI's World Conference in San Diego. Imagine being able raise a beer (or two) with some of the best in YOUR industry — and just hoping that some of that knowledge somehow would rub off on you and/or that you retain a tenth of that discussion...

Fast forward to 2009 — I continued to follow Jill and Evan via various blogs and web postings, and with the introduction of Twitter I now had another great outlet for getting my data quality info. It was via Jill that I got 'introduced' to Jim Harris and his outstanding @ocdqblog postings. It is really great to read the interactions between Jim, Jill and others of our industry, and to be able to learn from them...

So imagine my surprise and delight to find a tweet from Jill highlighting a recent post by Jim — where the challenge was for Jim to write an article about data quality in the style of a Rush song! TOO COOL! Now, for the record there is really only ONE Rush song I like — Tom Sawyer — but after reading Jim's post he not only captures the essence that IS Rush, but he manages to do a really great job of visualizing some of the challenges we have on selling data quality to our customers and co-workers.

So — I thought I would share my own take on this challenge and discuss data quality in the style of Styx's The Grand Illusion... I think too often we try to explain data quality issues and practices to those we are working with, and they just don't want to see it. I thought about following up with "Come Sail Away" as an encore, but figured that I had pushed this far enough.

Thank you Jim for a great read. Thank you Jill for your inspiration. Thank you Dennis DeYoung and Tommy Shaw for your fabulous music.

The Grand (Data) ILLUSION
Data can be a Grand Illusion
Run that report and see what's happening
Run your scripts, get a profiler just for show…

The staging db’s set, are audits running?
Suddenly your DBA’s heart is pounding
Wishing secretly they had backed up that cube.

But don't be fooled by those analysts
The managers or the CIOs
They’ll show you graphs of how your data should be
But they're just someone else's fantasy.

So if you think your data is complete confusion
Because you never match the source
Just remember that it's a Grand illusion
And without signed specs we’re placing blame…
And it’s all a game...

So if you think your data is complete confusion
Because your data steward said it was…
Just remember that it's a Grand illusion
And yet we can make it match again.
Auto correct that name...

Profiling spells competition, join us in our blind ambition
Get yourself a brand new cloud server
Someday soon we'll stop to ponder what the TSQL meant, I’ll always wonder
Jim’s made the grade and still we wonder if Jill and Evan are buying the next round?

I would love to hear what you think... drop me a note at berry at jbrcc dot com.