1 July 2016
The topic of investigating data quality as a formal, separate discipline is about 20 years old, now. Classic books like Redman’s (1997) “Data Quality for the Information Age” and English’ (1999) “Data Warehouse and Business Information Quality” have opened up discussions in many companies and settings whether data quality merits special and separate attention. Since then, a few dozen or so books have been written specifically about data quality. One of the main problems, though, seems to be that few people agree what data quality really is. Depending on whom you ask, you are likely to get a wide variety of answers and definitions.
When you ask an ETL programmer what data quality is, he will point to the number of conflicts in your audit dimensions, when he is merging disparate data sources. The front-end BI tool users might refer to the number of fields available and their richness to qualify and describe some unit of research interest, say a customer or order or shipment. Other analysts will refer to the predictive power that certain attributes hold for producing models with great lift. And still others will talk about sparsely populated fields with too many n/a or missing values.
What appears to be missing at this stage of maturity in our profession is an overarching framework to specify what aspect of data quality we are talking about, and where in the BI value stream it pertains. Until we resolve this confusion, most conversations about data quality are likely to remain, well, pretty low quality… J