Going meta on metadata

Tom Breur

25 December

Any data professional can tell you that we have always depended on metadata to describe data and facilitate data access. It is often a make-or-break factor for BI projects. When metadata are in bad shape, people working with data cannot do their jobs well, and maybe even worse: the data cannot be trusted. People sometimes use the word “metadata” as if it’s a monolithic thing, but of course it isn’t. The reality is we depend on various types of metadata. Metadata is the glue that holds our applications together.

Bill Inmon, author of “Business Metadata: Capturing Enterprise Knowledge” (2007) makes a distinction between technical and business metadata. “Technical metadata […] serves the technical practitioner, programmer, or DBA. Technical metadata has been seen as important to managing the technical environment, especially as these environments get increasingly more complex. Business metadata concerns itself with assisting business people, nontechnical users, in understanding the data. It adds context to the data. It is meant to communicate with business people and not the technicians.” (Inmon et al, 2007, p. 2)

Books on this topic for the BI space have dealt mostly with technical metadata, like Marco (2000), Powerantz (2015), etc. A google search most often points to material that librarians have developed around this topic. In this post, I focus on the BI space, where metadata is just as important, if less illuminated. Although there appears no definitive text on this subject, nowadays often three types of metadata are distinguished by BI practitioners: technical, business and operational metadata. I would argue that this refinement mostly singles out operational metadata as a special kind of technical metadata. Other terms I have come across (in no particular order) are: Process metadata, Data Quality metadata, Physical metadata, Conceptual metadata, Quality metadata, Data Model metadata, Business Rule metadata, Business Architecture metadata, and undoubtedly many more terms are being used today.

We refer to “technical metadata” when it is required for automated software access, to enable proper functioning of applications, etc. “Business metadata” are the ‘human language’ descriptions that non-technical business people can understand and need to interpret the data, and give it adequate context. The term “operational metadata” is used to reference information about lineage and data logistics but also usage and privileges. It also supports automation and code-generation efforts. So expect to find information like lineage, mappings and transformations, but also recording of details about data access.

Metadata appears in many forms. Technical metadata (in its narrower definition) is typically contained in a tool and formatted as required to run a database, for instance. RDBMS systems will require defining of field type, length, etc., so that the engine ‘knows’ how to process the content part of the data through its internal engine. A 255 character text field needs to get assigned the proper block size so that not information gets truncated. A Boolean field can be handled more efficiently, because it is so much smaller, numeric fields come in different sizes (“precision”), etc.

Business metadata probably comes in the widest variety of forms. Since collaboration around the context of data is so important, solutions need to provide (very) easy access, and facilitate linking of information sources. Standardization of descriptions and convenient access (and updates!) are key for success. Business metadata can be stored in any or every form ranging from an Excel sheet, Word document, database, to a wiki, or purpose built metadata tool that enables efficient collaboration and documentation. It can also be created as executable specifications that are directly used for test-driven ETL development and regression testing.

Operational metadata typically describe or log how data are moved, transformed or used. Data lineage and also event processing like loading of files (or failure thereof), updates and inserts, who made those changes, change dates, etc. Lineage itself is a sizeable topic, worthy of some elaboration. For compliance or regulatory purposes, it might be required that access privileges are made explicit and documented. Most RDBMS systems have this kind of functionality built in, it may be (often is) custom built functionality for NoSQL platforms.

Data lineage can fulfill various functions. We distinguish backward and forward lineage. Backward lineage describes where and how a particular data point originated, forward lineage tells you how and where a source field is being used. Backward lineage is crucial when you need to explain, for instance, how a particular calculation in a report was derived. Forward lineage is important when you try to assess which reports or downstream systems might be affected by a corrupt or missing data point. Note that this can have both technical, as well as business implications. The distinction between technical and business metadata isn’t always hard and fast. Data lineage metadata is hard to present in a physical document because of its many-to-many nature between sources and targets. Solutions that do text analytics on data processing code and publish this metadata in a graph database would be very valuable especially in the “schema on read” world.

As part of data interchange, metadata (both technical as well as business metadata) need to be passed along with the actual content that gets transferred. Metadata can be embedded, as in a designated header file, or can sometimes be deducible at run-time. When metadata are embedded, it is contained within the feed and recognizable as a separate part of the data that serves to allude to the content. Sometimes the metadata may be provided separately, as a descriptor portion of the data.

When metadata need to be deduced, AI technology can render plausible or proposed descriptions and labels, in order to automate as much of this “grunt work” as possible. There have been some exciting developments in this field, and as Data Science and Business Intelligence continue to evolve and mature, we can expect even more offerings. Since recording and maintenance of metadata so often cause bottlenecks in data management, contemporary solutions can add significant value by facilitating this burdensome task.

Many organizations find that their data strategies and expansion plans are constrained by scarce, often tacit metadata knowledge. It is crucial expertise that is held by relatively few key members of their business intelligence teams. Invariably these people become a bottleneck in multiple projects. Having them participate in too many projects doesn’t help, because the more you ask people to switch tasks, the less they will be able to accomplish. What you often see is that formal, explicit documentation was either deemed too expensive, or “there was no time for it”… Obviously, that dynamic keeps the organization in an involuntary stranglehold by these same key resources. Set the data free!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s