Choosing Technology for Data Science

Tom Breur

8 January 2017

Data Science is a critical capability for companies that are trying to compete on analytics. Because the field is so new, and many of the software solutions relatively immature, teams are at risk of getting bogged down in the weeds of techno centric distractions. Reality is, for Big Data applications that deal with Terabytes of data (or more!), the sheer volume of data itself poses a technical and intellectual challenge to handle. But it is crucial to always, always keep in mind why you are doing this: how will the corporate bottom-line benefit from insight, predictions, or maybe even simply a description of market dynamics? What tools can bring you there with the shortest learning curve and a reasonable investment?

Business objectives need to drive choice of technical solutions, not the other way around. As an a priori constraint, in many settings there are sensitivities with regards to widespread unfettered access to data. Those reservations have to be honored, or else you will trigger adverse sentiments, and that will only create unhelpful pushback. Perception matters. A lot. There is a tension between making all the data available to all the data consumers, yet at the same time ensuring that no sensitive data ever gets abused or leaked. You just can’t square a circle, there is no reward without some corresponding risks. There is always a balance between new opportunities from using more data, and risks associated with security breaches. Be conscious of that balance, and address it explicitly where possible. Decoupled data architectures will naturally draw a line between data storage and data presentation, and will let you apply the appropriate level of security to your storage without compromising the efficiency of your presentation layer.

Tooling for Data Science projects is a big deal, don’t take this lightly. Data Science is extremely knowledge intensive work. Many software solutions have long learning cycles, and the technology itself is in constant flux. The difference between “the right” and a “cumbersome” tool for the same job, can easily mean a tenfold increase in productivity. Since these are scarce resources who are likely to be a bottleneck in many projects, catering to Data Scientists’ needs has a huge impact on effectiveness. Creating a collaborative eco-system of data engineers and data scientists all focusing on what they do best is a major contributor to productivity. Just because you are already sitting in his chair, would you ask your dentist to also cut your hair?

Interoperability with existing (and future…) data platforms is important. Moving data in and out of your Data Science platform is a non-value added activity, so you want to minimize that overhead. Also, excessive data volumes can create a burden (and cost!) on the available infrastructure. These are expenses that set you back even before you have had any chance to earn your keep! Let’s face it: a large, or rather the majority of Data Scientists’ time will be spent preparing data for analysis. Therefore, being able to quickly stitch together data sets from disparate data silos and divergent technologies, comes at a premium. Being able to outsource the data preparation work to data engineers will also free your data scientists for more valuable tasks.

Since new technologies (like scalable “as needed” cloud processing) do impact your capabilities, your Big Data strategy should be mindful of changing requirements and evolving technological options. Maximum agility allows you to benefit from new advances, provided that contractual and technical choices makes that financially feasible. In particular the ‘traditional’ vendors are shielding themselves from market disruptions by adjusting their business models – often aimed at “locking in” customers by offering deep discounts for long-term commitments.

Given the learning curve, and infrastructure backbone that needs to be in place to support Data Science tools, there is a (big) premium for keeping the number of solutions limited. The more platforms, the more support and maintenance work that needs to be done. Also, and possibly more importantly, as the number of solutions grow, you wind up with exponentially less people who are well versed in that particular platform. This invariably causes bottlenecks when you manage resources and capabilities. So again, you need to strike a balance between finding “the perfect” tool for any given problem, and keeping the number of solutions in your portfolio (and that you need to maintain) limited.

As a manager of smart data engineers, it is a constant challenge. You have to keep the right balance between testing new tools and building an efficient team becoming experts on the tools already in place. You need a little bit of chaos to generate breakthrough ideas … but total chaos is not what you want. Delivering business value with the current set of tools (building trust) will create the leeway a team needs for experimenting.

Business objectives and tactical opportunities ought to drive the direction and priorities of Data Science efforts. Ideally these should be broken down in the smallest possible value creating chunks. It is simply too hard and too risky to dream up the perfect solution and attempt to surround all requirements. Before you get there, reality will have changed. Much better to grow in small steps and gather feedback how well your efforts align with corporate goals, and, how much money they are bringing in! Technology is an important enabler – don’t let yourself get distracted by discussions around technology.


One comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s