Sub-second analytical BI time to value still a pipe dream

Internet search engines with instant query responses may have misled enterprises into believing all analytical queries should deliver split second answers.

With the advent of Big Data analytics hype and the rapid convenience of internet searches, enterprises might be forgiven for expecting to have all answers to all questions at their fingertips in near real time.

pexels-photo-256307.jpeg

Unfortunately, getting trusted answers to complex questions is a lot more complicated and time consuming than simply typing a search query. Behind the scenes on any internet search, a great deal of preparation has already been done in order to serve up the appropriate answers. Google, for instance, dedicates vast amounts of high-end resources and all of its time to preparing the data necessary to answer a search query instantly. But even Google cannot answer broad questions or make forward-looking predictions. In cases where the data is known and trusted, the data has been prepared and rules have been applied, and the search parameters are limited, such as with a property website, almost instant answers are possible, but this is not true BI or analytics.

Within the enterprise, matters become a lot more complicated.  When the end-user seeks an answer to a broad query – such as when a marketing firm wants to assess social media to find an affinity for a certain range of products over a 6-month period – a great deal of ‘churn’ must take place in the background to deliver answers. This is not a split-second process, and it may deliver only general trends insights rather than trusted, quality data that can serve as the basis for strategic decisions.

When the end user wishes to do a query and is given the power to process their own BI/Analytics, lengthy churn must take place. Every time a query, report or instance of data access is converted into useful BI/Analytical information for end-consumers, there is a whole lot of preparation work to be done along the way : i.e. identify data sources>  access> verify> filter> pre-process>  standardize> lookup> match> merge> de-dup> integrate> apply rules> transform> preprocess> format> present> distribute/channel.

Because most queries have to traverse, link and process millions of rows of data and possibly trillions of words from within the data sources, this background churn could take hours, days or even longer.

A recent TWDI study found that organisations are dissatisfied with the time it takes for the chain of processes involved for BI, analytics and data warehousing to deliver valuable data and insights to business users. The organisations attributed this, in part, to ill-defined project objectives and scope, a lack of skilled personnel, data quality problems, slow development or inability to access all relevant data.

The problem is that most business users are not BI experts and do not all have analytical minds, so the discover and report method may be iterative (therefore slow) and in many cases the outputs/results are not of the quality expected. The results may also be inaccurate as data quality rules may not have been applied, and data linking may not be correct, as it would be in a typical data warehouse where data has been qualified and pre-defined/derived. In a traditional situation, with a structured data warehouse where all the preparation is done in one place, and once only, and then shared many times, supported by quality data and predefined rules, it may be possible to get sub-second answers. But often even in this scenario, sub-second insights are not achieved, since time to insight also depends on properly designed data warehouses, server power and network bandwidth.

Users tend to confuse search and discover on flat raw data that’s already there, with information and insight generation at the next level. In more complex BI/Analytics, each time a query is run, all the preparation work has to be done from the beginning and the necessary churn can take a significant amount of time.

Therefore, demanding faster BI ‘time to value’ and expecting answers in sub-seconds could prove to be a costly mistake. While it is possible to gain some form of output in sub-seconds, these outputs will likely not be qualified, trusted insights that can deliver real strategic value to the enterprise.

By Mervyn Mooi, Director at Knowledge Integration Dynamics (KID)

 

Advertisements

Remember governance in the rush to next generation platforms

South African enterprises are actively looking to next generation application and analytics platforms, but they need to be aware that the platforms alone cannot assure quality data and effective data governance, warns KID.

By Johann van der Walt, Director of Operations at Knowledge Integration Dynamics (KID)

There’s no doubt that SAP HANA as the platform for next-generation applications and analytics is a major buzzword among South African enterprises today. Enterprises know that if they want to stay relevant and competitive, it is an enabler that will allow them to process more data, faster.

AAEAAQAAAAAAAAMSAAAAJGZhYTM5OWU5LTViNGYtNGY4My04YzE5LWI0ZDY3NTc5Y2M4NQ

However, HANA cannot address the problem of poor quality or obsolete data – all it can do is allow enterprises to get their poor quality data faster. Unless enterprises address their data quality and data management practices before they migrate, they will dilute the value they get from HANA.

With modern enterprises grappling with massive databases, often with a significant amount of poor quality and obsolete data, simply migrating all of this data to a next generation platform would be a mistake. For effective governance and compliance, enterprises need to be very conscious of where the data going into HANA comes from and how it gets there. Once the data has been migrated, they also need to ensure that they are able to maintain their data quality over time.

In our consulting engagements with local enterprises, we have discovered that most CIOs are well aware of their data quality flaws, and most are anxious to address data quality and governance. But they are often challenged in actually doing so.

A typical challenge they face is the ability to come up with a unique dataset and to address inconsistencies. Most CIOs know they are sitting with more data they need to have, with as much as 50% – 90% of their data actually obsolete, but many battle to identify this obsolete data. Simply identifying inconsistencies in datasets – like a Johannesburg dialing code for a Cape Town-based customer – could require months of manual work by a large team of employees; and achieving the budget necessary to do so can prove tricky.

In our experience, an effective way to secure budget and launch a data governance project is to do so by piggybacking off a larger enterprise project – such as a SAP HANA migration. A move such as this gives the enterprise an ideal opportunity to run a data cleansing process as the first step towards a comprehensive data governance project. An effective place to begin is to identify what data is active, and what is obsolete, and then focus data quality improvement efforts on only the active data before migrating it to HANA. In this way, you are moving only the data that is needed, and it is accurate and cleansed.

This is just the start of the governance journey. It is easier to get clean than stay clean in the world of data quality. Typically, when data quality is improved in projects like migration, decay sets in over time. This is where data governance comes in: after the cleansing and migration to HANA, enterprises need to put the toolsets and policies in place to ensure continuously good data. In most companies, this entails a programme of passive data governance, in which the quality of data is monitored and addressed in a reactive way.

However, some may also move to active data governance – an arguable more invasive approach in which the data capturing process is more closely controlled, to ensure that the data meets governance rules. Active data governance might also be supported by being highly selective of who is allowed to input data and moving to a more centralised organisation for data management – instead of a distributed, non-governed environment.

 

.