Big data vs accuracy: don’t believe everything you read

Analysis of unstructured big data has the potential to complement / enhance structured data analysis. Application of big data analysis can deliver a range of interesting new insights that can enhance / support decision-making within the organization. But companies should not always believe all that may be derived from the data, warns KID.

By Mervyn Mooi, director at Knowledge Integration Dynamics (KID)

Organizations around the world are buying into the hope that big data analytics which includes combining structured and unstructured data, will deliver an all-knowing ‘crystal ball’ to drive competitiveness. Last year, Gartner reported that big data analytics services alone was a $40 billion market and growing fast.


A major part of the big data appeal is the promise of combining accurate and structured internal data with fast-changing unstructured external data, offering a complete picture of the market environment and the organization’s own position within it.

However, while unstructured external data could add useful new methods for information gathering and decision making processes, it cannot be considered 100% accurate. In some cases, it will not be even close to accurate, and cannot be counted on as a basis for making crucial business decisions.

What proportion of unstructured external data is brought into the big data mix, and how much credence is given to it, depends on the questions to be addressed, the organization’s willingness to accept discrepancies in the data when answering a particular question, and the importance of the decisions to be made based on the big data analysis. Searching for useful insights in unstructured external big data may also require a few passes before acceptable data is identified.

For example, a new car dealership looking for prospective customers might rely entirely on external data to build a leads list. They might use a search engine to identify companies in the area of the dealership; then narrow down the list to companies likely to need cars and likely to have the budget for new cars. The resulting leads list is a good start, but may still require verification calls to determine whether the prospective customers are still in business, still based in the area and likely to be interested.

A bank investigating new branches and new markets might combine its own structured customer data with unstructured external data such as a map, to plot a visual representation of where existing customers are, and where there are gaps with potential for marketing to new customers. This insight may require further clarification and does not guarantee new customers in the blank spots on the map, but it does give the bank useful information to work with.

When an organization is seeking insights for business-critical decisions, the ratio of qualified structured data to unstructured external data should be around 90-10, with unstructured external data serving to complement the analysis, not form the basis of it. This is because structured (high-value) data is traditionally compliance and quality (ACID) bound and can be trusted.

When using big data analytics, organizations should also note that deriving business value from big data is not an exact science, and there are no guarantees. For instance, a company using its own data in combination with unstructured data to assess interest in its products might count visits to its website as an indicator of its popularity.

While the visitor figures might be accurate, the assumptions made based on the figures could be completely wrong, since visitors to the site could have stumbled across it by accident or have been using it for comparison shopping and have no interest in buying the products.

Big Data analytics is helpful for traversing high volumes of unstructured data and supplementing the company’s existing, qualified data. But, depending on the answers needed, big data will need to achieve greater degrees of accuracy and reliability before business critical decisions can be made based on its analysis.

Big data: don’t adopt if you can’t derive value from it

Amid massive big data hype, KID warns that not every company is geared to benefit from costly big data projects yet.

By Mervyn Mooi, director at Knowledge Integration Dynamics (KID)

Big data has been a hot topic for some time now, and unfortunately, many big data projects still fail to deliver on the hype. Recent global studies are pointing out that it’s time for enterprises to move from big data implementations and spend, to actually acting on the insights gleaned from big data analytics.


(Image not owned by KID)

But turning big data analytics into bottom line benefits requires a number of things, including market maturity, the necessary skills, and processes geared to auctioning insights. In South Africa, very few companies have these factors in place to allow them to benefit from significant big data projects. Despite the hype about the potential value derived from big data; in truth, value derivation is still in its infancy.

Locally, we find the early adopters have been major enterprises like banks, where big data tools are necessary for sifting through massive volumes of structured and unstructured data to uncover trends and run affinity analysis and sentiment analysis. But while they have the necessary advanced big data tools, we often find that these new technologies are delivering little more than a sense of confirmation, rather than the surprise findings and bottom line benefits they hoped for.

This may be due to processes that result in slow application of new insights, as well as to a dire shortage of the new data science skills that marry technical, analytics and strategic business know-how. Currently, the process of big data management is often disjointed from start to finish: companies may be asking the right questions and gaining insights, but unless these insights are delivered rapidly and companies actually use those insights effectively, the whole process is rendered ineffective. There is  little point in having of a multi-million rand big data infrastructure if the resulting insights aren’t applied at right time in the right places.

The challenge now is around the positioning, management and resourcing of big data as a discipline. Companies with large big data implementations must also face the challenges of integration, security and governance at scale. We also find there are many misconceptions about big data, what it is, and how it should be managed. There is an element of fear about tackling the ‘brave new world’ of technology, when in reality, big data might be seen as the evolution of BI.

Most commonly, we see companies feeling pressured to adopt big data tools and strategies when they aren’t ready, and are not positioned to benefit. As with many technologies, hype and ‘hard sell’ may convince companies to spend on big data projects when they are simply not equipped to use them. In South Africa, only the major enterprises, research organisations and perhaps players in highly competitive markets stand to benefit from big data investments. For most of the mid-market, there is little to be gained from being a big data early adopter. We are already seeing cheaper cloud-based big data solutions coming to market, and – as with any new technology – we can expect more of these to emerge in future. Within a year or two, big data solutions will become more competitively priced,  simpler, require fewer skilled resources to manage, and may then become more viable for small to mid-market companies. Until then, many may find that more effective use of their existing BI tools, and even simple online searches, meet their current needs for market insights and information.

Unless there is a compelling reason to embark on a major big data project now, the big data laggers stand to benefit in the long run. This is particularly true for those small and mid-size companies currently facing IT budget constraints. These companies should be rationalizing, reducing duplication and waste, and looking to the technologies that support their business strategies, instead of constantly investing in new technology simply because it is the latest trend.



Companies still fail to protect data

Despite their having comprehensive information security and data protection policies in place, most South African businesses are still wide open to data theft and misuse, says KID.

By Mervyn Mooi, Director at the Knowledge Integration Dynamics Group

Numerous pieces of legislation, including the Protection of Personal Information (POPI) Act, and governance guidelines like King III, are very clear about how and why company information, and the information companies hold on partners and customers, should be protected. The penalties and risks involved in not protecting data are well known too. Why then, is data held within South African companies still inadequately protected?

In our experience, South African organisations have around 80% of the necessary policies and procedures in place to protect data. But the physical implementation of those policies and procedures is only at around 30%. Local organisations are not alone – a recent IDC study has found that two-thirds of enterprises internationally are failing to meet best practice standards for data control.


(Image not owned by KID)

The risks of data loss or misuse are present at every stage of data management – from gathering and transmission through to destruction of data. Governance and control are needed at every stage. A company might have its enterprise information systems secured, but if physical copies of data – like printed documents or memory sticks – are left lying around an office, or redundant PCs are sent for recycling without effective reformatting of the hard drives, sensitive data is still at risk. Many overlook the fact that confidential information can easily be stolen in physical form.

Many companies fail to manage information sharing by employees, partners and other businesses. For example, employees may unwittingly share sensitive data on social media: what may seem like a simple tweet about drafting merger documents with the other party might violate governance codes. Information shared with competitors in exploratory merger talks might be misused by the same competitors later.

We find that even larger enterprises with policies in place around moving data to memory sticks and mobile devices don’t clearly define what confidential information is, so employees tweet, post or otherwise share information without realizing they are compromising the company’s data protection policies. For example, an insurance firm might call a client and ask for the names of acquaintances who might also be interested in their product, but under the POPI Act, this is illegal. There are myriad ways in which sensitive information can be accessed and misused, with potentially devastating outcomes for the company that allows this to happen. In a significant breach, someone may lose their job, or there may be penalties or a court case as a result.

Most organisations are aware of the risks and may have invested heavily in drafting policies and procedures to mitigate them. But the best-laid governance policies cannot succeed without effective implementation. Physical implementation begins with analysing data risk: discovering, identifying, and classifying it, as well as analysing its risk based on value, location, protection, and proliferation.  Once the type and level of risk have been identified, data stewards need to take tactical and strategic steps to ensure data is safe.

These steps within the data lifecycle need to include:

  • Standards-based data definition and creation to also ensure that security and privacy rules are implemented from the out-set.
  • Strict provisioning of data security measures such as data masking, encryption/decryption and privacy controls to prevent unauthorised access to and disclosure of sensitive, private, and confidential information.
  • The organisation also needs to securely provision test and development data by automating data masking, data sub-setting and test data-generation capabilities.
  • Attention must also be given to data privacy and accountability by defining access based on privacy policies and laws – for instance,  who view personal, financial, health, or confidential data, and when.
  • Finally, archiving must be addressed: the organisation must ensure that it securely retires legacy applications, manages data growth, improves application performance, and maintains compliance with structured archiving.


Policies and awareness are not enough to address the vulnerabilities in data protection. The necessary guidelines, tools and education exist, but to succeed, governance has to move off paper and into action. It is important for companies to understand that policies and awareness programmes are not enough to ensure good governance. The impact of employee education is temporary – it must be refreshed regularly, and it must be enforced with systems and processes that entrench security within the database, at file level, server level, network level and in the cloud. This can be a huge task, but it is a necessary one when architecting for the future.

In context of the above, a big question to ponder is: Has your organisation mapped the rules, conditions, controls and standards (RCSSs) as translated from accords, legislation, regulation and policies, to your actual business / technical processes and data domains?


Remember governance in the rush to next generation platforms

South African enterprises are actively looking to next generation application and analytics platforms, but they need to be aware that the platforms alone cannot assure quality data and effective data governance, warns KID.

By Johann van der Walt, Director of Operations at Knowledge Integration Dynamics (KID)

There’s no doubt that SAP HANA as the platform for next-generation applications and analytics is a major buzzword among South African enterprises today. Enterprises know that if they want to stay relevant and competitive, it is an enabler that will allow them to process more data, faster.


However, HANA cannot address the problem of poor quality or obsolete data – all it can do is allow enterprises to get their poor quality data faster. Unless enterprises address their data quality and data management practices before they migrate, they will dilute the value they get from HANA.

With modern enterprises grappling with massive databases, often with a significant amount of poor quality and obsolete data, simply migrating all of this data to a next generation platform would be a mistake. For effective governance and compliance, enterprises need to be very conscious of where the data going into HANA comes from and how it gets there. Once the data has been migrated, they also need to ensure that they are able to maintain their data quality over time.

In our consulting engagements with local enterprises, we have discovered that most CIOs are well aware of their data quality flaws, and most are anxious to address data quality and governance. But they are often challenged in actually doing so.

A typical challenge they face is the ability to come up with a unique dataset and to address inconsistencies. Most CIOs know they are sitting with more data they need to have, with as much as 50% – 90% of their data actually obsolete, but many battle to identify this obsolete data. Simply identifying inconsistencies in datasets – like a Johannesburg dialing code for a Cape Town-based customer – could require months of manual work by a large team of employees; and achieving the budget necessary to do so can prove tricky.

In our experience, an effective way to secure budget and launch a data governance project is to do so by piggybacking off a larger enterprise project – such as a SAP HANA migration. A move such as this gives the enterprise an ideal opportunity to run a data cleansing process as the first step towards a comprehensive data governance project. An effective place to begin is to identify what data is active, and what is obsolete, and then focus data quality improvement efforts on only the active data before migrating it to HANA. In this way, you are moving only the data that is needed, and it is accurate and cleansed.

This is just the start of the governance journey. It is easier to get clean than stay clean in the world of data quality. Typically, when data quality is improved in projects like migration, decay sets in over time. This is where data governance comes in: after the cleansing and migration to HANA, enterprises need to put the toolsets and policies in place to ensure continuously good data. In most companies, this entails a programme of passive data governance, in which the quality of data is monitored and addressed in a reactive way.

However, some may also move to active data governance – an arguable more invasive approach in which the data capturing process is more closely controlled, to ensure that the data meets governance rules. Active data governance might also be supported by being highly selective of who is allowed to input data and moving to a more centralised organisation for data management – instead of a distributed, non-governed environment.



Big data follows the BI evolution curve

Big Data analysis in South Africa is early in its maturity levels, and has yet to evolve in much the same way as BI did 20 years ago, says Knowledge Integration Dynamics.

By Mervyn Mooi, director at Knowledge Integration Dynamics (KID)

Big data analysis tools aren’t ‘magical insight machines’ spitting out answers to all business’s questions: as is the case with all business intelligence tools, there are lengthy and complex processes that must take place behind the scenes before actionable and relevant insights can be drawn from the vast and growing pool of structured and unstructured data in the world.


South African companies of all sizes have an appetite for big data analysis, but because the country’s big data analysis segment is relatively immature, they are still focused on their big data strategies and the complexity of actually getting the relevant data out of this massive pool of information. We find many enterprises currently looking at technologies and tools like Hadoop to help them collate and manage big data. There are still misconceptions around the tools and methodologies for effective big data analysis: companies are sometimes surprised to discover they are expecting too much, and that a great deal of ‘pre-work’, strategic planning and resourcing is necessary.

Much like the early days of BI, big data analysis started as a relatively unstructured, ad hoc discovery process, but once patterns are established, models are developed, and the process becomes a structured one.

And in the same way that BI tools depend on data quality and relationship linking, big data analysis depends on some form of qualifying prior to being used. The data needs to be profiled for flaws which need to be cleansed (quality), it must be put into relevancy (relationships) and it must be timeous in context of what is being searched or reported on.  Methods must be devised to qualify much of the unstructured data, as a big question remains around how trusted and accurate information from the internet will be.

The reporting and application model that uses this structured and unstructured data must be addressed, and the models must be tried and tested. In the world of sentiment analysis and trends forecasting based on ever-changing unstructured data, automated models are not always the answer. Effective big data analysis also demands human intervention from highly skilled data scientists who have both business and technical experience.  These skills are still scare in South Africa, but we are finding a growing number of large enterprises retaining small teams of skilled data scientists to develop models and analyse reports.

As local big data analysis matures, we will find enterprises looking to strategise on their approaches, the questions they want to answer, what software and hardware to leverage and how to integrate new toolsets with their existing infrastructure. Some will even opt to leverage their existing BI toolsets to address their big data analysis needs.  BI and big data are already converging, and we can expect to see more of this taking place in years to come.

Fast data is old hat but customers now demand it in innovative ways

By Mervyn Mooi, director at Knowledge Integration Dynamics (KID)

People don’t just need fast data, which is really real-time data by another name but infers that the data or information derived, received or consumed needs to be relevant and actionable. That means it must, for example, initiate or enforce a set of follow-up or completion tasks.


(Image not owned by KID)

Fast data is the result of data and information throughput at high speed. Real-time data has always been an enabler for real-time action that allows companies to respond to customer, business and other operational situations and challenges – almost immediately.

Fast, actionable data is that which is handed to decision-makers or users at lightning speed. But it is the application of knowledge gleaned from the data that is paramount. Give your business-people piles of irrelevant data at light speed and they will only get bogged down. Data consumers need the right insights and at the right time when they need it to effectively marshal resources to meet demands.

The problem for some companies is that they are still grappling with big data. There are many more sources of data, there are more types of data, and many organisations are struggling to connect the data from beyond their private domains with that inside their domains. However, big data fuels fast data but it must do so in real-time after being clearly interpreted and prepared so that decision-makers can take action. And it must all lead back to improving customer service.


(Image not owned by KID)

Why focus on customer service? Because, as Roxana Strohmenger, director, Data Insights Innovation at Forrester Research, says in a guest blog: “Bad customer experiences are financially damaging to a company.” The damage goes beyond immediate wallet share to include loyalty, which has potentially significant long-term financial implications.

Retailers, for example, are using the Internet of Things (IoT) to improve customer service. That’s essentially big data massaged and served directly to customers. The International Data Corporation (IDC) 2014 US Services Consumer Survey found that 34% of respondents said they use social media for customer support more than once a month. Customer support personnel who cannot access customer data quickly cannot efficiently help those people. In a 2014 report Forrester states: “Companies struggle to deliver reproducible, effective and personalised customer service that meets customer expectations.”

The concern for many companies is that they don’t get it right in time to keep up with their competition. They could spend years trying to regain market share at enormous expense.

So fast data can help but how do you achieve it? In reality it differs little from any previous data programme that feeds your business decision-makers. The need has always been for reliable data, available as soon as possible, that helps people to make informed decisions. Today we find ourselves in the customer era. The advent of digital consumer technologies have given consumers strong voice with the associated ability to hold widespread sway over company image, brand perceptions, and other consumers’ product choices. They can effectively influence loyalty and wallet share so their needs must be met properly and quickly. Companies need to know what these people think so they can determine what they want and how to give it to them.

All of this comes back to working with data. Data warehouses provision information to create business insight. Business intelligence (BI), using a defined BI vision, supporting framework and strategy, delivers the insights that companies seek. Larger companies have numerous databases, data stores, repositories – call them what you will, their data sits in different places, often in different technologies. Decision-makers need to have a reliable view into all of it to get a consistent single view of customers, or risk erroneous decisions.

Data warehousing, BI, and integration must be achieved in a strategic framework that leads back to the business goals, in this case at least partly being improved customer service, to make it cost effective, efficient, effective and deliver proper return on investment (ROI).

The following standard system development life-cycle process also applies to the world of immediacy driven by digital technologies as prior to it:


  1. Audit what exists and fix what is broken
  2. Assess readiness and implement a roadmap to the desired outcomes
  3. Discovery – scope requirements and what resources are available to meet them
  4. Design the system – develop it or refine what exists
  5. Implement the system – develop, test and deploy
  6. Train – executives and administrators
  7. Project manage – business users must be involved from the beginning to improve ROI and aid adoption
  8. Maintain – this essentially maintains ROI

Fast data relies on task and delivery agility using these pillars, which are in fact age-old data disciplines that must be brought to bear in a world where there are new and larger sources of data. The trick is to work correctly with these new sources, employ proven methodologies, and roll these out for maximum effect for customer satisfaction.



Governance: still the biggest hurdle in the race to effective BI

By Mervyn Mooi, director at Knowledge Integration Dynamics (KID)

Whether you’re talking traditional big stack BI solutions or new visual analytics tools, it’s an unfortunate fact that enterprises still buy in to the candy-coated vision of BI without fully addressing the underlying factors that makes BI successful, cost-effective and sustainable.

Information management is a double-edged sword. Well architected, governed and sustainable BI will deliver the kind of data business needs to make strategic decisions. But BI projects built on ungoverned, unqualified data / information and undermined by ‘rebel’ or shadow BI will deliver skewed and inaccurate information: and any enterprise basing its decisions on bad information is making a costly mistake. Too many organisations have been doing the latter, resulting in failed BI implementations and investment losses.

For more than a decade, we at Knowledge Integration Dynamics have been urging enterprises to formalise and architect their enterprise information management (EIM) competencies based on best-practice or industry standards, which follow an architected approach and are subjected to governance.


EIM is a complex environment that needs to be governed and which encompasses data warehousing, business intelligence (BI), traditional data management, enterprise information architecture (EIA), data integration (DI), data quality management (DQM), master data management (MDM), data management life cycle (DMLC), information life cycle management (ILM), records and content management (ECM), metadata management and security / privacy management.

Effective governance is an ongoing challenge, particularly in an environment in which business must move at an increasingly rapid pace where information changes all the time.

For example, to tackle the governance issue in context of data quality starts with the matching and merging of historic data to ensure design and storage conventions are aligned and all data is accurate but according to set rules and standards. It is not just a matter of plugging in a BI solution that would give you results: it may require up to a year of careful design and architecture to integrate data from various departments and sources in order to feed the BI system. The conventions across departments within a single organization are often dissimilar, and all data has to be integrated and qualified. Even data as apparently straightforward as a customer’s ID number may be incorrect – with digits transposed, coded differently between source systems or missing – so the organisation must decide which data source or integration rule to trust in order to ensure data warehouses are compliant with quality rules and also with legislation standards needed to build the foundation of the 360-degree view of the customer that executive management aspires to. But integrating the data and addressing data quality is only one area where effective governance must be applied.

Many organisations wrongly assume that in data, nothing changes. But in reality, the organisation must cater for constant change. For example, when reporting in a bank, customer records could be dramatically incorrect if the data fails to reflect that certain customers have moved to new cities, or that bank branch hierarchies have changed. Therefore, linking and change tracking is crucial in ensuring data integrity and accurate current and historic reporting.

And automation takes you only so far: you can automate to the nth degree, but you still require data stewards to carry out certain manual verifications to ensure that the data is correct and remains so. Organisations also need to know who is responsible and accountable for their data and be able to monitor and control the lifecycle process from one end to the other. The goals are to eliminate multiple versions of the truth (results), have a trail back to sources and ensure that only the trusted version of the truth is integrated into systems.

Another challenge in the way of effective information management is the existence of ‘rebel’ or shadow data systems. In most organisations, departments frustrated by slow delivery from IT or with unique data requirements, start working in siloes, creating their own spreadsheets, duplicating data and processes, and not inputting all the data back into the central architecture. This undermines effective data governance and results in huge overall costs for the company. Instead, all users should follow the correct processes and table their requirements, and the BI system should be architected to cater for these new requirements. It all needs to come through the central architecture: In this way, the entire ecosystem can be governed effectively and data /information could be delivered from one place, also making management thereof easier and more cost-effective.

The right information management processes also have to be put in place, and they must be sustainable. This is where many BI projects fail – an organization builds a solution and it lasts only a year, because no supporting frameworks were put in place to make it sustainable. Organisations need to take a standards-based, architected approach to ensure EIM and governance is sustained and perpetuated.

New BI solutions and best practice models emerge continually, but will not solve the business and operational problems if they are implemented in an ungoverned environment, much the way a beautiful luxury car may have all the features you need, but unless the driver is disciplined, it will not perform as it should.


Knowledge Integration Dynamics, Mervyn Mooi, (011) 462-1277,