Avoiding the data spaghetti junction

Mervyn Mooi.

 

Despite all their efforts and investments in quality centres of excellence, some enterprises are still grappling with issues, and this at a time when data is more important for than ever before.

The effects of poor data quality are felt throughout the enterprise, impacting everything from operations to customer experience, costing companies an estimated $3 trillion a year in the US alone.

Data quality will become increasingly crucial as organisations seek to build on their data to benefit from advances in analytics (including big data), artificial intelligence and machine learning.

We find organisations unleashing agile disruptors into their databases without proper controls in place; business divisions failing to standardise their controls and definitions; and companies battling to reconcile data too late in the lifecycle, often resulting in a ‘spaghetti junction’ of siloed, duplicated and non-standardised data that cannot deliver on its potential business value for the company.

Controls at source

Data quality as a whole has improved in recent years, particularly in banks and financial services facing the pressures of compliance.

However, this improvement is largely on the wrong side of the fence, after the data has been captured. This may stem from challenges experienced decades ago, when data validation of data being captured by thousands of clerks could slow down systems and result in customers having to wait in banks and stores while their details were captured.

Data quality will become increasingly crucial as organisations seek to build on their data to benefit from advances in analytics, artificial intelligence and machine learning.

But this practice has continued to this day in many organisations, which still qualify data after capture and so add unnecessary additional layers of resources for data cleaning.

Ensuring data quality should start with pre-emptive controls, with strict entry validation and verification rules, and data profiling of both structured and unstructured data.

Controls at the integration layer

Standardisation is crucial in supporting data quality, but in many organisations different rules and definitions are applied to the same data, resulting in duplication and an inability to gain a clear view of the business and its customers.

For example, the definition of the data entity called a customer may differ from one bank department to another: for the retail division, the customer is an individual, while for the commercial division, the customer is a registered business, and the directors of the business, also registered as customers. The bank will then have multiple versions of what a customer is, and when data is integrated, there will be multiple definitions and structures involved.

Commonality must be found in terms of definitions, and common structures and rules applied to reduce this complexity, and relationships in the data must be understood, with data profiling applied to assess the quality of the data.

Controls at the physical layer

Wherever a list of data exists, reference data should also be standardised across the organisation instead of using a myriad of conventions across various business units.

The next prerequisites for data quality are cleaning and data reconciliation. Incorrect, incomplete and corrupt records must be addressed, standardised conventions, definitions and rules applied, and a reconciliation must be done. What you put in must balance with what you take out. By using standardised reconciliation frameworks and processes, data quality and compliance are supported.

Controls at the presentation layer

On the front end where data is consumed, there should be a common data portal and standard access controls, or view into the data. While the consumption and application needs of each organisation vary, 99% of users do not need report authoring capabilities, and those who do should not have the ability to manipulate data out of context or in an unprotected way.

With a common data portal and standardised access controls, data quality can be better protected.

Several practices also support data quality: starting with a thorough needs analysis and defining data rules and standards in line with both business requirements and in compliance with legislation.

Architecture and design must be carefully planned, with an integration strategy adopted that takes into account existing designs and meta-data. Development initiatives must adhere to data standards and business rules, and the correctness of meta-data must be verified.

Effective testing must be employed to verify the accuracy of the test results and designs; and deployment must include monitoring, audit, reconciliation counts and other best practices.

With these controls and practices in place, the organisation achieves tight, well-governed and sustained data quality.

Advertisements

InfoFlow, Hortonworks to offer Hadoop skills courses

Veemal Kalanjee, MD of InfoFlow.
Veemal Kalanjee, MD of InfoFlow.

 

Local management InfoFlow has partnered with international software firm Hortonworks, to provide enterprise Hadoop training courses to the South African market.

Hadoop is an open source software framework for storing data and running applications on clusters of commodity hardware.

While the global Hadoop market is expected to soar, with revenue reaching $84.6 billion by 2021, the sector is witnessing a severe lack of trained and talented technical experts globally.

Hortonworks develops and supports open source Hadoop data platform software. The California-based company says its enterprise Hadoop is in high demand in SA, but to date, Hadoop skills have been scare and costly to acquire locally.

InfoFlow provides software, consulting and specialised services in business intelligence solutions, data warehousing and data integration.

The company will deliver localised expert resources and Hadoop training support programmes to a wide range of local companies across the financial services, retail, telecommunications and manufacturing sectors.

“There is huge demand in SA for enterprise Hadoop skills, with large enterprises having to fly expensive resources into the country to give their enterprise Hadoop projects guidance and structure,” says Veemal Kalanjee, MD of InfoFlow.

“Instead of moving existing skills around across various clients, Hortonworks wants to take a longer term approach by cross-skilling people through the training and leveraging the graduate programme run by InfoFlow.”

This partnership makes InfoFlow the only accredited Hortonworks training entity in Sub-Saharan Africa, adds Kalanjee.

The Hortonworks training will be added to InfoFlow’s broader portfolio of accredited Informatica Intelligent Data Platform graduate programmes across data management and data warehousing, governance, security, operations and data access.

The Hortonworks-InfoFlow partnership will bring to Johannesburg the only Hortonworks training and testing site in SA, according to the companies.

Local professionals will be able to attend classes focusing on a range of Hortonworks product training programmes at InfoFlow’s training centre in Fourways, Johannesburg.

The courses to be offered include: Hadoop 123; Essentials; Hadoop Admin Foundations; Hortonworks Data Platform and Developer Quick Start.

“There is currently no classroom-based training available on Hortonworks locally and if clients require this, the costs are too high. Having classroom-based training affords clients the ability to ask questions, interact on real-world challenges they are experiencing and apply the theory learnt in a lab environment, set up specifically for them.”

InfoFlow will have an accredited trainer early next year and will provide the instructor-led, classroom Hortonworks training at reduced rates, concludes Kalanjee.