Top Tips for Weeding Out Bad Data



As databases grow larger, deeper, and increasingly business-critical, bad data is posing a major threat to many organizations.

Data is like water. “Without clean water, you will have significant negative health impacts,” says Venkat Gupta, associate vice president and data estate modernization leader for Sogeti, a unit of business advisory firm Capgemini. For many organizations, bad data means losing trust with entire stakeholder ecosystems — be it their customers, suppliers, or even their employees, he warns.

Failing to prioritize data trust can lead to poor decision-making capabilities, bad customer experiences, regulatory penalties due to non-compliance issues, and more, Gupta says. “Weeding bad data can’t be an afterthought if an organization hopes to remain relevant in today’s competitive market.”

Bad data should be filtered out of production databases regularly, recommends Jeremy Rambarran, a professor at Touro University’s Graduate School of Technology. “It’s critical for businesses to ensure that their databases are storing accurate information so they can instill trust in their customer base,” he explains. If bad data isn’t routinely weeded out from the production environment, enterprises, particularly financial industry firms, will find themselves relying on, and basing decisions on, inaccurate data. On a global scale, bad data has the potential to negatively affect the world economy.

Eliminating, or at least reducing bad data, also lowers the risk of errors and bias impacting data analysis. “Bad data can skew outcomes and result in incorrect conclusions, making its removal crucial for accuracy and reliability,” says Kunal Shah, senior manager for data analytics at AI and analytics firm SAS. “Eliminating bad data enhances overall data quality, leading to more accurate and dependable insights and conclusions.”

Know Your Enemy

“Bad data” is a nebulous term. “The standard for data quality differs, based on organizational requirements,” Shah says. “However, completeness, relevance, accuracy, consistency, and timeliness apply to every organization across all industries.”

Bad data often really means low quality data. In this case, it’s up to the data owner to define the acceptable level of quality in terms of relevance, accuracy, age, or other criteria. “But bad data can also mean inappropriate data, in which case “appropriate” would need to be defined,” says Erik Gfesser, director and chief architect at business advisory firm Deloitte Global. One enterprise’s highly useful data might be meaningless to another. Since many use cases aren’t particularly demanding, data quality doesn’t always have to adhere to the same standards. “As such, judgment often needs to be used to determine what’s appropriate,” he explains.

It’s also important to check for duplicate records, which can be caused by data entry errors or identical data being retrieved from multiple sources. “A clearly defined data governance program and an enterprise-level data pipeline design that’s shared enterprise-wide are the best ways to prevent duplicate records,” Shah recommends.

It’s possible to identify outliers and detect anomalies by comparing values that appear to be significantly different from the rest of the data or by running statistical tests, such as regression analysis, hypothesis testing, or correlation analysis, to identify patterns in data, Shah says.

Best Practices

Enterprises should establish active data governance and management practices with a structured and systematic approach. “This involves setting up policies, procedures, frameworks, and technologies that govern the collection, storage, use, and sharing of data within the organization and to external partners,” Gupta says. “The goal is to ensure that data is correct, dependable, and accessible to authorized users.”

A strong and active data governance program will also bring engagement and alignment across IT, business units, and data management teams. “It’s a continuous process that needs to be measured, watched, and adjusted to meet changing business needs,” Gupta says.

The best way for an organization to ensure a clean data set is to leverage automated tools that can sift through datasets and identify irregular data, data that may not comply with formatting, and other irregularities, says Portia Crowe, Accenture Federal Services chief data strategist for the defense portfolio and applied intelligence. “Setting up validation rules and having good data policies can also help with identifying, mitigating, and rectifying where bad data is originating from.”

Achieving Observability

In today’s organizations, DevOps teams ensure smooth and reliable software releases. Unfortunately, many enterprises continue to address data quality and lineage issues on an ad-hoc basis. “Applying the principles of observability to data pipelines can be a game changer,” Shah states.

Since ensuring data quality is a continuous process, organizations should observe standard data governance practices and commit themselves to improvement and making informed decisions, Gupta says. “Regular assessments and feedback loops allow organizations to address emerging challenges, adapt to evolving requirements, and refine their data governance processes over time,” he adds.

What to Read Next:

Bad Data: Is Cybersecurity Data Any Good at All?

Structured Data Management for Discovery and Insight

Can IT Run a Data Science Function?

Original Post>