The volume of data held by the public sector is constantly increasing. The data ranges from sensitive personal information used to deliver personalized services (such as health and social care) to non-personal information (such as environmental data). This data is useful to the organization that collects and holds it, but there is the possibility of making it even more useful by letting others use it again.
Open data is information that is freely available to everyone to use, reuse, and share. Open data is data that is made available, via the internet, in an electronic format that supports its ready re-use, and with open licensing that allows its reuse.
There’s no question that the open data movement has helped to make data more accessible. But what happens when you attempt to use open data in a project? You encounter some serious issues—such as data integrity and quality.
Data integrity is the most important part of data analytics. The confidence that your insights are accurate is what makes business intelligence work. Ensuring that your data is accurate and reliable at every stage of its life cycle is essential.
Data integrity starts with gathering high-quality data from trusted sources in a timely fashion, integrating it into one place, and then analyzing it with confidence. Organizations can add additional data attributes, like location intelligence, to the data to make better decisions based on all the information available.
The concepts of data quality and data integrity are often discussed in the open data domain. Data Quality is about ensuring that your data is accurate, complete, consistent, timely, valid, and unique. It is an assessment of how well your information meets your needs and expectations. Data Quality refers to the degree to which data conforms to a given standard or set of rules.
The issue is that there is no single entity who controls what gets published as open data. That makes it hard for people to check if the data is accurate or if it has been changed in any way.
The good news is that we can learn from how open-source software projects deal with this issue. One of the most important lessons we can learn from these projects is how they test their code before releasing it. These projects also perform extensive validation on the quality of their products before releasing them for public consumption.
And we should carry this idea further by ensuring that all organizations that publish open data adhere to rigorous standards when publishing their datasets. This will help them build a reputation as a reliable source and build trust in their products and services.
Besides meeting rigorous quality standards, we also need to address the gap between the publication side of the open datasets and the (re)user side of the open datasets. Because publishing organizations have very little insight into the use case opportunities coming from their data, it is difficult to envision utilization domains and improve data towards those. This is valid for both commercial and private company use cases, as well as other governmental agencies using the same datasets.