Data holds great value, but only if you can find it and access it. That may sound simple, yet in the real world, it’s not. Some people think they’ve solved the problem if they’ve asked permission to use someone else’s data and have gotten a “yes” in response.
But getting permission is just scratching the surface of the challenges that arise in data sharing, whether it is between departments within an organization or between organizations. How do you find the right data? How do you even know it exists? Can you trust it? Is it stored in a form your applications can access?
The challenges for getting the data you need can form a significant barrier to data getting used, and as a result, a lot of data lies dormant. “Finding, accessing, and using data effectively is one of the biggest challenges companies will face,” says Janice Zdankus, VP of Innovation for Social Impact in Hewlett Packard Enterprise’s Office of the CTO.
Data has value for cross-cutting purposes
One reason just getting permission isn’t enough is that often data wasn’t collected specifically for the data consumer’s project. Data science – from analytics to AI – frequently makes use of data in different ways than its original purpose. Customer transaction data, for instance, may have been collected for billing purposes, but it also provides a rich source of training data for online recommendation systems, fraud detection, and supply chain predictive analytics. That’s one reason it’s useful to run AI and analytics applications together, on the same system.
These examples illustrate the fact that context gives data value, as pointed out by Joanna McKenzie, Principal Data Scientist at The Data Lab – Innovation Centre in Scotland, in a recent AI and data panel “Data Sharing for Data Science.” Weather data, for instance, is obviously useful in making predictions about agricultural conditions. But one might produce better results by combining weather data with socio-economic or other human behavioral data in use cases such as predictive analytics for retail marketing or healthcare and disease prevention.
Watch the video “Data Saves Lives”
But to take advantage of the added value of combining diverse data sets, you may have to deal with mismatches in data type, in granularity, or many other parameters. Let’s look a bit deeper at some of the common challenges encountered in data sharing.
What gets in the way?
The gap between data producers and data consumers underlies a major challenge in getting the data you need: data discovery. To find and use the data you need, you first have to know it exists. That’s not a trivial issue, as data generated by data producers often is siloed, whether within an organization or even from open-source data sets. Data discovery is even harder given the huge proliferation of different types and different sources of data.
Data trustworthiness is another fundamental challenge, particularly when there is a gap between the data consumer and the data producer. Even assuming all parties are responsible and have good intentions, it still may be difficult to find out exactly what a particular data set really is, how and when it was collected, who else is using it, and whether you can trust that it is a reliable source with an assured lifetime that fits your requirements. Suppose you need to revisit a data set for retraining AI models or to respond to regulatory audit requirements? Has the data been changed or “cleaned up” by someone using it for other reasons? Proper data curation and governance is challenging, especially when data is collected without knowing how it will be needed by multiple data users.
Data accessibility by multiple users and a variety of applications is a fundamental requirement for practical data sharing. Does the data infrastructure your data producers and data consumers are using support broad accessibility by different types of applications or tools? And don’t forget the issue of when and where you will use data: must large data sets be moved to you? Or can they either be accessed remotely or by moving applications and models to them?
The need to connect data producers and data consumers isn’t entirely new, but demand is growing. Response to these pressures is the rapid expansion of a new frontier: better data sharing for large, distributed data sets. New data initiatives are being launched, some as international partnerships, and new ways are being developed to deal more effectively with the challenges.
New approaches to make getting data easier
One new approach is called Dataspaces, a data system being developed by Hewlett Packard Enterprise (HPE) to facilitate data exchange between producers and consumers in multicloud, hybrid cloud, or on-premises settings. This platform-agnostic approach lets you find data across multiple technology stacks from multiple distributed owners. The goal is to enhance data sharing and collaboration, steer away from inaccessible siloed data, and provide improved governance for data you can trust.
Technology innovation in the form of a unifying data infrastructure for large-scale, highly distributed systems is also helpful in making exchanges between data producers and multiple data consumers feasible. An example is HPE Ezmeral Data Fabric, a software-defined, hardware-agnostic data infrastructure that stores, manages, and moves data across edge, cloud, and hybrid cloud architectures, with flexible multi-API access to support unified analytics by diverse applications. Data fabric not only makes automated data motion easy to do, but it also makes it possible to access data remotely, in many cases removing the need for data motion.
Another approach to making data easier to find and use involves inserting appropriate metadata into large data sets. Suparna Bhattacharya, Principal Technologist with HPE AI Research, is developing new techniques for automating metadata insertion. Along with Ted Dunning, CTO for Data Fabric at HPE, Suparna recently described how metadata can help deal with mismatches between diverse data sets used to improve agriculture even for individual small farms.
Watch the video Data Feeds People.
One new open data-sharing initiative is the AgStack Foundation, tasked with building a global data infrastructure to modernize data usage for improved agriculture. The need is urgent, as about a third of the world’s food goes to waste while 800 million people go hungry. New approaches are required because legacy technology is not a good fit to the increasingly complex agriculture supply chain.
New data initiatives are not limited to agriculture: they are appearing across a variety of sectors as countries seek to help organizations unlock the value of data by making effective connections between data producers and data consumers.
A good example is GAIA-X, an interoperable data exchange project that started as a government-driven collaboration between Germany and France and is now a global effort supported by over 300 organizations. In the HPE Tech Talk, “Will GAIA-X’s federated data model benefit business?,” Robert Christiansen, VP for Strategy in the Office of the CTO, and Johannes Koch, HPE’s Managing Director in Germany, described the need to share data for a unified view, including widely distributed data in edge situations such as the global supply chain for automobile manufacturers or insights shared between researchers in healthcare. To help organizations prepare to participate in GAIA-X, HPE launched the HPE Solution Framework for GAIA-X in May 2021.
All of these approaches are works-in-progress, so watch for new developments as the trend toward better data sharing moves forward.
To find out more about the challenges of connecting data consumers with data producers and how they are being addressed, try these resources:
Watch the video “Data Spaces: Connecting to Data You Can Trust”
Read “Getting the most from your data-driven transformation: 10 key principles”
Find out more about HPE Ezmeral File and Object Store software.
About Ellen Friedman
Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.
Copyright © 2021 IDG Communications, Inc.