Data Strategies for Getting Greater Business Value from Distributed Data
KEY PLAYERS INSIGHTS
DR. PHILIP RUSSOM
Organizations today have an opportunity that’s too good not to seize. If they could access the full wealth of their data that is distributed across cloud and on-premises systems, they could leverage enterprise data to achieve several compelling benefits. These include new forms of advanced analytics, greater operational efficiency, faster time-to-insight, and more impactful decisions (both strategic and operational). However, the greatest barrier to seizing this opportunity is that each of these distributed systems has its own data models, data semantics, interfaces, and restrictions relative to security and governance. Distributed data has long been a challenge to leveraging and reusing data on-premises, and the distribution of data is broadening as organizations embrace hybrid or multi-cloud data architectures.
Let’s drill down into the compelling business opportunities achieved by addressing distributed data, as well as into the technical and organizational challenges of distributed data. Finally, let’s consider several data strategies as solutions for getting greater business value from distributed data, with a focus on logical, virtual strategies such as data virtualization, data fabric, and logical data architectures.
Prepare to seize the Opportunities of Distributed Data
Before we cover the potential benefits of distributed data, it is important to say a few words about how organizations can best prepare to seize this greater business value from distributed data.
Ideally, an organization will begin by establishing a data management team and populating it with data professionals who have diverse data management skills, the priorities being data integration, data quality, metadata, data virtualization, architecture, and data fabric. They should also build up a mature infrastructure for data management, including diverse data management tools, enterprise standards for data, governance, and support from IT in general.
Note that these preparations can be of any scale – large or small. For example, you can address modest volumes and sources of distributed data with data virtualization technology and skills. You can also take an existing data architecture (which may include a data warehouse, a data lake, and one or more operational databases) and build a logical layer over it and beyond to reach distributed data sources. However, you can also build out your distributed data solutions over time to address massive volumes from dozens or hundreds of sources, probably via an enterprise-scale data fabric.
The Benefits of Addressing Distributed Data
Whether your distributed data solution is small and young or large and mature, you should always be driven by use cases that have recognizable business value. Here are a few examples:
Customer views. A common early-phase goal for such solutions is to start building a “single view of the customer.” Since a customer view can incorporate dozens or even hundreds of customer details, you would start with a handful and expand over time, via a combination of data virtualization and aggregation techniques.
Metrics for performance management. Many metrics in management dashboards are time-sensitive and therefore need frequent or on-demand refreshes, as is ably accomplished (in real time or close to it) with data virtualization or data fabric functionality.
Self-service. The business-friendly views of data created with data virtualization or data fabric are critical for self-service practices, such as data prep, exploration, and visualization.
A data abstraction layer. A mature program for data virtualization, data fabric, or logical data architecture will build out a sizable data abstraction layer as the heart of the deployed solution. The abstraction layer is where numerous data views are presented to users and tools. Through the abstraction layer, users and tools can browse, search, access, integrate, and use distributed data. The layer handles all the necessary interfaces to source systems.
For this reason, the abstraction layer insulates users from the complexity of distributed data, making it more accessible to a wider range of user types. It also enables easy change management; when changes occur in source systems, data engineers can simply adjust the layer, without having to adjust the many tools that access distributed data through the layer.
A single point of entry to distributed data. A data virtualization, data fabric, or logical data architecture solution can support access for large numbers of concurrent users and tools. Furthermore, most or all of these users and tools access distributed data via the abstraction layer. This single point of entry, in turn, simplifies and provides consistency for, valuable practices such as security, data governance, and enterprise data standards.
Challenges to Leveraging and Reusing Enterprise Data
As mentioned above, data is increasingly distributed as organizations embrace hybrid and multi-cloud data architectures. Such architectures present their own sets of challenges:
Hybrid architecture. This is where a combination of distributed sources and targets are strewn across both traditional on-premises IT infrastructure and cloud infrastructure. Consequently, the data virtualization and data fabric tools you use must support – and be optimized for – a wide range of traditional and modern platforms.
Multi-cloud. Sometimes, an organization will commit to an application or data platform that is only available on a specific cloud service provider (CSP). Independent decisions of this sort can easily lead to the use of multiple CSPs in a single enterprise. Since each CSP has its own collection of interfaces, data egress rules, and performance optimizations, accessing and leveraging cloud-based distributed data becomes even more difficult to sort out.
Silos in the cloud. Many organizations have migrated data from on-premises to cloud through simplistic methods for “lift and shift,” which makes little or no improvement to data as it is moved. Lift-and-shift has its place, as a quick and easy preliminary step for getting data into the cloud. However, if it is not followed immediately by data re-engineering and other improvements, the resulting datasets in the cloud will remain sub-optimal silos of low quality.
In addition to the challenges already noted, addressing distributed data is easier said than done. It requires multiple, challenging steps. Here are a few:
Find enterprise data, often in siloed operational databases, that has potential for new value when shared across multiple IT systems and business processes. To assist with this process (as well as other similar processes for data exploration and discovery), some organizations are building an enterprise-scale data catalog and/or metadata repository.
Define the kind of business value that can result. If addressing distributed data is not compelling to business users, there may not be a good reason to integrate specific distributed data assets.
Respect distributed data’s requirements. For example, when accessing source data that is subject to data privacy rules or legislated regulations (common with data domains for customers, healthcare, financials, etc.), those policies must be extended to the aggregated data and other data products that result from your distributed data solutions.
Data Strategies for Getting Business Value from Distributed Data
Given the distribution and diversity of enterprise data assets, you probably need multiple strategies for addressing distributed data. Luckily, there are several options, all supported by known best practices and tool types.
Several common strategies for addressing distributed data are summarized in Figure 1. The vertical axis charts the amount of effort required of these strategies, with the least effort (which is what you want) at the top of the chart. The horizontal axis charts the amount of business value, with the greatest business value (which is what you want) on the right side of the chart.
For each source of distributed data that you consider, you will need to select one or more data strategies for the solution you will build. As you consider data strategies, consider the ones that provide the most business value, but with minimal technical effort. In Figure 1, the strategies plotted in the upper right-hand corner of the chart attain this desirable combination of high value and low effort. These strategies are based on logical and virtual techniques, including data virtualization, data fabric, and logical data architectures.
Strategies of high value and low effort are ideal, but not always possible or needed. For example, sometimes data collocation (also known as “lift-and-shift”) is sufficient for as a first step. Likewise, limited dataset consolidation can enable valuable practices (with reasonable effort), such as customer views or operational reporting. In other cases, an organization might lack the resources or willpower for new solutions for distributed data, and so it may be forced to live with existing data silos. At the other extreme, a few organizations are re-engineering most of their data estate as they migrate it to the cloud. This raises the quality, modeling, and metadata of all datasets – and therefore business value, too – but can easily take three years or more to accomplish.
Organizations today have an opportunity that’s too good not to seize. If they could access the full wealth of their data that is distributed across cloud and on-premises systems, they could leverage enterprise data to achieve several compelling benefits. These include new forms of advanced analytics, greater operational efficiency, faster time-to-insight, and more impactful decisions (both strategic and operational). However, the greatest barrier to seizing this opportunity is that each of these distributed systems has its own data models, data semantics, interfaces, and restrictions relative to security and governance. Distributed data has long been a challenge to leveraging and reusing data on-premises, and the distribution of data is broadening as organizations embrace hybrid or multi-cloud data architectures.
Let’s drill down into the compelling business opportunities achieved by addressing distributed data, as well as into the technical and organizational challenges of distributed data. Finally, let’s consider several data strategies as solutions for getting greater business value from distributed data, with a focus on logical, virtual strategies such as data virtualization, data fabric, and logical data architectures.
Conclusion
Distributed data is a business opportunity to be seized. It enables more impactful decisions, agile analytics, operational efficiency, faster time-to-insight, and many other benefits.
Distributed data is a tech challenge to be addressed. Addressing it makes more enterprise data accessible to leverage, and it embraces new technologies and data design methods.
Leveraging distributed data is best done with logical or virtual data strategies. Logical data architectures, data fabric, and data virtualization all excel with distributed data. That is because they were originally designed for distributed data environments, and over time they have been optimized and extended for those. Therefore, they are ideal for enterprise data that is distributed across IT systems that are deployed on premises, on clouds, or hybrid combinations of these.
For more information about getting business value from distributed data, watch on-demand the session “Distributed Data Across Cloud and On-Premises: Opportunities and Challenges” at the Fast Data Strategy Virtual Summit 2023, where I’ll cover this topic in greater detail, and where you will also be able to hear from a variety of other thought leaders and industry experts.
Dr. Philip Russom is an independent industry analyst and thought leader for data management. Though independent today, Russom has worked at leading IT analyst firms, namely Gartner, TDWI, and Forrester. For decades, his coverage area as an analyst has focused on data and analytics, which includes the technologies and practices of data warehousing, data lakes, data integration, data quality, hybrid data architectures, cloud data management, advanced analytics, and a broad range of database management systems.