19-Dec

The Cloud

Rethinking data sharing in the next decade

Data is today recognized as the single most important resource to contribute to economic growth in the next decade. Connecting data in new ways creates new opportunities for innovation and new services. The public and private sector have engaged in this development by sharing many valuable datasets. But are todays methods for data sharing suitable for the future?

7 min read

·

By Safurudin Mahic

·

December 19, 2020

The data economy

Data has become the oil of the 21st century. And over the next 5 years we will have produced more data than the previous 5 000 years of humanity. However, unlike oil, data is a renewable resource, and as long there are computers, data will keep flowing through the fibre pipes that constitute the Internet. Not convinced? Please take a look at how much data is generated each minute.

With the increased amount of data flowing across devices and increased amount of data sources across public and private sectors, the value of data has gained political attention in the world.

For instance, the EU says:

"Most economic activity will depend on data within a few years. The value of the European data economy for the 28 Member States is expected to grow from €377 billion in 2018 to €477 billion by 2020 and €1.054 billion by 2025"

Similarly, other governments and political bodies recognize the importance of data as a source of innovation, new business models and improvement of existing digital services.\ \ Thus, with the increased amount of data, during the next 5 years we will witness a large ecosystem around both public and commercial data sources which will contribute to this economy.

How is data shared today?

Data today is mostly shared via web based APIs. Perhaps as a consequence of the famous Wogels quote that APIs will rule the world. Perhaps as a consequence of other uncontrolled factors in technological evolution. But the matter of fact is that APIs are the dominantly preferred medium/method for a data producer to share data with third party consumers.\ \ And the promise of APIs are many. Firstly, APIs are language independent. You can make APIs in almost every concievable programming language there is. And everyone can read from an API in any other language. It takes only a couple of lines of code to fetch something from an API. Secondly, APIs are platform independent. You can host APIs on premise, in the Cloud, on a Raspberry Pi, on your watch, you name it. As such, the ecosystem around APIs is enormous. REST based APIs have existed for a decade. And recently, we have seen the emergence of alternatives such as GraphQL. And web based APIs exist for almost anything imaginable, including how to brew beer. So what is the problem with APIs when it comes to sharing data?

Shortcomings

Let's assume you are an entity that wants to make your data available either as a completely public dataset, or a commercial dataset. We are in the year of 2025, where 50% of all data is produced on the fly by the 40 or so billion IoT devices. When even your vacuum cleaner is generating vast amounts of datatoday, it's not hard to understand that over 50% of data created in the world will be generated by IoT.\ \ Let's also assume that your system is generating a couple of gigabytes of data each day - by any means not a large amount even in todays systems. That would mean you would have some terabytes of data at hand.\ \ Let's further assume your data set is popular and you have a couple of hundred potential consumers of said data set.

\ What do you need to make your data available through an API?

  1. You need engineers to develop, maintain and operate the backend and API. You probably need to make it performant, robust, secure and other things non-functional.
  2. You need some place to host the API, either on prem or in a Cloud service.
  3. You need some kind of network access to the API and enough bandwidth.

What kind of problems arise for those that want to make use of your data? Perhaps combine your data with other data to make a new service.

  1. They would need engineers to make a robust ingest pipeline, sucking the data out from your API. Anyone who has done this, know how error prone and brittle these types of pipelines can be. In order to make them robust, extra care must be taken.
  2. In order to make use of it, they most likely need to copy the data. Perhaps even historic data, hammering your API to get all those terabytes of data. So if there are a lot of consumers of a dataset, and some datasets have already today thousands of consumers, there will be multiple copies, with more or less synchronized datasets around, where only your data is the real source of truth.
  3. Sharing data through an API is not the most human-friendly way of making data useful for others. In practice you are limiting data access to those that are tech-savvy, while in real life there are many business developers, analysts and other who could have equal use of data to innovate and create new business models.

The questions arise

  1. Is an API really the best way to share such a data source?
  2. Are APIs really well suited if the data in question is binary data such as audio, images, video or files?
  3. Or if your data is real-time data, do you really want consumers to hammer your API endpoint to get up-to-date data?

Furthermore, what do you think would happen if you have 10 systems that share datasets to othe? Or 50? The points above become a problem of multiplication, and you realize that sharing data through APIs is resource-intensive with both special expertise necessary and system capabilities on both ends of the API.\ \ In the example above, we have only described a single data producer. Imagine how many connections in the graph you would get if there were thousands of data producers and thousands of data consumers all intertwined in a web sharing data back and forth via APIs. The problem is factorial.

Possibilities

Luckily, some have already faced issues with the described model with regards to data sharing, and possible solutions are emerging. \ \ One of the examples of such a solution is to make your dataset available through a Cloud provider data marketplace. AWS and Google have both such marketplaces

An example of a dataset made available through this mechanism is the OpenStreetMap dataset. There are hundreds of other datasets available in these marketplaces. The advantage of this mechanism is that the Cloud provider takes care of the infrastructure, and it is equally easy, or in some regards easier for consumers to make use of the data.\ \ Another possibility is to provide data through a messaging interface / streaming platform, decoupling the producer from consumers and providing a model for serving real-time data to third parties that is better than APIs, which are in nature based on a request-response model. There are many open source alternatives to these types of systems, such as Kafka, and a lot of Cloud providers also provide general purpose managed services based on a messaging / data push interface. A well known example of a purpose-built system based on a data push-model are the notifications you recieve on your phone from all the applications you use. \ \ A third possibilty if you have huge amounts of binary data, such as video, is to make such data available in the buckets on Clouds, such as S3, GCS or Azure Storage.\ \ No matter what type of data you have or will have in the future, in order to make good choices for how to share data, you must familiarize yourself with the possibilities and the shortcomings of the alternatives I have mentioned as well. Perhaps even better alternatives for data sharing will emerge. Or perhaps we will be stuck with APIs forever. Time will tell.