Governing your IoT data
How to manage enterprise-wide IoT initiatives
IoT solution governance involves having strategies, architectures, teams, and processes in place to develop, deliver, and maintain successful IoT solutions. Part 1 of this series helped you define your overall IoT governance practices. Two key components of any IoT solution are the devices and the data. Part 2 of this series focused on how to govern your devices, and Part 3 will focus on how to govern your data.
IoT devices generate and transfer a huge amount of data over the internet. An effective governance mechanism is required, both at the enterprise level and at the platform level, to ensure that this data is effectively used by intended stakeholders and not misused by others.
Data governance in IoT has the same characteristics as data governance in an enterprise, such as data collection, data quality, data storage, data processing, and data consumption. However, IoT solutions need to address a few additional areas in the data lifecycle:
- Deciding on the correct data to be collected through the sensors or devices
- Sending the data to the cloud over a network, and then storing it to a cloud platform
- Analyzing the data to predict or optimize outcomes
- Distributing the data to consumers or other applications
- Managing the privacy and security of the data across the entire lifecycle
When defining your IoT data governance practices, you must take into account the IoT context or use case, such as managing a large-scale infrastructure (a smart city), managing a utility infrastructure (an energy grid), or managing production or manufacturing lines.
Governing IoT data throughout the IoT data lifecycle
Each layer of an IoT architecture must manage the IoT data throughout its lifecycle. The IoT data lifecycle starts at the physical layer (devices and gateways) with data collection, proceeds through the communication layer with data transportation, then through platform services layer with data storage and data processing, and finally at the application layer with data visualization or data reporting.
IoT data management is not only about the sensors, but also about the actuators and the edge gateways. As an example, a visual door lock may have a camera to take a photo of the visitor, may process the photo against a set of pre-approved faces on the edge gateway to automatically recognize a visitor, and actuate the lock to open the door. Thus, you have to manage the sensor data that is collected, manage the reference data that is used to process or filter the sensor data, and finally manage the control data that is sent to the actuator.
In order to process the data (either real-time data or historical data), data coming from sensors must be stored in appropriate storage. Depending on the data format, frequency of the data and processing need, a proper storage needs to be decided and the data needs to be transported securely to the storage platform. An end to end IoT data management solution needs to address secured transportation, storage and consumption of the data.
Subsequent sections provide more details about managing IoT data lifecycle.
IoT data collection begins from the hundreds, possibly thousands, of sensors. These sensors might be incorporated in static devices, such as a smoke alarm, or in mobile devices, such as in a navigation device. These sensors might be manufactured by a widely varying set of manufacturers, and, therefore, might have a wide variety of standards in data formats and protocols. They will likely need regular calibration. Depending on the sensor standards, the data might need to be aggregated or otherwise filtered before sending to the server.
You need to model the data from these disparate sources to make it uniform based on the specific need of the IoT solution and you must ensure that the data meets the required quality requirements in order to transform the data from “raw” data to “usable” data. This phase is often seen as the first gateway between the worlds of unknown, unseen, or unorganized IoT data and structured, meaningful, or usable inputs for the next lifecycle phases.
You need to begin with standard metadata for your IoT solution context and maybe also at an enterprise context for your organization that defines the data structure that you want to enforce. While the metadata standards for sensors in your one factory or a section of that factory might be unique to that solution, you might want to establish enterprise-level metadata standards for your supply chain that might begin at one continent and end at another.
Due to the varied nature of the devices and of the business solutions, it is quite challenging to have a suitable metadata standard that is at the right level of detail. You might end up generalizing too much or too little. It’s important to not to take the “one size fits all” approach. Depending on the business context or solution context, choose the granularity that works across the devices and manufacturers but that allows you to enforce your governance policies.
It could be critical for you to ensure that the sensors are well calibrated, especially in demanding industrial situations or in highly sensitive instruments. You will want to collect data at a sufficient granularity and at sufficient frequency that allows you determine, or at least infer, whether the sensor calibration is off. If it is not possible to determine if calibration is off, you might have to set policies for regular manual verification, which might be more expensive.
IBM Watson IoT Platform supports defining a logical device schema to abstract out complexity and isolate IoT applications from vendor-specific device details. In device schemas, you can use meaningful attribute names for different data that is coming from devices, especially devices from different vendors. Once you have the schema defined, you can use that to create data rules and corresponding actions based on the logical device schema.
In addition to data modeling, you must consider data quality, in the form of consistency, completeness, timeliness, and reliability.
- Consistency: Consistency of data refers to the correlation among a series of data points that are reported by the same sensor over a short period of time. For example, a temperature sensor is not expected to report very different temperature readings or wild swings in temperature within a few seconds. Or, a geolocation sensor is not expected to report locations that are many kilometers apart within few seconds. If such a case happens, it’s likely that the sensor is faulty. A sensor might report readings that are consistent but incorrect if it’s calibration is off. In critical use cases, you might includemultiple sensors and then statistically correlate readings to determine which readings are inconsistent. A sensor that regularly reports data that is inconsistent with readings from other sensors might automatically be reported for health check.
- Completeness: Completeness of data refers to whether all supporting data points are available; for example, raw data that supports a predicted event or time series data that does not have gaps in it. Quality data is when further detailed data points are available that can be traced back to a particular sensor at a particular point of time. Where relevant, completeness can refer to the ability to combine and correlate data from multiple sensors or even from other information systems.
- Timeliness: Because so much of IoT data is real-time data or near real-time data, determining whether the sensor data or derived data arrived on time at the required point in the network becomes critical. You want to be able to act on the data in time to prevent or preempt incidents. Where it’s important to correlate readings from multiple sensors, timeliness might include the ability to synchronize the data from those sensors.
- Reliability: Reliability is of course paramount, especially for mission critical applications. For reliability, the measurements from the sensors have to be accurate and repeatable, over a given lifetime of the sensor. Accuracy of the reading refers to the required precision of the measurement that must be achieved by the sensor. Repeatability refers to whether the sensor will produce the same measurement with the same accuracy when put in the same scenario. For example, if the geolocation sensor is brought back to the same street corner, the sensor needs to produce the same location readings to the precision required.
IoT raw data comes from different discrete sources with varying formats, structures, and importance. Transmitting a high volume of raw data across a network in real time and storing that raw data is expensive. Data is typically sent from one step to the next by using either wired or wireless protocols. When you design your IoT system, you must carefully select an appropriate transport based on the nature of the data and your IoT devices. There are many transport protocols, including the standard protocols HTTP and MQTT, but also a number of proprietary protocols.
When we start talking about storing IoT data, several interesting questions must be addressed in your IoT governance:
- Where is the IoT data stored? How should the IoT data be stored? Is the cloud always the right place to store the data? No, not always. Consider the example of images that come from a moving car or the example of a submarine that collects underwater images. Even a very high-speed internet connection is not good enough to transfer and save the data in real time in the cloud. A local, intermediate storage is required in such scenarios, and then you periodically send filtered data (by preprocessing it, aggregating it, or fusing it) to a central database, which is used for further analysis. Using intermediate databases is a common strategy in IoT systems. Also, the technology can vary based on the size, type, and speed of the IoT data. When you choose an IoT database, do not start with any preconceived notion about the database. There are multiple options like SQL, NoSQL, Object/Document DB, File Storage, among many others. You can choose one or multiple database technologies based on the storage and access requirements. Even though the cost of storage is reducing day by day, cost is still an important factor due to the sheer volume of IoT data.
- How much IoT data should be stored? The objective of most IoT systems certainly doesn’t include storing precise and predefined datasets. Sometimes the data is not well known before it is stored, and it is fed into unsupervised machine learning algorithms to discover any meaningful patterns. Some amount of redundancy can be allowed, but it should be manageable in terms of disk space requirements.
Data processing (analytics)
Devices and sensors generate a huge amount of raw data that needs to be processed to extract meaningful information for users and applications to consume or use. Before you can apply advanced analytics on the data, which is coming from different devices (of different vendors) in different formats, you may need to:
- Transform and standardize all the data and store it in a uniform format that your application understands
- Filter unwanted, repetitive data to improve the accuracy of your analytics
- Enrich the data by integrating other structured or unstructured data from other sources, such as enterprise information system data, weather data, or traffic data
Processing data for an IoT solution requires that you:
- Can manage a high volume of data that needs to be stored and analyzed in real-time.
- Provide very fast access to data because a delay would make that data obsolete or not usable for you to make appropriate decisions (for example, consider the scenario of people driving connected cars).
- Handle errors and missed readings.
- Plan for and ensure data standardization and interoperability. You’ll need to deal with protocol and data format issues from combining multiple data formats and aggregating data from different devices and formats.
- Design your IoT solution for the types of analytical processing that is needed based on your use case, such as real-time, machine learning, cognitive, or edge analytics. For example, a machine learning model can be very useful for asset management solutions to detect possible equipment failures based on past data and a failure model.
Therefore, the IoT governance team has to focus the computing power with these IoT data governance policies:
- Data collection policies, which determine how the data is collected and sent for further processing. Data can be sent as raw data or can be sent after initial processing, depending on the business need and based on the device capabilities. Such policies also establish the security protocol that ensure that the data that is received is valid data.
- Data filtration policies, which define how data is managed after it is collected. Not every data point needs to be sent all the way to a data center. Sensor data is often processed for aggregation before you send it to the next point in the network. Because most IoT devices (at the time of publishing of this article) are meant as low-power devices, there is often a gateway or an aggregation point assigned to a certain area, which manages and collects data from sensors within that area. For example, a height sensor in an elevator might collect and send raw data collected every 100 ms to a controller gateway, the controller gateway might determine whether the elevator is moving or stuck between floors, and then the gateway might send the data on to another aggregator that reports on failure events across multiple elevators.
- Data upload policies, which define what data is mean for further processing and long-term storage. Among the huge amount of data that is generated by IoT devices, only the events of interest, supported by data points, need to be uploaded to the cloud for processing and long-term storage. So, typically, only processed, filtered data is uploaded. Raw data points are seldom sent for long-term storage. However, each point in the network might store some relevant data for detailed auditing, defect identification,or health-check purposes. In the above elevator example, the controller gateway might store raw height data for the last 24 hours such that it can be used for further diagnosis in case a failure event was detected.
IBM Watson IoT Platform provides a rich set of analytical services to analyze IoT data that is produced by wide variety of devices. Depending on the need of a specific application, developers can start with basic real-time processing using rules and actions to creating advance machine learning models to predict possible outcomes based on data sent by the devices.
Data Consumption (visualization & reporting)
The data consumption phase talks about data selection, enrichment, cleansing, reporting, visualizing, and other housekeeping activities. At a very high level, there are two strategies driving this phase.
- Handling of known data – The system is already familiar with the data, and the processing steps like filtering, pre-processing, analysis, post-processing steps are already defined. Reporting requirements are also clear, and thesystem is designed and implemented with a defined objective in terms of what to do with the data.
- Handling of unknown data – In this case, the “what to do with the data” part is not well defined in the beginning and is derived at a later stage after the initial analysis of the data. The data consumption volume is also more in this case. But, reporting requirements are less, as the generated reports might not be meaningful or useful at this stage.
To get value from the IoT solution, raw or processed data (by an IoT application) needs to be made available to external users and other applications in a secure way. The main objectives of this phase and the related governance policies are as follows:
- Data selection
Push/pull policies, which define whether to poll the devices continuously to fetch the data or design devices to send the data to server with no prompting.
Filtering policies, which help to optimize data throughput and define howsensor devices must apply filter criteria to the data-update events so that they are only raised when needed.
- Data sharing
Data sharing policies, which specify an agreement for who can view or use the IoT data and what the access mechanism will be.
Channel usage policies, which determine whether the data is consumed byusing private or public channels.
- Usage Consumption QoS policies, which identify and define the QoS parameters for the IoT data, such as performance, throughput, reliability, and availability of the data. Usage tracking policies, which define how the IoT data is being used. Data collection procedures and filtering criteria can be fine-tuned accordingly. Monetization policies, which define whether IoT data and patterns, analytics, and insights have monetary value. Monetization refers to the realization of financial benefits from the usage of IoT data. Retention policies, which defines how much data can be maintained. Even big data eventually gets too big and costly to maintain.
IBM Watson IoT platform has in-built support for visualizing real-time data by using boards and cards. Developers can build custom visualization based on the data stored in Watson IoT platform accessing data using secured API provided by the platform.
Managing IoT data privacy and security
The data that is collected and stored by different IoT devices, services, and systems are increasingly being scrutinized by regulatory agencies and government. Tighter laws and regulations are being introduced around protection of sensitive and personal data. It is becoming more important for the developers and designers to know exactly what data is being stored and why it is being stored. Also, there could be legislations or regulations that are specific to countries or regions that make the physical location of the IoT data an important consideration. In the US, there is the Health Insurance Portability and Accountability Act (HIPPA). In the EU, the General Data Protection Regulation (GDPR) was adopted in April 2016 and becomes enforceable in May 2018. And, if your IoT solution involves the use of drones, there are cyber laws for drones in Israel.
As you design your IoT solution, you need to take these IoT data privacy and security guidelines in mind:
- Data categorization based on sensitivity. Data definitions need to clearly categorize data in terms of Personally Identifiable Information (PII), Sensitive Personal Information (SPI), secure information, and open information. Consider these examples: Health monitoring data of an individual may be considered SPI Any data that is traceable to an individual may be considered PII CCTV footage of a public place may need to be secure so that hackers may not replace the feed with false clippings.
- Controls and validation on data uploaded. After data is uploaded, the data has to be secured against unauthorized access and manipulation. Necessary controls must be established against each category of data.
- Data privacy by design. The solution has to be designed from the beginning to ensure data privacy. The solution has to apply techniques such as data partitioning, pseudonymization, anonymization, or encryption on data to reduce the risks of unintended revelation and consumers being able to correlate the data to derive sensitive information.
- Access control. The IoT governance team should establish granular role-based access control to each section of the data, by leveraging a granularity of the data definitions. Access control must prevent access through generic logins and must enforce that each user use individual credentials for all access.
- Protection in the context of multi-tenancy. Given that almost all IoT data is stored and managed in the cloud, multi-tenancy security becomes paramount. The person who is responsible for data processing must implement privacy design and controls to separate a data workspace for each person working with the data to prevent any data crossover, including when backing up and restoring IoT data.
Defining IoT data governance policies
Data governance for IoT requires well-documented policies to ensure that the data that is generated and used by the IoT solution conforms to all the requirements and standards. Data governance requires a significant focus on security policies to allow for the valid consumption of data.
The Data Policy Reference Model (see the diagram below) can assist organizations in understanding and creating a complete set of data governance policies. The model consists of:
Policy life cycle management, which manages the authoring, transformation, enforcement, and monitoring of the policies.
Data policy layers, which is a vertical dimension for policy classification and provides a level of abstraction for policies including business, architectural, and operational.
Data policy domains, which is a horizontal dimension for policy classification and identifies the policy domains that each of the policy layers must address or at least consider. This includes business, process, service, information, and non-functional requirements.
Data policy enablers, who assist in the proper management of the policy life cycle, including policy auditing and logging, which provides traceability, distribution and transformation, and monitoring and reporting.
Data lifecycle management, which covers the policies related to data collection, transportation, storage, and processing.
The indicative policies for IoT data governance include these policies:
- Business policies:
- Privacy and security policies All customer information must be encrypted whenever it is data in motion or data at rest. No patient data cannot be viewed by anyone but can only be viewed by a doctor treating the patient.
- Industry regulation policies If transaction > $10,000 then record of transaction will be sent to government revenue authorities
- Architecture policies:
- Business rules can be used to provide ownership decisions as part of a maintenance process, routing to separate process paths based on parts type or condition.
- Preventative activities to be initiated when the appropriate threshold values are reached.
- For example, for data transportation, “Use devices supporting custom encryption if underlying network does not support secured transportation.” Or, for data storage, “All data more than 15 days old must be moved to a back-up storage.”
- Operational policies:
- Access control policies: Who can access what data at runtime. For example, “Sensor data coming from public sensors can be viewed by anyone but sensor data coming measuring private asset status can only be viewed by authorized personnel.”
- Message protection policies: If transportation layer is not secured, ensure message data is encrypted before it is transported.
- Data integrity policies: How long can the data be trusted or used.
In this article, we have discussed why data governance is a key aspect of designing and operating an IoT solution. We have described key data lifecycle phases, such as how data is to be collected, how data is transported and stored, and how data is to be processed in an IoT solution. Data security is a key dimension of the data governance solution and proper care must be taken to address data security issues at each phase of data lifecycle. Finally, we presented a common set of IoT data governance policies.