2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Digital twins and the Internet of Things

What is a digital twin? A digital twin is a virtual copy of a physical system. Thanks for reading, bye! Just kidding, but this simple definition is the essence of the term that was originally coined by Dr. Michael Grieves in 2002. And, by the way, this concept was first used by NASA for space exploration missions. Initially, NASA had physical twins: actual physical copies of the spaceship on earth reflecting the remote spaceship’s state.

Now, digital twins are affecting all industries, but mainly Manufacturing, Automotive, Construction, Utilities and Healthcare.

Digital twins are not only used during the operation of a system, but they are also used during the design and build phases. But more on that later. Just remember for now: Reason, Realize, Run!

So let’s break down this article into two parts. First, I’ll describe the current technology that is used for digital twins and that make up the architecture. Data management, databases, and (real-time) machine learning play a crucial part here. Once we’ve explained the technological foundations, I’ll discuss the different types of digital twins which currently exist.

Architecture of digital twins

Digital twins are connecting the physical realm and virtual world. But wait. Haven’t we already been doing this? Isn’t ERP (Enterprise Resource Planning) all about managing physical assets by virtual copies? And, isn’t each record in a client database a digital twin of a real person? Yes, it is. But, to make a digital twin really cool, we need to have two things in place:

  • Real-time data integration
  • Real-time machine learning

Real-time data integration

We have been doing batch data integration for decades. IBM even dedicated a whole tool suite to this task called IBM Information Server. But real-time data integration had not really been considered in this solution. Therefore, alternatives like IBM Streams, Apache Flink, Apache Spark Structured Streaming, Apache Kafka, and Node-RED emerged, just to name a few.

Apache Spark is particularly useful because it combines batch processing with streaming. Apache Spark Version 2.3 uses micro-batching and nearly matches the performance of Apache Flink and IBM Streams, but it still cannot match the extremely low latency that IBM Streams provides. But still, it gets us closer to a reasonable solution for real-time data integration.

We can’t talk about real-time data integration in the IoT without talking about edge computing. (Edge computing is worth its own separate article, which we’ll publish soon.) You do not always have to integrate data into a centralized cloud storage. It can also be distributed over edges of all kinds, and it can be directly processed where it makes the most sense. Edge computing addresses three major concerns:

  • Network partitioning. The closer you are to the edge, the more unreliable your network connections can become. Therefore, a smarter way of local data processing mitigates the problem of disconnected edges. For example, you don’t want to wait for the irrigation system to start irrigating your fields until network access has been restored.

  • Network latency. Again, the closer that you get to the edge, the network latency in your solution grows bigger and bigger. Decisions about the data that are made on the edge avoid this latency, so decisions are made faster. Network latency is an important feature since most IoT sensor data loses its value within the first few seconds. For example, in a self-driving car you, after a child runs in front of the car you don’t want to wait another 250ms for the breaks to stop the car.

  • Data privacy. IoT sensors, including cameras and microphones, are capturing highly valuable data. But they also raise the highest concerns about data privacy. If data is directly processed on the edge, critical information doesn’t ever need to leave the tiny edge device ever. For example, occupancy in an elevator, once it is measured by a video stream, can optimize scheduling and floor assignment for reducing wait time and improving the workload of elevators. However, you never want the video stream from inside an elevator to leave the edge device.

Real-time machine learning

Traditional ERP systems are rule-based systems. The rules are manually implemented in software and mostly derived from domain experts in interviews and by looking at historic data and processes. A lot of manual work is involved, and those rules are changing very infrequently.

In digital twins, data is ingested and processed in real time. This allows for models of physical systems, such as black box models powered by machine learning or white box models that are defined by domain experts, to act on the data in real time. For example, an anomaly detector will raise an alert and shut down a production line in order to prevent further damage. Or, after simulating outcomes with different parameter sets on the digital twin, the real system is updated with the optimal parameter set.

Most machine learning models are trained on data at rest. So, we need to store all real-time data somewhere where we can retrieve it in an efficient way:

  • Many machine learning algorithms can be trained on data streams as well using windows (you can find out more about windows here). One critical phase in implementing machine learning is the hyper parameter tuning phase where you re-run your model training many times with changed parameter configurations just to obtain the best result. This process of tuning hyper parameter configurations is much harder on data streams because if you have a new idea or want to test a new algorithm, the data is already gone since it is not stored.

  • In real-time model training, your system performance must always keep up with the data arrival rate. Otherwise, buffers get overrun and the system starts trashing (throwing away data).

  • Temporal distant events can’t be considered since training on windows reduces temporal bandwidth.

So actually we need to do both, data processing on real-time streams and also on historic data. We need to create a corpus of historic data, and there is just no way around this.

The IoT Data Management Challenge

Data processing on IoT data is definitely quite challenging. Mainly because of the vast amounts of data arriving at high speeds. As we’ve learned before, it might be crucial to have access to historic data for model training. But before we talk about the best IoT data store, let’s consider something else, which is equally as important: metadata.

Digital twins often reflect thousands of sensor parameters. In order to not get lost, metadata databases are used. Here in Munich, we are using a graph database because this allows us to model the physical system in hierarchies. For example, the IBM Watson IoT Center is composed of 2 buildings, 33 floors each, various rooms per floor, and various sensors per room. So, using hierarchical graph queries, the relevant data sources can be selected that need to be taken into account for a particular downstream analytics task.

Trashing is a common problem in any IT system, so IoT systems aren’t exempt from it. A lot of folks are using time-series databases in their IoT solutions because they promise high-throughput ingestion and efficient temporal queries. But let me share with you how I’m doing it. I’m using Cloud Object Storage with a simple folder scheme:


I’m creating an index to every 1-second temporal window for each and every sensor. The UUID avoids conflicts and I can look up the metadata in the metadata repository. I even don’t care if this data is distributed on multiple IoT edges, because I either keep track of an index of data partitions that contain a specific sensor UUID or I just ask for them all.

So I’m making heavy use of the fact that IoT sensor data is “append only” data and it is written by a single thread per sensor UUID. Cloud Object Storage can now be paralleled up to the lowest level of the folder (second), which means we get infinite and linear scalability!

Inside the second folder, I’m using Apache Parquet files where data is compressed (for faster I/O and stored in a column format). I now can use Apache SparkSQL out of the box to get an SQL view over all my data where backup, replication, and scaling is being taken care of by Cloud Object Storage.

Different types of digital twins

Now that you understand the basic architecture and data management challenges with digital twins, let’s explore the different types of digital twins.

As you remember, we’ve started this article with “Reason, Realize, Run!” … so what do we mean by that? If we are talking about digital twins, we most frequently think about a software solution mirroring a production system in a digital way. But there is more to that. The production system had a history before it was been built and digital twins can support the complete development cycle of such a product. Therefore, Reason stands for the product planning phase, Realize stands for the product production phase, and Run stands for the product deployment phase. Three different digital twins, all working in orchestration. That is, data from a deployed product can influence planning and production of the new version of the product.

So, let’s dive into the different types of digital twins:

  • Part twin. A part twin, as the name implies, is tied to a part of a bigger system. Let’s consider the bearing of an energy production plant, for example. So this bearing can have a digital twin at operation time, which explains about it’s condition, like estimated Mean Time Between Failures (MTBF) or Mean Time to Failure (MTTF). That data can be derived (predicted or modeled) from current data (such as vibration sensor data or sound) but also from data during the design or build phase (such as what gear tooth shape was designed or what molding cutter has been used for building it). And, of course, the findings that are discovered during operation of the part can be fed back to the design and build phase. Just remember, digital twins are about: Reason, Realize, Run!

  • Product twin. A product twin is basically a set of part twins reflecting their interactions. From a software product perspective, the product twin is usually the same one, so part twins are accessed from a product twin by drilling down. An example of a product twin is the generator of an energy production plant, which has multiple bearings part twins among others.

  • System twin. A system twin is just up one more step from the product twin. Again, system twins are most likely implemented in the same software product, providing similar functionality as a product or part twin, but it is just a view of a whole system. So let’s stay with the energy production plant example, and a system twin (depending on it’s definition) might reflect historic and current state and predict future state of a particular power train of a power plant, the whole power plant, or even a partition of a power grid.

There are many different additional types of digital twins, depending on whom you ask, but I think I’ve covered the most significant ones. And, in my opinion, there is no need to distinguish between them in a real software system because it’s all about drilling up and down.

Digital twins are in their infancy, but growing up fast!

“Just throw in as much data as you can and run some smart AI” – that’s maybe how marketing sells a digital twin. Actually, it’s not that wrong. Digital twins are benefitting from the fact that there exists an abundance of machine-generated data, which is a luxury that you don’t have in other data science disciplines (such as drugs that are “thrown” at populations after having been tested on “only” 3000 individuals). When there is a lot of data, the use of deep learning models starts to become feasible.

Now, I’m going to introduce you to an IBM Product that I’ve contributed DeepLearning models to (you can learn about these models in this tutorial series). It’s called IBM Maximo Asset Health Insights. IBM Maximo is an asset management tool suite. Remember, digital twins need metadata. This is where asset management takes effect. Now each asset that is managed through Maximo can be transformed into a digital twin by attaching (real-time and historic) sensor data to it. Once this is done, Maximo Asset Health Insights can go ahead and monitor and predict failures in real-time. So, for our clients, through that product, the vision of digital twins has already become a reality.


Digital twins are the new version of a control center that combines historical and current system state with a future, predicted state. Drill-down capabilities enable users to dive into individual products or product parts but also show the big picture that allows for highly complex optimization tasks. Finally, digital twins are not only useful at operation but unfold their complete potential when product design and fabrication is taken into account as well. Remember: Reason, Realize, Run!