Rhize Up

Rhize Up w/ David Schultz: Why Your Backend Matters

David Schultz Season 1 Episode 9

In this episode, we break down why it is critical to understand how data is organized and consumed. Different data structures are better for different types of analysis. And here's a hint, you are going to need something more than a data lake...

[David]

All right, well, let’s go ahead and fire this off. So good morning, good afternoon, good evening, and welcome to the Rhize Up podcast. We are back after a brief hiatus, and we’re going to get right into it.

So we’ve talked a little bit before about the unified namespace. We’ve talked about manufacturing data hubs and things like the data broker, which is a lot of how do we get data moved around? How do we connect all these various systems?

We talk about this connect, collect, store, analyze, visualize, and a good chunk of that’s just really been on that connecting and collecting. But once you get into storage, it sort of seems to be a little bit of hand-waving, and we just say, oh, we’re just going to go put it in some kind of database, and then when you need it back for some kind of retrieval, you just go ahead and get that. But as we talk about these types of things, we realize that there’s a lot of different ways to store data and a lot of different data structures.

And really, as I like to say, your back end matters. And joining us again today is a previous guest on the Rhize Up podcast, joined by Andy German. So Andy, if you could just take a quick moment.

And even though people are very familiar with who you are, we might have some new people. Please take a moment to introduce yourself to our guests.

 

[Andy]

Yeah, sure. I’ll be quick. No one’s that interested, I don’t think.

But yeah, I’ve been with Rhize for about three and a half years now, got a long history in software development and software generally in the manufacturing space. Probably 10 years, 10 to 15 years in a technical sort of software engineering role, and probably about the last 10 years in kind of leadership and architecture roles and that kind of thing. So I’m customer-facing.

I like getting right in front of the customer and solving the actual problems. And yeah, I get exposed to a lot of the technology as well in doing that.

Time-series Database vs. Event Database

[David]

Yeah, a lot have been there and done that. So deep manufacturing knowledge, especially around the collecting and storing of data. So when I’m talking to customers, I generally refer to, and I’ve heard others do the same, refer to the really two different types of data that you have in a manufacturing environment.

There’s your time series data, that’s your telemetry data. It’s time value, quality, and that gets stored in some kind of historian or time series database. And then we have this event database, and that’s where we’re putting all of our events.

Could you talk a little bit more about some of the nuance that’s there? I know we’re going to be talking in a future episode about the difference between a process of storing in a time series database, but can you put a little more meat around what do we mean by the differences between a time series data and an event data?

 

[Andy]

OK, right. Yeah, I think you mentioned in there about time series data fundamentally being TVQ. I think that’s the starting point for working with time series data.

And that matches up quite well, really, with how do I historize the process values that are coming from these PLCs effectively. So as a kind of a small area within manufacturing, a small sort of slice of the manufacturing domain, matching time series databases and time series concepts with data that’s streaming out of the process at level zero or level one sort of makes sense.

And you could view these discrete values that are coming out of the process.

You could view those as events, tiny little events. But again, they’ve got some meaning and not that much context. But we also, if we kind of look at a slightly broader scope in manufacturing and look at what else we’ve got that’s event driven, it’s not just process values.

There are more complex things going on, particularly during execution stage in the manufacturing environment, where these objects that are being published, these events or these things that are happening down on the line and in planning and in other areas. They’re more complex. These objects have got structure.

They’ve got many attributes. They might have relationships with other entities. They might have child objects or they might have other sort of relationships to hierarchies within the domain and that kind of thing.

And so when you start to think about these more complex objects, they’ve definitely got a time component, but it starts to become difficult to really think about storing those in a time series database because the time series database doesn’t really lend itself well to these complex sort of data structures. And so we’d look at another way of doing that. And there are other databases out there that would allow you to store complex data structures.

And a document storing system, for example, is an obvious one that comes to mind where you can dump complex documents into MongoDB, for example, and that gives you a way to persist those kind of events. But again, when data is non-trivial, when data needs to be meaningful, it tends to grow relationships. And then when data tends to grow relationships to other entities, that’s when a document storing system like Mongo would sort of tend to fall down.

And then if you’re trying to store that kind of data in a relational database, that makes sense. But again, when the relationships become very complicated, storing the data might be possible. But then when it comes to querying the data with complex relationships out of the relational database, that becomes quite cumbersome and not very convenient.

So then we move over to graph and maybe there are other databases that you might want to use for different kinds of databases. So we’ve got vector databases and that kind of thing. So I think it depends on the structures of the data and how you want to place it and how you want to consume it.

I think that’s part of what we need to consider.

The 3 Pillars of Manufacturing Data

[David]

Okay. So when we talked about time series and event data, there’s a lot more that goes into just the event data because we’re going to get into that back end and that structure really depends on what’s the type of events, what’s the relationship, all the events, what does all this stuff look like? So as we were getting ready for this, there’s sort of an evolution of the production and the creation of data in a manufacturing environment.

And you refer to these as the pillars. So if I understand right, we start with the definition, then we get into the demand, and then finally there’s the result. So let’s unpack that a little bit.

What do you mean by this data, these flavors or these pillars of manufacturing data?

 

[Andy]

Okay. So yeah, it can be quite difficult to talk about this stuff in very general terms because you end up talking in a really abstract way about time series or graph and it all becomes very technological and very technical. So there’s this sort of model that I use with my customers to try and sort of bring to life the differences between this data as the data lifecycle unfolds within a factory.

So making a start on these three pillars, the first one here is definition. And this data is what people would often refer to as master data. I want to avoid using that terminology.

I’ll come back to that. But definition data describes what the factory could do. So it describes the materials, the people, the equipment, the physical assets, the operations definitions, that kind of thing.

It’s all of the static data, but it’s not going to change that often and really is a representation of what is a very complex thing, the factory. And once you’ve got that definition in place in a factory, then the next thing that you would do or the next thing that an organization would want to do is to create demand on the factory. So creating demand on the factory really means understanding the sales forecast, the actual sales orders that we’ve got in.

And somebody in production planning generally will create demand on the factory based on the demand that’s coming in from the customers or the expected demand. So that’s orders. And whether you call them works orders or production orders or job orders and that kind of thing, we’ve got this activity in the factory.

That defines what the factory will do. And then the moment that an order has been dispatched into the factory and we start working on it, then we start getting the events. We start getting sort of real-time events coming through from the factory, which describe what the factory did do, 

what the factory has made.

 

So I’ve got sort of a summary of that there up on the screen. We’ve got definition, which is what the factory could do. Demand is what the factory will do.

And result is what the factory did do. Moving further from that or further into the detail here, what this gives us in the domain of definition, we’ve got some really complicated object models and data structures like ISA 95, where we’ve got materials and we’ve probably got a material taxonomy or a material class taxonomy. And then we’ll probably have a lot of definitions for the different raw materials, the different finished goods, how all those interrelate.

Same goes for the equipment. We probably have equipment classes and equipment hierarchies. Then we’ve got the operations definitions, the workmasters.

All this stuff starts off as a hierarchy of sort of operations, definitions, and segments. And then we also have relationships between entities that allow us to sort of root materials in different ways for different jobs and that kind of thing. So it becomes a very complicated object model with a lot of relationships.

And it’s characterized, really, these sort of things that I’ve got down at the bottom in yellow. It’s characterized by complex data structures, as I’ve said. It’s very slow moving, so it’s not changing that often.

You’re not adding new entities to your definition model. We’re not adding lots and lots of new equipment every day, not normally, and not lots of materials or people every day. Normally, once it’s built up, the ebb and flow of definition data is a kind of an ongoing curation of that data.

And it’s long-lived as well. Unless your factory changes radically, the data entities that we place will probably just churn on versions rather than sort of radical change. And this stuff’s really foundational for later context, which we’ll touch on, and also for business logic, so driving business logic.

So you might have state machines in that definition that dictate that if you close an order, you can’t reopen it again. Or if you’ve got an order that’s in quarantine, then some quality steps need to take place before it can come back out of quarantine. Normally, those are these definitions that were coded up in a certain way.

[David]

And then some… It sounds like the definitions are sort of the what’s and the how. And once you’ve got established what the what and the how is, then it’s like, okay, now let’s move on to the when.

Exactly.

[Andy]

Yeah, okay. Yeah, exactly that. So now that we know what our factory can do, then we place demand on it.

And that’s just… It’s simpler. It’s scheduling and it’s creation, dispatching of orders.

So from a data structure point of view, it’s much simpler. But actually, from a business logic point of view, it can be one of the most complex parts of a manufacturing organization. The whole deciding what to schedule on what machine when to really fit that demand and the complexity that leaks back up into the supply chain is not something to be underestimated.

Anybody that’s been to sort of an automotive plan will understand just how sophisticated the planning gets in those kind of places. But it is… The data structures that you’re working with are not as complicated as the ones in definition.

We’ve got hourly to daily sort of data churn, when new orders are being created constantly every hour and modified constantly every hour. And then going out of fashion after only a few days, once an order has been completed, often the people that are working with this data are sort of moving on to new things. And then these orders sort of become the subject of later analysis, if you like.

So the demand data, the orders are an enabler for business context, because if you can tie some piece of data that sort of floats in some process value or some material, if you can tie that back to an order number, often that order number is the gateway to all the context that you’ve got on the left-hand side. So your order number sort of in the middle is what really ties the context back to this sort of complex sort of definition model that you’ll probably have in a graph. 

So yeah, that’s demand then, David, that’s the when.

And then the final part is what the factory did do. And I’ve put that in past tense deliberately there. And I’ve not said what the factory is doing.

It’s what the factory did do. And this is kind of one of the fundamentals of the event- driven architectures is that when you receive an event, it’s something that has happened very recently, but it isn’t something that’s about to happen or needs to happen. The result is about an event that’s recorded something that’s happened on the shop floor.

And again, we get these results, the data structures are, again, quite simple, but they come in through different channels. So this is the results generally are characterized by sort of high frequency data in motion. So this is the stuff that’s coming through from your broker, and it will be mixed with stuff that’s coming through from operators transitioning orders to be running or recording quantities of materials or scanning barcodes, that kind of thing.

So we’ll have these more complex or human-driven events sat alongside machine-driven events and telemetry and that kind of thing. So for the three, we’ve got really different personalities of data across those three domains. On the left-hand side, we’ve got on definition, we’ve got this ongoing curation of data.

And then in demand, we’ve got this sort of daily or hourly sort of most often sort of manual manipulation and creation of orders to create the demand on the factory. And then in results, we’ve got high-speed event-driven data streaming in from brokers or OPC servers or from PLCs or other data channels, which we’ve got to deal with. And I mean, the idea, the trick is with this is to how do we combine these sort of three areas so that when we’re interested in data from one of these domains, we can mix that with data that sort of already exists in one of the other domains.

And obviously, this model, like all models, no model is correct, but some models are useful. This is kind of, if you look at the contrast, I’m sort of trying to paint it. There is a use that is useful to try and think about the way that this data behaves differently in different environments.

But there’s a huge overlap here.

 

[David]

Yeah, absolutely. Yeah, I mean, it seems like, you know, when I first started with the conversation of time series and event, it’s well, yeah, there’s events, but there’s also a lot of other data that goes into manufacturing of it. And I don’t even know how I would characterize all this.

 

So, you know, we’ve spent some time talking about the pillars, you know, the definition, the demand and the result. But I think we’ve gotten a little bit into the life cycle of the data, which is really how does, you know, where does data start and where does that data end up? Yeah.

We already talked about this, Andy, or can we just move on past that data lifecycle point?

The Data Lifecycle

[Andy]

Let’s touch on it. Again, it’s a sort of generic way of thinking about data. So this data lifecycle model here, we’ve got sort of creation and collection of data, the transmission of data.

So creation and collection, you know, where are we sourcing that data from? Is it from PLCs, barcode scanners, users, scales, that kind of thing? Or is it from another MES system?

Is it from SAP or is it coming through from a master data file? You know, we’ve got a lot of that, a lot of consideration to make for creation and collection. And then transmission tends to be a technical thing.

You know, how are we actually getting the data from one place to another? Are we querying? Are we publishing?

What are the technical aspects? And then storage, we generally store it in a database, but, you know, we could have data sat in sort of Kafka streams or, you know, we could be just sort of leaving data in a broker. And processing and transformation, that’s a key part of what we do at Rhize is the processing and transformation of data, not just to grab context from the graph, but also to position data into different sort of different databases for different consumption later for analysis, sharing and utilization, which are the other parts of the data life cycle.

And then we’ve got archiving and retention and disposal of data, which are sometimes not really considered when you’re sort of architecting the happy path, but important considerations when you’re sort of looking at the broad sort of cost of ownership of a heavyweight data architecture. And so, yeah, the data life cycle. So I think what sometimes happens if you talk about a specific area of technology or a specific set of use cases, you can tend to focus on just one bit.

So, you know, you’ll get sort of data scientists will be thinking more about processing and transformation or storage processing and transformation into analysis and sharing, whereas OT people are really focused on creation and collection and transmission. Maybe IT admins, maybe they’re focused on archiving and retention and disposal of data and that kind of thing. But I think it’s important to just keep in view that there’s quite a broad scope to data management.

Yeah. So that’s the data life cycle. I think it’s relevant when you’re mixing database types and mixing the location of data as well.

So some of it’s in cloud, some of it’s on-prem. You know, we’ve got data lakes and data warehouses and all that kind of thing. This stuff becomes important to manage explicitly, sort of a governance level, a group IT level.

 

[David]

So we’ve gone a long way from just time series and events to, well, there’s pillars of that data. There’s the demand, there’s the definition, there’s the results. And well, now we also have a life cycle.

All of this data that you’re going to collect and store and analyze, you’re also going to have different places for it of when it’s created and how it gets used. And then, you know, eventually, where do we park it? So, you know, we talked a little, we haven’t even really gotten to the gist of it.

And I think now it’s a great time to do it of, and you alluded to it earlier, of there’s all these different data structures that can actually exist for it. Not only do we have the pillars and the life cycle, but there’s different data sets. So let’s really get into understanding, you know, what are these different types of, we’ll just call them generically databases that can be used and how should they really be, you know, be utilized to maximize what it is we’re trying to do?

Maximizing Different Types of Databases

[Andy]

Yeah, sure. Okay. Okay.

This is, like I say, the meat of it, the bit that we wanted to get to. And databases, I’ll pull up a diagram here on my screen and sort of work through that a little bit. We’ve got discrete values that come up from the process, from PLCs and that kind of thing.

And it would probably land in some kind of OPC server or MQTT broker or some system that can sort of provide these discrete values. And that’s the starting point often when we’re sort of architecting systems. We’re kind of assuming that the stuff that’s coming up from the PLCs is getting sort of transformed into a format that can be consumed by attack systems.

We’ve got discrete values. We’ve got, you know, as you described it earlier, David, you know, the TVQ thing. It’s raw data.

There are no relationships in place, generally speaking. So we’ve only got the discrete objects that we’re dealing with. And we’ve got sort of high frequency data streams.

So you might have temperature values or, you know, valve open, closed values coming in, you know, millisecond sort of resolution and that kind of thing. So there are not many relationships that you get. There are no relationships at that level.

And then you might not even have much context either. So you might have an ID indicating which equipment this is from. And certainly most of the time you wouldn’t have an order number and you wouldn’t necessarily have a material sort of number associated with a value.

You probably won’t even get a unit of measure. And you’re just getting sort of raw values. So, yeah, that’s a tag system.

The data doesn’t really become meaningful until you start creating relationships. And the next sort of area of interest really is, you know, how do you represent complex relationships? And really the answer to that, really in sort of modern databases of design, it’s accepted really that graph databases are often the most appropriate way of representing sort of complex object models.

So, you know, an example of a complex object model is, you know, ISA 95, particularly the stuff that we’ve talked about on the definition side. So we’ve got complex relationships, we’ve got recursion, we’ve got the need to sort of traverse many jimes. We might have the requirement to sort of scale horizontally and that kind of thing.

 

So with a graph database, it lends itself to that kind of thing. And then, you know, access technologies like GraphQL, again, make it convenient to access a graph database. So in the past or up until, well, let me give you the example of ISA 95.

I think ISA 95 is a good example of an ontology of a very complex object model. It’s been really very difficult to implement using relational database technology in the past because of this idea of object relational impedance mismatch, which is jargon and it’s a right mouthful. But what we’re actually talking about there is that when we’ve got complex nested objects with relationships attached to them, it can be quite hard to translate that representation out into a relational database.

And this has been a kind of a book bearer of object-oriented software engineers for the last 25 years, however long we’ve been doing that. It’s been that they can really express themselves within their own software or object-oriented software environment. But when it comes to persisting or querying the data back out of the database, there’s always this object relational translation layer that really gets in the way.

And I think in the past, that problem has made it very difficult and impractical to implement complex object models because relational databases have been the only outlet for doing that. But we’re here with graph databases now and suddenly that problem goes away. It brings with it all the challenges and that kind of thing, but your ability to express yourself and really place the data and relate it to each other in a way that makes sense for the problem that you’re solving, the monitoring domain, really, you know, your ability to do that comes to life with a graph database.

And that’s why often we use graph databases. And that’s why, particularly in the definition side of ISA 95, that sort of comes into itself. And I’ve touched on relational databases and I’m breaking this down into sort of two categories.

There could be a third category, actually, but I’m breaking this down into sort of row- oriented data, which is your traditional SQL Server OLTP transactional database and column-oriented data, which lends itself more to analytics. So for the row-oriented data, this is what we’re used to seeing. And it lends itself really well to the demand, the bit in the middle, for the three pillars of manufacturing.

The idea that you’re working with an order as an entity and you’re dealing with groups of orders and that kind of thing, kind of lends itself to row-oriented data structures, row- oriented queries, that kind of thing. In fact, the reason it’s been so popular for so long, it’s just so flexible, that row-oriented relational database technology are really flexible. So low query latency for analytics, small data set for analytics is okay.

We’ve got ACID compliance and bulletproof reliability. And your ability to delete and update records and frequently access records sort of lends itself to working traditional sort of transaction processing and that kind of thing. Okay, so that’s row-oriented data.

And again, it lends itself to the transaction processing and what we’d normally see as sort of traditional IT systems. And then we’ve got very closely related, as I said, relational databases, column-oriented data is another way of dealing with data that’s got a slightly different query workload. So for transaction processing, as I said, records are king, rows are king.

In column-oriented data, the difference is that the way that the data is arranged on disk or in memory, let’s just stick with on disk for now, means that it is faster. The data structures on disk lend themselves to algorithms that are trying to do aggregation type stuff. So these are your OLAP databases, online analytics processing.

So when you need low query latency, where you’ve got massive datasets, where you’ve got maybe some time series in there, where you need to do some compression, where you’re working on large datasets and not necessarily on a record-by-record individual basis, then column-oriented data comes into its own. And the difference between row and column is just whereabouts on the disk the different elements are stored. So in row- oriented data, a record, the data all associated with an individual record would be kept together.

And to traverse a lot of records would result in a table scan, which can be quite expensive. Whereas in a column-oriented database, the data that lives together are the data that’s related to a specific column. So if you’ve got a dataset that contains a lot of numerical data, then the column for height or temperature, all that data will be stored together.

It won’t be scattered across the disk. So when you’re trying to do aggregation on a column, the way that row-oriented databases organize that data on a disk, it would be fragmented and distributed more across the disk and difficult for an algorithm to access and gather and bring that data together. Whereas a column-oriented database, there’s an assumption that there’s going to be an aggregation workload.

And therefore, the data is oriented on the disk in a way that’s convenient for algorithms that are trying to do that work. And this is why with a lot of time series databases, I know Influx is one of the ones that uses a column-oriented data structure. Because typically, a time series database will be dealing with aggregations as a first class citizen.

And time series takes it one stage further and adds a time component to the storage that it’s using on the disk. So that when you’re using time series databases, you would probably be doing things like interpolation, grouping by time elements, last observation, carried forward queries, and other aggregations. So what you’d have with time series data is it will be optimized for doing that across time.

So that’s column-oriented. You’re going to say something then, David? [David]

Yeah, I was just going to say it sounds like a time series database that we’ve talked a little bit about is just a very specific instance of a column-oriented data. Is that correct?

 

[Andy]

Correct. Yeah, yeah. And I think a lot of the data warehouses, you know, Google BigQuery, there’s certainly one option for using Google BigQuery is to have it as a column-oriented database.

And they’ve not yet released time series, certainly not on general release, I don’t think. They’ve not yet released some of these time series features. But they’re going to catch up with that, if you know what I mean.

So time series could be viewed, maybe I might be wrong about this, but could be viewed as a kind of a subset of column-oriented databases.

 

[David]

Yeah, I think that’s fair. All right, so now we have this document store. Tell me a little bit about that.

 

[Andy]

Document store, MongoDB being the sort of the most widely used example. So document store, we don’t really use it at Rhize. I have used it in the past live.

What a document store allows you to do is effectively store, let’s take MongoDB, store JSON documents. So in a document store, you’d have the equivalent of tables will be, rather than containing rows, would contain documents that would contain JSON documents. So in the case of MongoDB, it’s a binary format.

So I think it’s JSON. So your document store allows for complex objects to be stored. And but it doesn’t allow for relationships, not really, between documents or between documents in different parts of the system.

So as I said earlier, when data starts to become, or when you need data to start to become meaningful, or when you want to derive meaning from data, often relationships start to grow. And this is a weakness of document store systems. They’re very fast on ingest and fast to query.

But when it comes to sort of building relationships between different objects, the method is embedding. If you want quick access to something that’s related to a particular document that you’re dealing with, and it lives in another document, rather than create a relationship and traverse that relationship through a query in a document database, generally speaking, what you would do is embed, and therefore duplicate data inside that document. So that you’ve got the data sort of living alongside it.

So document stores kind of, you can end up with a lot of duplication, and you can end up with a lot of orchestration of queries and stitching of data back together in the application layer. So there are kind of a few niche use cases for document store. But I think a lot of people, if they overuse document stores, what can end up happening is that the application layer ends up with a kind of a referential integrity layer having to be built to make sure that if you delete a parent document that’s related to something else, then you go off and delete the child documents from other places.

Or you end up with logic in the application layer that actually synthesizes relationships between documents by sort of scanning full tables and matching foreign and primary keys across the two table seats. You end up getting a kind of a quick win early on because it’s a schemaless database. You can kind of just dump data in there, but your quick win results in quite a lot of technical debt that then needs paying back in the ways I’ve described a little bit later.

 

[David]

Yeah, no good deed goes unpunished.

 

[Andy]

Well, sometimes you get there fast, right? And it delivers value, but then you’ve got to pay back at some stage. There’s your technical debt.

It happens everywhere. It catches you out with Mongo for sure.

Combining the Pillars and Database Knowledge

[David]

All right, perfect. So we’ve talked about the different pillars. We’ve talked about there’s actually the evolution or the lifecycle of it.

Now we’ve talked about, well, really in a manufacturing environment, we have sort of that tag data or discrete data. We have the graph database that builds relationships in. We have role-oriented for some transactional data.

We have the column-oriented for more of the analytic data, the time series being a very specific or a subset of that. And finally, there’s just the generic document store, I would say. So can you start bringing this back?

And what does this data structure look like? What does it mean within the context of, let’s just start with the pillars of, we have the definition, we have the demand, we have the results. How would you utilize these different data sets for capturing this kind of information so you could maximize the value of all that data that you’re collecting?

 

[Andy]

Yeah, okay. Well, I’ve talked about a couple of models there. As I said, all models are wrong, some are useful.

I’ll try and combine these models now, the ones that are just presented, to sort of try and bring to life why we’re talking about them in the first place. So again, there’s a diagram on screen now, sort of bringing the two together. And it just sort of exposes a way of thinking about how these things come together.

So definition data could do what the factory could do. Demand data, what the factory will do. And result data, what the factory did do.

We’ve got a progression there between those three sort of areas. The definition data, it’s complex objects, it’s relationships, it lends itself to graph databases, very much lends itself to graph databases. And that’s how we’re able to sort of expose ISA-95 as a persistence ontology within the Rhize platform, is using a graph and we’re able to sort of represent those relationships there.

It’s hard to do that with row-oriented databases and then column-oriented databases. You end up with loads of joins and really complex queries and quite a lot of frustration.

So you end up watering down your complex object model so that it’s a poorer representation than it could be of your manufacturing environment.

You don’t have that constraint with a graph database. So that’s where our definition data, we put that into graph complex object store. The demand data, again, we’ve said it lends itself well to row-oriented data structures, online transaction processes, because it’s orders and schedules.

Within Rhize, we’ve got the graph database there ready. It works well with that kind of workload, with that kind of concept. So we tend to keep that, the demand data, the scheduling data in the graph as well.

But again, row-oriented data attached to that. What we’re always doing at Rhize is, whenever the full picture, we tend to use the different parts of our Swiss Army knife, depending on the circumstances. So we would often interface to a scheduling system or consume orders from SAP, for example.

And that might be one use of the row-oriented data. Okay, so demand, yeah, we’ve got these orders and we’re managing these orders daily, hourly. As I said, it lends itself to the row-oriented data.

But in Rhize, we’ve already got the graph database there available. It can cope with this workload quite well. And obviously, we’ve got the data structures that are available from ISA 95 to support that.

So what we do do a lot of within Rhize is to work with other systems. So we’re never normally the complete solution. We’re normally sort of working with different tools on our Swiss Army knife to sort of pull together the integration to make a full solution.

So we will probably be working with SAP or some other ERP system to onboard and data from your order data. We might be working with a scheduling system. We might be deferring the scheduling tasks to another system because that functionality is sort of rich in that environment people are used to it.

So we would tend to sort of work with what’s available there. And often, that’s sort of the row-oriented data or SQL Server effectively most of the time. And then if we talk about result data, this is the bit where it all comes together, actually.

And when people talk about event-driven architectures and event-driven manufacturing and that kind of thing, it’s this result data, what actually happens on the shop floor, that’s really the bit where the value is, where the value of the data is. This is the bit that people want to analyze. They want to understand what happened in the factory and why and what it was related to.

So this is the bit where it all comes together. And I’ve not mentioned discrete values so far. But discrete values will be a primary contributor to the result, part of our three pillars.

So we’ve got these discrete values that come in at very high frequency. And we would normally land that data. It would come through a broker generally, generally speaking.

And we would probably land that data, for the most part, into a time series database, a column-oriented sort of time series database. On the way through, part of the lifecycle, the data lifecycle is this idea that we transform, we contextualize. So as we receive these discrete values from a broker, we would tend to make associations with that data, with the usual tools, primary keys and foreign keys, that would be sort of queried and stored in the right place.

But we’d grab context before we persist through into time series. And the context allows us to maintain and preserve the relationship between the data that’s coming from the process, the discrete values, and the object model that was used to sort of define the factory. And same with the role-oriented data.

If we’ve got links to… We’re able to query across the different databases to bring it all together. And that’s the part where we can enable and facilitate contextualized analysis, or we can enable and facilitate the contextualization and pre-formatting and transformation of data as it comes into the result end of our data model.

And post that off in the right format, off to a data lake or some other persistence layer that would be used for the querying.

 

[David]

So it sounds like that we’ve talked about the pillars of the data, and then when we get into the definition of it, we really want to put that in a graph or some sort of a fairly highly contextualized relationship. And the knowledge graph really lends itself well to that. And then, of course, once we get into the demand, that can either be in a graph, or we can traditionally use a row-oriented data.

And that’s very common, and it lends itself very well. But where we get into the result, that’s where we’re going to see all of these different types of data sets being used. We’re going to have the discrete values going to the time series data.

And we talked about that’s a subset or a specialized, a tab-oriented, or excuse me, a column-oriented database. But when we want to get into the analysis of that data, what type of analysis you want to do really is going to determine and define the types of data that you want to use. So if your analysis is a relationship of data, then using a graph or a highly complex relationship, use the graph.

If you want something that’s more of a transaction, that’s when we’re going to want to start using more of the row-oriented. But now if you need to start doing some kind of analysis, so when we get in, you mentioned BigQuery earlier, where we’re going to do all kinds of analytics, that’s where you get into that columnar piece, because it’s going to lend itself well. Do I understand 

this correctly?

 

[Andy]

Yeah, you do. And the thing I want to reiterate is that there are no hard, fast rules about this. There isn’t a kind of a…

These are ideas that need to blend together. So, yeah, the idea that if you know that you’re going to be doing a lot of aggregation on massive data sets, then you’re going to be going in the direction of column-oriented data. But if you know you’re going to need to do, traverse a very complex graph structure for, I don’t know, probably genealogy or something like that, then graph might be it.

If you need to represent very complex structures like an ontology, graph might be it. You’ve got this blend, and sometimes these things blend together. So your column- oriented data, the data you want to push off into a data warehouse, you might want to push that off in the format of Parquet files because that’s what’s important.

But you might just want to break a few context items out from the graph and put them into those Parquet files, just to make later queries more convenient. So maybe some primary keys from your graph sort of object model will help you grab context later if that’s what your data scientists need to be able to do or want to be able to do. And actually making those decisions is a kind of a project by project, data model by data model sort of problem that really needs to be kind of, you’ve got two sides to the consideration.

One, am I representing my data structures correctly and to the right fidelity? And then the second part is, what’s my workload? How do I want to consume this data?

And depending on those two, depends on what emphasis you place on these considerations. But if you leave any of those out of the consideration and if you leave graph out or you leave row out, if you leave time series out or you leave out the other channels, like the broker or whatever that means, if you leave any of them out, you leave yourself actually with probably not the right number of tools to solve the problem that you’re likely to have in manufacturing, unless the problem is dead simple. And I think for large organizations with many factories and that kind of thing, the problems are never that simple.

Semantic Model and Technical Model

[David]

And unfortunately, no, there’s a lot of complexity that goes along with it. So I wanted to talk a little bit about just data models. So I use the term semantic data model a lot and I generally speak this in terms of say like an asset.

So for people that know me, that’s your motor model, that is your compressor model. That’s something that’s not just describing these discrete bits of information. We’ll just call them tags if you’re familiar with like PLC or a SCADA system, but we want to model it as an object.

It even could be say a mixing tank that has a mixer, it has a level. There’s a lot of things that describe what that is and that becomes that semantic data model. But then there’s also the technical model of what does this overall structure of the data look like?

So as I think as we start collecting all this data and think about here are the pillars of data, there’s the life cycle of the data, here’s the different data structures. Well, we do need to start exchanging this data a little bit. So can you talk a little bit about, let’s expand a little bit on the semantic model as well.

Let’s talk a little bit about the technical model there as well. And what does this mean in the context of the storage of the data?

 

[Andy]

Okay, yeah. So semantics and semantics being meaning, I suppose. When you’ve got, again, that landscape, large organizations with many factories and many, you might have MES and ERP and LIMS and all kinds of other systems and sort of siloed data stores in a lot of different places.

In IT, there’s always the technological imperative that you’ve got to create interoperability between systems from a technical point of view. If there’s no pipe between the systems, if there’s no technical way to integrate, then it’s kind of pointless. This is why we have REST interfaces and why we have brokers and HTTP and why we have databases and query languages and data structures and data formats.

Why we’ve got JSON and HTML and XML and all of these sort of technical, these technologies that facilitate and enable interoperability at a technical level. But the bit that’s just as important, but kind of gets missed a lot, is this idea of semantic interoperability. So when concepts that are shared by systems, necessarily shared by systems, like your equipment model, that’s the best example of a structure that’s shared by MES, by ERP, by LIMS, by other systems.

They all really need to be talking a common language when they’re referring to these things that are being shared. Like I said, the equipment models, the first thing that sort of springs to mind. And when systems do it differently, when semantically systems view equipment and physical assets in different ways and they represent those structures in different ways, it can really get, you might have the technical interop in place so that one system can call a REST API on another system. 

 

You might have that bit in place, but actually if the two systems have got a different semantic representation of the thing that they’re trying to exchange on, then an awful lot of technical work actually then has to take place, a translation layer, to try and make the conversion between the two systems. And it’s like, if you’ve got an MES and ERP, and some other system there, you might have orders, you might have works orders, or you might have production orders or job orders. They don’t always mean the same thing.

So, and an important part of integration in a complex landscape is to somehow bring together this semantic interoperability so that things can exchange information in a consistent way. And that’s the bit that’s really hard, I think. And it’s becoming more of a topic at the moment, a more important topic, or something’s being talked about more at the moment, just because of this sort of rise of large language models and this idea that we can use retrieval augmented generation to sort of ask questions about data and use large language models to sort of navigate the data that we’ve got.

And that kind of a fundamental starting point to be able to do that kind of stuff. If you want my ambition to be able to use sort of large language models and AI generally, and use them specifically on your data set, the thing that’s getting in the way is not technical interop and not technology interop, it’s this semantic interoperability where the LLM’s not got a chance of actually interpreting all of this data. And in order to sort of create this semantic interop, you need an ontology.

Maybe you need an ontology. There are different schools of thought here, but an ontology that describes a domain in detail and the relationships between the objects in detail is a starting point for that interop. Now, when you talk about the semantic layer in a UNS, that’s a great example of where a very small part of ISA 95 has been encoded into what is probably in most practical terms, a topic structure.

And if you use a sort of enterprise site area line model to sort of organize the equipment hierarchy into a sort of composition hierarchy of equipment, that agreement between systems, that agreement between the guys that are building the UNSs and the guys that are gonna consume the UNSs, that agreement on that semantic framework is a big part of the enablement of a UNS. If you were gonna choose those topic names, or those topic structures at random, it wouldn’t be a UNS because consuming the data from the UNS would be inconsistent depending on what decisions people have made. So the decision to go with ISA 95 for that part has enabled a kind of a semantic interoperability, but it doesn’t just stop at sort of the equipment hierarchy.

If you’re gonna get true value from the data, it needs to go further than the equipment hierarchy. It needs to, you know, we need to visit materials and people and recipes and routings and all that other stuff. And where you put that, you know, that’s the debate.

That’s the debate for architects and engineers is do you pick one of the systems in your landscape to be the representation? Do you choose or build a new ontology to make this sort of universal representation? And that’s kind of, yeah, it’s a big problem.

I’ve no solution for it, but I think it’s the kind of the interesting thing that’s going on at the moment is can we really need to arrive at a sort of universal sort of semantic interoperability for manufacturing? And if we do, how do we go about doing that? And I think, you know, maybe ISA 95 has got something to do with that.

 

Maybe there are other ontologies that could be reused. I think that’s the thing that’s coming through for me at the moment. And I can really see that being a problem as these technology interop problems start to get resolved more readily.

 

[David]

Yeah, you know, as we’ve evolved in this so-called digital transformation, there’s a lot of new ways and new approaches to the types of things we’re doing. I don’t think we’ve arrived on what that looks like, but certainly there’s a lot of very smart people that are coming together to let’s see if we can’t figure out and what is this going to look like long- term? And I suspect we’ll probably land on something.

And then just about as soon as we have all decided, yeah, this is what we’re gonna do, then something else is gonna occur and then we’ll be right back at it. So I suspect we’re gonna be iterating for a little while here. So, you know, so we’ve talked a little bit about, you know, the pillars of the data.

We’ve talked about the data lifecycle. We’ve talked about the different types of datasets and how they can be used. And even we’ve talked about, you know, the semantic interoperability, these semantic models.

You say, we wanna make sure that technically exchanging data, that’s the easy part. Semantically, let’s make sure that, you know, what it is that we’re talking about, you know, we can get all our information around. So can you kind of bring everything together?

Let’s just, you know, what does this look like? You know, let’s pull back a little bit and just start looking at what does this look like long-term for everything? Can you tie this all up for us, Andy?

Final Thoughts

[Andy]

Oh, yeah, that’s a big ask actually, David. That’s right. Yeah, I think in the past, if I look back on, you know, 15 years ago, the options, the architecture options, and particularly if you’re using other people’s technologies, and I’ve always been lucky enough to sort of work in a product environment where we’re sort of building our own technologies.

But if you’re using other people’s technologies and trying to stitch them together, the solution options have been very one dimensional. So you use the vendor’s data model and their extensibility model and their programming languages that they provide. And the options for integration aren’t that great.

So, you know, there’s kind of this vendor locking idea and prefabricated architectures. I think we’re coming through that actually. And people are able to take a sort of a composition approach to system design where they can, you know, they’ve got more choices.

They can choose to use different kinds of databases and they can choose to use different sort of types of transport technology and sort of, and the technical interlock between those elements, again, is maturing. And we’ve come through. So because we’ve got more choice, there needs to be a bit more kind of, there needs to be more sort of mental models, if you like, of how to approach the different situations with these options that we’ve got.

 

So if you’re composing an architecture, because you can use a lot of different kinds of systems and a lot of different kinds of databases, that’s when the problem with semantic interoperability starts to come in. Because you’re no longer just sort of using the data, you’re using what you’ve got available. You’ve got choices to make around that.

So I think as the sort of manufacturing architecture landscape evolves, people are going to be looking to mix the different database types, the different sort of transport technologies that are available. Maybe taking off the shelf products for implementing certain parts of functionality. Maybe implementing, you know, a full MES and a full ERP in the classical sense, but actually complementing this with a composition model around how they extract and transform and consume that data using sort of different technologies.

And as soon as you break away from that kind of vendor-centric approach and take a more sort of democratized approach to the way that you’re sort of solving these problems, that, yeah, that’s when you’re going to need sort of to start thinking about ontologies. And that’s when, you know, ontologies like ISA-95 sort of come into their own. And the consideration that how do you adopt ISA-95?

Can you force that into the MES vendor? Or do you need to create some sort of middleware that allows you to sort of act as a sort of almost not a single version of the truth of the data, but maybe a single version of the truth for the semantics? And how does that sort of, how does that work?

So I don’t know whether I’ve brought it together. I’ve certainly named a number of very difficult problems like that, David.

 

[David]

Oh, absolutely. There’s a lot to think about here, you know? And so if we go back to the very beginning, you know, and I’ve talked about it’s, and I’ll reference this again.

There was the time series database. There was the event database. Well, what we’ve really learned here is that, yeah, there’s a lot more to both of those events there.

So we have that pillars where you’re going to have your definition. You’re going to have your demand. You’re going to have your result.

There is the life cycle of where does all that live and how does it live and how it moves its way through. There are different data structures that you want to use for those different pillars. And of course, you know, bringing it all together, it’s you also want to ensure that you’re not only utilizing the right database, the right tool for the job, we want to ensure that we’re also exchanging all that data.

So it’s not just technically interoperable, but we also want to maintain some semantic interoperability. So we’ve definitely come a long way from the time series and the event to, there’s a lot of things that go into it. The good news is that there’s a lot more systems that will talk to each other.

We want to think about exchanging data in a semantic interoperability way and ensure that we also park the data and store that data and persist that data in the right sort of data structure so that when we want to analyze it, we want to use it, we’re going to be able to maximize the ability. So does that seem to kind of conclude where it is that we’ve landed on this?



[Andy]

I think that rounds it off quite nicely. The only thing to add really is that the only thing you can be sure of is just kind of a long sequence of problems stretching off into the future. Absolutely.

This is what we’re in.

[David]

But as they say in programming, when you get a different error code, that’s kind of an exciting day. That’s right. That’s progress.

Anything else you want to add, Andy, before we sign off?

[Andy]

I think I’ve talked too much, actually, David. No, I think that’s all good for now. 

[David]

All right, perfect. Well, thank you for joining us and thank you everybody for being here for the Rhize Up podcast as I think we’ve really dissected why the back end of your data structure matters. And we’ll look forward to seeing you on future episodes of the Rhize Up podcast.

People on this episode