Rhize Up

Rhize Up w/ David Schultz: Keeping Up with Time Series

David Schultz Season 1 Episode 10

This week, David joins Leonard Smit from Flow Software, Nicholas Woolmer and Nicolas Hourcard, from Quest DB to talk all about time-series data. They discuss time-series databases, process historians and historian design, and strategies to ingest and store high-volume data

[David]

All right, let’s go ahead and fire this thing off. Good morning, good afternoon, good evening, and welcome to the Rise Up podcast. Today we’re going to be talking about time series databases and process historians.

And if you remember from a previous podcast, we were talking about there’s all types of different data formats and different data backends that you want to use. Today we’re going to get a little more into the weeds and talk about a very specific type of a column- based historian or a database, which is that of a process historian and a time series database. And it’s important to understand that there is a distinction.

These things are not exactly the same thing. And I get asked a lot of, well, why do I need to go out and buy a process historian? Why can’t I just use one of these free open source software solutions?

And we’re all familiar with them, whether it’s InfluxDB, TimescaleDB, QuestDB, it’s all out there. And then, of course, then Flow Software, they recently released their flavor that is called Timebase. I hope I got that name right, Lenny.

And so we’re going to talk a little bit about that because it’s very relevant to what it ever or to what it is that we’re doing. So with that, I’m being joined by some people from industry that are, I will call them experts on time series data and what that means within manufacturing. So let’s go ahead and introduce our guests.

So Lenny, please start with, tell the audience who you are.

[Leonard (Lenny)]

Good morning, as David said, good morning, afternoon, or evening to all our listeners. You could have been doing so many other things, but you decided to listen to our podcast. So we’re really humbled and grateful for that.

My name is Leonard. Well, most people call me Lenny. I work for Flow Software.

I’m part of the customer success team at Flow. And yeah, we just recently released our own offering in this kind of space between data historians and time series databases. And for us, it was really to, how do we leverage more data and turn it into information?

And that’s what the product Flow does. But we obviously needed that solid foundation of historical data to actually achieve this. So we decided, well, let’s use that as an enabler.

And we delved our toes a little bit into this world of time series data. So yeah, very excited about launching that product. I will talk a little bit later about what it does and how it works, but yeah, very excited to join this podcast.

Thanks, David.

[David]

All right, perfect. And thank you for being here and participating and taking time out of your very busy schedule. I know you guys got a lot going on right now.

So from QuestDB, we have two people, even though they say Nick and Nick, we’re going to refer to them as Nick and Nicholas. So Nick, if you would please introduce yourself.

[Nic]

Hi, folks. Very nice to meet you all and be here. I’m Nick, co-founder and CEO of QuestDB.

And QuestDB is an open source database for time series data. It focuses on performance, ease of use, and very well suited for applications in IoT. And we’ll talk a little bit more about this later in this podcast.

Thank you very much for inviting me.

[David]

All right, perfect. Nick, thank you for being here. And Nicholas.

[Nicholas Woolmer]

Hi there. My name’s Nicholas Woolmer, in this case, and I’m a core engineer at QuestDB. So I work on the product, the main database, primarily the open source side.

And prior to joining QuestDB, I was actually a user of QuestDB myself in the telematics industry. So I have some familiarity with IoT and kind of use cases and how they differ from other applications of time series like finance and what have you. So yeah, thank you very much for having me here.

Difference Between Time Series and Process Historian

[David]

All right, Nicholas, thank you very much for being here. So let’s go ahead and get right into it, Lenny. So as I mentioned in the intro, I get asked a lot of why can’t I just use these commercial or these free open source softwares, you know, like, again, the ones that I already mentioned and just use that?

Why would I pay an expensive license fee for some kind of process historian that’s been out on the market for years? So, you know, really, what is the difference between a process historian and just a pure time series database? And what goes into that?

[Leonard (Lenny)]

I think there’s a lot of commonality, and I think sometimes that blurs the line between what the functionality and the purposes are between these two, because both of them stores time series data. What do we mean with time series data? Well, it’s data that’s timestamped or stamped, handled by a timestamp.

So and I think that’s a little bit where the lines get blurred in some cases. If we look at pure data historians and relation to time series databases, the purpose for these two have been designed for two completely different things. If you look at a data historian, it’s primarily purpose and design is in the industrial space.

It’s designed to handle industrial data, industrial process data in the manufacturing environment, where the time series databases is a little bit more wider range of applications that it can serve. Nick spoke about IoT data as an example, but you can use time series databases for financial markets, IoT sensor data, and a little bit more wider range of data that’s handled by a timestamp. So the domain-specific features of the two is widely different.

Data historians, domain-specific, it is designed to handle industrial type of data from industrial pieces of equipment, like PLCs or from your SCADA solution, and it’s really tightly integrated with those OT type systems. Where a time series database, it’s a little bit more interoperable, right? They use open source most of the time.

They use APIs to a great extent to make it a little bit more flexible and interoperable from that perspective. Typically, again, process historians, they live on premise, on the domain. They see it as a critical application within the industrial applications.

You need access to that data reliably, fast, to make your decisions. And normally, your time series databases, they are a little bit more scalable to be deployed on cloud, hybrid even, a little bit on-prem versus a hybrid technology. And that makes it a little bit, sometimes a little bit more cost-effective as well to host these time series databases as well.

But I think a big part of where data historians come into their own is in the features on top of the databases. So there are visualization tools that are specifically designed to visualize this industry specific data. A various amount of training capability to analyze the data to make that work.

With time series databases, some of them do have visualization capability. Some of them rely on their openness with their API solutions to you can go and create your own visualization outside of those technologies. But a data historian is purpose-built for the visualization that needs to happen in industrial environments.

There’s a whole bunch of technical reasons that goes into that. One classic example, I

always bring it up. You guys are probably tired of me talking about it, but just a classical case of boundary values.

That is very important within data historian worlds. That’s something that normally a time series database doesn’t handle very well. And that’s crucial in the industrial space when we do our analytics on top of that.

And then scalability. Normally, privacy historians, I’m not talking about scalability from the amount of tags that they can ingest or that sort. But it’s very limited potentially to what the specific vendor wanted to achieve with their integration into their specific OT related design.

Whereas time series databases, they’re a little bit generally a little bit more open, a little bit more cost friendly. Some of them are free, completely free, and sometimes a lot easier to use and to get going than the typical process historians.

[David]

Sure. Yeah. And I think from a technical standpoint, it’s how you ingest data between these two tools are what I have seen is fundamentally the big difference.

So when you’re reading from a process historian, you’re ingesting data. There is an evaluation that’s done first of, am I going to write this to the table? And that’s where we get into a lot of the compression.

And so there’s some decisions. And this is that goes into the setup versus a time series database that you send me a value, I’m going to I’m going to plop it in. And there it goes.

And then, of course, that’s where you get in some of the technical capabilities of the tool of do I actually can continue to persist or do I have algorithms to do that? Do I understand that correctly? There’s a little bit difference.

And maybe Nick or Nicholas, you can offer some insights into that as well. But I think that’s that’s a real fundamental point to understand is how they’re ingesting data.

[Leonard (Lenny)]

I think so. Yeah. Sorry.

Just to complete that, I think data integrity and accuracy is where we’re going with that data. The specific thing that a process historian does, like we call it a TVQ, a timestamp, a value and a quality component. And that quality component is the part that’s normally not not in the realm of time series.

[David]

OK, Nicholas, you were going to say.

[Nicholas Woolmer]

Yeah. So I think this is true. Obviously, the more domain specific you get in a use case, you can add those lower level kind of very not very niche features, but more specific to the to the use case features.

And it’s certainly the case that time series databases on the whole tend to be a little more generic and a bit more. You kind of get the product, you configure it the way that works for you. And you kind of you know, you go from that on the on the ingestion side.

I think it’s it’s it’s clear that people already often have historians. And sometimes it’s a greenfield project where they’re going to have a different ingestion style. They kind of have different goals about durability and they’re starting new.

And then we see people who have historians and they have these bespoke features that they need for, you know, they might have monitoring in some industrial plants or something similar. And then they’ll have a secondary TSDB to, you know, extract data to and have different requirements for the lifecycle of that. So I think it’s certainly the case with a database.

Generally, data comes in and people want to keep it for at least some time. And yeah, therefore, you know, we can’t necessarily specialize to every every detail in that use case and may have other priorities. I would just say one more thing is in our particular case, we adopt a relational model for our data.

So other time series databases like InfluxDB historically, which uses this measurements and dimensions and text style, which is much more popular and a very, very modeled way of sorting your data, whereas we support more relational model. And again, this is this is also a usability positive for many people who have sort of relational data, but they need more time series kind of features that would never invest in something like a historian. So I think those are some salient differences.

Applications of Open Source or Commercial TSDB's

[David]

All right, perfect. So when we start talking about different applications, you know, and for me, maybe it’s a firmer line than it probably is, because it’s you know, what I’m learning here is that there’s somewhat of a spectrum of there’s the there’s the traditional process historian. There is the true time series database.

I think what you’ll see in at least in financial markets, you know, that’s where you’ll see it. But, you know, there’s a lot of stuff in the middle. But, you know, maybe there’s applications that lend itself better. So as as manufacturers become data driven organizations, we always continue to talk about, you know, the level to the process of story. And this is all our telemetry data. But there’s a lot of applications where I need the TVQs. I need time series data. So what are the applications where maybe whether it’s a commercial offering or a free open source offering where I would just want to use what I would generally term more as a time series database than than a true process historian. So what do those look like?

So Nick Nicholas insights on that.

[Nic]

Yeah, I’ll say a few words about the primary driver for using Quest DB that we’ve seen, you know, in the last couple of years. And this really is around performance. So when there is a lot of data and the data shape is high cardinality and what we mean with our cardinality, a large number of unique time series in the data set.

So, you know, if you have a lot of attributes with different sensors, stuff like that, the ability to sustain pretty extreme throughputs without pressuring the database is one of those things that, you know, led people to try specialized tools like Quest DB. I, you know, specialized time series database with a clear focus on performance. That’s the that’s the that’s the one thing on ingestion.

And you can imagine, right, like sustaining a few million rows per second nonstop is pretty hard. And this is, you know, this is an important thing for for specific use cases. And, you know, we can we can give a few example of those in a second.

And then not only high ingestion speed, but also the ability to query and understand this data in real time as it comes. So the kind of low latency side of the equation here. And those are insights, sort of real time insights as the data is being processed.

And typically underpinning pretty critical use cases. So, you know, you can think of real time monitoring and anomaly detection for batteries, space rockets, solar panels, you know, even nuclear reactors. Even if something goes wrong, we want to know as soon as possible.

And, you know, in the case of space rockets, that’s like a pretty, pretty niche use case, but pretty interesting. The ingestion rates there are very extreme, right? The run is actually fairly limited in time.

But the amount of data that is routed towards the database is very, very significant. And this is why a time series database focused on the performance side of things could make sense for these people. And then maybe like one last thing that could be interesting is the ability to extract performance from limited hardware.

Let’s say like a Raspberry Pi, for example, right? So we did a benchmark a few months ago, where QuestDB was able to process 300k rows per second on a Raspberry Pi. And the ability to actually extract that performance from, you know, very limited hardware as such can be interesting for application on the edge, right?

Where you don’t need like a beefy server, or you just cannot actually use a beefy server to collect the data before processing the data, perhaps, you know, centrally on, I don’t know, on a bigger server or even on the cloud. So that is typically a use case that we see more and more. And, you know, it’s linked to the design of the database, right?

So how, I think here it’s less of a distinction between time series database or historian. It’s more, hey, what was the focus on performance since day one from the outset? Has this engine been designed with performance in mind?

And I’ll give you the example of QuestDB. It’s been fully built from scratch. So it’s not built on top of anything else.

And some of the design decisions really were made to, you know, ingest time series data in the most efficient way, and then laid out on disk so that it could be paralyzed and sliced as much as possible to give very, very fast queries and, you know, get that sort of real-time capability. So, and this in a package that’s very lightweight. So, you know, actually sub five megabytes for the entire database, which makes it easy to be deployed on, you know, this kind of Raspberry Pi device and stuff like that.

So those are things which I think are quite important. And then beyond the use case specifically, what we see is a trend where some companies are trying to move away from vendor lock-in. And this is a shift towards open formats that we see being more and more prominent in IoT.

And by the way, beyond IoT. So QuestDB is also used heavily in financial services. And this is where we see an adoption of open formats.

That’s quite impressive, and has been really accelerating in the last few months. And here we say, hey, you know, instead of having to rely on a tool and therefore a provider for the next 20 years, you can actually ingest and store data in an open format that should be the standard for the next 10, 20, 30 years, potentially for storing time series data. Here I mean Parquet, for example, Apache Parquet.

And you can actually move this data from, you know, the database itself moving to outside a database, to object stores, for example, where we fully decouple storage and compute. And then actually use best-in-class querying engine to process this data, typically historical data, less of a real-time thing, more of a, you know, large data volumes that have been accumulating over time. And companies are very intrigued with this model.

And, you know, the ability to not rely on a specific vendor is quite an interesting one.

[David]

Excellent. Nicholas, anything that you wanted to add to that? Just in terms of, you know, where does open source maybe fit better than a commercial time series or, you know, any additional thoughts?

[Nicholas Woolmer]

I think some strong points have been raised already. I would say more and more there’s a current movement in the database world, especially of, you know, why not just use Postgres, right? Just simplify your stack.

Use one thing for everything. You can get it to work. It’s going to be good enough.

And lots of people are starting to reconsider their choices. Do I really need this complexity? Do I really need this?

Do I really need that? And I think as described with this open format story, you know, QuestDB is not a graph database, fundamentally. And if your use case is not just time series, but has some graphing functionality, you know, are we in a position to give you the best results ever?

Do we want to turn people away who we can solve their problems, 90% of them, but just not this last 10% potentially. And so giving people the kind of agency to not feel trapped in their decisions, to be able to move if they have to, and not be penalized for that, really opens up opportunities for development, which previously, you know, you couldn’t get approval for these things or so on and so forth. And that combined not just with the storage format, but with messaging format.

You know, IoT has very strong messaging standards with MQTT, with OPC UA. Like these are very, very strong formats, but the storage formats, you know, maybe it’s stuck behind a proprietary database sometimes, maybe it’s stuck on mainframes and it’s not easy to retrieve. So there’s this integration part where you can not only have an open format for your protocol to connect all these systems, but then the data as well is not, it’s not trapped away and not hard to extract.

So yeah, I think those are kind of the main reasons people start to find these things attractive.

Good Process Historian Design

[David]

Yeah, as you guys were talking about it, it seems like it’s, like everything, it depends. It comes down to what’s the application, is there high ingestion? You know, the rocket launching, there is just gob loads of data that’s there.

And then of course, how it’s going to be consumed. And of course, that back-end data format, that’s exactly what our previous podcast that I mentioned we first got going was all about. There’s different storage mechanisms, make sure that what it is you’re trying to do aligns with that.

So, you know, coming back to just good historian design, there’s a lot that goes into it. And I imagine Lenny, this is pretty relevant and fresh in your mind, since again, you guys just released, or I guess it’s still in beta. It’s going to be formally released here at the end of the year, but the time base.

So what goes into good process historian design? Is there a separation of concerns? Are there, you know, what does that look like?

Do you want just one big behemoth or is it some way the Unix philosophy of do one thing, do it well? So what kind of goes into that?

[Leonard (Lenny)]

Yeah, quite a lot, to be honest, David. And I think there’s some good things that the time series world taught us that we are trying to incorporate into our way that we’re designing our personal history. And I think, Nicholas, you pointed to it just a little bit on your last comment about, well, it’s great that we use open standards just to get the data in.

Now it sits in that proprietary format and you’re kind of in that vendor lock situation where it’s very difficult to get the data out to other systems or to other people just to analyze that data. So let’s start with designing it to handle data ingestion, right? So we spoke a little bit about it.

From a process historian perspective, you need to make sure that what you are inserting or ingestion into that database is got a data integrity component to it, right? So you mentioned it, David, there’s rules that you can maybe apply or there’s specific things that you can say, okay, I don’t trust this data. Do I really want to ingest it?

And if I ingest it, how do I let the user know that this point or this value that I’ve now ingested into my story is potentially not great? So we cater for that. Again, as I mentioned, we natively store a quality component with the value that comes back from that.

And so to make sure that we have that quality component with each record that’s very important. And I think one thing that’s also very important is to make sure that to actually store every data point. So make sure that you’ve got access to all raw data that’s been thrown at you.

There is some algorithms out there that tries to reduce the amount of space that is being stored to this, but that in a sense doesn’t give you all the raw data that’s available. So I think from that perspective, make sure that you can handle all raw data that’s been

 

 thrown at you and store that with integrity. I think that’s the biggest part there.

Then obviously you need to be able to scale. You need to be able to handle increasing volume as data as your system grows. And as you throw more tags and more tags and more tags to it, you need to be able to scale.

So you need a robust way to compress the data that you actually now persist on this. And that is where most of the process historians will have some sort of priority way that they’re going to compress and store it into these flat files. And that’s probably where a little bit of the closeness comes in.

But yeah, normally it’s trying to get that scale or that growth in those files as small as possible and to really optimize the amount of space that you utilize on this. Nick spoke about it. It’s good that you can throw millions of records into the file.

It’s almost more important how fast you can get it out. So when you design this, the performance side is both ways. You need to almost have more emphasis on how fast you can retrieve the data and present that for real-time analysis than actually ingesting the data in as well.

Need to be a security aspect on it, of course. There might be very sensitive data that we are storing. You need to make sure that you are protected from that.

So you need to use your proper security measures to handle that data so you don’t have any breaches on that. And then, again, maybe something that process historians did a little bit different than what we see now is just how interoperable is your story? How compatible are you with different solutions in the industrial space?

That’s where a little bit of vendor lock was in the old process historians, definitely. I think open standards, Nicholas, you spoke about a little bit of those, NQTT, OPC UA. At least those open standards make interoperability into a whole bunch of different systems easier these days.

So it’s to make sure that you utilize those open standards to make sure that you are widely operable to collect data from various sources on the plant floor. And then, obviously, it must be user-friendly to operate, right? You don’t want to sit a day trying to install, configure, just to get one OPC point to log in and install.

[David]

You mean that can actually happen? 

[Leonard (Lenny)]

It can. It can definitely happen. I think those days are over, especially with the open source technology that’s available. Guys want to quickly spin up a docker, a microservice, get it running, get it to log. And these days you can achieve that with rendering those technologies as well. So that’s one point that we took as well.

We’re leveraging .NET capability in our historian, which give us that interoperability from an operating level. So we can spin up on microcontainers, docker, Linux, Windows, that kind of scenarios. So I think that’s important is how quickly and easy we can go and spin this up and get it going and how easy it is for the user to actually get that going.

And then you need to be able to handle, unfortunately things does go bump in the night. So how do you handle disconnections to your device? How do you handle disconnections between where you’re collecting the data and sending it to the historian?

How does the historian know that it doesn’t have any more connection to those underlying devices? So you need to be able, do you implement redundancy on that level? Do you have a way to mark your data, make sure that it’s a reliable data or when the user query those data, that it knows, listen, this is not the latest data.

There’s something wrong actually with the data ingestion from that perspective as well. And then from a process historian perspective, yes, we unfortunately need the tools. We need the graphing tools, we need the trending tools on top of our historical data access to get a little bit more insight.

And how do we analyze these time series data and make it very easy for the user to take analogies to do a little bit of analysis to maybe do batch to batch comparison. David, I know we spoke about batch, golden batch comparison.

[David]

Yeah, the golden batch.

[Leonard (Lenny)]

In the past. But how do we make that easy for the user from a client perspective as well? So from a historian perspective, it’s a whole bunch of components that goes into it.

It’s not just storing in the part of the system. We broke it down into four components. So the first component is plugins.

So what do you connect to? Is it MQTT? Is it OPC UA?

Then we’ve got a component that handles the comms just between that and the historian to make sure that, listen, if there’s a break, notify the historian about that break. Then we’ve got the historian component that actually just handles collecting updates, compressing and storing to this. And then we’ve got a fourth component that’s going to handle our analytics side, the training capability of that. So that whole component, all of those actually makes a processing story. 

[David]

Yeah, so there’s both some technical things that go into it. And if I remember what was listed, there was a lot more than I expected. So it’s getting data in quickly.

It’s validating that integrity. It’s being able to get data out efficiently. It’s the security.

It’s the tools, the ability to get at an open source. I’m trying to think of what the… There was a couple more after that.

There’s a lot that goes into it. Then of course, it’s mainly how do we collect the data? How do we connect to that data?

How do we store that data? And how do we analyze that data? Which is a very common…

[Leonard (Lenny)]

And then this is something not technical, but it’s very crucial to also get right is a knowledge base or a community around what this does. How easy is it to access either a forum or an article explaining what this thing does, how it works. It’s the whole support around that as well.

So not just only the technology side, but then there’s the whole customer aspect as well, which is also something that people need to think about. If something goes wrong, what do you do? Customer experience is huge.

What’s the support structure? How does it look to launch a bug request or a fix? That whole life cycle around how do you get the user now back into your development pipeline is also huge, especially if it’s a new product you just launched.

To get that interaction going and to get that interaction going right from the get-go. So that’s something completely different just to also to note that you have to have in place.

[David]

I’m a big believer in show me how to solve problems. Don’t tell me about how great it is. Give me some real life examples.

I like that.

[Leonard (Lenny)]

One thing that we liked about the time series role and it’s something that we included with our design is just the ease of getting data out. So getting data out is not just limited to our training tool and our front-end or client tools. We did create a full-on REST API again trying to focus on the open standard technologies, REST in this case, so that people can get the data out.

It’s not locked. It’s not driven by us. It’s not something that you pay for to get out or whatever the case is.

It’s truly, here’s your data. You stored it. You should be able to access it and you can try and get the tool.

So quite a few learnings from the time series role or time series database role that we try to incorporate into our design.

Understanding SQL Limitations

[David]

All right. Excellent. A lot goes into it, Lenny.

So I’m going to have to listen to this again just so I understand what those are. So let’s talk a little bit about some other back-end design and we talk about the ability to access data using some kind of open standard where it’s an API. I’ve also seen SQL being a way to get data out and I think it’s important to understand there are SQL-based historians that are out there.

It’s important to understand there’s a difference between all I’m doing is taking values and writing it to a SQL table, which is we’ve recently learned here that’s really great for transactions, not necessarily for time series data. But there’s also historians that just use SQL query. That’s the language that gets used to retrieve that data.

So Nick, can you talk a little bit about the SQL offerings that are out there? I mean, we’re not here to pick on anybody of, maybe why using a pure SQL back-end isn’t necessarily the best or maybe there are, we talked about applications where it is and understanding what some of those fundamental differences are and maybe limitations of just SQL in general in terms of doing time series data. Can you speak a little bit about that?

[Nicholas Woolmer]

Yeah, traditionally people, everybody has a SQL database, right? Even when we had this big explosion of document databases, MongoDB and this kind of movement, SQL has always been very consistent. We like the ACID guarantees or all that kind of thing.

But the fundamental model of most OLTB databases is some sort of tree structure, usually a B plus tree. And these have different performance dynamics to something like a flat array or a flat column. And generally in an analytics use case, if your data is fully flat in an array already, you can use SIMD, you can easily chunk it and break it down and aggregate the results back together. There’s a lot of advantages that aren’t necessarily conveyed with a traditional OLTP database design. So in our case, what we find with time series is that it’s OLAP most of the time, right? It’s mostly analytics.

And then just when you ingest it, there’s this little bit of a window where maybe it’s come in and we need to update it a little, or maybe we have something that came in wrong and we need to send it again and deduplicate the data. There’s this kind of small window where we need to be a little bit more flexible. And then after that, we want to be performance and optimized.

So in the case of QuestDB, the approach was to take, instead of going from a right optimized solution, like InfluxDB at the time had their time series merge trees, which are effectively an LSM tree. Instead of going for a fully right optimized approach and then optimizing for reads, instead we took the opposite approach. We started with flat arrays, highly most optimized format for querying.

And then just kind of went backwards. What do we need in order to support concurrent ingestion from multiple sources? What do we need to support out of all the data in a way that’s performant and doesn’t get in the way?

And that’s where by taking that approach, you get this performance ground up kind of output. But if you don’t start there, trying to retrofit it, it becomes much more difficult. And the other thing I’d say is really interesting points raised by Leonard on how to build a good historian.

This kind of package is a whole thing, right? You’ve got everything you need. It plugs into your hardware, you get the storage, you get the visualization, you can interoperate with other tools if you need to, right?

With this REST API. What we find is that people often already have some parts of a stack that they’re used to and they want to keep. They already have some MQTT broker, for example.

And they just want to transfer that into QuestDB to do some different things. Maybe they already use Power BI or they use Grafana. So they don’t really want to use the database for that because they’ve already done that bit for themselves.

And that’s where you start getting this lower level separation, all the way down to eventually the storage format where people can not have to sign up to a whole big deal just to get the bit that they need. And I think that generally very close to the industrial side on site, that’s where the biggest gap lies. And then as soon as you start getting up the business levels, different people need to look at different things.

That’s where this kind of separation becomes more important. So yeah, that’s one of the main reasons why performance works so differently. But also the fundamental advantage of SQL is that many developers are familiar with SQL.

And many of them are not familiar with time series or custom time series querying languages. And in our case, we take SQL, we add some extra syntax sugar to make it easier to downsample data, to join data by time. And those just little bits are familiar enough to developers that they can just get on with it or have a little bit of help from us to kind of solve their problems.

In a way, that’s just much more blocking if you have no time series experience. So you get a little bit of, hey, a trade off, we don’t do all the OLTP stuff well, but we do it well enough. And in return, you can approach time series from a much more familiar standpoint.

[David]

Cool. So you just did use a few FLAs in there. So four letter acronyms.

So the OLTP and OLAP, could you just real briefly, what are those and what do they mean?

[Nicholas Woolmer]

Technically, I think they mean online transactional processing and online analytical processing. The big picture is that OLTP, transactional processing, is your traditional database management system like Postgres, MySQL. OLAP, OLAP, analytics processing is your more, and generally tend to be more column oriented type of database.

So this is, you know, Clickhouse or QuestDB in this case, DuckDB. So different priorities, one’s much more about mutations and one’s much more about querying.

[David]

Perfect. Thank you for that. And just a real quick follow-up.

Is there an application where you would say, you know what, SQL’s good. Maybe it’s digital maturity. It’s not a high ingestion rate.

It’s familiar. It integrates with maybe another tool that’s there. I mean, are there some decent reasons or good reasons to use that?

Or is it something like, you know what, it’s just really not fit form and function for what it is we’re trying to do?

[Nicholas Woolmer]

If you have a very update heavy workflow, right? You have few rows, lots of updates. You’re not ingesting a lot of new rows. You know, you don’t have a huge dataset. Maybe it’s only 50 megabytes. It’s not hundreds of gigabytes.

Then you’re going to be fine and maybe even better off in that scenario. And we want to encourage people to do that in that scenario. We don’t want to bring people to the database who are going to have a bad experience, you know, get stuck, get in trouble just because they like it, but it’s not a fit, you know?

So for those kinds of use cases, that kind of business data, yeah, keep it separate and just, you know, put in your stuff that fits the model a bit better, but may occasionally need joining between some of these tables or might need some of the asset guarantees. That’s where, you know, you specialize a little, but you don’t specialize to the extreme and you still keep those niceties.

Utilizing the Cloud with Time Series Data

[David]

All right, perfect. Well, as I always say, though, you know, one of the best parts or the best answer as a consultant is, it depends. And it sounds like that’s much the case.

There’s a lot that goes here. So Lenny, let’s talk a little bit more about, you know, common architectures now is I’m going to, and this is what I refer to affectionately as the underpants gnomes, where I’m going to steal underpants and make profits. I’m going to shout out to South Park again.

Nobody should watch the show, but there it is, I’m guilty. And I’ve seen that, especially in the past few years, that we’ve decided we’re going to take all of our data and we’re going to dump it in the cloud and then we’re going to make profit because there’s going to be some sort of perceived value that’s out there. So as I’m following along these ideas of, I’m, you know, you talked earlier about some of the architectures of the ability that no longer am I just going to have it sitting on prem.

I’m going to dump all this data up into the cloud. Well, that’s all good. And so you get the bill for realize how expensive it is to ingest the data, how expensive it is to consume that data and all the other things that go along with that.

So if I’m looking at, you know, whether it’s a cloud first strategy or a cloud native, or maybe it’s a cloud preferred, I mean, there’s a lot of things we can call it. How do I want to go about deciding what do I put into the cloud, you know, in the cloud, which is really just somebody else’s computer. And what does that look like from a either a time series or a process of story?

Can you share some insights, guidance on that? 

[Leonard (Lenny)]

You want the consultant answer? It depends. Yeah, no, there’s a lot of things that you need to consider.

It’s not, I don’t think it’s a blank. Personally, I don’t think it’s a blanket approach about, oh, now we’re just going to cloud, especially in this type of environment. I think personally, hybrids or a solution where edge with cloud works well.

And I’ll explain a little bit why I say that. But again, it’s what is your end goal? What do you actually want to achieve by putting all of this data in the cloud?

Do you want to start looking at AI models or machine learning or whatever the case is? Or is this just something that you read and you can now actually do it? So I’m going to go ahead and do it.

So there’s horses for courses, and it’s very dependent on what is the end goal and the end strategy of your organization and where you want to go from that maturity side. Why would people want to go to? So what does cloud offer?

What is the selling points of what cloud gives us? Cloud gives us the notion that I don’t have physical infrastructure. I don’t have to build a massive server room or whatever the case is to have the cooling and the electricity and all of that bowl at the end of the day.

Well, you’re still going to have a bowl. You’re just going to get paid from the cloud provider, right? So it’s just switching the one for the other.

Nick mentioned it. Edge computing these days are getting faster. And the technology that we can actually extract from edge computing is great.

There’s no need for these massive server farms that you required in the olden days to get your process historian and process OT side systems running anymore. So you need to really think about what is the actual advantage you’re going to get from going cloud? What is the criticality of my data?

What is the latency of getting the data there and the latency of getting the data back? Are you happy that you’re going to be wanting to pull a trend and you’re not going to get the immediate response when you do your real-time analytics because maybe your latency is not that great? What happens if there’s no, if the internet goes down for a day?

You don’t have access to your cloud provider. So there’s all of these questions that you need to go through and really decide if it’s the right strategy for you or not. One thing that we do see is people have this notion that I’m just going to pump all my raw data either into a data lake or whatever the case is.

Well, that’s great. What context is added to that raw data for the IT guy? Normally, now

you’re also separating not just on-prem versus cloud, you’re normally now separating this whole OT versus IT. Now you’re taking operational data where the OT guy understands, you’re now putting it in the cloud and now the IT team needs to take over because now it’s an IT host, a cloud environment. Do they actually understand OT data? Do they understand how to query that data, get the data out that’s required without the correct context that’s now being added to that?

Probably not. How would they know that a value represents a specific machine state or what the unit of measure is of a particular point that you need to utilize in your retrieval? So I think, yes, there is definitely usages for that, especially if you want to start using AI and machine learning.

But in the historian world, in the more traditional process, historian world, especially when you have to have these critical applications looking at these things, maybe a hybrid between edge and cloud is right. And then to make sure that what you send to the cloud is really used for what’s needed. Nicholas, you spoke about as you move up in the hierarchy of the organization, the needs of users change, right?

So, yes, I’m an OT guy. I live on the edge. I need my millisecond resolution of data to do my fault finding.

Now you move up the stack. Now you do a daily aggregate of a value to give a production manager or a shift supervisor a value, right? Yes, you might use Power BI or one of the cases to build a support on that.

But you don’t need to have all of that raw data now replicated in your data. Whereas just for that user to be able to access one transactional record, which is essentially what is the production for the data we show. So move with your strategy, figure out what is the data requirements up the chain as well, and then potentially just move, turn the time series back into transactions and send the necessary transactions back to the cloud for users to use.

It will reduce your costs. It will reduce the way that people analyze the data. So, yeah, it’s very important to understand what is the use case, business use case you are addressing with going to cloud.

And do not forget the business use case for the guy sitting down on the floor as well. Because, yes, you are going to start incurring quite a lot of costs, not just pushing data up, but you’re paying for getting data back again, which I feel is sometimes a little bit cheeky. But yes, that’s unfortunate how it works.

[David]

Yeah, as I say, the meter tracks every one and zero that it passes by it. That’s just the nature of it. So, you know, it sounds like we’re coming back to a common theme here.

It really just depends on the application, how both you’re going to get the data, how you’re going to analyze the data. One of the recommendations I make is, you know, you want to put your… So Rick Bellotta, you know, he was on a previous podcast talking about the Uber broker, but he talks about this hot data, this warm data and this cold data.

For me, it’s that stuff that’s on the edge. It’s that millisecond data that that’s your hot data. You need ingestion right away.

And then going back to conventional architectures that as you move into further and further away, it becomes more warm and cold data. But you also want to aggregate that data ultimately coming down to what is it we’re trying to do. You need a lot of data for machine learning, perhaps, or some kind of AI application, depending on that.

But is that generally a good approach for people is have a lot more data available at plant floor and then, you know, either aggregate it or, you know, specialized use cases as you go to the cloud? Does that seem to be a good approach?

[Leonard (Lenny)]

I think so. It depends, again, what your initiative is. Yes, we need a lot of data, unfortunately, to train machine learning models.

So there is use cases where you would need a lot of data. But again, when the machine model gets executed back on site, potentially, that’s going to run on the edge with the hot data that you spoke about. Again, all about the use case, all about the use that you need to achieve and address.

I don’t think it’s just something that you have to do because it’s the buzzword or you have to now be digital transformed or whatever the case is. Really think about what is it going to address in your business and what is the business case behind that move and what are you trying to address?

[David]

Awesome. Nicholas, anything you want to add on just moving stuff to cloud or how people want to leverage that?

[Nic]

Yeah, I’ll say something very quickly. Actually, in cloud, there are different ways to store the data, right? There is your kind of hot storage, let’s say your EBS volume on a WS.

And then there is an object store, which is for, say, colder data. And the cost of storing the data in each of those is wildly different. We found that a lot of people love to actually store historical data and that’s perhaps less important to analyze in real time, but still needs to be stored somewhere.

And then you may want to acquire this data ad hoc in the future, regulatory purposes or other reasons. And having a cheaper place to store of this data is quite crucial. So this is one of the interesting design of choices, I guess, of a database like QuestDB.

In this case, the ability to move data into different tiers of storage under the hood fully without involving any manual process from the end user. So you could say with SQL, hey, for data that’s older than two weeks, move it to S3. And the cool thing here is that not only we would move it to S3, but the data would be transformed into an open format, i.e. Parquet, which is super compressed. So not only you store the data in a cheaper storage tier, but the data is heavily compressed on top, making it even cheaper as a result. That is what we think could be an interesting mental model to sort of think about real-time data, highly critical monitoring, which of course cannot stay on object stores and historical data, which might be less critical, but still needs to be stored somewhere where cost is a pretty key decision factor.

[Nicholas Woolmer]

If I can interject, there’s two things that sprung up for me on this. So adding on to what Nick said, this trying to simplify for the user, like avoiding these terrible cost scenarios where misuse is easy and also expensive, that’s a huge priority. And it was mentioned about regulatory compliance.

There’s almost an opposite use case for some users. In finance, this is one where there’s a requirement to keep actual records for a long time. So whereas you might have a situation where you need dense data close to the edge and then you need less dense data after, you also sometimes have the opposite where you only need to keep some summarized records locally that you refer to frequently, but you do need to keep a long- term batch of data.

And the thing to note about that is that there’s multiple costs to how this data is going to be expensive in the cloud. So yes, you may pay a storage cost just for having the data there, but then you pay a cost for the network transfer or you pay a cost to have security aspects on it, or you have a cost to query that data in some form. And that is a very complex calculation to figure out.

It can be very cheap to store data. And then you query that 10 years of data and suddenly this is expensive. And so the other aspect that really becomes critical is empowering the user to optimize that access to the data. One way of doing that is with something like a materialized view where, okay, you take your data, you make a view of it that’s specific for what you want to visualize on this chart, or it might be some summary per hour, for example. You store that extra bit of data, which is relatively cheap. And then when you want to access it, you only download it the one time, you cache it and you don’t pay those cloud costs anymore.

So that’s one aspect where it’s not just as simple as where do you store it and how much does that cost? It’s how do you make sure the user knows that they can access it quickly and cheaply? And it’s not always 100% obvious to the user how to do that best.

Approaching Client Experience

[David]

Yeah, pre-aggregating data, Lenny, man, that’d be something if somebody could do something like that for us, right? So speaking of client tools, Lenny talked a little bit about it earlier as well, and this is what I ran into a lot in process historians or just time series data, or we just want to have data to look at, is how it’s stored and the mechanisms that go into that, those are really technical into the weeds, but really what sells these things are the client tools. And that’s what people want.

I mean, I need this great looking trend chart or some kind of application that sits on there. So when you get into the client tools, the client experience, how do you guys approach that? Going back to say InfluxDB, it seemed like the stack that was there was the TIG stack where you had your Telegraph, InfluxDB, and Grafana ended up becoming that user experience or the client tool of choice there.

So Nick Nicholas, how do you guys approach with QuestDB? What does that UI look like? And how do you go about making sure that you’ll be able to visualize this data in a way that your users are going to be able to consume that information?

[Nic]

I’ll touch on this very briefly and then pass it on to Nicholas. We chose essentially to double down on this TIG stack because we think it’s already well-established. And the best way to do that was to find a way to be compatible with the protocols underneath InfluxDB.

Therefore, the ingestion protocol that is in QuestDB is actually the same as InfluxDB line protocol. The performance will be very different because the internals are different, but that protocol is the same. Therefore, we plug into this ecosystem, Telegraph being one very popular tool and all the others that you mentioned.

This is an easy way to tap into an existing ecosystem rather than trying to recreate one, which would be a lot more difficult. So that’s one. And two, that’s a bit more anecdotal, but I think it’s quite interesting. The open source nature of the product makes it very easy for people to integrate it in their libraries and especially pretty sophisticated slash specialized libraries that end up facilitating a bridge with IoT devices and et cetera. So I’ll just give an example. There’s somebody from a very large industrial company who is building an open source OPC UA gateway.

And this includes a sync to a lot of time series databases. And QuestDB is one of them. And it’s pretty easy to include an open source product into those kind of libraries.

And if you’re able then to leverage those libraries, it makes the communication and integration with all those sensor protocols we talked about a lot easier. This is anecdotal, but I think it’s a good example on how open source software can be leveraged to be very well integrated into such a, I guess, complex ecosystem.

[Nicholas Woolmer]

So to add onto this, so yes, influx line protocol for ingestion is one of the main ways that people get data in. That doesn’t really cover querying, right? Which is the other main capability.

For that, we support Postgres wire protocol. And in fact, we’re actually gonna release a rewrite of it soon, which is better, faster, more compatible. We’re not Postgres.

We have similar syntax. We don’t support the same things, but most Postgres clients work well enough. So if you already use Postgres and then you move to QuestDB for some of your data, you’re pretty much good to go for many of these simpler group by, or by kind of queries.

So that’s one aspect we provide HTTP as well. On the side of IoT specific data, Nick touched it on OPC UA and one of our users who has this interesting open source project to streamline this kind of gap from IoT to the database and other databases. For MQTT specifically, we have the concept of a broker, right?

Where the broker, you connect to it and you subscribe to it and it kind of manages to flirt these messages between all these different systems. But one thing, one interesting challenge is that brokers are fundamentally stateless. Whereas when you start integrating this data with a database, databases are absolutely not stateless.

That’s the whole point. So in this open source ecosystem, you have different people who are trying to kind of find ways that this data can be translated most efficiently into the database that you want ahead of time for you so that the user doesn’t have to worry about these concerns. And we do the same thing on our end with the influx clients that we produce.

We make sure that they make it easy to buff your data. We make sure that you get some error messages back. If the data’s not gone in, we help you to make your sending idempotent.

And it’s no use having a great database that’s highly performant if nobody can use it. And so that’s why supporting common stuff like PostgreSQL and these things just massively ease the use of adoption. And the more people who adopt it, the more feedback we get, the better product we will get in the end.

So it’s definitely high on our priority list to make sure that we’re not just only focusing on the database features, but actually on the ease of use and the user experience. Most of the time, they’re interacting with the APIs and not with the internals.

[David]

Gotcha. So if I understand, it’s use your visualization tool of choice. We’re going to enable you to use that through these common standards.

Is that really what we’re getting at? 

[Nicholas Woolmer]

Exactly. Yeah. We’re not going to try and make it harder for you to do the stuff you already do.

It just doesn’t make sense.

Future of Historians?

[David]

Now, understood. I mean, that’s, as I said, when I was more selling the historian, it was the client tool that everybody just loves. And it seems like almost, no, it’s really the backend.

How are we going to use data? You just so many great visualization tools that are also open source out there. Use the one that’s there, which is that’s foreign concept to me, but it totally makes sense because I’m all about the interoperability of I want to be able to pick and choose whatever the application calls for.

And then I can use the tools that I work with. So that brings us to the last point. I wanted to just visit real briefly, or maybe we can talk about it for a while.

Is what is the future of historians? As I talked earlier, it seemed like, oh, we’re going to take all this data and we’re going to just pump it up into the cloud, problem solved. Well, then we realized that maybe was the best.

So we want to optimize that before sending it out there. But what do you guys see the future trend of, I talked earlier about time series data and process historians. There’s going to be a lot more blurring where really it just, it’s going to get very specific down to the use case.

So I’ll just go around the horn real fast. Lenny, thoughts on what’s the future of historians, time series database in general?

[Leonard (Lenny)]

I don’t know, to be honest, we’ve seen so many things and so many technological advances in the past five years. I think the traditional process historian has to adapt a little bit. And it has to adapt to some of the features of what the time series databases gives us.

So that’s one thing. I think there’s definitely a notion that I need to have the data easily available and easily accessible without having to jump some hoops. So I think the whole notion of having a proprietary closed environment, I think those days are over, to be honest.

And that’s where this new open source world is giving us great ideas and insights on how to achieve that. I think people are still in an infancy stage, to be honest, to take the data that sits in historians and actually apply that to analytics and machine learning and really mine the data, as we can say, mine the data that we are storing. Because I think we’ve just been so focused on, I need to get my data in the historian and I’ve got my train tool and that’s that.

So I think there’s massive opportunity in data that’s currently not unlocked with what we are actually storing within our environment. Are we going to move everything to the cloud? Are we not?

I don’t know. I don’t have the answer for that. I do feel that edge computing is something that we should not overlook.

I think, like Luke said, there’s really, really cool ways that we can leverage edge computing to really do incredible stuff. We don’t have to potentially push everything up to the cloud. I know there’s reasons for that.

But I do feel that as we’re going to go forward and closer to the future, I think edge computing is going to really revolutionize the way that we use our historical data that we reproduce on the cloud floor.

[David]

Oh, go ahead. I’m sorry. No, no, you can go there.

I was just saying, I mean, what I’m saying, you referenced this earlier, it’s hybrid, is we’re going to have cloud managed. So the services, whether it’s AWS or Google Cloud or

Azure, it’s that there’s going to be a managed piece of tech that’s sitting down there that’s going to facilitate some of that. I think that hybrid seems to be the direction that it’s going.

[Leonard (Lenny)]

I think so. I think the future is going to be definitely a more integrated space, between them. Absolutely.

[David]

Nick, what are your thoughts? Future of Astorians time series data? 

[Nic]

From a time series database, and actually we can probably make it a bit broader. The future of databases in general, really, we could narrow it down to analytical databases. I think there’s three things.

One, openness to AI, two, private cloud. So openness is essentially this idea that the data will be stored in an open format, and that this will lead the people to being able to choose the best possible tool for each individual use case. And each one of those tools will be able to read and essentially process those open formats, such as Parquet, but there are also other ones.

And you could think of Iceberg in terms of table format and et cetera. I think we’re moving there. I don’t know how many years this will take, but you’re starting to see a pretty clear trend where companies are willing to pay extremely substantial amounts of money to buy companies building an open format.

You can check like Tabular, selling for more than $2 billion apparently. And they were building one of those open formats I mentioned. I think this openness will become ubiquitous eventually.

And it will not be abnormal to have different tools that are a bit more specialized for a specific use case. And using more than one will be completely normal. So that is one thing I see happening in the next few years.

The second AI obviously. So how do you then access all those models that are proliferating every day? There’s a lot of exciting stuff with time series foundational models.

It’s very new, but we actually see quite a lot of potential, especially in industrial applications. Same thing. I think that there will be open formats, open protocols that will be leveraged to access those models in the best possible way.

And those technologies who leverage that sort of bridge the better will be in a very, very good place because those will become a very important consideration pretty much now already and in the coming years. And then finally private cloud is an interesting one. So there has been this trend towards managed cloud offerings in the last 10 years.

You could see a SaaS company doing essentially managed cloud offering as being the default strategic route in order to scale a business. However, though, what we’ve been saying and what industry has been saying as well, a lot of companies are moving back from public cloud to private cloud. And there is a new type of offering that’s emerging as a result.

We are actually building it ourselves and it’s called bring your own cloud. So the data stays in your own cloud account. It’s not on the public cloud.

It’s in your cloud infrastructure, private account. And we are essentially bringing live monitoring. Therefore, the logs are being sent to us, the provider, in order to monitor the database 24 seven.

And there are also agents that automate a lot of the database deployment on top. So you sort of get this managed like experience. However, there is no concerns about data privacy or the data is and all of those, you know, security considerations that make the public cloud a hard sell in a lot of industries, including manufacturing.

So, yeah, those are the top three for me.

[David]

Excellent. And then, Nicholas, what’s the future bring?

[Nicholas Woolmer]

What’s the future? If we knew we could all make highly successful businesses and make a lot of money and make a difference, right? I think for me, so there’s a couple of things that were alluded to earlier.

One I mentioned was this recent move to simplify, simplify. What do you need? Is postgres enough?

This kind of thing. And I think that that’s slowly transitioning into, OK, maybe the complexity isn’t having a lot of tools. Maybe the complexity is using tools for the wrong things.

Maybe it’s OK to have several things that do very small, very niche, you know, specific, but very helpful use cases. They deal with those. And that’s not creating a massive burden because your data is not locked in anywhere, right?

You just point them at the tools and you go. So I think that that’s really positive direction for everybody, because this means that competition is not just weighted based on, you know, which cloud vendor you’re in and what offering they have. It’s now what is the quality of your offering?

Is your database better at querying certain types of data than others? Is it providing a better user experience for others? So for the community and people using databases, it’s great competition.

I think the other thing with having a more open source future is that the community and problems like between different companies become shared. And we see this. We have a public Slack where, you know, you can talk to developers and get help directly.

There’s no like tiered support. You just speak to people who can solve the problem. But also we see people talking to each other and they’re like, oh, I’ve had this problem too.

How did you fix it? And then they’ll open issues and on those issues, they’ll share this feature is great. But with this tiny addition, it would solve a lot more of our problems.

It would simplify our stack. We could delete half our code. And I think this is a much, you know, it’s such a strong way to get feedback for your product and also to win users, open source paid users, because they are getting the very best feedback to you and the very best help that you can get to that.

And so that combination of open feedback, everybody can see what’s happening. And also direct competition on the quality of your offering and not just on other details, business details is just a fantastic direction to move in. So I’m personally quite excited, you know, to see how things pan out.

[David]

Thank you for all being here. Thank you for our audience. I greatly appreciate the participation and taking time out of the day for everything.

So again, we’ll see everybody next time and look forward to that conversation. See you soon.

People on this episode