Database Kernel Development, Streaming PII Obfuscation, and Change Data Capture with Alok Pareek Artwork

What's New In Data

A podcast by Striim (pronounced 'Stream') that covers the latest trends and news in data, cloud computing, data streaming, and analytics.

All Episodes

What's New In Data

Database Kernel Development, Streaming PII Obfuscation, and Change Data Capture with Alok Pareek

April 01, 2025 • Striim

Alok Pareek, Co-founder and EVP of Products at Striim, joins What’s New in Data to dive into the game-changing innovations in Striim’s latest release. We explore how real-time data streaming is transforming analytics, operations, and decision-making across industries. Alok breaks down the challenges of building reliable, low-latency data pipelines and shares how Striim’s newest advancements help businesses process and act on data faster than ever. From cloud adoption to AI-driven insights, we discuss what’s next for streaming-first architectures and why the shift to real-time data is more critical than ever.

Learn more about our latest release on Striim's Release Highlight page.

What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

John Kutay: 0:05

Welcome back to what's New in Data. I'm your host, John Kutay. In this episode, I'm thrilled to be joined by Alok Pareek, co-founder and EVP of products at Striim, the first platform to truly unify change data capture and distributed Striim processing into a single powerful solution. Alok is a longtime database industry veteran, starting his career at Oracle, where he worked on core database internals for disaster recovery. He then became CTO at GoldenGate, pioneering CDC before it was even Striim. Now, at Striim, he's redefining how data moves, with strong consistency, sub-second latency and cloud-native scale. We get into the future of real-time data, innovations in Striim architecture and clever approaches to handling sensitive data and AI workflows. Let's dive right in.

Alok Pareek: 0:58

Doing terrific. Thank you for having me, john, it's a pleasure. I've been meaning to come to one of these famous podcasts that you do so. Excited to be here today.

Speaker 3: 1:07

Yeah, absolutely, and I'm equally excited. You know, I've always wanted to catch up with you and in this forum, which is just incredible and super interesting all the collective knowledge you've gained throughout the decades and you're really one of the leading minds on this topic. But I want to open it up to you to just tell the listeners about yourself.

Alok Pareek: 1:37

Sure, so I can get started and of course that shows my age too, that I've been around the database industry a while. So I started my career in the Oracle Database team, which was a fascinating experience. So this was right. After my grad school I did some work at Stanford, in fact, hector, god bless him I worked with him as his TA and so he was sort of a heavyweight in transaction processing, taught me a lot, told me to keep things simple, always told me about some of the hard problems in industry which I'm hoping to share with you. Yeah, so I started my career at Oracle, which was a fascinating journey for 10 years and it was a really good program. So they, they sort of had this amazing program where I got to be part of different teams.

Alok Pareek: 2:32

Um, believe it or not, I've actually got up at five in the morning, um, and listened to, uh, oracle support calls with some of the some really you know amazing oracle people, um, like sherryry Yamaguchi and these guys who are responsible for some of the core practical ultra-large database implementations in the world on the maximum availability architecture. Then a little bit of consulting as part of still being sort of in the database side. A little bit of consulting, where there were always these firefights between you know, sometimes support would have an issue and then development would take a time, some time to kind of navigate that. And so there was this kind of rapid team called React I don't know if I remember what I think was called rdbms, uh, escalation and activation team or something like that um. And so there were, you know, these people who actually also were programmers and could understand support issues and respond to customers um quickly.

Alok Pareek: 3:36

So I got to experience that um and I learned that you know most at that time kind of the hardest area that a lot of people used to face challenges in was, you know database would crash or they would lose a file or they'd have some sort of a discorruption. So that kind of attracted me to recovery and self-healing systems and things like that. And that's when I sort of, you know, went into the recovery team, spent close to 10 years there. So this is kind of the core area of you know crash recovery and media recovery in databases. So that's kind of like how I got my career started. And then I went on to you know other things like GoldenGate and eventually up to Striim.

Speaker 3: 4:21

Nowadays, oracle is a household name in data and AI. You can't go to CNBCcom without seeing an article about Oracle stock or Larry Ellison and even the new stuff they're doing with AI and collaborating with OpenAI through their latest projects. But in the generation you started there, it was still very much a competitive and open market, so the work you were doing really led to their market dominance, which is pretty incredible to hear about. How you worked back then, and I wanted to get into one aspect that you mentioned to me, which is, you know, now we have this definition of big data, which is terabytes and petabytes of data. You were working on one of the early big data projects with, let's say, roughly a terabyte of data in a transaction processing system. We'd love to hear about that.

Alok Pareek: 5:11

Sure, sure yeah that's a fun story. So you know, when I joined I think you know it was probably early versions of 6, I think late 5, early 6. And you know, and then you know obviously the way the database development team was working. We were working on seven also because that's usually a few years ahead and at that time there was an initiative to try and sort of have like a very large, one terabyte database between HP and EMC and Oracle, and so it was interesting. So there's a bunch of engineers from HP and Oracle and we all actually flew to EMC, which was in Massachusetts, and I mean it was. I still remember it very vividly. It was a huge kind of a warehouse type of a setup and there was just literally disk drives like sprawled all over over. At that time disks were big and there were cables all around and so a number of folks that went from Oracle and also from HP, we literally just went in there and had to hook up servers and disks and the cables and prepared the Oracle database.

Alok Pareek: 6:26

Now what was interesting about that was at that time loading one terabyte was kind of a crazy concept, right. So we had to figure out things like you know, how are we going to load these things, you know, in a reasonable amount of time, and so that led to a bunch of work, including, you know, direct path loading, where you could bypass some of the logging concepts and etc. Efficient indexing around that. We also had to lay out the data on literally different disks. There's no contention, this stuff is classic now. There's no contention between your actual data blocks and your index blocks, literally just physically at a file level, at a disk level. And yeah, and then we published that. I don't know this was pre-internet, I think. So I know that that white paper is available somewhere, but I don't know if it's available in the public read or not. But yeah, it was very exciting to kind of declare victory on kind of like a one terabyte database collaboration between EMCHP and Oracle.

Speaker 3: 7:29

Yeah, that's super cool and you know you mentioned some of the contention between the indexing versus the storage level and the read level. Can you get into some of the technical trade-offs you had to make there?

Alok Pareek: 7:42

Yeah, so you know, I mean typically, like you know, if you, I mean ultimately, you know, at that time, you know now this technology is also advanced, though still continues to still bottleneck to a certain degree. But at that time you know, when we talk about just hard disk drives, I mean ultimately there's an arm right, there's got to move, and you know, so your seek time sort of. You know you have to pay that penalty. So if, let's say, you're trying to go to the optimizer and you try to look up something, so you gotta, like you know, go through your B3 or whatever the variant might be, maybe a hash index, so you have to go fetch the kind of the root blocks there and the index block and then the data block. So if they're all on the same drive then you can't quite. You know, to a certain degree there's a queuing that gets involved there.

Alok Pareek: 8:33

So to try to actually just keep them on different disks was kind of a classic technique that was used. Right, we would say, okay, let's separate. Even our, like you know, log files would be on a separate disk than the data files, for example, and you know, and then the between the data files. You know, if you had you know different applications and trying to actually also keep those separate from a you know broader sort of maybe a container level, like a table space, for example. So those were some of the techniques that we used to kind of really be gentle to the optimizer, so to speak.

Speaker 3: 9:02

Yeah, absolutely, and this was very much early days for those types of workloads, so ingesting a terabyte of data at that time was considered on the kind of the extreme end of performance requirements. So have you seen any of that, any adoption of those techniques in like more recent storage engines?

Alok Pareek: 9:24

Yeah, I mean nowadays you know again, like you know. So there's several ways, see, ultimately the problem of just sort of doing fast loading comes down to can you partition the data set? Number one, right? So if you can partition the data set, then it's just a number of like how many you know threads, so to speak, how many cores you want to actually go through at that problem and you can actually go load that super fast. Now what's interesting there is, you know, when you do these massive loads, you know what happens to also kind of the indexing, right. So it's kind of inefficient to build the index as you're going along loading the data as well, your tree gets skewed and whatnot, right, and I think that's pretty much common now, right, where you could actually do things like direct loads and then build an index after the load itself, so that now you kind of have you can take a pass at all the data, so your keys are sort of more well-organized and you're not sort of doing random seeks all over the place. So yeah, that are the early techniques that we had, and I can just specifically tell you, in Oracle for sure, we had this client utility called SQL Loader and in SQL Loader and we're still running into some of these issues, by the way, even today, where we evolved it, I mean, a number of developers were involved.

Alok Pareek: 10:47

But to try and say, hey, I'm going to actually do sort of an insert directly in place. So let's say, if you have a table which has a bunch of blocks, the blocks are on some free list. So you grab a free list and say, okay, I'm going to insert it in there, so that's an in-place insert. But a faster way was well, let's just go and see what is the high watermark of this table at a segment level and then just go ahead and just start shunting all the bytes there. The advantage there, john, was then if you died in between, you could just do one simple undo of the whole thing. Right.

Alok Pareek: 11:18

But imagine if you actually did a direct place insert. Imagine that I have 10,000 blocks to write write hypothetically and if I'm going into different uh areas on disk, then if I have to undo it, I gotta go to each one of those blocks all over again and undo that right. So all of a sudden then there's implications in terms of the the, not just the forward wing load performance, but also because if it's a direct path load. You could write it contiguously, but if you have to go in in place, then you have to go in specifically to locations and that's inefficient and then not to mention that if something goes wrong, to recover from it would be super fast, right? So those are some of the techniques which I think now they're Striim, although I would still say I still get surprised that not all systems have it.

Speaker 3: 12:01

Yeah, and you know that's some of the fun work we do here at Striim is we're constantly evaluating the new popular up and coming architectures. You know whether it's the hyperscaler, cloud storage engines like S have their own approach and you know, for instance, in Iceberg you have to pretty much manage the write process yourself. I mean it's great for reads and managing. You know tables at scale across object storage. But you know it's some of the surprising stuff that we come across where you sort of have to build this stuff again right and take a pass at it as if it's a, you know, nascent area to do development. But you know, one of the things that you touched on in your description was the recovery process and the logging process. So you've always had a lot of experience in that area. I'd love to hear kind of your perspectives on, you know, your experience with recovery and how it's evolved over the years.

Alok Pareek: 13:03

Yeah, yeah. So recovery is definitely close to my heart. I mean, I spent a lot of years in database recovery. So in Oracle earlier on I used to own these there's one component called LogWriter and at the time I think I joined Oracle there was only I want to say there was like only five kind of these main background processes. It's just, you know, this is a C code base and one of them was LogWriter, along with, like DatabaseWriter and CheckPointer and, you know, systemmonitor and ProcessMonitor, et cetera.

Alok Pareek: 13:35

So when I took that over, so the interesting thing there was, you know, logging was this one thing which made things really fast, right? Like, I mean, if you just imagine, if you didn't have logging, then obviously you have to go back and, do you know, for every commit you'd have to go in and really update the in-place table records, right? Just imagine, if I, like you know, 10 records, then I have to go in and make sure that, you know, as part of my commit I, you know, flush all those blocks to disk and so logging is sort of like this and some of this is like you know you know has been around for a while since the Ares system.

Alok Pareek: 14:16

But you know you're doing sequential, right right ahead logging to kind of just effectively be super fast and efficient in your commit path, right. So I go touch a bunch of stuff. It allows me to go, and you know, dirty a bunch of blocks and then in one operation in the log I can go in and commit or undo my work. It also allows you to just, you know, steal buffer, dirty buffers before commit, so that alleviates the memory pressure. So this area was really interesting and partly, you know, I was actually more interested in recovery because customers would really struggle with this thing, right. So just to give you an idea, john, like there was, imagine that you have like I don't know 200 files that make up your database, right, and they're sitting on like you know 10 disks. So you have all these files, then you have your logs and then you have like the metadata which is your catalog in your dictionary, and you'd get into these funky scenarios where someone would be like, oh, my database is not coming up right and you know you're getting some errors. So there is an amazing dance between the metadata, between these files and the log and kind of the metadata which is the like in what was kind of the control file, where we would, you know, keep the appropriate metadata to know what is the state of this thing. Right, so, very earlier on, I used to just love that, that, you know I, you could, you could manipulate, like you know, a few bytes here and there in just like one file in its header, and it could sometimes you just get corrupted because of, you know, just magnets, right, I mean, disks are ultimately magnetic and things will go wrong. So what would happen is, you know, the DBAs would try something they would like try to recover this thing by saying, okay, I think that file is corrupted, so let me go to my backup that I made two weeks ago, slap on that file and backup that I made two weeks ago, slap on that file, and then, you know, it would still not work sometimes.

Alok Pareek: 16:08

So then it almost became this amazing thing where these recovery developers were like brain surgeons trying to figure out what is going on with this entire recovery, and you would find out that when, during the backup, the backup was taken perhaps inappropriately, and so it's like not a proper backup, where you have to go through some proper steps and so forth, and so it's like not a proper backup where you have to go through some proper steps and so forth. And so I learned that you know forward going in recovery, you have to throw enough stuff so that when things really go bad, you're able to recreate a picture of what's going on. So looking at a state and then saying, how did I get there? And so I've always in my head always maintained that even during, you know, my career in recovery at Oracle. And then moving on to replication and Striim and then AI and so forth Right, these things are sort of all.

Alok Pareek: 16:56

The log is sort of an interesting thing. That kind of is a unifying framework to me across a bunch of this work that I've done till now teamwork to me across a bunch of this work that I've done till now.

Speaker 3: 17:08

Yeah, absolutely, and you know we can touch on all the reasons why it's technically crucial to have logging in your database for recovery purposes. You know I'm just going to jump to the other end of the spectrum what happens if you just don't run your database with logging? You know where are the risks, how will it operate. You know what does it actually mean for the business users of the database.

Alok Pareek: 17:26

Yeah, so I mean, in fact, you know I can, depending on the definition of the database, there are many databases that in fact don't run with logging, right. I mean, if you take a look at even especially analytic systems, right, logging might be a big overhead just because you're doing massive parallel loads and there's no point in maybe logging all of that stuff. So what you do there is, like you know, you run this thing and then you just take another backup, right. So now you're moving at basically your. So in fact, what the log does is it spreads out like your state evolution, right. So each change vector can take you from one state to another.

Alok Pareek: 18:01

And then if you sort of you know bypass logging, now you're going through sort of you know bypass logging, now you're going through sort of you know, from one jump of a state to another, jump of a state to another, jump of a state.

Alok Pareek: 18:09

So you just be methodical about that and be aware that your specific, the state to which you can recover, is not going to be the most recent state.

Alok Pareek: 18:20

You might have work that's lost since the last time you took that backup, which may or may not be okay for your specific workload and also, like I mentioned.

Alok Pareek: 18:30

If it's an operational database, then I think doing if you bypass logging, you will find that your actual IO is going to just be not, it's going to start interfering, right. You won't be able to get as much throughput, like I mentioned earlier, if I go in and for every single transaction let's say my transaction has four actions and it's touching different tables that means that I need to go now right to four different specific locations on disk and and now, depending on again, the disk type and whatnot, I may have to like pay that, uh, you know, my my time for random access to that specific block on disk, which kind of logging obviously is good. Right, I have a sequential point and I just have to, kind of, I already have my specific, I've already sought to the specific sector and now I'm basically just, you know, seek time and rotation latency are hardly, you know, so it really does have impacts on the operational side of the database, not just recovery.

Alok Pareek: 19:27

Absolutely, absolutely. Yeah, I mean I think if you, I mean just imagine, right, Like you're, if you're doing work and I want to commit, and let's say I just touch, like just hypothetically speaking, I just updated every single record in my table. And let's say I'm a customer, like I find a large financial application, let's say PayPal or Square, right, I might have like 100 million users. That means I need to go in and update 100 million locations on disk, Right? So what I want to do is I want to be able to actually go in and, in my memory, go dirty or change all these buffers. But then if at the time of hitting commit, instead of me as a client just waiting around for flushing all those blocks, I just have like one write to the log, right and I'm done, and lazily. Now you know in the database, you know the flushing can happen lazily, you know, based on your LRU schemes and whatnot, right? So that's kind of the optimization there.

Speaker 3: 20:16

Yeah, and your work obviously evolved from working on the log writer of the database to GoldenGate. And you know, just for a bit of context for the listeners, you know GoldenGate is a I'll let you describe GoldenGate but logging in general I mean, you know, for instance, it's received a lot of attention from the leaders in the industry and those who actually adopt and run these massive big data processing and distributed systems products. You know, for instance, you know Jay Kreps, the co-founder of Confluent, was famously infatuated with your work at GoldenGate. He wrote about it even before he started Confluent. But that's just one example. I mean there's many examples of how logging has been deployed as this kind of like you said, it's a unifying concept for kind of journaling, the state of an application. So I touched on GoldenGate. Would love to hear about your experience there. Yeah, so you know John.

Alok Pareek: 21:14

So you know Jim Gray famously used to say the log knows everything. In fact I did meet Jim one time at a VLDB, but that's a separate story. So let me kind of tell you how I got interested in GoldenGate. So if you take a look at the use of a log from backup recovery very quickly for certain replication techniques, you know, and there's been a lot of research in this area over the years, over the decades, but very famously, you know, pat Helland and Jim Gray talked about some of the evils of replication very earlier on and they used to talk about, like you know, one of the classic techniques obviously is lazy replication. And in lazy replication you can use the log to make it available to sort of, you know, not just the node that's generating the transaction but also to other nodes and you can catch them up using the log. So replication was so, rather than do eager replication, where you're replaying the same action at multiple sites, you just say, hey, I'm going to just do it, harden the commit on one and then I'm going to asynchronously propagate it lazily. So that was a classic technique in replication and you know, at that time when I was at Oracle, this was also used for data protection, for things like standby databases, and led to eventually products like DataGuard and whatnot. But you know, the evolution at GoldenGate that was sort of very interesting was the heterogeneous application of logging onto just a completely separate system. And so the idea there was that you know, I may be originating my transaction on maybe a specific database vendor MySQL, postgres, oracle but I want to replay that transaction logically onto just a completely separate vendor stack, and so that's the problem that GoldenGate very successfully solved and it became very, you know, obviously sort of state-of-the-art and so the application of it. Then, speaking of replication, so both to number one, just availability. So for ATM networks, you know there may be, you know, banking applications that might be running against an ATM and so you want to do high availability for the ATM. So replication based on GoldenGate became like a really interesting technique to solve that problem.

Alok Pareek: 23:46

And the second one was performance to scale out a workload, and you know, with the dawn of the internet era there was a lot of work around. Hey, how do we scale these systems? Because the number of subscribers or users that are coming onto the system were orders of magnitude higher, right, I mean earlier. For example, you might have maybe hundreds of travel agents doing something. Now, with you and me logging on and trying to search for fares, you know that number just became, you know, million or higher.

Alok Pareek: 24:16

So the choices there for an airline reservation system are okay. I got to scale this thing. So if you scale it vertically that's a huge cost function. So you know, at that time you know technologies like GoldenGate sort of solved it rather elegantly by saying, well, let's go ahead and sort of create one master and you know, and end slaves in sort of a replication architecture from a logical perspective, to sort of separate out our writes from our reads, and so they cleverly use that distribution for scale and performance. So that was the idea behind GoldenGate, and you know. Obviously then, you know, as we made that successful, there was a number of interesting problems that emerged during GoldenGate and that's what led us to Striim and try to address some of the problems that we were facing back in the 2000 to 2010 timeframe.

Speaker 3: 25:06

Yeah, it's like I said, it's really foundational software that really inspired the big data processing products that have come out since then. And if you look into GoldenGate now it's under Oracle of course, and you you resume that Oracle after uh, goldengate was acquired there. Now you know GoldenGate. If you want to research that, you know you can go look at the, the magic quadrant and uh for data integration products and you know it's in it's, you know it's at the top there and it's widely recognized as sort of you know foundational software, like I said, for many of the replication and database processing workloads. And now you know which kind of is a good segue into Striim. What inspired you to start Striim? Is it just another golden gate or is it something else?

Alok Pareek: 25:55

Yeah, no, that's a great question and I get that question quite a bit, john. Obviously it's not the same thing, right, I think. So, if I go back right, there were two distinct things that we were trying to address that GoldenGate did not address and that sort of converged with kind of the advent of some of the work coming out from the Hadoop and Spark, the Berkeley guys and Blab guys at that time, with this whole approach to big data through, you know, parallel paradigms like MapReduce et cetera, right. So one was the focus on not just structured data but also on semi-structured data or maybe unstructured data. So addressing sort of the breadth of these sources was kind of interesting and important because people were beginning to ask for that. I'll give you an example. So GoldenGate was focused more on database-to-database stuff.

Alok Pareek: 26:48

But when people would bring those problems up, like hey, we also want to go apply some of these changes onto a non-database, so yeah, so some of our teams went in and we kind of did something kludgy to make it work, right. But there was a pattern there and I don't think that that pattern was solved very elegantly there. Right, it was more of a band-aid add-on, right, and some customers used it, but that was kind of like one idea that you know you could just generally apply this problem to truly account for the heterogeneity, the structural mismatches, the syntactical mismatches between different systems, right, databases, nosql systems, databases, messaging systems, messaging systems, databases, messaging systems, storage systems, and you can see the gamut of these things, right, and you know it's interesting. Right, I mean, one of the core drivers behind some of the things we are doing today, you know, with the advent of the AI, Striim AI or generative AI era, if you will right is the fact that even to enable my AI agents, I need to go in and retrieve information from somewhere and then process it and maybe allow someone to take an automated action on it. So it was important to address the various sources. Okay, so that was one of the problems.

Alok Pareek: 28:10

The second was there was a lot of databases that were becoming very, very large, so the scale part of it was coming in. And when you have the scale come in, you know sometimes between you know the different points of, like, a distributed system. Let's call that, you know people, you can call it like you know, maybe a mesh or maybe just a fabric or however you want, but fundamentally there's a. It's a distributed system with multiple nodes and if you do, if you push one of these nodes super hard in terms of how much traffic is coming through, you're going to build up latency as the Zeta is going to one of the additional points in your distributed ecosystem. So there was an open challenge there, john, which was you know, there's this lag building up, so what is going on here?

Alok Pareek: 29:00

And that actually meant something ultimately to the businesses. So they would say you know, typically, we see that you know the propagation latency, the message propagation latency or replication latency, however you want to call it, or lag between you know, two of my systems is maybe a few hundred milliseconds, but it's like 35 minutes right now. So that means that there's a spike going on. And there was this general problem of how do you go in and literally peek into the spike and observe that and analyze that to figure out what does this really mean to the business? And I think that was where these two so the part about accounting for kind of the newer sources of data and then the scale part of it as it started hitting it and these latencies started going up the ability to observe that and try to explain that necessarily, you know, warranted some sort of an interesting engine.

Alok Pareek: 29:57

And that's where I think so Striim is, you know, sort of a Golden Gate++ in that sense. Right, that it's not just about the data movement piece of it, it's also data movement applied generically. But then on top of that the ability to go literally having transparency and observability on the moving data, to try and make sense out of that and to try to express declarative queries on that, to try to express, you know, any AI automation on that. So we've kind of evolved that system now to this point where you know you can actually do a lot of smart intelligence as you are dealing with this distributed system, and that's kind of one of the very powerful capabilities here.

Speaker 3: 30:38

And that's kind of one of the very powerful capabilities here. So GoldenGate was really changing to capture more to solve the 1990s, early 2000s generation of problems, which was replication for heterogeneous database setups, for high availability If you had, you know, multiple database nodes that you wanted to keep in sync, which is a great, great use case and still applied in a lot of areas. Now, when we look at Striim change data capture, high-speed, low-impact change data capture, which is going against the logs, is one of the core features. But it's also natively tied into a Striim processor and it's a horizontally scalable Striim processing engine. So what made you look at? You know, obviously GoldenGate didn't do Striim processing, it was just actually. I'll ask you for clarification Was GoldenGate an in-memory Striim processor? How did it actually replicate changes?

Alok Pareek: 31:32

Yeah, so GoldenGate first of all didn't have any Striim engine within sort of its architecture, right. So the idea there was more. It was a, you know, processed, component-based, published and distribute and apply type of architecture, right? So that was like a CDC process that would, you know, take the changes and push it onto a queue, and there was another guy who would read the queue and then, you know, serve as a client to a database and deliver the changes and push it onto a queue, and there was another guy who would read the queue and then serve as a client to a database and deliver the changes In Striim. The architecture is very, very different. So CDC is also a part of it, but not only CDC from databases, but also getting changes from, for example, nosql systems like MongoDB and getting it from the OpsLog or from Snowflake from their chain Striim or, you know, using our delta identifying techniques on things like BigQuery and Redshift, and so these are the things that. So we are able to universally identify what's changing and move that right. So that's the extension part of it.

Alok Pareek: 32:38

But to your point now, within the Striim engine itself that Striim has, we have very high speed in-memory constructs that allow data to go over sockets. You don't have to just do that. You could also leverage Kafka as a persistent layer underneath the covers to try and free up know, free up the publisher and the subscriber so that they can move at their own speed. So both of those are available within the platform, but I think the big novelty there was the presence of a actual engine, which is the continuous query processor. Right, and the continuous query processor is you know the part, john, I was talking about the where you can observe the data and you can try to analyze the data. So oftentimes, you know, earlier on, you know, this may be characterized as Striim analytics, right, where people like you know, when Twitter and Google and these guys came in, or when the you know, let's say, the concept of like meta was introduced, they would count up things super fast. So how do you do that? So that means that you've got to constantly go in and take your metrics and aggregate them on the fly. You don't have the luxury of pushing them on disks first and then doing it. So that piece of it is where this capability comes in into Striim, and this was, by the way, I mean, our ideas were patented in early 2014 and we, you know, got that granted. Flink was invented after that, by the way, right, and at that time, other than you know, twitter was using Apache Storm, which was like just ridiculously slow. It was like an order of magnitude less than what Striim could do.

Alok Pareek: 34:20

So what we've done now is made sure that you could express these declarative queries that can be used on the moving data and you could apply windowing functions there. You could apply, you know, interesting agents which allow you to do AI or machine learning on the data that's moving. So, in other words, we're coming on to this era of hyper automation, right, where earlier I need to do things manually. I want to. Okay, let me grab data from here, let me push data here.

Alok Pareek: 34:52

Now what we are doing is we're opening the world up to this class of applications where the data is moving, opening the world up to this class of applications where the data is moving and now the system is actually reacting in an automated way to say, hmm, I think that there's a pattern here. Someone needs to take an action on it. I've taken the human out of that. Now, these are very interesting ideas and advanced concepts that we are enabling, but they are going to be super interesting with some of these newer patterns we are seeing with agentic AI and hyper automation. That's where you do need truly to observe the data that's moving on the fly in real time, because otherwise you're as stale as the last time you train on some data and that is kind of game changing in my view.

Speaker 3: 35:35

Absolutely, and you know, in terms of applications of this, you know I look at ABC's Good Morning America did this great segment on how UPS is using AI to battle porch pirates people who basically steal packages from your porch and if you you know it has huge business impact because this is a very obvious issue to consumers who don't want their packages stolen, a very obvious issue to consumers who don't want their packages stolen and they're looking for ways to mitigate the risk there. And UPS has deployed this really amazing next-generation cloud-based AI stack which does include using Striim to get that operational data into the AI engine. So that's one real-world deployment of Striim for AI and it can also be broadly used for analytics use cases. It can even be used for replication. I'll get into how you know we use Striim for our own internal workloads, but you know I wanted to talk to you about you know, what are the larger use cases for tying Striim with AI.

Alok Pareek: 36:38

Yeah, so great question, right. So I think we just began to sort of get into that area. So, if you just take a look at AI in general, if you sort of look at its own evolution all the way to now generative AI and the applications in kind of natural language processing and, you know, text summarization and so forth we introduced AI in Striim from a machine learning perspective very earlier on, and this was and I think we published some of this work at one of the VLDBs I think 2019, I think or 2018. I forget the year, but where we talked about sort of you know, an online machine learning model that we were keeping fresh and up to date for just predicting what you know network traffic patterns will be based on the historical and the real time Striim that we are seeing, right. So you know where I see sort of you know Striim and AI converge is number one. It's AI is only good. Everybody understands this thing, and you've probably heard this a hundred times that AI is only as good as the data. Right Now, it's remarkable that at Striim, we tend to get a lot of this data from highly accurate and high fidelity systems like databases, where we have bothered to curate the data and push that into a knowledge base. So, to a large degree, despite some minor cleansing issues and whatnot, to a large degree it's serving as the operational tier and it's high quality data, so as that data is getting updated in real time, making that data available into an AI model in real time, right, that is one area where this is a very nice convergence and, I would say, an interesting use case, because it still requires you to do change identification, right, if I'm a large retailer or if I'm a large logistics carrier when I'm trying to track my shipments and so forth.

Alok Pareek: 38:42

So, as newer things are happening, how do you make sure that an AI model is made aware that things are actually happening around the model, right? So intelligence is not about specifically being stale and dumb, right. Intelligence is about actually being super aware and learning on the fly. An intelligent person is not somebody who can just answer your questions. An intelligent person is also somebody who's taking real-time feedback and dynamically going ahead and constantly doing learning, and so that's where I see these worlds kind of converge on right that so far, a bulk alliance share of the attention has been on hey, I train the model and I do inference from that model.

Alok Pareek: 39:25

Now, slowly, we are beginning to see through the, you know, maybe a retrieval, augmented generation pattern that, hey, the model can actually be enriched, perhaps with better context, which you know typically translates in the form of the form of vector embeddings and whatnot. So how do you create these vector embeddings so that your fabulous new Gen AI application can take advantage of it? Well, it needs to get real-time data from somewhere to create that vector. That's where I see the convergence of the Striim capabilities and the AI capabilities, and, in fact, we've introduced our own agents into the Striim platform where we are doing some fascinating things like which traditionally have been super hard to do, like identifying, you know, any leaks in, you know, personally identifiable information, for example.

Alok Pareek: 40:16

Or is there any sensitive data that is going through my pipeline and not just databases? It could be, you know, it could be a communication tool like Slack, right, and if I cut and paste something in a Slack channel, you know who's monitoring whether or not I'm allowed to share this with whoever's on that channel. So you want to scrub these things and I think, as we go through in the, as AI unfolds itself, some of these things about number one, the ability to do real-time ingestion into the model, the ability to actually make real-time data available to the model for the purposes of RAG, for the purposes of fidelity, for the purposes of trust. I think these are the kinds of things where I see Striim and AI kind of converge.

Speaker 3: 40:55

Yeah, and especially for these inference time workloads, you know. I'll just ask you a blunt question. Which is okay, I have all these AI agents. Why don't they just go talk to my database or go into my Slack and hit the API there or go fetch the data from the source at the time it's needed, versus replicating the data into the model?

Alok Pareek: 41:16

Yeah, that's a great question, right, and I think that in maybe some workloads that is absolutely the right thing to do. However, I think what we see really in enterprises and in our customer base is they're not leaning on one system, right? Typically, these systems evolve and that's kind of what I earlier I was saying. Typically, the pattern is that I have this interesting network of distributed nodes and there are things that are happening in this distributed network at different. You know, it's like a time-space thing, almost right.

Alok Pareek: 41:46

So if something happens, the interesting thing is how do you get notified? In your example, the database of records, how does it get notified that something else externally has changed, but the model needs to be aware of it? So that's where change identification is a problem. And the second part is you can't keep the agent, can't? It's operating on some training model that was trained as of a certain time. So how do you then, you know, make your inference smarter? How do you give it advantages to say that, hey, if something has changed and if you're running maybe you know a vector search on this agent is running some sort of vector search here, right? Whatever it is, how do you get the most recent representation of that semantic information to the actual vector search so that you can actually further qualify your specific response. And now you're fine-tuned on the fly.

Alok Pareek: 42:41

So that, john, I think, is where I would say I would draw the line right that obviously we complement some of this fascinating work that others like OpenAI or Gemini and Cohere all these guys are doing right. So their strength comes in where they're saying hey, look, we have a zoo or a garden of these models. So, depending on which multimodal aspect you're talking about image or audio or video they're focused on saying select. This model saying is that, hey, as events are moving, as real-time transactions are occurring in the real world, there are changes there that need to interact with the model itself. And this model, if it's only operating in isolation, is not going to have the most recent change information available, either in its raw form or in its vector form. That's what I think Striim is adding to that mix.

Speaker 3: 43:42

Absolutely, and so many of the fundamentals and best practices of data management apply in AI as well, and it's unpredictable how many AI agents and workloads can be running at the same time and are you really going to let them go pull your production operational systems? It's the same question of like why do we have OLAP? You know, earlier in the season we had Andy Pavlo from CMU, from Carnegie Mellon, and I asked him the same question and you know, basically, I mean we wouldn't have engines like Snowflake and Databricks, which looked at the analytics platforms and really centralized data for BI and data science purposes right there, and the operational teams are running the database and your analytics teams are generally running your data warehouse. Then you'll have AI teams, whether they fall under engineering or analytics. I think that's, you know, every company's own adventure that they need to figure out, but they'll they'll still need their their own data store.

Speaker 3: 44:39

Uh, that has fresh, curated uh. Uh govern data specifically for ai workloads and, just like the, the analytics team, that might have hundreds to thousands of analysts and data scientists who can be running workloads at any time. They don't want to hit the operational database, they want to work on their data lake. Yeah, ai agents can. Agents will very practically run into the same issues where you'll have any number of AI agents and they'll need their own sort of layer of indirection, their own kind of buffer zone to use as the context window for their data. So, absolutely, replication and Striim data into the models absolutely makes sense there. And I want to bring this back to some of the work we did around data governance for AI, specifically in the latest release of Striim. You know what we call the fifth generation of Striim.

Alok Pareek: 45:33

Yeah, so, and you know, and I'm really excited about some of the work that we have done because it is very novel To my knowledge we're in February, now 2025, right, so to my knowledge, I don't think anybody else is doing this yet. So the idea was that, hey, when I'm setting up my data pipelines, could I make sure that I'm using some of these foundational models to identify things that are private and sensitive? And so we've actually applied the model. So we introduced two new agents into Striim 5.0, which is our fifth generation, as you said. One of them is called a Sherlock agent and another is a Sentinel agent, and we thought about it, you know, very thoroughly and from a holistic perspective. So there's a design time part of this thing where, before I'm setting my pipelines, I want to just know am I going to encounter any sensitive data. So we designed one specific agent for that purpose to say, go investigate, do a bunch of snooping around on my systems where my events are going to come from and tell me whether I'm going to encounter any sensitive data. And that actually, you know, again, we've optimized it, we've kind of, you know, I made sure that we have the right frequency and sampling there so we don't disrupt the external world. So that's kind of one powerful capability there, before you actually go in and start moving the data. Now the second part of it is Sentinel. And so Sentinel is like this incredible agent and it's kind of, you know, literally sits in and, you know, reminds me of some of these like sci-fi movies where you know you can peek into like a bus and look at the people and say okay, that's John right there. And so Sentinel actually is running. You know a lot of the sensitive data detection on the various attributes.

Alok Pareek: 47:24

And then you know, just taking a look at that and for governance reasons, identifying that something might be a social security number, or it might be a Singaporean national identity number, or it could be, you know, just an address somewhere in Tunisia, so, and it's very accurate and it's very powerful, but we didn't stop there. So once you identify it, the question is, you know, for governance, what do you do next with it? And that's where there's a gamut of policies and actions that we have added into this thing. So, as a policy, I can Striim go in and say deal with this sensitive information categorically by either masking the entire thing or masking a portion of it or encrypting it and or tagging it, and I think tagging is a very powerful technique. I think that in a few years from now, this is going to become Striim, because tagging has a lot to do with this.

Alok Pareek: 48:21

Trusting in AI, right, the fact that I'm able to go in and tag a piece of data using my model to give you more metadata around it.

Alok Pareek: 48:32

You know, so that you know, right from you know, understanding maybe, what is the lineage of this data? Can I trust it? Did my observers, like all these agents that are observing it, what did they think about this at the time that it was moving from one location to another or one node to another? So you can see that you know, as this is evolving, we are putting in these breadcrumbs which kind of brings me all the way down to the top of recovery right, that there's a pattern here from a logging perspective, that you know, if I want to evolve something, I need to keep track of stuff in an efficient way. And I think the same concepts that we learned all the way back from saying, hey, there are these changes or there's these breadcrumbs that I need to tie into events so that I can, either for backup reasons, replication reasons, heterogeneous logical replication reasons, Striim processing reasons and now AI reasons. Go ahead and leverage some of these artifacts so really excited about these capabilities in Striim 5.0.

Speaker 3: 49:37

Yeah, absolutely, and this has been discussed even with other guests here on what's New in Data. But the quality of the data influences the quality of the AI.

Speaker 3: 49:50

So having inaccurate data, data that doesn't make sense. This will all lead to negative outcomes with AI and hallucination and just things that just generally don't work and make people say, okay, this is another form of AI slop, no one's going to invest in it. Then the other side of it is making sure that it's governed right, making sure you're not sending customers social security numbers and things like that into a workload that you know becomes a customer-facing chat experience. And hey me, john, I can look up alok's social security number by asking I mean, these are, these are the type of data leakage that can realistically happen.

Alok Pareek: 50:22

Right with with uh, and it happens all the time, john, like, we know, we talk to customers all the time and they're like oh yeah, so, so such and such uh happen, and it's always post-facto and I always tell people that you know it's. It's surprising to me the cost factor that people pay in a reactive manner rather than just proactively thinking about this as a preventable problem. Right, you may have like 100 security products out there, but you need to have a horizontal view across these things to see that the corners where these things are glued to each other, right, is there a correlation happening? There Is my surface area of attack being tracked in real time, and those are some of the interesting things that I do think that now, with these security agents and so forth on these moving data, I think we're going to venture into that area soon as well.

Speaker 3: 51:11

Yeah, absolutely. And there's tons of use cases for Striim. And, for instance, we have our serverless Striim developer platform where anyone can go in and try these features. For you know, retrieval, augmented generation, vector embeddings, replication, where we're basically offering as a multi-node Striim cluster right, and you know, that's just in and of itself a use case of Striim and offering Striim cluster as a service, data Striim as a service, and then even in the backend, the way we ensure high availability in the case of regional or availability zone outages, we're actually using Striim to replicate to a backup database and then restarting the cluster with the same metadata. Then, on top of that, we say, okay, well, what do people actually get out of these Striim pipelines? What are the number of weekly active users? How far do they get into the pipelines? Again, you know that's that's another use case of Striim, because we copy that data into our analytical warehouse where we use that for reporting. So Striim is really.

Speaker 3: 52:14

A lot of people have this misconception that Striim super complicated. But then I actually look at all the Striim guides and implementations like, oh well, you got to set up your open source CDC platform, you hook that up to your Kafka cluster. You got to set up your zookeeper. Now it's craft, and you got to run this Kafka cluster. You have to think about topics and brokers and you know the replication factor and the cost implications of that. And then, finally, once you get into interfacing with analytical systems and AI systems, okay, now you have to think about, okay, what does each system actually do with an event and what is the payload of that event and how is it actionable? Right, so that in itself is an engineering problem. Actionable? Right, so that that in itself is an engineering problem.

Speaker 3: 53:02

But I think, especially with the, the, the next generation of work, especially getting the business value, the time to value, as fast as possible. You know, just having a, a product that abstracts, that for you end to end, uh, makes it a lot, a lot simpler as a, as a, as a managed service or something you can deploy in your environment. So you know I that that's super exciting, super exciting for the work you're doing here, you know. But I also want to back up to just, you know we were having a really fun conversation about databases. What's a general fun fact about databases that most people don't know?

Alok Pareek: 53:33

Good question. Well, I mean, I can talk about one specific database that I worked on and I think it's still true, but maybe a lot of people may not know that you could configure, like, maybe an odd size block size in a database. So usually you think like, okay, well, maybe 2K or 4K or 8K or 16K or 64K, 32k. But I think it's true that you could probably have it as a multiple of 512 bytes, so you could probably get away with a three and a half K block size. I don't know why you would do that, but I think it's a fun fact. So I think that's used to be true, at least by the time that I was dealing with it, but I'm not sure if that's plugged in now or not.

Speaker 3: 54:18

Yeah, and you know there's a lot of interesting, especially when you get into the bare metal aspect of databases and how databases kind of compete with the operating system for resources and even the way the database assumes that certain resources might be available and, for instance, the operating system. If you take an operating systems class, you'll know that operating systems have their own buffer cache and then databases on top of that have their own buffer cache implementation, which is a bit of a different approach.

Alok Pareek: 54:51

I don't know if that's something that you could comment on the buffer cache directly in the database layer and bypass that completely. So you can actually even in interfaces let's say, when I'm just doing IO to a disk, right, I can just bypass the operating system. I can indicate to the operating system I don't want this cached, right, so then it doesn't store. And now you don't have to pay the penalty of kind of a multi-layer caching scheme there, but the database itself will directly just say I'll manage this thing. So at least in the advanced databases in the world that should be just de facto in my view.

Alok Pareek: 55:34

But it's not, I don't think everybody does that?

Speaker 3: 55:38

Are there any open source databases you can?

Alok Pareek: 55:40

kind of. I'm honestly not. I haven't caught up, john. I've been busy with Striim so I haven't caught up with what maybe some of the open source guys have been doing lately or not. But you know, I mean it's 2025. I think it should be something. Now, whether or not everybody leverages it or not is a separate question, but I think in code I think it's doable. I think in most of the popular, like Postgres for sure, I think you can do that.

Alok Pareek: 56:07

Yeah, absolutely, and you also had a fun project with running Oracle on the Mac. Oh yeah, that's probably one of my biggest regrets, by the way. Well, let me clarify that. So back in the day, we ended up doing this project. So, you know, back in the day, you know, we ended up doing this project which kind of allowed, you know, some objects plug it into a Windows or a Mac-based Okay, let's not talk about Mac, just the Windows-based database. Right Prior to that, you had to, john, go through and do this massive thing where you would unload all the data and reload it to the SQL engine.

Alok Pareek: 57:05

So the problem there was that it was kind of dependent on the size of the data. So if you had 100 gigs versus a terabyte, right, all of a sudden it's 10 times the penalty that you're paying. But if you're at disk copy speeds, you could do this much faster, right? Okay, so we wanted to do that. This is when I was still in Oracle. I think it was in Oracle eight or nine, I forget exactly the version right now. So it was called cross-platform transportable tablespaces and as part of that, we needed to figure out, okay, what are the various platforms where you could go take this to.

Alok Pareek: 57:36

And Oracle had a lot of ports at that time. Right, it used to run on like Siemens and HPUX and Mac and you know all kinds of flavors of like you know AT&T, svr5. There's a bunch of operating systems that DAC, vms. So we had to manage that, make sure that we kind of restricted the number of platforms that we wanted to port the Oracle database on. And as part of that we did some analysis and I guess maybe one of the regrets is that Oracle used to have a port on Mac OS and we nuked it at that time. Mac OS X didn't make it to that list because this was kind of like Apple was kind of going down at that time and 90s, right before Steve Jobs rejoined.

Alok Pareek: 58:26

Yeah, so it was never considered like, like we were like who would run an Oracle database on a Mac. But now let me tell you, like you know, I love the Mac and I'm a developer on the Mac and I wish I could just run the Oracle database on my own Mac. So that's one of the regrets that I have. So that was a fun fact about kind of the porting aspect of macOS XY. So now there's no Oracle port on macOS.

Speaker 3: 58:50

Oh wow, your team didn't want to think different at the time.

Alok Pareek: 58:55

I thought we were thinking differently. I thought we were like you know who's going to run the database, like it's not a serious operating system where you'd want to run an Oracle database on. But I mean, ultimately I think we've kind of converged towards kind of the Linux flavors anyway.

Speaker 3: 59:12

So I think in retrospect, yeah, dockerize Oracle on a Mac is totally acceptable now, exactly yeah. But Oracle famously has over 25 million lines of code and porting to every operating system. At the level that Oracle had to support it, it must have been a massive undertaking. The level that Oracle had to support it must have been a massive undertaking. So of course it made sense that you had to prioritize the platforms that you know people were realistically deploying on at scale.

Alok Pareek: 59:34

Yeah, and then you got to test these combinations and there were genuinely some ports that were, let me Well, I don't want to get into anything that's not documented publicly, but let me just say that you know some of the platforms kind of. You know they have like this sort of header which is special and not everybody had that header. So imagine that you know, like, if you're the database manager, right as in like code, and someone says, go start up the database, you got to go examine all these files and the headers to say, you know, do these belong to me? Is this kosher? This is kind of like the dance between the headers that are metadata that I was talking about earlier. Right, is this safe for me to open?

Alok Pareek: 1:00:15

And if you didn't have that header, it was tough to know. Is this corrupted or is it actually, you know, something that is coming in from some other platform that someone has shipped to me? In other words, if you're doing this trick to plug this kind of this data from one database into another, I have to ship it, right. So this database manager now when it opens that file, it needs to recognize what am I dealing with. And because of this non-standard, you know, kind of we used to call this the kind of block zero or the OS block header.

Alok Pareek: 1:00:50

Not everybody had it, so that made it complicated. So that's why we had to kind of trim down and make sure that the platform list was kind of manageable and we didn't have to have this you know sort of you know end by end problem right. We kind of restricted it to just a small set of ports. I think we brought it down to maybe 18 or 20. Restricted it to just a small set of ports I think we brought it down to maybe 18 or 20. I don't even know if there's 18 ports anymore.

Speaker 3: 1:01:13

It's probably a smaller number. Yeah, it's definitely. You know, database development is a fascinating area. It's becoming more popular than ever.

Speaker 3: 1:01:19

You know, like I said, we had Andy Pavlo earlier in the season and Mark Rooker from AWS, and you know it's just, it's incredible to see that it has a, you know, almost a cult following and so an audience. You know thousands of people follow. You know the trends in database development and just go in and I think ultimately a lot of people ask well, you know, now, I know how a database is built, why don't I just build my own? But then you look at real world implementations of databases and see, okay, oracle's 25 million lines of code and you know it's not slowing down.

Speaker 3: 1:01:49

I mean, you know Larry Ellison was just up on stage with Sam Altman and our president you know talking about. You know the $500 billion that's going to be invested there, right? So, and then you look at some of the popular open source databases like MySQL and Postgres, and then even the growing NoSQL object stores like MongoDB and things along those lines. So you know there's a lot of work out there. There's a lot of opportunities to stand on the shoulders of giants. So, especially for, you know, data engineers who want to really do impactful work that relates to an operational database, or whether they're scaling it or trying to get data out of the database for analytics and AI use cases. It's really important to be aware of these types of fundamentals.

Alok Pareek: 1:02:37

Absolutely, and you're right, john. Like I think you know, a lot of the people have taken a stab at kind of writing a database from scratch. It can be done, but I think the evolution of it and, you know, addressing sort of a generic class of workloads, that's not an easy problem. Like I mean, you mentioned Oracle's lines of code right. I mean, if you take a look at some of the more recent stuff right from, you know model extensions to vector representations, trying to do, like you know, open neural network extensibility through PL SQL functions, and you know there's a lot of work in that right.

Alok Pareek: 1:03:12

So when you go in and start embarking on some of these things, you know in a narrow domain you could always build like maybe something that's super specialized, and I think you know we've seen that right. I mean a bunch of folks have done that. But to try to sort of sustain it across hybrid workloads but still keep up with sort of the cutting-edge technology, it requires a lot of R&D budget, for that requires like a proper investment in that area, and I think that's where you'll fall behind very rapidly, because you don't know where to focus on at that point. And I think that's where you know. I do think that it's not an easy problem. But if you want to restrict it to a specialized domain, yeah, that'd be fun. To build another database, I guess, okay that's our next project.

Speaker 3: 1:03:55

Let's do it. Okay, John?

Alok Pareek: 1:03:58

That was good.

Speaker 3: 1:03:59

I mean a lot of people ask us is Striim a database, Because we have a SQL engine? You can technically store data yeah, but we've been always pretty firm. You could technically store data yeah, but we've been always pretty firm in our stance that it's not a database.

Alok Pareek: 1:04:13

Yeah, I mean and that's deliberately so right, because I don't think that the whole idea here was that, you know, there's data in motion and there's data at rest, and combining these two things. There's been a lot of architectures proposed, but I don't think anyone's cleanly solved that problem, in my view at least, and I think so. We're definitely not a database. However, you know, there is a storage tier in Striim and, in fact, by default, we back it up with a distributed Elasticsearch cluster.

Speaker 3: 1:04:47

Yeah, my first project here at Striim, by the way, so just fun.

Alok Pareek: 1:04:50

Yeah, I remember that and so you know. So you could actually, you know, build these fast aggregates, right, let's say, over moving data. So let's say that you know I have store sales being reported from, you know, every single I don't know Lululemon store or something. I don't know if I'm allowed to talk about specific entities, but that's an example you could actually see. Like, you know, hey, what are the 15-minute sales per store? You can partition by the store ID and you can actually go ahead and, you know, use the Striim store to go and push this into Elasticsearch and now it gets indexed for you automatically and you can actually build an actual application directly right on Striim. Now, right, so that's absolutely possible.

Alok Pareek: 1:05:33

In fact, you know, because Elasticsearch also has capabilities to, you know, do things like, you know, vector search. There is a possibility from an AI workload aspect for us, but I don't want to speak about futures in this thing right now. But that's an active area of R&D for Striim, trying to get us into that realm where you could just take a brand new agent and you could look at it as an automation problem between different data pipelines and there are pieces external to Striim where agents can go in and grab the data and then process it. But then you know there's another agent that's actually leveraging from an application perspective directly on the Striim platform itself and doing things like you know, maybe having the rack pattern within that application directly, and so that's an area that we want to definitely get into in the future.

Speaker 3: 1:06:28

Yeah, absolutely, and you know my my take on that is you know, a lot of companies are doing CDC today, but the the compromises they made in the process suggests to me that it's not a solved problem for for analytics or AI, cause you have these teams that say, yeah, we have, you know this, the CDC thing running, but we don't care about latency. You know, we, we just we ship reports out and we're okay with 24 hours, or you know even two hours, and that's fine. You know it's not our role to tell people, everybody, that real time is always required. It's, honestly, not even required If you're a Striim customer. We have customers that run reports at an hourly basis.

Speaker 3: 1:07:06

Um, and then you look at other compromises made like okay, I mean, how do you know that your reports are accurate with the transactionally consistent data? That's a dicey question because most analytics teams don't even want to touch that area, because once you even raise the question, it kind of opens up a can of worms that they're not ready to surface with their own management. So usually we hear about the other way around, from a data executive or an AI executive or a CIO who definitely wants to be able to go to the CEO with confidence and a CEO. First thing they will say is well, is this report real-time? Can I trust it? Can we actually make insights from it, right?

Speaker 3: 1:07:44

So my message on just your comment about the storage tier and using AI to collect automatic insights, is that this is the next leap that teams will have to make with change, data capture and replication and Striim in general to make sure that they have fast time to value, fast time to business insights. That's enabled, you know, by ai and, like we said, we have really awesome large-scale customers like ups, who have done great work here, and you know we're excited to to work with, you know, hundreds of other customers that are doing this to the to the thousands and in the future it's it's really, uh, an exciting time yeah, absolutely, you know, I, I and I think those are great points, right.

Alok Pareek: 1:08:26

I mean, I think, uh, I think you know, I hear the term CDC kind of casually being thrown around a lot and there's an unusual hardship right when you are trying to support change data capture. And one of the very naive approaches that I see people take is assuming that somehow I'm going to have the entire record available in my change Striim. If I have a wide record with 318 attributes, for example, and I go update only one of those attributes, a lot of the frameworks make the assumption that sure, in my change Striim I'm going to have all 318 attributes available. Now, okay, if you and that's the part that you're saying that people make kind of compromises or assumptions, if you can make that assumption, okay, you can live with it. But real world applications don't behave like that, right, that's a lot of overhead, it's inefficient, right, and it involves logging on the system that is generating that change and makes assumptions about that system. That's not true in the real world. So how does Striim solve that? Well, Striim supports compressed updates. So in many of the sources, right, we can actually just take an update with just the key and the before-after image and basically a bitmap that tells me what columns have changed, and we carry that information along with the metadata and we react to that.

Alok Pareek: 1:09:55

And that's where the problem becomes interesting. Because, if you like to your point, john, now if you take a look at a, at sort of a data engineer right in the, in the in the first case, right where they had the entire record, okay, as a client I can go push this entire thing into my data lake, but if I have partial images coming in from a very mission-critical application, I'm lost. I have no idea what to do with this thing now. You're not conforming to this. And even with open formats now that are coming in to represent logical change records, it's not an easy problem. You've got to manage the schema, you have to do conversions Within the decision-making pipeline.

Alok Pareek: 1:10:38

You want to say that, like we talked about, I want to maybe mask something, and these are the kinds of things, as we are moving towards event-driven, especially real-world event-driven applications, that systems have to pay attention to. So I think you know, I always sort of think that people who think this is just a CDC problem are looking at the world in a very, very narrow domain. I think that problem was solved probably 35 years ago, in my view. Right, you could write a trigger and do chain data capture, and now I have everything. But it begs the question like okay, is the? Are the operators of that system going to allow me to write triggers? And if they say no, all of a sudden all bets are off.

Alok Pareek: 1:11:18

So I think those are the kinds of things that move sort of this argument from a developer oriented mindset to an enterprise oriented mindset, and I think that the choice is really, you know up to you, right. Do you want to spend your time with the stuff that you really know about and excel at, or do you want to get into the nitty gritty of saying that, hey, I want to get into some of the core logging concepts, transactional concepts, for consistency reasons, etc. Or the accurate aspects that you talked about. Who's going to deal with this? Who's going to deal with event processes and guarantees? That's where the fun actually starts and I don't think that a lot of teams are very equipped, especially if they're not database guys, especially their application level guys. It's a tough problem to crack right. Very few get it right in my view, absolutely.

Speaker 3: 1:12:07

And there's so much work to be done right now in really deploying next-gen analytics and AI use cases. I think they go hand in hand and, as we're seeing more teams adopt AI in real-world use cases, it does seem like a function of a data team, right, because it's not quite software engineering, it's not quite a reporting or business intelligence role, but it is the team that's sort of bringing in both operational and business systems, marketing systems, crms, erps that you know the internal teams use, and then the customer facing systems and figuring out how do we really make production applications out of this. So really allowing you to, to, to innovate there and build IP for your company uh, that that brings the value and shows the ROI. You know we, we, like I said we we work with some incredible teams there. There's, you know, no debating like the. You know we, we talk about UPS cause it's public and you know it was so successful.

Speaker 3: 1:13:05

They went on good morning America. It was a cool segment. So there's just so much fun work to do. I think it's an awesome time to be in data and it's an awesome time to be working with these products, and I can't remember this level of excitement since the first iOS app store blew up and we saw the early days of things like Uber and you know all the various mobile applications that blew up and you know Instagram and whatnot. Now we have this kind of new wave you know, generational platform shipped into AI.

Alok Pareek: 1:13:46

So absolutely, absolutely feels like it's just a brand new world and, uh, I absolutely share your excitement and enthusiasm there and on to sort of. You know, uh, I think each year is going to show something, something very, very different, uh, and I'm pretty excited to see what's uh, what's there, and we have some great things to help, uh, you know, really um the world with some of the cool, cool, novel things that we're introducing in the Striim platform.

Speaker 3: 1:14:14

Yeah, absolutely, and you know we're super excited about that. Alok Parikh, co-founder of Striim. You used to run data integration products at Oracle. You were CTO of GoldenGate and then you worked on Oracle blogging system before. What's your advice to folks getting into the industry?

Alok Pareek: 1:14:35

Oh, wow.

Alok Pareek: 1:14:37

Well, you know, my sense is that A you know, make sure that you understand the value of abstraction.

Alok Pareek: 1:14:49

You know whoever's coming in, I think we're going to observe a shift in terms of how we are writing code, how we are putting together different types of tasks, how reasoning is morphing, and so I think you know, I would say, pay attention to sort of what is the abstraction around this, and what I mean by that is there's a way in which these new you know, ai-based, you know workflows, ai-based automation, ai-based agents they're going to radically change how systems and humans and processes evolve. And having a good understanding of that and being at ease with that, so do as much as possible to kind of get comfortable with it and adopt it in your daily life. I think that's really and I myself don't have the magical answer, right, but on a day-to-day basis, that's what I do, right? Just if somebody comes in and says, hey, here's a way in which AI is going to change this, I'm not afraid of that, right, it's another powerful way to incorporate that towards what I truly believe is the highest form of intelligence, and that's human intelligence.

Speaker 3: 1:16:14

Absolutely, and I wanted to ask you that because you know you're an expert in your area, but you're still I mean, I see it every day you know you're still adopting the latest and greatest you know I want to call it personal life hacks and ways to be more productive and efficient and really moving things forward. So thank you for answering that question and thanks for doing this episode of what's new in data. It was super fun. Love talking about databases. It's always a good time absolutely john.

Alok Pareek: 1:16:41

It was a pleasure and great questions and I love uh, I love being on. I've been wanting to come here for a while, so this was was fantastic. And say hi to Andy too, if you're doing. I know Andy well and hopefully we get to do one of these things again in the future.

Speaker 3: 1:16:55

Yeah, absolutely. We'll have to do it again soon and thank you for all the listeners for tuning in you.