Real World Serverless with theburningmonk

#60: Building a serverless payment platform at Melio

March 23, 2022 Yan Cui Season 1 Episode 60
Real World Serverless with theburningmonk
#60: Building a serverless payment platform at Melio
Show Notes Transcript

You can follow Omer Baki and Or Cohen on social media here:

Links from the episode:

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.

To learn how to build production-ready serverless applications, check out my upcoming workshops.

Opening theme song:
Cheery Monday by Kevin MacLeod

Yan Cui: 00:12  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today, I'm joined by the team from Melio payment Omer Baki and Or Cohen. Hey guys.

Omer Baki: 00:26  

Hi Yan, great to be here.

Or Cohen: 00:28

Hi, hi.

Yan Cui: 00:29  

Yeah. So Omer, I read a blog post of yours on the, I think it was on Medium, right? That way you talked about how you guys built this payment processing system using serverless. And I was quite interested to see, you know, some of the things you guys were able to do really quickly and quite impressive work to be honest. So I guess maybe let's start by just introducing yourselves, you know, who you are? Who is Melio? And what is your role inside Melio?

Omer Baki: 00:57  

My name is Omer. And I lead the infrastructure teams in Melio, the R&D.

Or Cohen: 01:03  

I'm… My name is Or. I am a principal engineer at Melio. It doesn't mean much outside of Melio. I was one of the first employees in the company. Basically what the title says means is that today, I have the privilege of taking part in the interesting projects in the company. And the first few stages of like the first iteration of like the infrastructure of Melio, I was really part of building it.

Yan Cui: 01:35  

Okay, so I guess maybe tell us a bit about who is Melio and what is your business? Because you're in the payment processing banking area. Tell us a bit more about what do you guys do?

Omer Baki: 01:46  

So Melio in general is a b2b payment company existing to solve the and digitize the payment flow between businesses. That is currently processing around $10 trillion in checks. So it's a big market. And even as we grow and the exponential growth in Melio in terms of the payment volume is, is impressive, and was one of the challenges that drove us to do this refactoring and improving our system in order to be able to to process an increasing amount of payment volume over time,

Yan Cui: 02:25  

I guess being in the payment processing sector, that comes with lots of interesting regulatory requirements. So let's get into that in a minute, but you talk about $10 trillion worth of checks you have to process every year. So I guess, how long have you guys been around as the whole company? 

Or Cohen: 02:45  

Yeah, so Milio first, like production launch was January 19. We started out as Omer said, the first where the idea was born, is from exactly like small business payments in the US, they make a lot of their payments with checks, it's manually made, they write them, they send them out. And then so we started out by allowing a payor a business to pay their vendors with essentially any payment method that they want to use, they can use a bank transfer, they can use a credit card, they can use the debit card, and the vendor will receive the payment the same way that they've received until that. So they, if they received it by checks, we convert the bank transfer to a check. If they received it by bank transfer, we can convert the credit card payment, again to a bank transfer. So that's what we started out in January 19. Essentially, the product is basically the same. So if you kind of strip out all of the marketing, like the marketing talk, and what we publish outside, essentially, technically what we have is we have a bank account in the middle, we pull funds to this bank account from our customers, our businesses that are registered to Melio. And then we send out money from that bank account to their vendors. So we started with this and then we added with this functionality, we added the ability to receive payments. A few months later, we added that you can register to receive payments. And then we partnered with QuickBooks today. If you log into QuickBooks, you can pay your bills through QuickBooks, which essentially launches Melio experience and allows you to do the payment flow with all of Melio’s features. There has been a lot of features added since then. There's like a workflow that you can approve payments. We added more payment methods with virtual cards and you can now pay payments outside of the US but essentially Melio is to Melio you do business payments. And, and yep, so it's very exciting. It's been a very exciting time.

Omer Baki: 04:59  

It's just in the past year and a half, it has been really potential, there's been a significant growth in around 3,000% in the volume that Melio processes due to the ability to partner with different companies, different companies that used our payment platform as their payment services to service their businesses.

Yan Cui: 05:20  

Okay, and I guess, do you mainly serve customers in the in Israel? Or do you, I guess, it sounds that you also look after customers in the US as well, because you talked about some of the payment networks in the US like ACH and things like that. I guess in that case, that puts you guys in a lot of the different financial regulators jurisdiction. In that case, how does that impact you guys in terms of the the work you are doing day to day in terms of the technology selection that you have to make? It sounds like you guys started with AWS from the start. Were you always doing serverless? Or do you start with EC2 and Kubernetes?

Or Cohen: 05:58  

Yeah, so just to kind of frame, we only do operate on United States. This is where the big market is, this is where the problem with checks actually is a big problem. Like this is where we started. Outside the US, if you go to European countries, or Eastern Eastern countries like Singapore, then checks are really non existent. It's basically, it's very rare to see a check. It's just the US. So that's where we started. What we did, essentially, in my background, I was DevOps consultant. I was doing backend. And I was doing front end. I was doing all sorts of things. And then when I joined Melio, I had a talk with the CTO, Ilan, and we sat together. And I kind of don't just my aspiration to to not configure an NTP service, anytime, anyplace, I don't want to do it anymore. I don't want to configure services and hosts, EC2 instances, configuration management and all this complexity. With that mindset of not configuring anything, we I've kind of Lambda or serverless, essentially, was kind of like the natural solution for this for this mindset. We wanted, sometimes Omer laughs when I say this, but I normally don't call this serverless. To me, it's it's more sitting in the area of like management lists. The idea of us using serverless is just to manage as little as possible by ourselves, and focus on the business logic, the minimal implementation that we need a round to make things work. And just Lambda in AWS, specifically, Lambda is a perfect solution for this. So our first implementations were actually around Lambda, we started out with Lambda, like the first backend of our front end system was running on AWS as serverless. Later, actually, that was the round, that was the end of like, 2018, we had some issues with cold starts. Back then there wasn't wasn't no, still, like provisioned concurrency and similar features. So in keeping with the mindset of going serverless, we kind of dropped down a level and used Fargate. So essentially, our architecture, what it looks like is that we have a lot of Lambda based services, and we have a few we have a layer that is based on Fargate, which supports a lot of different tasks. So you can say that from day one, we had the essentially the benefit and privilege of actually using serverless and experimenting and actually building a production environment around that.

Yan Cui: 08:47  

Okay, and I guess in this case, nowadays, you guys are using both Fargate. And you're using Lambda. So when someone is working on a new workload, what are some of the decision points that used to say, okay, that workload should run on Fargate and or that workflow should run on Lambda?

Omer Baki: 09:03  

I could say that specifically, in my previous company, we worked with Kubernetes. And it required a very strong team of DevOps, and it created a dependency on that team. So the availability of that team was sort of orchestrating the speed or the velocity of our ability to develop. So I guess it now that we use serverless mainly, there were a few advantages, but deciding what should be the workload really depends on the task. I think all we'll maybe want to elaborate it more later on this specifically on maybe the front end side, but on the payment platform side, it's completely serverless. And we only use Lambdas. Because we found an amazing advantage of the platform, the infrastructure because first of all, when you work with payment processing specifically, there are specific time windows that processing is taking place, the system doesn't have to be up, be loaded all the time, it doesn't have to be servers that run continuously. It works in, in sliding windows or in windows of burst operations or burst payments that needs to be processed. So for example, at a specific time, there are a few thousand payments that are processed, and then the payment system can just shut down completely. So this was one of the advantages that we saw. And in addition, specifically, it was easier to take part of the system. And if it was a bit monolithic, it was easier to take the part of the money flow, and export it into new services and change the configuration accordingly. It was also possible with that with that infrastructure. But specifically, Or, do you want to talk about it.

Or Cohen: 10:49  

Yeah, so I can add to that there's like a, well, there's like a policy that the first thing you try to do is you try to use Lambda, this is because it provides all of the benefits. And we also use Lumigo for tracing. So we kind of have everything out of the box, we have perfect visibility into what's going on. And if that doesn't work, then we kind of adjust to make the solution work. So I have few examples for that. Let's take WebSockets, for example, Lambda and WebSockets. It's, it's not like it's not a natural solution for that. I mean, you can make it work, it works. And we have a few experiments with that. It's just that sometimes engineers are not used to the idea. And it seems it's it looks a bit clunky to make it work, like naturally as you would expect a WebSocket to work with. If we go to the to the Fargate route. And then we have a load balancer connected to these instances, the servers, essentially, then it seems much more natural and works much more naturally, you write your code, as you would expect to write it. For example, with WebSockets. If we want to take down for example, the latency to the bare minimum to have almost no overhead because it matters at certain points, then some then yeah, maybe we would also do Fargate for that, but that sometimes I kind of insist that let's try to make Lambda more optimised because of all the other benefits. So just let's actually try to use the overhead to our advantage, instead of trying to avoid it

Omer Baki: 12:27  

As an example, maybe we can say sorry, just as an example, maybe if we encounter the 15 minutes timeout that Lambda has that it might mean that it's not that it's the wrong tool. But we might need to reconsider changing our architecture, or redesign it in a way that it works, maybe break down the task into smaller tasks. And it it sometimes it was really beneficial because it speed up the whole system and even reduce the cost in a way. So it was nice to have those limitations to reconsider. If we had, for example, a server that is up and running, we might leave it as is. But this limitation actually was beneficial and help us rethink our design.

Or Cohen: 13:10  

Yeah, that's a very good point. I was I had another example of like MapReduce jobs running on Spark or something similar, which is not, again, not natural to have on Lambda. But yeah, but essentially, is we try to go all the way. And if that doesn't work to the solution that we need, we don't do it. So to almost point about limitations, one of the things that you encounter with engineers that are onboarding in the company, and then they start using Lambda or serverless. And they come from like the Kubernetes world, or just you know, any server world is around limitations like we have the time limit limit, we have the time limitation timeout Lambda timeout, we have the memory limitation, we have, and we have a bunch more like there's only single execution isolate, sorry, isolated execution and similar stuff. These limitations are actually or, for example, sorry, I forgot provisioned, the concurrency, concurrency of how many Lambdas you can run. So at first glance, like these limitations are somewhat like, Hey, I now have a limitation that I should consider all the time when we're writing the code, like I have a timeframe which I can work in and if I can cross that timeframe, or I have a concurrency limit in my entire account, and I need to consider that concurrency limit, or I need to reserve some of that or need to, for example, if we have a Lambda that listens on an SQS, if the SQS kind of explodes for a few minutes, suddenly this Lambda can catch the entire concurrency limit of the entire account, which could be problematic if you're running other workloads. So you should be aware of all of these parameters coming into play. But if you flip the coin these  limitations are actually kind of liberating. This sounds weird when I say this, even to me now. But, but consider it you have a server that you're running a workload, okay? And you have a bunch of threads in a JVM, or you have Node process or a bunch of them that are running on the server, and it starts getting requests upon requests upon requests, and the request load starts to go up. Where's the limit? Like how do you know if the limit is 10,000 requests concurrent or 5000s, or even 100, that you can't know exactly where the limit is. So you initially you kind of eyeball it, and then you say, okay, it should be around 5000. And then you hit 2000, and everything breaks. So you adjust for this parameter. And then and then you reach more and more limitations, context switching, I sometimes hit even MAC address caching limitation in the kernel, you hit all these limitations without even knowing that they exist once you start. When you have predefined limitations, like the time, like the memory, like all of these, then you actually have a very well formed boundaries that you can work in. And you can expect, once production is actually running hot. So so instead of being limiting, actually, these limitations actually provide a very, a very defined frame to work in.

Yan Cui: 16:29  

Yeah, and I agree. And definitely I think what you mentioned Omer there that you have to think about these limits, and oftentimes they help you, they help you to better design, which are going to make your system better in the long run, as opposed to just assume that you have infinite amount of resources, you can run your code for as long as possible, until such time that you know, your code runs for an entire weekend. And no one really, you know, thinks twice about, oh, maybe it shouldn't take that long to run this simple thing, right? I've worked in banks in the past where we've got this batch job, they just run for as long as it needs to, until one time it just took so long, that over the weekend, it couldn't finish. And then come Monday, people are gonna go go back to work and use the system, and they can't because this thing is still running. Things like that you just don't really think about it because you know, there's no constraint, you can just can just do whatever you are you do, rather than having some constraints to force you to be creative and think, really think about the problems and you know, what are some of the more efficient ways of solving this problem. Until you know, until you can't do anything, these limits are there and waiting to do just have to go over these limits in which case, okay, maybe like you said, Lambda is not a good fit, because we do have a batch job. It’s going to take more than 15 minutes. And there's no way for us to break into smaller tasks that we can, you know, run in sequence instead. So maybe at that point, you bring in Fargate. But at least it's it's a decision that you thought about, and you consider a limit, and then you reach the point. Okay, let's do something else, as opposed to just, yeah, let's just throw a box and let it run for as long as you can. No one really thinks about, okay, is it the best way to do this thing?

Omer Baki: 18:07  

Exactly, exactly. Also, also, the cost can be an interesting parameter. In a way, the cost is like a monitoring tool that you can use in order to understand if you're using it, you're using this infrastructure in the best way, maybe a task that takes a long time, but doesn't require a lot of memory should be splitted, from the other tasks that are short, that require that memory, that memory consumption, or whatever. So in this in that sense, it was easy, because you can just look at the code and take part of the logic and just extract it into a different Lambda, and maybe connect it to the orchestration yourself by placing an SQS between them or whatever the design requires. But also, the next evolution of it is relatively easy, because if you want to move to an orchestration by Step functions, you already have everything set up. And you just need to change the orchestrator to an external, external tool. 

Yan Cui: 19:10  

Yeah, I'm so glad you mentioned the Step functions, because because that's one of the things that I was gonna ask you as well. Obviously, you know, payment processing, there's a lot of different steps that you have to go through, a lot of different third party services you got to talk to, you know, you got to, like you said, credit money into your account first and then take it out and pay to someone else's account. So, how are you guys using Step functions? Is it mostly to drive your batch processing to process these checks? Or do you use them more widely for lots of different things?

Omer Baki: 19:41  

Well, I guess that one of the issues we encounter is really how to define what is the what is the state machine, where does it start and where does it end and where is the new state machine begins? So, this is also I see it as a as a as a transition as an ongoing transition. For example, if we take a long task, for example, if I look at the payment, there's the the part of the processing on our side and bills, the part will, the bank, for example, notifies you back the reports of what was transferred yesterday or stuff like that. So you could decide that the whole state of a payment is the is the whole flow from from sending it from our bank to the from the vendor bank or whatever and also getting back the report. And this will all be one state or maybe split them. And in a way, it's easier to split them at first, because we can, for example, have the entire flow itself orchestrated by us by just placing SQS, SNS whatever, between them, and then maybe take a few Lambdas part of this flow and externalize it into Step function, and then decide later if this specific Step function should be longer or should include another section or about other. So initially, it's easier to just decide on one of the flows, that is maybe the most critical to change. And first change it and then consider the next step. So one of the advantages is the ability to move in, in small steps. And it doesn't have to be like a huge refactor of the code. And you know, and it could in a long go all out and stuff like that, you can really do minor changes each time and move forward all the time.

Yan Cui: 21:31  

So you mentioned that you're going to, at some point, you're going to move some of this orchestration logic out of Lambda functions into Step functions. So in that case, why not start with Step functions to begin with?

Or Cohen: 21:43  

When we first started the payments processor, the first iteration of payment processing was kind of like a monolithic Lambda, it's kind of a Lambda, one Lambda that contained everything and ran all the payments, and then almost team took it and kind of split it into multiple Lambdas, it all do one job, or they kind of stream from one queue to another. So when we started, we saw the complexity of kind of having code that runs everything. And we wanted something better in the form of Step function form of a workflow, manage workflow. The problem that we had, the problem that we faced is actually similar to what we mentioned earlier with limitations is that when you first start, when you start to write your you write to this, when you start to describe your programme as a workflow, you suddenly realise how many things you're not handling, when you're just writing code, you suddenly realise that you have a bunch of edge cases that when you're just describing a code, when you're just writing code, you say, Okay, this is gonna bork at some point, and it's gonna throw an exception, and someone's gonna catch it, log it, and then we'll monitor. With a workflow, and specifically with Step functions, if something fails, you must handle it specifically, you must be prepared for almost any error, so that the workflow will be either complete, or it will be recoverable from different points in the in the flow itself. So this was I can share, like, this was the first thing that kept us from using Step functions initially, we just didn't have enough resources to invest. And we, to be honest, we didn't know enough about the payment processing that we're going to face with just start with a workflow. That is kind of well defined. Our process itself for payments processing wasn't well defined. So we couldn't define it as a workflow yet. Today, it's much more rich, like it's it's richer, and it's much it's much better defined. But that was initially maybe Omer can continue from there about what's happening now. 

Omer Baki: 23:49  

Yeah, I guess it's a matter of, of priority in a way, but it's like the decision whether to do one giant leap or take three small steps. And I think our approach was to take those smaller steps and which, by the way, serverless really enables you to do it more easily. But, but one of the reasons was that we had to keep up with the business. Because there was such a huge growth in such a short time, the highest priority was to know that everything would work and know the money will go through rather than maybe do the maybe read the furthest ideal design. But I think the idea was that at least we're moving. We're not taking one step forward, two steps back in a way, we're always moving forward. So it wasn't contradicting the approach to take the small steps. Because anyway, we had to first split it in order to move to Step functions. So the evolution was a bit natural. I mean, it was making sense anyway, to do this in small steps. And it was also enabled us to test it and make sure that the implementation and the design is correct.

Yan Cui: 24:59  

Okay. makes sense. And earlier on, you mentioned something that from your previous jobs where you had Kubernetes. And the Kubernetes team or the DevOps team became a bottleneck for the whole company, because everyone had to go through them. And one of the benefits of serverless and or not using Kubernetes is that each team can then just move at their own speed. So how do you in that case, as a company makes, still make sure that there's some consensus or standards in terms of how everyone is building the application? So different teams are all following best practices or conventions that you agreed on at a company?

Or Cohen: 25:38  

So it's a very good question, because it's a really tough problem to get everyone around the same ideas. So actually, what happened was, is that we started with some kind of standardised service structure, then as the company evolved, and more and more people joined, it became apparent that it doesn't, it doesn't hold like we can have a single structure for all services. And everyone would work around that. We needed to give some freedom to back to the engineers to decide how their service is going to look like what are the structure, what is the structure of the of their service, Lambdas, or Fargate, would be so the standard is somewhere, if I can describe it, it's like, it would be like every team is like their own island. And the infrastructure, like the standardisation lives in the international waters. So essentially, every team with the mindset of using Lambda and having everything kind of going SQS, if you need a queue and DynamoDB, if you need some kind of keys, key value storage, like we have consensus around ideas or stuff that we want to do. But when you step outside of the service outside of the team's responsibility into those international waters, then things should be speaking the same language. So for example, if I want to share my service with others, I must use API Gateway with AWS IAM authentication. That's, that's the standard way, if you want to go another route, because you need to for different reasons, then it's okay. But you need to, you need a very good reason for that. If I want to share an event with a different team in the company, then I need to use some kind of event bus that we have. It's a standard structure like SNS to SQS subscriptions, it's nothing special. And all of these, if I want to log stuff, I need to send them to cloud watch logs, whether I'm using Fargate, or I'm using Lambda, Lambda is natural, because that's where it goes. But with Fargate, you can do a lot of things. So we kind of matched everything to go to the same location, essentially, like everything that happens outside the effects that your service has on the system, they need to be this is where the standard lives, we have a few standards for the company itself. Like we use almost the same language everywhere. We use Node.js, TypeScript, and a bunch of these. But this is more to get like people will orient it with what other teams are writing, and not necessarily to have the same structure. I hope that explains like, where we see how the standard lives.

Omer Baki: 28:25  

I think I can add to that, and I mentioned it in my blog post that and I also saw this in a way being asked by colleagues, by friends from previous jobs. I think it was fail to ask sometimes, you usually ask, should they use serverless? And it's never the question, should they use Kubernetes? I mean, at least fail to ask those same questions. Because if I asked this myself, honestly, I would say if it does clarity, no limitation. So you know, sometimes I think it's funny that the problem with serverless that it's too easy. You know, engineers are getting bored, and they want Kubernetes because it's more complicated. But but really, if you want to move fast, and it's easier, and so it makes sense. So if it makes sense at least this should be the dynamic of the culture, I mean, to have that debate and decide logically what makes sense.

Yan Cui: 29:22  

Yeah, yeah. Forgot who once said that. Nobody gets fired for using Kubernetes. But if you want to use something like Lambda, well, you're putting yourself under the hood. But yeah, I totally agree in terms of the fact that Lambda is just so easy, so boring. All you do is just, oh, I need API. Okay, there you go, push, push, got something done. There is no engineering challenge per se. There's no tinkering. There's no, you don't get the same satisfaction at the end where okay, I build this thing that's complicated, but I feel really good because obviously, I'm quite smart to figure this out, even if what you've got is a Frankenstein, house of cards that just gonna fall apart. If someone just looks at it wrong. But that said, you do get a lot of, I guess, excitement and enjoyment from tinkering in the same way that people like working on their cars, they're not going to be, you know, as good a car as say, Tesla or any other car manufacturer, but you know, the thing that you're building, even if it's not as good as the real people would do. But it's, it's good, right? It feels good that you've done something, you've been tinkering with it. But obviously, as a company, you probably don't want that. You want people to just do their job, get the right outcome for the company.

Omer Baki: 30:30  

Sometimes they joke about it that we have the best DevOps team in the world, which is AWS, we have AWS in our services. They're investing so much resources in it. And you know, with provisioned concurrency, which solves a lot of issues with every time every re:Invent conference, you see how much they invest in it and the way that working together with them really also makes sense and helps us focus. It's not that we are missing some challenges, you know, we have enough so we can I can skip the having a challenge to write to log. I guess those challenges are less interesting than solving business challenges, and also with serverless there are interesting challenges to solve as well, because many people talk about the difficulty to work in local environment, and the amazing solutions for it. I mean, or maybe can elaborate it on one of the efforts that they're doing in building personally environments, which I never saw that working with multiple servers. I mean, with microservices, I didn't see a company that really solved this issue properly. Working locally really makes sense. So this is one of the efforts one of the challenges that we want to invest time in resolving.

Yan Cui: 31:47  

Okay, Or, do you want to, are you happy to talk about how you guys are doing this a local development?

Or Cohen: 31:53  

Yeah, yeah, sure. I think Omer is right. So, actually, I spoken earlier about having kind of a shared structure for all services. There was a reason for that. And the reason was that I wanted to be able to run a certain service, either on a local computer or on Lambda, or if we need to, like drop down to Fargate and put it there. That was the initial incentive to have kind of all services look alike. And then just having code different live differently, like the code would be a different part of the service. So what we did, essentially, is we had an AWS account, this is what we started with. We had an AWS account that was shared with all of the engineers, which was essentially a playground, each engineer had like a local Dev Tools repository. And what this repository did is it configured for you, a bunch of resources in the cloud, SQS, SNS, DynamoDB, a bunch more. And then it wrote a file through your computer with all the references to everything that lives in the cloud. And then when you started services on your computer, you just clone repository, typed in npm start and then this service would like automatic automatically connected to the cloud. Like all of the services didn't use any local implementations. We didn't use SQS for cloud or rabbit MQ when you ran it locally. We didn't use Cassandra locally, or MongoDB, locally. And then in the cloud, we ran DynamoDB. We actually worked locally with resources that exist in the cloud. So that was the first iteration. But then we started, it started to kind of burst at the seams because the services kind of took their own structure and every team did their own thing. So now, our approach is to have a sort of like a mini Melio production environment for each engineer in their own AWS accounts. So essentially, engineers would get their own personal accounts. And then with a bunch of small conventions and a few short scripts on your computer, what you would be able to do is you would just be able to deploy your service deploy Melio services into your own personal accounts. So you can have your own version with your own branches, your own kind of experiments and things that you want to do for yourself in your personal account. And because this account lives in the cloud, it has its own domain. Communication works wherever you're going to it. And if you want to work on a specific service, you can just clone the service to your own computer, start it. You would need to configure it a bit to speak like we have a few wrappers for that. But you start it in a specific context and essentially it's as if it's running in the cloud. So for example, if we were using SAM, AWS SAM for our services, so you could just do SAM local API, start API, and then the code that exists inside your computer essentially the runtime is speaking to the cloud resources. It gets us, it sends messages to SQS on the cloud. It speaks to DynamoDB, speaks to an RDS database that you have yourself in your account. And the idea there is that a lot of the work that we do because we move forward so fast, it's pretty hard to write tests for implementations that you have no idea how they're going to look like. Like a lot of the things that we do, are actually a prototype that we test out, we launch. It's a new service. It's a new feature that we have no idea how it's going to behave. So it's very helpful to have kind of a working production environment, which is not actually production data, but you can just start tinker with in start services and put your own branch and move a service from here to there, change a configuration, redeploy, redeploy, do all of this rinse repeat operations and have just everything working. So this is kind of a big challenge that we try to solve. It's not 100% working yet, like there are a few gaps here and there, but we are trying to make it work, but it's there. It's very exciting. It's very, it feels like we're kind of trailblazing a lot of what's going on in the serverless world.

Yan Cui: 36:16  

Yeah, I've seen this approach at quite a few companies now. And that's the approach that I've taken within the previous role at DAZN as well where developers can just deploy to their own temporary environment if you'd like so that they've got a copy of their own service. But I guess normally we haven't, we haven't just tried to copy the whole company's infrastructure because it requires a lot more coordination. So every time some other team that you don't normally have to deal with update their service you have to have some way to get their latest version of their code into your personal environment. So we haven't gone that far, usually just limit it to the services that you own, or maybe the adjacent services that you have to talk to you depend on, either upstream or downstream. But I do see I do see that becoming more of, I guess, emergent practice in this space. More and more companies are kind of trying to adopt this approach that developers, like you said, can rinse and repeat against their own personal environments. I think over time it is not a separate account. Have seen that as well. But oftentimes, it’s the same to dev account for your team, but you can have a separate environment because like you said, everyone's got their own subdomain based on the name of the of the environment that you gave, you gave to your stack. So that all kind of plays really nicely, especially given that you know, when you don't use it, you don't pay for it. You don't, you know, you're not talking about having all these containers running around just because you've got all these spare environments that everyone has to run. So with Lambda API Gateway DynamoDB, you can have as many as you want, so long as no one is using them you don't pay for this temporary environments. And that's that's one of the really nice things you get with this whole serverless approach to to do things as well.

Omer Baki: 38:04  

Usually, we don't really go beyond the free tier account of AWS, you know, we don't require anything more than that. So it's easy.

Yan Cui: 38:14  

Exactly. And with Lambda, the free tier doesn't expire after 12 months as well. So a lot of our services the free tier expires after 12 months but Lambda just is based on how many invocations you run. So I was gonna say that before we before we wrap up, I want to also just get a taste of some of the challenges that you still have in your serverless environment. You talked about using Lumigo as your observability platform and it gives you a lot of visibility. So that obviously that takes care of one big common problem people have. What about any other problems that you're running into in terms of, you know, day to day some of the recurring pain points?

Omer Baki: 38:54  

I could say that, Or, maybe you can elaborate also, but I can say that part of the flow that we still have is of course, what we discussed in regarding the part of, of the development process. And in terms of a bit of visibility because we have microservices. It's a general problem. That doesn't is not caused only by serverless. But this is one of the problems and of course breaking down, add complexity to the entire system, in terms of for example, either idempotency, and stuff like that, especially in payments where you have to hardly decide where you want to do stuff at most once or at least once. And usually we take the approach it was once and we need to recover from you know, things that are not in our control. Sometimes maybe Step functions can help resolve those issues because it's easier to retry and stuff like that. But there are no in the end there are no it's not magic. I mean, always glitches and stuff like that are beyond our control. But these these are mainly the next cases that we need to handle. 

Yan Cui: 39:59  

Yeah, with those kinds of challenges, idempotency is an interesting one, because depending on what you're doing, idempotency can be either really easy or really, really complicated. And in the past, I've used things like a saga patterns with a Step function so that you capture transaction with the action and the rollback and depending on which step you are, you have to roll back in the same order that you committed. But if you do that for everything and for everything, then your entire architecture could become really, really complicated with a lot of sagas in your Step functions. But uh, yeah, I totally get it.

Omer Baki: 40:34  

But our flow is to upload the file to the bank. That is, you know, their SFTP. And after they read the file they delete it so we have no way to to retrace what happened. So they are sometimes something that we need to reconsider and validate. Yeah.

Yan Cui: 40:51  

Yeah, I think that's just the challenges of modern software development. I think there's lots of different services, especially in your area where you have to integrate a lot of other third party APIs and services and sounds like different kind of protocols, and REST APIs, FTP, FTP file systems and whatnot.

Omer Baki: 41:12  

Especially with banks which are, you know, not the most technological institutes that you can think of. And there's a famous saying in Melio that, you know, building the payments platform was building a Ferrari over bicycle wheels. Because really, the bank is down to APIs, I don't know, it's really limited in in many ways that you can, you know, really retrace and build the proper design pattern that you want.

Yan Cui: 41:40  

And I guess in the US, especially in the US, where you've got such a fragmented banking space, there are so many different local small banks. There's just a lot more people you need to integrate with in their case as well. Okay, so I think we are coming up to time now. I want to thank you guys very much for joining us today. Before we go, is there anything else that you guys would like to share with the with the listeners, maybe a blog post you want to promote or YouTube videos or other things that you want to share? Maybe you got job openings as well?

Omer Baki: 42:08  

We're definitely hiring. And we have a nice blog post in Medium, a nice publication that you can visit. Or?

Or Cohen: 42:16  

Yeah. We have a very nice, Melio engineering blog that just kind of started about two months ago, starting to kind of add posts and a lot of coming posts now. We have a lot of openings in the US as well. There are offices in Denver, in New York, and in Tel Aviv for the Israeli listeners. I forgot I had one more thing I want to say, yeah, we are starting to release a few open source projects so we will be actually publishing them also in our Melio engineering blog like it will work in tandem. So expect a few surprises as well from us in the coming months.

Yan Cui: 42:54  

Okay, I would make sure that the links to your Medium publication is in the show notes as well as the blog post that Omer wrote about how you built this whole payment platform on serverless and AWS. And yeah, once the your your once your open source projects are released, I’ll make sure those are in the show notes as well. Again, thank you guys so much for taking the time today to talk to us and sharing your experience. It's been a pleasure having you. And yeah, stay safe. And I hope to see you guys at re:Invent perhaps. 

Omer Baki and Or Cohen: 43:24  

Yeah, thank you very much for inviting us. Thank you very much.

Yan Cui: 43:28  

Take it easy. Bye, bye.

Or Cohen: 43:29

Take it easy. Bye, bye. 

Yan Cui: 43:44

So that's it for another episode of Real World Serverless. To access the show notes, please go to If you want to learn how to build production ready Serverless Applications, please check out my upcoming courses at And I'll see you guys next time.