On Rails
On Rails invites Rails developers to share real-world technical challenges and solutions, architectural decisions, and lessons learned while building with Rails. Through technical deep-dives and retrospectives with experienced engineers in the Rails community, we explore the strategies behind building and scaling Rails applications.
Hosted by Robby Russell of Planet Argon.
On Rails
Miguel Conde & Peter Compernolle: Inside Gusto’s Rails Biolith
In this episode of OnRails, Robby is joined by Miguel Conde and Peter Compernolle from Gusto, where they work on a "biolith"—two distinct Rails monoliths serving 600+ engineers. Peter leads the HIPAA-compliant benefits domain, while Miguel is extracting the time product from the main monolith. They explore how Gusto identifies boundaries, manages temporal data, handles eventual consistency, and navigates the trade-offs of GraphQL federation.
🧰 Tools & Libraries Mentioned
ActiveAdmin: Admin UI for Rails.
after_commit_everywhere: Run code after commits.
Datadog: App and CI/CD observability.
FactoryBot: Build test data.
GraphQL: API layer for SPAs.
Kafka: Event streaming backbone.
Packwerk: Enforce boundaries in monoliths.
PaperTrail: Model change auditing.
React: Front-end framework.
Sidekiq: Background job processing.
Sorbet: Gradual Ruby types.
TypeScript: Typed JavaScript.
explicit_activerecord: Guard writes to models.
Ruby Koans: Learn Ruby by practice.
Cracking the Coding Interview: Interview prep book.
Gusto Engineering Blog: Posts from Gusto engineers.
On Rails is a podcast focused on real-world technical decision-making, exploring how teams are scaling, architecting, and solving complex challenges with Rails.
On Rails is brought to you by The Rails Foundation, and hosted by Robby Russell of Planet Argon, a consultancy that helps teams modernize their Ruby on Rails applications.
[00:00:04.840] - Robby Russell
Welcome to OnRails, the podcast where we dig into the technical decisions behind building and maintaining production Ruby on Rails apps. I'm your host, Robby Russell. On this episode, I'm joined by Miguel Conde and Peter Compernolle from Gusto, where both play pivotal roles in the company's architectural evolution. Miguel leads efforts to extract the time product, covering time off, scheduling, and time tracking from Gusto's massive Rails monolith. Peter is the architectural lead for the company's benefits domain, a code base that's been separate from the main app for years due to HIPAA compliance. Together, they sit on Gusto's architectural council and bring unique perspectives from working on two very different, yet equally complex Rails applications. This conversation will explore what it takes to untangle large code bases, make Rails work at scale and find the right balance between convention and customization. All right, check for your belongings. All aboard. Peter and Miguel, welcome to On Rails.
[00:00:59.080] - Peter Compernolle
Thank you very much.
[00:01:00.000] - Miguel Conde
Thank you for having us.
[00:01:01.420] - Robby Russell
I want to start with a a simple question, I think. Let me start with you, Miguel. What keeps you on Rails?
[00:01:08.100] - Miguel Conde
I like the speed of Rails, especially coming from a Java world. Rails unlocks a lot of speed and development, especially in this day and age, where we have such new technologies and AI and things are evolving quicker than ever. The speed that Rails provides allows companies to unlock new capabilities faster than, say, a strongly tight, very methodical language like Java or C#?
[00:01:32.580] - Robby Russell
What about you, Peter?
[00:01:33.980] - Peter Compernolle
I think that's a good question. I really like how quick it is to prototype and Ruby on Rails. The way that I think is usually by putting my ideas in code and then seeing how it feels. With Rails, you can do that really quickly. Then with some of the more recent tooling, you can get some benefits that feel like a compiler, so you get the best of both.
[00:01:55.820] - Robby Russell
Interesting. We'll probably touch on a couple of these topics throughout the conversation. But one of the reasons I really wanted to have you both on this show, you both work at Gusto. I mentioned in the intro how approximately how large the engineering team is there. But how long have you both been at Gusto at this point?
[00:02:10.480] - Peter Compernolle
I've been at Gusto going on seven years, so a little more than six and a half. I've been on the benefits team since day one. At the time, there were 12 engineers on benefits. Now, there are over 60. So we've grown quite a bit.
[00:02:22.680] - Miguel Conde
I have been at Gusto, I guess, hit my four-year mark last month. I have been primarily on the times team. That team, when I joined, was about four engineers. We are up to 16 now.
[00:02:36.360] - Robby Russell
We'll definitely dig into what those different sides of the organization there make, what is involved in there. Peter, can you help paint a picture of Gusto's, I'm going to say, biolith setup and how benefits ended up as its own Rails application?
[00:02:51.500] - Peter Compernolle
Like you said, we have a biolith.
[00:02:53.480] - Robby Russell
What does that mean?
[00:02:55.140] - Peter Compernolle
Instead of one monolith, we have two monoliths. Big picture. We have, let's say, 80% of our engineers, 80% of our code, 80% of our product in one monolith, and then we have just under 20% in a different monolith. That second one was split out before my time for HIPAA compliance reasons. So all of our PHI or personal health information is stored separately from our main application. So at the time, it was really just for HIPAA compliance reasons. And over time, that's been one of the best or luckiest decisions that I've inherited because we have a smaller application to work with, which is a lot easier to move quickly than a very large application.
[00:03:42.040] - Robby Russell
Something that makes me curious about I'm just assuming if you're thinking, or if I was going to work at a brand new startup and we're going to be like, We're building an HR platform, maybe you might describe it more than maybe an HR platform, but that's how I think about it as a customer of Gusto. For my company. But I would just think like, Well, the whole thing has a lot of private information. Where does that boundary lie? I guess there's the hip of stuff when it comes to health care related things. But I'm like, Where's that distinction where it's all very, very private information about employee information and records about employees. Is there a lot of difference in compliance from that part of the application that's more like HR related employee stuff versus the HIPAA thing? You're like, We need to segment that off. How is that not just already making it just apply the same rules to the whole thing?
[00:04:33.400] - Peter Compernolle
That's a great question. So I think this has evolved over time. Now we're very sensitive to compliance, but at the time it was simpler. So the main reason this was built separate was partially so we could have a second data store. And at the time, Rails didn't handle multiple data stores. But also the application is only accessible from the other side of the firewall. So all of our benefits related admin functionality you can't even access from the outside world if you tried. And that way we could be very explicit and careful about our APIs and how that data was shared, because by default it was inaccessible to everybody, and we had to be explicit and go out of our way to make it accessible to anybody.
[00:05:19.060] - Robby Russell
Do they interact much with each other?
[00:05:21.840] - Peter Compernolle
So that's another thing that's changed over time. At the moment, the only interaction really that we have with the main monolith is when we are sending information to the front-end for our end users, our customers. Over the years, we've reduced that communication quite a bit, but it used to be a little more of a circular dependency. Our primary customers, the way that we think of them, there are companies or employers and employees. Those are the two different ways most people interact with Gusto. And we used to sync that data between the two applications all the time. And now we've become much more intentional about what does the benefits product need, and we don't want any more. We lean heavily into eventual consistencies, so we don't have run time dependencies. I can go into a lot more detail, but suffice to I would say over the last five or so years, we've gone from a pretty tightly coupled pair of applications to a very loosely coupled pair of applications.
[00:06:25.540] - Robby Russell
Just as a end user of Gusto, is it safe assumption that if I'm interacting with the web app and I'm looking at my employee version of my... I'm an employer, but I'm also an employee, so I switch over to being an employee and I'm looking at some of my healthcare-related details. Is that then interacting directly with that back-end, the HIPAA compliance platform, and it just happens to be displaying that information in the same... It's rendering out, but the other app doesn't really know what's being shared with me. Is that a dumb way of thinking about that?
[00:06:58.840] - Peter Compernolle
I think I I can answer the question by explaining the interaction. Our older code, whether you're an employer or an employee, would ask the main monolith for some data, and then that would authenticate and then basically proxy the request over to our behind the firewall app. So that was a really simple way for us to be confident that nobody was accessing this data unless they were already validated in some way. Over the years, we've moved much of that to GraphQL. So now the front-end doesn't need to think so much about where am I getting this data from. We can off in pretty robust ways using GraphQL. But I think to answer your question, the front-end right now is hosted by the main application. It just gets its data for benefits-specific things from the benefits application.
[00:07:49.780] - Robby Russell
That's fascinating. Miguel, what's it like working on the larger monolith by comparison?
[00:07:55.140] - Miguel Conde
It's more fun, but it is definitely a bit slower. I think double the size of HI at this point. You have to be very careful when you're modifying something because everything is so coupled with each because this is an 11, if not 12-year-old app that was just built on top of each other. We have a lot of coupling between what we would call services. When you make a change, you have to be extremely careful that your change doesn't have unknown propagations and you call side effects, which could bring the monolith down. So there is that. Alongside that, there is also a slower build code to deploy because our deployment CI/CD pipeline essentially has to build this entire monolith, run all the suites, make sure that everything's okay, and then at that point, you go live. If there's an incident or someone's putting out a fire and deploys locked. Even if your service is not affected in any way, shape, or form, you cannot deploy any code because the entire pipeline is shut down for the monolith while this alert gets resolved.
[00:08:56.460] - Robby Russell
That sounds like a challenging space to be in. In preparing for this conversation, Miguel, you mentioned that you're in the process of, say, I think extracting the time. Maybe you can tell us a little bit what you mean by time product from the main monolith.
[00:09:11.860] - Miguel Conde
There is an idea that, okay, all these pain points are felt by every team. If we consider every system as a service, you have time tracking, scheduling, time off, you have payroll, you can name them. We decided that we're going to analyze what systems we can split up to be atomic units that work well. We're not going microservices where every system is very precise in what it does. It's more like a smaller, robust ecosystem that can operate in a silo, almost. Scheduling, which is the team, which is a pod in time, was chosen to lead this effort. The idea is that all of the pain points that I just described in the build, in the worry that you might be coupled to something and have propagating effect gets reduced and you unlock the speed of Rails that we discussed earlier, instead of being bogged down by the maintenance and overhead of a huge monolith.
[00:10:15.840] - Robby Russell
How do you decide where the boundaries are when you see that there's a lot of overlapping in the coupling?
[00:10:22.640] - Miguel Conde
That is a good question. I think it's something of an art that we're still perfecting. For scheduling, it It was pretty simple. We were a newer feature in Gusto, so it was built in isolation. But one of the key things that I look for, and this is me personally, is essentially, do I have any piece of code that my system needs in order to operate. Meaning if I don't have it immediately, meaning it cannot support eventual consistency, then it's probably an integral part of the system and it should be brought into the fold. If these are just side effects and stuff like that, then those boundaries are... That can be a separate system. When we carved out what the time systems would be, we carved out three distinct ones, scheduling time tracking and time off, because all of those can operate independent of each other. They have a lot of intertwined communications, but none of them are necessary for the system to operate as a whole.
[00:11:24.320] - Robby Russell
What role has Packwerk played in helping you set those boundaries?
[00:11:28.840] - Miguel Conde
Packwerk has It's been very useful in analyzing external dependencies. So Packwerk, one, it helped us isolate the code better as we were working within the monolith to make sure that we're not calling into a model that doesn't belong to us. We're not calling a private method that we shouldn't have access to. So it helped in the creation of scheduling to be as isolated as possible. And now it's helping us with a list of public APIs, public Ruby APIs, that we now have to transform into either Kafka or GraphQL by essentially not allowing any outbound calls from our pack, then we can start winding down the list from there. It helped in the creation by isolating and now with detection.
[00:12:16.300] - Peter Compernolle
I think there are some good things and bad things about packwerk when we're trying to identify these boundaries. I should start by saying my single most important goal for my time at Gusto, and while I'm still here, is to figure out what these boundaries are on benefits. Over the years, we've tried different things. The first thing that we tried was extracting leaf nodes to gems. Anything we identified as having really no dependencies, ideally no persistence, really clear logical APIs or boundaries. We moved into gems. That worked really well because Ruby has, I think, one fatal flaw, and that is by default, everything in Ruby is public rather than a language where it defaults to private, and then you need to explicitly say this is part of my public API. So we had years of these unintentionally public APIs, and by moving things into a gem, it becomes a lot clearer what are we actually using and what are we actually depending on. Then packwerk brings that to the surface so we can see how are things being used unintentionally or intentionally I should also say I was initially a big dissenter of using packwerk because at the time I was a little bit of a purist, and I thought that moving things to gems allow us to have technical constraints rather than tooling-level constraints.
[00:13:47.580] - Peter Compernolle
Rather than something at CI or build time saying, you're using a constant, you shouldn't, well, if it's in a gem, you have complete control. Your gem runs its own tests. It doesn't bootstrap the monolith. You have to be very intentional about that. But something changed that I think made packwerk at least 10 times more valuable in figuring this stuff out. And that was when we introduced Sorbet. So Sorbet, if you've never used it, is a typing system. It's a way to take an untyped ruby method, and you can define a signature. So we can talk about Sorbet's pros and cons, but what really enhances packwerk with Sorbet is you're required to say the type in the signature, whether the arguments or the return value. If your method before had a variable and the method contents didn't change or declare the constant or the type, you had no idea what that method was using or depending on. But when you introduce Sorbet, you're forced to say this is taking an instance of such and such class, and this class is defined over there and based on our rules, we've decided we don't want that relationship. So doing that, I think packwerk changed everything once we introduced Sorbet.
[00:15:09.880] - Peter Compernolle
A colleague of mine described it as, it's like we had a huge mountain of dirt, and we bulldozed big hills from the dirt. So we weren't exactly sure what the boundaries were, but we could see where there was a lot of signal, where there's smoke, there's fire. So we could see this is like a hotspot, and we can dig deeper into it. And I think that was really a combination of packwerk and Sorbet helping us do that.
[00:15:35.380] - Robby Russell
I'm really curious about, when you talk to teams at your scale, where you're seeing the benefits of using something like Sorbet, and that's not the default way, like a lot of Rails developers would think about these things. So I think some people might be curious, be like, well, what problems were you really having? If you're like, well, if we're passing something to a method and it's the wrong type of thing, you would figure that out because something would break, right? Or your tests would help capture that, or is it just a certain scale of a certain volume of engineers and not everybody's going to understand things? Can you describe maybe some of the problems that you had that mountain of a mess of dirt? What was that like?
[00:16:18.880] - Peter Compernolle
I think there are two reasons this before was a big problem for us. The scale of the application, just the size, and also how old it is. So a lot of our job is code archeology. It's going back in time to understand why somebody built something the way they did. So that's the age of the application. So there might be a variable name that in its original commit made perfect sense. And then four years later, somebody changed something such that that variable could be one of two things. I don't know, something like that. And that means that now, five years after that, I'm looking at a method with a variable name that is misleading, and you can dig up the stack or go down the stack and make sense of what it is, but that takes time, which now plays into the scale. So if we want to refactor anything, you need to know what is using that thing. And the more decoupled we get, the more we isolate things, the more we make things private, the easier this becomes. But in the meantime, we have to know all of the callers of this method or all of the consumers of this constant.
[00:17:34.920] - Peter Compernolle
And if it's just being passed around as variables or if there's even a hint of meta programming, that can be very risky. And so the beauty of Sorbet in this case, place is you can remove a method and in an instant, just like people working in a compiled language, you can see 17 different places that are calling it and tests that are doing this with it or whatever. And Sorbet makes that a whole lot easier.
[00:18:01.940] - Robby Russell
If someone listening were to be curious about implementing something like Sorbet, is that something that they would need to do all-encompassing? Does the whole application need to be "sorbeted" or however the verb of that might be, or can you introduce this in certain areas of your code? It's like, Hey, this is a really delicate part of the system. Let's apply some of the signatures around this part of the application to help protect that, or you have to to think about it on a macro level.
[00:18:31.520] - Peter Compernolle
You can do it gradually, which is the whole design. So you can declare per file how strongly typed you want it to be. So you can say, I want Sorbet to ignore this code entirely. You can say, I don't want Sorbet to type it, but it knows about the code. You can have true where it allows some flexibility or strict where there's no flexibility. And this is the only reason we were able to use Sorbet because our application is just too big. So we could do this gradually hotspots first. But I will say that in the beginning, this is useful as documentation, but little more, because documentation, we all know, can be out of date or inaccurate. And early days, Sorbet signatures, we might be making a best guess. Like something is coming from HTTP and it's an integer. So we type it as such. But then it turns out in some contexts, it's a string. So we can learn that and log in production, and we get a little benefit. But it can be risky because if we trust that it's an integer too much, then we add to it, then we can have issues.
[00:19:39.980] - Peter Compernolle
And so there's a tipping point where after we get to a certain point, we can start to trust it. And we know that it's not 100%, it's not a compiler. But we can start to look at a signature and be pretty confident that signature will be right. So I don't need to look at the implementation of the method or validate the colors. And ultimately, that's what this is for. It's to allow us to move faster, so we don't need to understand the entire application to make any change. We just look at a signature.
[00:20:11.460] - Robby Russell
Miguel, I'm curious. You both mentioned this, I think, already. Can you describe, go back to something you mentioned, you said eventual consistency. What do you mean by that?
[00:20:20.660] - Miguel Conde
Within the store right now, we have a lot of asynchronous processes. So one of the things, and this is something that the app, its egg shows, in Rails land, if I make a change to a model or something, and I have 10 different services listening in on it, there are many ways to do that. You could either do a sequential and make sure that every service updates, and that works at a maybe small, medium scale. But at a large scale, or when you have multiple data stores, you cannot guarantee transactions, so you don't have that optimicity. Eventually, the system will come into consistency across all of our services. Services. We have an inventing structure that you can hook up to. Right now, it's being powered by an internal implementation, but it's slowly being replaced by Kafka. That when something happens, a pack or service will emit an event that says, I have made some change. Then any consumers that are interested in that can update their tables as need be. The only thing is there's no guarantee when that message arrives to us. It could be almost immediately or there can be some lag. That is what eventual consistency means.
[00:21:37.160] - Robby Russell
That's interesting. I don't know that I've heard that term before, so I was curious about that. Diving back a little bit into the Sorbet and packwerk, are there any boundaries you've had to enforce that Rails didn't quite give you by default?
[00:21:49.140] - Miguel Conde
Yes. One of the major pain points in a monolith with Rails is that everything as you pass it around is inherently mutable. If I pass around a model and five services down, someone decides to update that model or update the object and I'm not aware of it, and then I say, Hit save, that is problematic because I lose control of my own data store. One of the things that we did that Sorbet empowers is that we now have structs which are inherently immutable. With Sorbet and packwerk, what we did is we created a Repository Service API layer where the only interaction with our system or any of our models is through our services. We have the type signature, we return a structure that is immutable, and if you want to make any modifications to our systems or get any data from our systems, you must go through our service layer. We cannot enforce strict privacy, but this is where Sorbet and Packwerk comes in because we can check to see, is anyone not adhering to our pattern? Is someone bypassing us and going, You know what? I'm going to call your model directly and do what I want with it.
[00:23:04.180] - Miguel Conde
That's where I think the two systems come into play.
[00:23:07.500] - Robby Russell
Could you tell us a little bit about that service layer? You mentioned you have... It sounds like you might be using gems. Are you using like, Rails engines at all in this process as well?
[00:23:16.980] - Miguel Conde
No, it's just a pattern. It's a coding pattern. You see it a lot in other like Java C#. The repository layer in this case, would be the application record, and that's what interacts with the database. In the service layer is essentially where all business logics live. One of the methodologies that the time teams has adopted is that we keep the application record layer as dumb as possible. All business logic and all interactions with any external dependencies live strictly in our service layer. If we're making an outbound call somewhere, that is in the service layer. If we're emitting an event, that is in the service layer. When you go to the model, there might some helper functions or transformative logic, but it is these APIs that those are essentially our public APIs that we expose to everyone that says, get schedule, modify schedule, do this, do that. But it's tightly controlled, so we know exactly what functionalities we are exposing and what functionalities we want to remain hidden.
[00:24:24.460] - Robby Russell
For myself and listeners, can you describe what that directory structure looks like at a high level? Where's your service layer sit in a Rails app?
[00:24:33.780] - Miguel Conde
It would be at the service. We have essentially our models, which is... It looks very similar to any Rails app, but essentially, you have the models, you have the service, and That's where all of our service code lives. Our libs would be essentially like Rake tasks or anything that are internal that no one's going to call into our library folder. Those are internal tools. Then we have our GraphQL layer. If If you think of a flow once we're extracted, this isn't how it works right now, but essentially our GraphQL layer will be the interface for interactions to our service. That level is very dumb as well. It just does some basic authorization as well as transforming the network call into the signature that the service expects. It goes into the service layer, and then at that point, all business logic occurs and stuff like that. If we need to make any save to the database or modify We had the database, we call the application record. Then the application record deals with the actual database interaction.
[00:25:37.060] - Peter Compernolle
Another thing I wanted to mention, and this comes back to what I was saying before about Ruby being public by default. By default, an active record model has something like 350 methods, and that's not including specific accessors added per column. A big effort we've that are taken over the last few years is to reduce the size, the footprint of a single model. And so Miguel mentioned we use Sorbet Structs, for example. So this can be a public API. We say, my return value for these APIs is a domain-driven struct, and it is immutable, and only I can instantiate it, I being the service. But another thing we've done is make active record models private. So it is, unfortunately, a binary thing. You have to work to make sure nobody is using it. There are no other consumers of the model, but for new models, especially, we make it private by default. By convention, we usually suffix the model name with record. So instead of a policy, we have a policy record, and this is private to its owning namespace. And that means that any time some other system needs that data, well, they can't access the model.
[00:26:56.340] - Peter Compernolle
So that means we need to expose it via some API. And when we do that, we're confident that nobody's writing to it, whether intentionally or accidentally. And as a tool we've built to get us there gradually, this is open sourced, by the way, at Rails at Scale. We call it explicit active record. And So explicit active record is a concern or a module that you can include in your model, and it will raise an error if you try to write the record in any way. And so the only way you can do that is by my constant with explicit persistence, then you pass in the instance, and then it just does some simple instance variable stuff under the hood to make sure you're allowed. But what's beautiful about this is, one, people don't accidentally write this, so that's good. And two, we can grep or search the code base for anything that is writing this model. And this is particularly valuable for really old legacy models that are distributed across the application. So we We can see all 12 places where we're modifying this. We can use that information to strangle it and bring it into a service class like Miguel is describing.
[00:28:10.580] - Peter Compernolle
And then suddenly, we have pretty high confidence that the only thing that is able to change this record is here, and therefore we can refactor it or evolve it in any way that we need to.
[00:28:22.700] - Miguel Conde
I think a real concise way to put it is Sorbet unlocks or makes it much easier to do object oriented programming within Rails. If you want to use object-oriented programming and take use of encapsulation, inheritance, everything that that entails, Sorbet is your friend. If not, it's not as useful.
[00:28:45.080] - Robby Russell
So you mentioned the explicit AR or active record. That's just like a gem that teams can install or something.
[00:28:52.440] - Peter Compernolle
Yeah, it's really simple. So the idea is you call a static method and you pass in instances that are allowed to change, and then that keeps track of them in a private internal store. And then when you call save, it checks to make sure that you're allowed to change it. So it's really simple. It's more a pattern that's useful. But over the years, I've become very, very dependent on searching for with explicit persistence for, because then I can be confident nobody else is changing this.
[00:29:22.780] - Robby Russell
I'm just looking up the read me now on that up here on GitHub. So I'll definitely include links to that in the show notes for everybody as well. Are there other tools that your team has built to help optimize things around defining boundaries or making Rails more object-oriented?
[00:29:39.300] - Peter Compernolle
There's one that I love, and it's a little hard to explain. This, unfortunately, is not open source, but it's really just a pattern, again. In Hawaiian ice, that's the application used for managing benefits. We use this pattern all the time. We call it Published Model Builders. And so part of this is we have a few models. We call them God models. So these are models that over the year have just become super bloated. They're just very, very big models and hard to work with. And so if you mix in the fact that they're very big, it's unclear what their purpose is, and we don't have control over the rights unless we're using something like explicit persistence. We've defined a structure that takes an instance of an active record model, but cherry-picks specific attributes from it, and then builds a read-only struct. I don't know if I explained that well, but what's cool about this is consumers of this model can define which attributes they need by defining their own Struct as a subclass of this thing. Then they include certain modules. And then the owner, the owning pack of the model, gets to decide which attributes to expose using this pattern.
[00:30:54.700] - Peter Compernolle
And so from that, we get two really cool things. One is we're guaranteed that these models aren't being written to by a consumer because we aren't exposing those methods. And two, because you need to include modules to describe which attributes you want, we're beginning to understand how we might decompose the model into smaller models per concern. If you have 15 different attributes, you can maybe break them down into three groups of five. For example, if you have an employee, so An employee is one of our bloated models. You can define a module called Employee Demographics. And maybe this includes first name, last name, birthday. Maybe it includes address. And then there's another module that's, let's say, dependents. So in benefits, we deal with dependents. And that means that as a consumer, if I only need to know a first name, I include the Employee Demographics, but I don't include the dependents. And so from that, because it's Sorbet friendly and because we've defined this with packwerk in mind, we can use those tools to gather, Oh, this is what is being used across the board. This is only used by this one pack. Maybe that can make a different model and synchronize that data somehow.
[00:32:18.260] - Peter Compernolle
It works well at constraining everything so we don't make things worse while also teaching us a little bit about how the application is behaving.
[00:32:27.020] - Miguel Conde
Now, one of the gems that has really helped us do this encapsulation and keep the logic in the service and try to keep the active record model as lean as possible is after commit everywhere. This is a nifty little gem because this ties into a virtual consistency. Let me give you a scenario. I have a service that I modified three tables, and at the end of this transaction, I want to emit an event. This event goes out and says, I have changed my data store in some way. I can guarantee within my method that I complete my own transaction. What I cannot guarantee within a monolith is that someone is calling my method within a parent transaction. If they decide to roll back the parent transaction, my transactions are also rolled back as well. This is where after commit everywhere shines, because instead of me having to put the logic in after commit and only be able to do it at the model layer, what after commit does is that it checks if It's in a transactional state. If it is, it actually choose whatever you put in the block to only execute after a full transaction commit.
[00:33:41.400] - Miguel Conde
If it rolls back, then none of the code within that block statement gets executed.
[00:33:48.920] - Robby Russell
That's interesting. I was just pulling that up. It looks like also that might be created by our good friends over at Evil Martians.
[00:33:55.140] - Peter Compernolle
Sidekiq uses this under the hood as well. A year ago or so, they introduced a new API called Transactional Push. If you enable Transactional Push, under the hood, it uses after commit everywhere. One problem we've had with this is in Hawaiian ice, we have two different databases, and after commit everywhere has a little support for multiple databases, but Sidekiq doesn't make use of it. We have to think, which database do I want to commit and which do I not care about?
[00:34:25.240] - Robby Russell
One of the things that you're talking about there was those examples of those God objects and get really big. You brought up the employee model is getting really big. Maybe there's a lot of attributes in that. I'm assuming your database table for that one. How often does your team think about going back and trying to break those up into smaller? Is it just the models or also the models and the tables themselves? Where's that trade-off?
[00:34:47.380] - Peter Compernolle
All the time. I think that's something we're frequently thinking about. We talked earlier about identifying boundaries. One of our biggest challenges is we've all been trained to think about normalized databases and what is a reference and what isn't. But as we've gotten smarter about our boundaries, now we start thinking a little bit more about snapshots. So this data, maybe the service only cares about what the data was when this method was called. An example, I'll use an employee as an example. If an employee signs up for benefits and then we submit that information to the carrier, the benefits carrier, that process might take two or three three months. In the beginning, if the employee signs up for something and then we submit, if we only store an employee ID to look up the demographic information, then it becomes ambiguous at best what demographic information we submitted to the carrier because it might have changed over those three months. Demographics is maybe an extreme example. People don't change their name very often, though they might, but something like an address is actually very important. If someone changes their address before we've submitted it, great. We make sure they're still eligible for that coverage because maybe they change states or something.
[00:36:09.160] - Peter Compernolle
But if they change their address after they submit it, well, now we need to submit that new information to the carrier. So it's forced us to rethink a lot of our old best practices, and now we're working really hard at denormalizing things and being very mindful, does this fit in one table, or is this maybe multiple records that are effective dated, or do we split it across tables or domains?
[00:36:35.380] - Robby Russell
I would imagine with that type of data, you're also you're having to track the changes of all that information, right? And so even in that particular example of, well, you've submitted for health care, you need to get health care coverage or something to a carrier, and maybe that person gets married. All of a sudden, maybe they want to include their partner or not, but their status, even their tax details might need to at that point. They want to start filing differently for payroll. There's a lot of versioning, it sounds like, happening behind the scenes as well. It wasn't a topic that I was thinking of, I plan on digging into, but while we're here, are there tools out there in the ecosystem that you're able to leverage to help you manage some of that data persistence and evolution, or have you just had to get really creative on your own there, Augusto?
[00:37:24.100] - Peter Compernolle
This might not be an answer that people want to hear, but I don't think it's a problem solved with tooling. So we use paper trail. Paper trail doesn't work particularly well at scale to run the application. So it's good for audit tRails. It's not great for checking what was the data as of this time. But mostly it's understanding what information does this function need. So we've gotten much better about it as we've defined these boundaries. It's really, I think, a very fun problem to work on in benefits, because if you look at the schema, it could make perfect sense. We have a very well-defined schema. But then if you add the dimension of time to it, that open enrollment window might take a month or more, and then submission takes time after that, and then you go a year, maybe nothing changes in that year, maybe it does. Suddenly, the schema, you need to add that other dimension of what is it at certain points in time when it's being used. I'm getting excited now thinking about it because it's such a fun problem to solve, and we have a pretty solid foundation, and so it requires rethinking with low stakes.
[00:38:34.910] - Peter Compernolle
It's fun.
[00:38:36.300] - Miguel Conde
From time side, everything in our system is a timeline. All of our tables are actually temporal in nature, and some of our tables are bi-temporal. So not only do we answer the question sometimes, what was this at this given point? We also ask the question, what was this at this point? Also, what did the system think it was at this point? So what that unlocks is you can retroactively change something, and we can tell you before this date, it was this, but now someone went into the system and changed it three days back to now be this. We can answer that question. It's very important for auditing, especially when it comes to time sheets, because someone can argue, Hey, my time sheet was not this three days ago. We can clearly answer that question in our own system and say, Here you go, this is what occurred.
[00:39:34.080] - Peter Compernolle
How do you do that? Are you storing two time stamps on a row? One is the date that it was supposed to be effective, and one is the date that it was actually effective?
[00:39:44.260] - Miguel Conde
Exactly. We have a effective start and end, and then a version start and end, where effective says this is when this takes hold, and then the version is essentially the system recognizing what the state of the world truly is.
[00:40:00.100] - Robby Russell
It's fascinating because as a user, as an employer, and say you're approving PTO as an example, someone might schedule time off for PTO, and they have enough PTO available because we don't give unlimited PTO at our company. So it's just like, well, you have this much available. You planned that. And then maybe then there's a change and they need to take some other time off before that happened. And all of a sudden that data ends up getting… I can understand how it could be a really complicated thing to wrap around. How do you display how much is available at all time when at the point it was approved, but now, since you had to take those couple of extra days off in between, you're going to have less PTO available than you thought you did at the end, even though that was what it was at the time. And it sounds messy.
[00:40:45.560] - Peter Compernolle
what makes that a lot easier is if you're writing that data in one place, because then you can be really explicit. Oh, I need to change some validation. So just come in full circle. If that writing is happening in multiple places, It becomes very difficult to prevent stuff. So the tendency, I think, with active record is to add validations, to say, look up all the other records and make sure this is good. But if you've invalidated old records, you might not know about that. Having a service class or one place that isolates all of the rights, then now you at least can figure out how to move forward.
[00:41:23.680] - Robby Russell
When you're building apps like this, where you're having to think about these other dimensions, how complicated is the rendering, the views to handle? If you're now displaying some data that you didn't have or is a different type of... Or maybe it's you have data now, but maybe that you didn't have in the past, are Are there weird edge cases in the views themselves to try to handle these types of scenarios? Or can you decouple that a little bit from the views and just render out what it is? There's a lot of edge cases written in conditionals in your views.
[00:41:56.840] - Miguel Conde
There are edge cases, especially in time when we are temporal, because not all of our data is denormalized. We only denormalize the data sets that we require to work. However, especially with GraphQL, everything, the UI is the combinator of everything. So it may fetch data from scheduling, but let's say that you were scheduled with a job or a department. Department is a good one. What occurs is if department, we might have an ID reference to a department, but we cannot guarantee that department which lives outside of the time domain, is effectively dated. We might be effectively dated, but you fetch this data point using GraphQL, so now it's resolved by another sub call it the Department subgraph, and it says, Oh, yeah, your department was always, I don't know, name your department, Rails Inc. When in actuality, what happened was that three days ago, someone changed the Department name, but instead of that being a new record that's effectively dated, that system overrode that record. So now it has the same ID, and it seemed from there that it was always the case. The UI poses that because while the service may be tolerant to eventual consistency or might be versioned, the UI that brings all of these pieces together may not.
[00:43:28.690] - Miguel Conde
And that tends to be very prominent point of failure because it's bringing all data to service a page.
[00:43:41.800] - Robby Russell
This episode of OnRails is brought to you by Unsafe Save. The fastest way to skip validations and write questionable data with confidence. Are you tired of pesky validations getting in the way of your momentum? Just add validate false, and boom, your record is saved. Empty fields, no problem. Missing associations, who cares? But wait, there's more. Act now and we'll throw an update column totally free. Skip validations and callbacks. That's right, bypass your business logic entirely. Update your database like no one's watching. Why wait for your code to behave when you can just skip the parts you don't care about? Unsafe safe because sometimes you just need that record in the database. Consequence is pending.
[00:44:23.200] - Robby Russell
Miguel and Peter, I really appreciate you diving into the weeds here for us, and I'm sure listeners are probably like, Okay, that sounds it's complicated and trying to wrap their head. Some of them probably have very similar things that they're dealing with as well. Obviously, we'll help provide introductions or whatever. Everybody wants to try to talk shop with you. But one of the other topics I really wanted to dig into with you at Gusto is, can you remind me approximately how many engineers are typically working on either of the bioliths on a regular basis?
[00:44:51.840] - Peter Compernolle
I think we're up to 600 something. So it's a lot.
[00:44:55.700] - Robby Russell
First-name basis?
[00:44:58.040] - Peter Compernolle
Not quite. I I think leading up to this, I've been thinking about what is working well for us with Rails and what is not working well. And time and again, it comes back to how our organization operates. One of the beautiful things about packwerk is we can be explicit about who owns certain packs. And then each owner, each owning team, can decide how strict they want to be about contributions. So for example, we can require an owner to approve a pull request or to non-owners to approve. Or we can say, you can only make a change if an owner approves, for example. And so now, how many people work in Hawaiian ice. For years, when I started on benefits, It was 10 or 15. So it was very first-name basis. I was even aware almost of every pull request that was submitted and merged. But now we've had to really change how we think about that. What does it mean to own code, really? Do we want to slow people down if they're willing to help us in some cases? Yeah, I forgot your question a little bit, but I think org structure and software structure, Conway's law, really plays a big role at our size.
[00:46:15.120] - Robby Russell
Can you tell us a little bit about Gusto's air quoting Company pool system for local development? I'm finding that I want to talk with a lot of larger organizations. How do you think about as a developer showing up to work at Gusto, what's that experience like for being able to get something up and running enough that they can start making some valuable contributions, but that relies on a lot of... At a scale of this size, of this big of an application, how do you have enough seed data or what have you to be able to interact with something, presumably on your local laptop environment or something?
[00:46:45.580] - Peter Compernolle
We have two ways to help with development, not including factories. So we use factory bot in tests, but not including factories. We have two ways, really. One, we call demo scenarios. So this is useful for a number of reasons. But generally speaking, we write a whole bunch of code to say, build a universe to verify or demo some feature. So we can spin up an ephemeral environment internally, and that way our product partners or designers can see what it looks like before we ship it to production. And that effectively, like I said, we just write tons of code to build up the environment. That's demo scenarios. The other one we call a Company Poll. I mentioned before, our main, like the root node for most of our data is a company. So a company has many employees, they run payroll, they have benefits, et cetera. And a lot of times when we're developing, we need to know, will this work for companies in certain situations? Because the shape of all of our hundreds of thousands of customers is very, very different. Some companies have 100 employees, some companies have one. And so being able to test this out with real-world-ish data is very valuable.
[00:48:01.040] - Peter Compernolle
So Company Poll is a system we've built internally that scrubs our data of anything sensitive whatsoever, but keeps that shape. So if you have three employees, you can identify the company, and this system being compliant and everything, so it runs in production, so the data never leaves, it will scrub all sensitive data and give you a company with three employees. Those employees are all named something static. So it's very clear that this isn't real data. It's extremely powerful. Sometimes it's not perfect. Like if we're debugging a production issue with encoding, for example, since we're scrubbing data, we lose that encoding problem. Or if something is date-specific, we scrub birthdays, so we lose birthdays. But generally speaking, this is extremely powerful to give us realistic data that we can poke around with.
[00:48:57.780] - Robby Russell
When you're scrubbing things like a birthday, are you swapping with other birthdays then, at least?
[00:49:02.260] - Peter Compernolle
Yeah. The way this all works is anytime you add a column or a table with columns, you're required by CI. Our continuous integration will prevent moving forward unless you do this, you have to define, is this column sensitive or not? And if it is sensitive, you define how to sanitize it. By default, it just comes up with, if it's a string column, it just uses a bogus string. If it's a date, it uses a bogus date. But you can write custom ones to say a social security number is always in this format, and so it can be a little bit realistic. Well, anytime you make a change to this, we run it by legal and compliance to make sure that we are understanding what is sensitive, particularly on benefits, because we care a lot about being HIPAA compliant. But yeah, so it's custom. So you can look at realistic data. It's not always ACMECO for the company name, but it's close.
[00:49:58.240] - Robby Russell
I'm curious about the identification of a person, let's say me, by a social security number, is it remotely possible for anyone with access to the database to see all the different places I worked at?
[00:50:12.260] - Peter Compernolle
No. And that's, again, really important for us. The best you can do with identification is something like the number of employees, which isn't very useful given our scale. We take extra safeguards. For example, in our data warehouse, anything that is subject to HIPAA has one way of identifying employees, and then anything else identifies the same employee differently. So we have three ways of identifying companies. We have three ways of identifying employees, and we choose those identifiers based on the sensitivity or the access controls, just because we really don't want to slip up on this, we go to the extreme.
[00:50:55.300] - Robby Russell
That makes a lot of sense. Yeah, just always curious about things like that at that scale. I'm like, Imagine that Gusto has a lot of data that could be really valuable in the wrong hands. Just trying to think of how you protect that type of information and make it just impossible to some extent to do that type of thing. You mentioned having... You have your data scrubbers, be able to have these snapshots that people can get access to. Are you able to get that down to a small enough thing that is able to run on your local environment, or is your developers connecting to... You have your local, sometimes local databases, or are you needing to connect to some cloud-based database somewhere that you're running?
[00:51:37.160] - Peter Compernolle
For the most part, we do all local development locally. Our local environments are not exactly production environments. So there are little differences. Over the years, we've gotten a lot better. But because they're big Rails monoliths with lots of gem dependencies, it requires lots of resources. And we found the best way to do that for now is just to run it locally. And so when we do this company pull process, we limit it to 25 companies, I think. But that's enough to make sure what we're building makes sense. And then if we're testing something specific, like pagination, or will this work for big companies, then We have demo scenarios that we can run locally to see a thousand-person company just to make sure.
[00:52:22.280] - Robby Russell
Miguel, how do you see local development changing as, say, with your extraction type of progress? When you're extracting things, does that end up impacting things like that very often?
[00:52:32.880] - Miguel Conde
It does. One of the main assumptions that Company Pool had when it was first created was that we would have a singular database, a singular data store. This is no longer true, and it will not be true. We're approaching this in two folds. One, we have to solve the problem of how do we pull data from multiple data stores. That's currently being worked on right now. Then there's a harder problem, and it's slightly different, is that now, because one of the way that it works is there's a lot of Ruby code which uses models. It maps the data to the models, and you tell it like association. But how does this work in a separate service? Meaning, let's say I have an association to an employee in scheduling, but the service itself has no idea what an employee is, nor does it have a model that company pool can use to create employees locally. That second problem is still being solved for right now, and it probably will change fundamentally how teams use company pool when they just want to boot up their service in isolation and be, Hey, I need some data injected. Even demo scenarios would change as well, because demo scenarios right now use factory bots behind the scenes and tooling or and call services.
[00:54:01.960] - Miguel Conde
Now it's going to have to change to almost network calls. One, the service that has the model has to be up locally if you want to do that. Two, you now have to change what used to be Ruby API calls to network request and go from there. It'll change the landscape quite a bit.
[00:54:23.920] - Robby Russell
Miguel, I appreciate that. I'm curious about, are there any approaches that you've already experimented to try to help wrap your head around how you're going to make that happen, or is there a couple of different patterns you're exploring at the moment to make that not so complicated to handle? I would imagine you're going to have to rethink onboarding quite a bit for your new engineers with that because you have, what, 600 or seven hundred developers working on the application, they're all going to have to rethink some stuff as well. That seems like a big rollout to wrap your head around.
[00:54:51.980] - Miguel Conde
Yeah. Honestly, we're flying the plane as we're building it and designing it at the same time. It is a problem, and we're starting to see that. There was an introduction of Kafka into one of our core systems, which all of a sudden introduced a hard dependency in Kafka to be able to run locally. Because what would happen is before you didn't need to run Kafka because there was no core flow. But now it is in a state that when I make a change to a model, we try to produce a message. If you don't have a local broker set up, that will fail. So your local environment will fail now if you don't have that cluster. So our tooling is catching up to some of the things that we are doing. There are patterns that we have tried in the past that did not work out too well. We have a list of what to avoid, but we're still trying to define the best approaches going forward, if that makes sense.
[00:55:55.420] - Peter Compernolle
I think the simplest answer is be extremely extremely sensitive about dependencies. We now hold a really high bar for what is an acceptable dependency, either run time or development or background. Anything that we depend on, we don't run by default locally. So that forces us to think, what happens if it's down? And then we think, well, I can't run this feature if it's down. So that makes us think maybe we should build the feature slightly differently so the dependency is softer. So one example here is we have a super hard core sensitive data store. We call it the Hardened. I don't remember what the acronym stands for. We call it Happy. And this store is a lot of our most sensitive data, like social security numbers. And so the benefit of doing this is we have really hardened access controls, we log access and so forth. And then consumers of this just get an obfuscated key. This service didn't run for us locally for a long time in Hawaiian ice on benefits. So we built all of our stuff to access this data and handle errors. So the way we did this, for example, in the front-end, if you're looking at a web UI and you want to see an employee's social security number.
[00:57:17.600] - Peter Compernolle
That would be a run time dependency. So what we did is instead we show a padlock, and then you click the padlock, which makes a different HTTP request just to get the social security number. That gave us two advantages. One is we can authenticate separately. Instead of seeing non-sensitive employee data, now we have to auth differently for sensitive employee data. But mostly what it meant is that when the service goes down or is on an accessible, it doesn't really stop anything in the application. It just means that you can't see that data. Then all of the rights that we make to this system now happen in background jobs. If those jobs fail because the service is down, they get cued up. Then a few minutes later, when we fix that service being down, we can just retry the jobs. So that's maybe an obvious example, but that's the behavior we're looking for anytime we have a dependency. We don't want it. So what can we do to avoid it?
[00:58:16.820] - Robby Russell
Interesting. You're curious about that. When you're sending, say, use that example of maybe your happy system or your storing more, your hardened data system, and you have a background job, how are you approaching encryption of that data in that temporary time between submission of a user into a thing, putting into a background job, it's in this state before it gets put into this long term storage place? What's that process look like?
[00:58:43.860] - Peter Compernolle
You mean for writing, specifically? Yeah. It depends on the context. A lot of the sensitive data we get on benefits is coming from the main monolith. So the source of truth is still there. So we wouldn't be told to write a copy of it unless we had the data there. And so we can always fetch it later to rewrite. Or if it's coming from a user, it's the same as that padlock. So if the user is uploading some sensitive data and happy is down, we tell that user, You can't upload this sensitive data right now. Try again in a few minutes. But again, we've isolated it just to that one thing, so they can still use the overwhelming majority of the app, but we can't avoid that dependency.
[00:59:26.800] - Miguel Conde
I want to stress that that is something that monolith, HI, has done, which is where Peter has done. Unfortunately, the main monolith was built with like Rails first, that all the data will be there accessible at all times because we're a singular on a lift. One of the biggest things that we're going to have to evolve in the company is that we must build our UI now with the same principles that HI has done, that we cannot guarantee anymore that all data will be there. That is a culture shift that I think many engineers are going to have to get comfortable with because our service might be tolerant to it, but the UI might not be. From a user's endpoint, it says this isn't working. But in actuality, it's just that one piece that's not working.
[01:00:16.800] - Peter Compernolle
The one sadness I have about Sorbet, and maybe Ruby more generally, is that you can't be explicit about what errors to expect. It really needs to be a cultural thing. If you're making an HTTP request, I think we're all pretty sensitive about if you make an HTTP request to some other service, you have to handle errors. But we're building these pseudo services using packwerk. We still want to have that same idea. Any time you lob data over the fence to a different pack, you have to pretend there's latency. There's not because it's in a Ruby process. Pretend there's no guarantee for a database transaction because it's in a different data store. And pretend that it might fail. So that's something that's really hard to do if it's all in the Ruby process, and you know, if I'm here, the database is just fine. So that's something that we're still building a muscle on. One thing we've done on benefits that I'm very proud of is we identified one of these pseudo services, and it goes back to the timing aspect of this where we need a snapshot. We built it using packwerk inside of the monolith.
[01:01:32.120] - Peter Compernolle
We did that because we didn't quite want to be guinea pigs of spinning up a new service that was dependent on and so forth and so forth. But we did use a new database because a new database would force us to have this discipline. We can't depend on database transaction safety. We have to be sensitive about how we're writing data or making calls to write data. Then when we read it, if it's in a database, it's the same thing. We have to build in the eventual consistency, Miguel has talked about a few times. If we write and then immediately read, the write may not have happened. So we have to build that. It's worked out really, really well at forcing us to think about how rigid this boundary is. And because it's all in the same code base, we can iterate really, really quickly rather than thinking about versioning or like, what does this API, how do we evolve the API? It's all there on the same code base. And then the next step is move it to engine, and then the step after that is move it to a different service and communicate over HEDP.
[01:02:34.920] - Peter Compernolle
But we're getting a lot of the benefits now all inside of the same code base.
[01:02:41.120] - Robby Russell
Out of curiosity, you mentioned the front-end tooling. What does your front-end tooling look like at Gusto?
[01:02:48.640] - Miguel Conde
We have essentially in the main application world, two major front-ends. We have our administrator front-end, which is essentially pure Rails. It uses active admin, and it is Rails paradigm. Erb files. HTML. Then we have our customer-facing one, which is React. Now, our front-end is a timeline of different approaches. We have within it, you name it. We have HTTP request, we have Redux, we have GraphQL. You can tell the age of a page by going into its code base and going like, What's it doing here?
[01:03:29.000] - Robby Russell
I think I think that's pretty common for a lot of teams to be out there. I like that. It's like a timeline.
[01:03:35.880] - Miguel Conde
But primarily, the way that new products are built right now and new pages is that it is a single-page application powered by React, and the interface to the back-end is through GraphQL. We have TypeScript for type assurances within JavaScript, and it currently lives in the monorepo. However, that is also in the process of being split out to its own repo, which has its own set of problems as well.
[01:04:08.100] - Robby Russell
When you're evaluating those, maybe this is the purview of other departments, given that it's all in the monorepo, like you said, it's going to potentially have other challenges to navigate it, but they're in different repositories. Are they able to work independently on the SPAs, or do you find that every time they're working on features, it's very, very tightly coupled to some, I'm air quoting, back-end development that also needs to happen, or is GraphQL usually providing enough already and then let the SPA get worked on and it gets deployed and you just work that stuff out?
[01:04:41.200] - Miguel Conde
We have set some rules in our CI/CD pipeline. One of the main ones is that we cannot guarantee that when you're making a GraphQL schema change, that it will be there. We do have a separate that you cannot commit GraphQL schema changes at the same time as UI changes because there could be a difference. But this is one of those cultural shifts that we were talking about. Many engineers feel that they must test the full flow of things before they feel comfortable that, Okay, this is working as intended. They make their back-end changes, and they use the front-end to validate that their back-end changes are working. With this new world, we have to get into a mindset that you must be able to validate your back-end changes in isolation within service. Then just test the boundaries at it, whether it be a contract test or something of that nature, so that you don't have to have the UI up and running and going like, Okay, let me test this. If we stick with that, I think we're going to be in a lot of pain because then local development will become very tedious. You're going to have to find every service that you depend on for every page that you're navigating to.
[01:05:56.840] - Miguel Conde
But if we can get away from that mindset and really see them as separate entities and be like, The back-end is the back-end. The UI is just a collar. Then we're in a good spot to succeed here.
[01:06:10.340] - Robby Russell
How does GraphQL affect the way you design those service boundaries?
[01:06:15.260] - Peter Compernolle
I have maybe a spicy take here. I think that GraphQL being federated makes it really easy for implementers to not think too much about where to get the data, which is a big reason for it. I'm maybe an idealist or maybe too old school, but if we rewind about five minutes, I was talking about forcing people to think about error handling and eventual consistency, and GraphQL makes it so you don't need to do any of those things. I think this is something that we're going to learn as an org over the next couple of years, because right now, most of our data is coming from the monolith, or we're building benefit-specific pages, and they're coming from a different monolith. But as we split things up, we're going to have to really think about, well, this page right now is depending on five different services. What if three of them are up and two are down, or which combination can we tolerate? Right now, we don't need to think about that, which is good. It's helping us move really quickly, but slowly it's going to become hard, and we're not going to know, Okay, well, if this service is down, this is the experience that I want.
[01:07:32.540] - Peter Compernolle
This is the degraded experience. I don't think that opinion is super common at Gusto. I think Gusto is helping us go so quickly that the pros outweigh the cons. But I really think hard about these boundaries all the time, and I don't like that GraphQL obfuscates it.
[01:07:49.340] - Miguel Conde
GraphQL, that is the benefit of GraphQL. The resolution of it is federated. So me as a caller, I don't need to know. I need to call service X, Y, and Z to get the data that I want. That is obscure to us where it becomes hard, and we have to change our art here, is that because you cannot guarantee what service resolves what, if one of those services are down, how do you handle that? Let's take that Social Security padlock scenario. Let's say I have a page that shows me the employee name, their Social Security, I don't know, birthday, and a list of schedules. I know that my service can resolve the schedule, but let's say that in ZP, that employee was actually deleted, and we never got the message that it was deleted, so we still have a reference to it. The front-end passes that ID or whatever, we resolve it. But then when it goes and tries to resolve anything else, like name, birthday, whatever, it'll bomb out because that employee doesn't exist anymore, and it could have been hard deleted instead of versioned out. So what is a degraded experience, that padlock strategy isn't going to work to padlock the name, padlock that's not how it goes.
[01:09:08.060] - Miguel Conde
So we have to think about how your service might be tolerant, your backend might be fully tolerant, but how does it all come together in a way that if something is down, how does the page behave?
[01:09:22.320] - Robby Russell
Those sound like some complicated things to wrap your head around. It's fun, though. To disseminate across 600 engineers.
[01:09:29.320] - Peter Compernolle
No, it is It's hard. I'll say it one more time. It all comes down to people problems or organization problems. A lot of this is we can either build something very simple where we can afford to be thoughtful about all of these things, or we can build something complex where we maybe can't afford to think about all of these things. And so I think as we're maturing as a company, we're getting much better at really simplifying things so we can be more fault-tolerant and thoughtful about these things. We're making a lot of progress, and we're addressing those concerns, but it's hard.
[01:10:05.820] - Robby Russell
For someone working on, say, the React SPA, and they're trying to fetch a bunch of data, and one of the systems is down, are they then needing to try to handle all of those edge cases now in their code? Is that the reality? Is the trade-off... I'm just thinking high level. I'm not pro or against GraphQL or federated GraphQL in particular, is the illusion that you don't have to think about it because you're just going to fetch stuff and it's going to go do all the work behind the scenes, bringing more complexity into just the front-end code itself to handle all those scenarios in the fallback?
[01:10:46.500] - Miguel Conde
It is a transfer of responsibilities, I feel, because if you call my back-end service with a REST controller and I had dependencies that I needed to fetch data from, I can inherit that risk, try to fetch it, and if one of those services are down, I now have to be tolerant to that in some capacity. The burden lives on the service with GraphQL that I cannot guarantee anything from my service other than what I can provide. Now, the burden is always on the caller to be as tolerant as possible when the thing that they're calling is down. Like a REST controller where I make a REST, I am the caller, so I have to be tolerant to when they go down, and I can be. But now the UI is a caller, and some of these pages are expensive. I think we just have to rethink of how much data do we actually want to load on a page? Because the more complex the page with different data points, the more the experience can get degraded if any of these things on it. It just might be a rethinking of the UI.
[01:11:55.600] - Robby Russell
Sounds complicated. Wrap your head around. Thanks for diving into that for us. I'm going to do some pondering myself after we get off our conversation later about how I'm thinking about things on my end as well. Miguel, one of the things you said in our prep call is you describe Rails like a Formula One car, fast but tricky with too much weight. How has Rails helped you succeed and where has it shown its limits at scale?
[01:12:20.500] - Miguel Conde
The analogy I was using is that Rails is a Formula One, so there's no baggage to it, meaning you're a lightweight product or something like that. You can go really quickly, and it allows developers to just go. But there's not a lot of securities in Formula One cars either. There's no airbags, you have a small cage and stuff like that. But it's really up to the driver, in this case, the coder, to really know how to control this car well and to follow the guidelines just so that if something does happen, they don't blow up and crash. When your app increases in size and complexity, consider that putting weights on the Formula One. Now the speed is being weighed down by all this complexity and maneuverability becomes much more difficult because now when you turn, you might be shifting a bunch of stuff that you before didn't care about. Now you have to control how fast you're going, and you have to be much more cognizant of if you make a pivot somewhere, what are the consequences of that? That's why the extraction work is at play here because we're trying to get back to that Formula One, get rid of the weights, and just have a very lean service so that you can once again feel comfortable going at speeds that you were previously achieving.
[01:13:47.400] - Peter Compernolle
This isn't super related to the metaphor, but it was a real turning point for me when we were talking about breaking up the Rails application, the monolith. We, for a long time, thought about how a certain service owned certain things, data, functionality, and so forth. That went reasonably far, but the turning point was when I realized it was more what the surface area is and the interface between services, so how we're sharing those things. And then you start to think, if you think from that perspective, Rails has a very large surface area for almost everything. And this allows you to go really fast because there is an API for the thing that you want to do, I promise. We talked before, an active record model has several hundred APIs. But if all I want to expose is an attribute reader for one column, I have to do more work to lock the rest of that down. And so in a similar way, the F1 is all of the APIs are there. Everything that you could possibly need is there. But that safety, Miguel was talking about how it's lacking is, well, I have to support all several hundred of those APIs now, whereas if we invert that a little bit, make everything private by default, now I'm exposing only what I need.
[01:15:13.860] - Peter Compernolle
And when we extract all these services, we get that boundary. It sounds obvious, but that boundary is there because the only things that I'm exposing are available over HDP or GraphQL or whatever. But all of those other APIs, you can't access them if you tried.
[01:15:30.840] - Robby Russell
One of the things you mentioned earlier was around developers. Right now, if someone's working on some features or some things in the back, air quoting back-end, and needing to maybe feel like they need to go click around in the web views to make sure things are still working there. How does your team think about testing? What tooling do you have in place? Do you have a lot of end-to-end tests, or is that more on the SPA side of their responsibility? Are the back-end developers running those tests on a regular basis? Or Where's that those lines? Is it blurry or everybody be able to run the whole test suite locally? Do you have a test suite?
[01:16:07.700] - Peter Compernolle
We have better test coverage than any other company I've worked at. But most of that coverage is at the unit level or at the very high level integration test. And the high level integration tests generally test happy path cases. So we get the edge cases through unit tests, and this works pretty well for us, but we could still improve. I think the best way to do this is to make the problem smaller. So if we can define those boundaries better, then there are fewer cases that we need to support because we've identified what works and what doesn't. Then whether we call it a unit test or an integration test, it fades away. Miguel, you work in a much bigger application. I'm curious to hear what you think.
[01:16:54.400] - Miguel Conde
We have essentially three levels of testing. We have the unit testing, the integration testing, and then a full end-to-end where we actually boot up a headless browser and navigate from the front-end. Unit testings, we're primarily using them for edge cases, like happy paths, to make sure that the function works. Integration testing tests our GraphQL layer primarily to make sure that when you hit one of our GraphQL layers, our services are called and everything resolves. Then end-to-end is just smoke test, or some teams use it as a better assurance of an edge case that they found and they want to test with an end-to-end. But all of this is going to change in extraction because a lot of these tests are built with the assumption that you're in a monolith because some of these use factory bots to build up applications. Well, in my service, I may not have access, I won't have access to your factory bots. A lot of the integrations tests have to be thought differently. Unit tests remain the same. That's all mocked for the most part. But integration tests and particularly end-to-end, how do you run end-to-end in a distributed system?
[01:18:13.120] - Miguel Conde
Do you bring them all up and then you run them? What does that mean for local development if I want to run an end-to-end test locally? So that is being thought out of right now, and we're going through these problems to try to find out what does the future state of our testing suite look like.
[01:18:31.140] - Robby Russell
That's another fun challenge. It sounds like you folks are wrapping your head around, and maybe we'll have to have a follow-up conversation in a few years and see what you learn from that. Are there any areas of the code base in the monolith that nobody wants to touch?
[01:18:46.900] - Peter Compernolle
Those are my favorite places. Sorry. Those are the juiciest ones. There was a time a few years ago where we had to work on an integration with a third party, and the integration was built very quickly. Really, and maybe not in the best way. And that was one of the most fun projects for me to work on. I used to joke it was like walking around in an apple orchard with so many apples that you couldn't take a step without one whacking you in the face. There was so much low-hanging fruit, so much room for opportunity. Anyway, but of course, there are parts of the code that are briddle, and if you look at it, you think, I can't believe this is working, but it's been working just fine for 10 years. You just let it respect the legacy code.
[01:19:34.080] - Miguel Conde
When I joined, we've got a much better at this now, but when I first joined, one of the things that I found that was pretty hard to engineer was you had teams that own code, but that was on paper. What it actually was, you had that one engineer that worked on it that was the true owner of the code. If that person is gone, the team might say, Yeah, we're the owners, but as much as we do about it. We've gotten better and time has really invested in documentation such that anyone can be an owner. Even if you just come into the team and know nothing about that, we can onboard you with a list of diagrams and explanations so you don't have to splunk through the code and get at least a high-level understanding of what does this do. Then that way, the team truly owns a domain instead of just the individual that happened to work on it.
[01:20:31.760] - Robby Russell
It's helpful to get a perspective on how teams think about that. It always does seem to come back down to, there's that one person. Can we wait till they're back from vacation next week before we have to touch it, please? Do you use any other tooling to help you identify high churn areas or things? Is there team members there? Do you have a platform team type of thing where someone that's responsible for thinking about the developer experience and always optimizing that? I've talked to some teams at a larger scale that have that, say, luxury to have people that are focused on those things, thinking about upgrades, and they're the ones tackling those things, or is those just shared responsibilities across the wider engineering organization?
[01:21:12.520] - Peter Compernolle
We have dedicated teams for a lot of that stuff. We also send out a survey, I think once a quarter, to just get a feel for how engineers are feeling. Is CI, does it feel fast? Could it be faster? Is code easy to understand if you don't work in it regularly? How's documentation? We can see how things trend over time, which is pretty valuable to identify hotspots. Then we break it down by team. You can see, well, this team thinks that CI/CD is really slow, and that team thinks it's really fast. It brings a lot to the surface. But we do have teams dedicated to a lot of the foundation stuff. Then within the teams, we talk about that stuff a lot.
[01:21:55.480] - Miguel Conde
The overarching, we do have platform teams that we have a that lets you run scripts against it. It gives you ownership, tells you who owns what and who owns where. We have these, I guess, surveys, as Peter mentioned. But there's also the responsibility of the teams to make sure that their code is their domain is well maintained and well documented. You can have wide differences in our repository because that ground level responsibility is still on on a team by team basis.
[01:22:31.800] - Peter Compernolle
It's one of the challenges that we have because as we've gotten better at identifying these pseudo services, some pseudo... By pseudo service, I mean a thing that would ideally be a remote service, but is now just the boundary is with packwerk. Some are really clear, and they've had the time and resources to invest in their APIs. Their feeling of the application is very different than a team that's inherited, very legacy code, or maybe more dependent on the other monolith. So experience from team to team is largely dependent on what that team is responsible for. That's kind of difficult because we try to draw conclusions like we need better documentation. Well, half the teams think that their code is better documentation than a document would be because the APIs are well defined, and we've got Sorbet strict signatures across the board. Everything else is private. But then some other team is looking at 10-year-old code that makes no sense to anybody. And so they have to depend on documents explaining like, this is what you don't touch. This is how this works at the big picture. These are the things that we haven't quite been able to fix but don't need to be fixed quite yet.
[01:23:49.260] - Peter Compernolle
It varies so much because it's such a big application.
[01:23:52.880] - Robby Russell
I can appreciate that you're just mentioning those quarterly surveys, there's developer experience type surveys. I don't know what types of tools you're using there. And we've used those types of tools as well. And it always feels like even at our scale, like an engineering team of 15 people or so, it always feels like, well, it's very contextual off the types of projects the team is working on at this point in time. And we're working on different clients even. So it's like when you're in that scale, it's like you're tracking sentiment, some subjective. It's like a snapshot in time of how people are feeling about the code, their test coverage, documentation. And then there's a lot of like, Well, we can't just draw a conclusion across the whole organization because of this. There's a lot of context there. Are you able to then align that with any objective data points? Our test suite has actually gotten a lot slower over the last quarter, so this seems to be aligning. Or actually, we can actually say the test suite is actually a lot faster than it was two quarters ago, but the team still feels like it's super slow.
[01:24:50.920] - Robby Russell
Is this a team human communication process? Like, Hey, we're not celebrating the wins enough. We're not talking about the positive progress that we're making. At 600 people, or plus, you can remember the exact number there. That sounds like a challenging thing just to get momentum and energy and excitement about when we're all working on our little silos in different parts of the organization.
[01:25:12.120] - Peter Compernolle
Let's see. We do have a lot of objective data around code throughput, cyclometric complexity of diffs, breakdown by engineer even. We can see, again, big picture trends. The problem with a lot of this data is it can be explained away in many cases. But another thing we've done in the last two or so years is lean very heavily into visibility through Datadog. So not only do we use Datadog for performance of the application, but our foundation teams will even use it for things like understanding CI/CD or deploy frequency and so forth. So we build service level objectives, SLOs, both for a product team like I'm on is responsible for endpoints take less than a few hundred milliseconds. Error rates need to be less than this, whatever. But our foundation teams will say, this process needs to take less than 10 minutes, or those sentiment surveys need to be trending up instead of down. We really believe a lot in that because big picture that tells a much more detailed story over time.
[01:26:22.040] - Robby Russell
What's something that your team does differently from most Rails teams you've met? Miguel.
[01:26:28.480] - Miguel Conde
I think, at least within Gusto, the structure of our services is generally what many people would consider to be non-Rail-ey, which is the services to repository, because one of the things that our team does that when someone modifies our code is that they put a lot of logic on the model because it's that mentality of, I cannot guarantee that someone's not going to call my model out of the blue, and I need to have that guarantee that if you do something, we have validation at the model. Where our team comes that does differently is like, no. The model may have some basic validations or something that just validate the presence of this and through, but no complex thing. We are very adamant that the only way that you interact with it is through our services. Our models are actually protected classes that if anyone modifies them, you need a review from us. We take on that risk. Like Peter says, we cannot guarantee privacy, but we are very adamant that we will follow that service repository structure and rely on Sorbet and packwerk to analyze is someone doing something that they're not supposed to in the monolith.
[01:27:48.760] - Miguel Conde
Once we get away from the monolith, that might change because now anything that we do is within the service itself, and privacy is inherited by nature of of not being in the same namespace as everyone else.
[01:28:04.480] - Robby Russell
Peter, anything on your end that comes to mind, maybe less technical?
[01:28:09.400] - Peter Compernolle
I don't know exactly, but I think there are a few things that Rails likes that we don't do. We pretty much forbid meta programming, or at least that we write. So the meta programming that there is, is tucked away in gems with really clear APIs. Like I mentioned before, we try to make all of our active record models private, so we harness power, but we hide it and we protect ourselves from that. Yeah, we lean a lot on packwerk and Sorbet. I think one thing that is really powerful about Rails. Maybe this is just a different way to say what I said before, but there's a lot of functionality tucked away and implicit with things that we're doing or the things that we can do. And I think the number one thing that we try to do or try to include with every single thing we do is make it explicit. So anytime there's a dependency, we want to scream from the mountaintops, this is a dependency. We don't want to hide it. Again, I'll bring up the active record models being private. It's very, very powerful to have an API that says explicitly, this is what is happening.
[01:29:24.020] - Peter Compernolle
This is what we're using. This is what we're depending on. And then all of those other things that you get, ideally, we lock them away. And if we need to bring it to the surface, we can very easily. But the hidden functionality at our scale becomes very, very difficult to control. And so we want to make it all explicit.
[01:29:44.140] - Robby Russell
I can appreciate that. I want to thank you both for coming OnRails today to talk shop with us. A couple of last questions for you. Is there a programming book that you find yourself recommending to teammates or peers again and again?
[01:29:57.980] - Peter Compernolle
No. The The closest thing is, Ruby is not my first language, and I used Ruby Koans years ago, and that was probably the most valuable learning tool I've ever used for programming. But the way that I learn is by messing around with stuff and seeing what happens and thinking a lot. And so I don't, unfortunately, have a book that I'd recommend. I think the best way to learn, for me at least, is build stuff, see what breaks, fix it, learn from it, and just accumulate all of those little details over the years.
[01:30:33.620] - Robby Russell
Sure. What about you, Miguel?
[01:30:36.200] - Miguel Conde
The only book I recommend when it comes to programming is Cracking the Coding Interview, because once you get into the door of a company, you learn much, much more being on the ground than I feel any book can give you. Books give you theory and ideas, but they cannot replicate the real-life scenarios and the day-to-day that working at a company does. I recommend that book, Get yourself through the door, and then you really learn the intricacies of engineering.
[01:31:11.000] - Peter Compernolle
Another thing, too, is reading source code. I love Digging through Rails. I work a lot in iOS, too, so you can't see the source code there. But understanding how those things work and digging really deep into it has taught me patterns in a much more intuitive way than reading a book would because I can see how it's working and I can play with it and I can touch it and feel it. I love reading source code.
[01:31:37.300] - Robby Russell
It's a fun hobby for some of us, right? Where can folks follow your... Does Gusto have an engineering blog that I can link people to?
[01:31:45.860] - Peter Compernolle
There's definitely an engineering blog. The URL escapes me right now, but check out Gusto.com. It's there.
[01:31:55.360] - Robby Russell
Well, I'll definitely include links for the engineering blog and the show notes for all of our listeners With that, it's been such a delight, Miguel and Peter. Thank you so much for stopping by to talk shop with us OnRails. Really appreciate you taking the time with us.
[01:32:07.040] - Peter Compernolle
Thank you so much. This was a lot of fun. Yeah.
[01:32:08.890] - Miguel Conde
Likewise. Thank you very much for having us.
[01:32:13.000] - Robby Russell
That's it for this episode of OnRails. This podcast is produced by the Rails Foundation with support from its core and contributing members. If you enjoyed the ride, leave a quick review on Apple Podcasts, Spotify, or YouTube. It helps more folks find the show. Again, I'm Robby Russell. Thanks for riding along. See you next time.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.
Maintainable
Robby Russell
Remote Ruby
Chris Oliver, Andrew Mason
The Ruby on Rails Podcast
Elise Shaffer
IndieRails
Jess Brown & Jeremy Smith