Break Things on Purpose

Today Jason chats with Mauricio Galdieri, a staff engineer at Pismo. Mauricio describes his journey to becoming a developer, and then talks about his role at Pismo. Jason and Mauricio talk about reliability and the reluctance of financial institutions to adopt new tech, and Mauricio delves into some of the work he’s done with Chaos Engineering. The conversation concludes with a discussion about maximizing success and new technologies Mauricio’s team has been utilizing at Pismo.

Show Notes

In this episode, we cover:
  • Mauricio talks about his background and his role at Pismo (1:14)
  • Jason and Mauricio discuss tech and reliability with regards to financial institutions (5:59)
  • Mauricio talks about the work he has done in Chaos Engineering with reliability (10:36)
  • Mauricio discusses things he and his team have done to maximize success (19:44)
  • Mauricio talks about new technologies his team has been utilizing (22:59)


Links Referenced:

Transcript
Mauricio: That’s why the name Cockroach, I guess, if there’s a [laugh] a world nuclear war here, all that will survive would be cockroaches in our client’s data. [laugh]. So, I guess that’s the gist of it.

Jason: Welcome to Break Things on Purpose, a podcast about Chaos Engineering and reliability. In this episode, we chat with Mauricio Galdieri, a staff engineer at Pismo about testing versus exploration, reliability and resiliency, and the challenges of bringing new technologies to the financial sector.

Jason: Welcome to the show.

Mauricio: Hey, thank you. Welcome. Thanks for having me here, Jason.

Jason: Yeah. So, Mauricio, you and I have chatted before in the past. We were at Chaos Conf, and you are part of a panel. So, I’m curious, I guess to kick things off, can you tell folks a little bit more about yourself and what you do at Pismo? And then we can maybe pick up from our conversations previously?

Mauricio: Okay, awesome. I work as a staff engineer here at Pismo. I work in a squad called staff engineering squad, so we’re a bunch of—five squad engineers there. And we’re mostly responsible for coming up with new ways of using the existing technology, new technologies for us to have, and also standardize things like how we use those technologies here? How does it fit the whole processes we have here? And how does it fit in the pipelines we have here, also?

And so, we do lots of documentation, lots of POCs, and try different things, and we talk to different people from different companies and see how they’re solving problems that we also have. So, this is basically our day-to-day activities here. Before that, well, I have a kind of a different story, I guess. Most people that work in this field, have a degree in something like a technical degree or something like that. But I actually graduated as an architect in urban planning, so I came from a completely different field.

But I’ve always worked as a software developer since a long time ago, more than [laugh] willing to disclose. So, at that time when I started working with software development, I like to say that startups were called dotcoms that back then, so, [laugh] there was a lots of job opportunities back then, so I worked as a software developer at that time. And things evolved. I grew less and less as an architect and more as an engineer, so after I graduated, I started to look for a second degree, but on the more technical college, so I went to an engineering college and graduated as a system analyst.

So, from then on, I’ve always worked as a software developer and never, never have done any house planning or house project or something like that. And I really doubt if I could do that right now [laugh] so I may be a lousy architect [in that sense 00:03:32]. But anyway, I’ve worked in different companies for both in private and public sectors. And I’ve worked with consultancy firms and so on. But just before I came to Pismo, I went working with a FinTech.

So, this is where I was my first contact with the world of finance in a software context. Since then, I’ve digged deep into this industry, and here I am now working at Pismo, it’s for almost five years now.

Jason: Wow. That quite a journey. And although it’s a unique journey, it’s also one that I feel like a lot of folks in tech come from different backgrounds and maybe haven’t gone down the traditional computer science route. With that said, you know, one of the things you mentioned FinTech. Can you give us a little bit of a description of Prismo, just so folks understand the company that you’re working at now?

Mauricio: Oh, yeah. Well, Pismo, it’s a company that has about six years now. And we provide infrastructure for financial services. So, we’re not banks ourselves, but we provide the infrastructure for banks to build their financial projects with this. So basically, what we do is we manage accounts, we manage those accounts’ balances, we have connections with credit card networks, so we process—we’re also a credit card processor.

We issue cards, although we’re not the issuer in this in the strict sense, but we issue cards here and manage all the lifecycle of those cards. And basically, that’s it. But we have a very broad offering of products, from account management to accounting management, and transactions management, and spending control limits and stuff. So, we have a very broad product portfolio. But basically, what we do is provide infrastructure for financial services.

Jason: That’s fascinating to me. So, if I were to sum that up, would it be accurate to say that you’re basically like Software as a Service for financial institutions? You do all the heavy lifting?

Mauricio: Yeah, yeah. I could say that, yeah.

Jason: It’s interesting to me because, you know, traditionally, we always think of banks because they need to be regulated and there needs to be a whole lot more security and reliability around finances, we always think of banks as being very slow when it comes to technology. And so, I think it’s interesting that, in essence, what you’ve said with trying the latest technology and getting to play around with new technology and how it applies, especially within your staff engineering group, it’s almost the exact opposite. You’re sort of this forefront, this leading edge within the world of finance and technology.

Mauricio: Yeah. And that actually is, it’s something that—it’s the most difficult part to sell banks to sign up with us, you know? Because they have those ancient systems running on-premises and most likely running on top of COBOL programs and so on. But at the same time, it’s highly, highly reliable. That they’ve been running those systems for, like, 40 years, even more than that, so it’s a very highly reliable.

And as you said, it’s a very regulated industry, so it’s very hard to sell them this kind of new approach to banking. And actually, we consider this as almost an innovation for them. And it’s a little bit strange to talk about innovation in a sense that we’re proposing other companies to run in the cloud. This doesn’t sound innovating at all nowadays. So, every company runs their systems in the cloud nowadays, so it’s difficult to [laugh] realize that this is actually innovation in the banking system because they’re not used to running those things.

And as you said, they’re slow in adopting new technologies because of security concerns, and so on. So, we’re trying to bring these new things to the table and prove them. And we had to prove banks and other financial institutions that it is possible to run a banking system a hundred percent in the cloud while maintaining security standards and security compliances and governance compliance and all that stuff. It’s very hard to do so and we have a very stringent process to evaluate and assess new technologies because we have to make sure it complies with those standards and all those certifications that we need to have in order to operate in this industry. So, it’s very hard, but it doesn’t—at that same time, we have lots of new technologies and different ways we can provide the same services to those banks.

And then I think the most difficult part in this is to map what traditional banks were doing into this new way of doing things in the cloud. So, this mapping, it’s sometimes it gets a little confusing and we have to be very patient and very clear with our clients what they should expect from us and how we will provide the same services they already have now, but using different technologies and different ways. For instance, they are used to these communications with different services, they’re used to things like webhooks. But webhooks are not reliable; they can fail and if they fail, you lose that connection, you lose connectivity, and you may lose data and you may have things out of sync using webhooks. So, now we have things like event streaming, or queues and other stuff that you can use to [replay 00:09:47] things and not lose any data.

But at the same time, you have to process this, and then offline in an asynchronous manner. So, you have to map those synchronous things that they did before to this asynchronous world and this world where things are—we have an eventual consistency. But it’s very difficult but it’s also at the same time, it’s a very fascinating industry.

Jason: Yeah, that is fascinating. But I do love how you mentioned taking the idea of the new technology and what it does, and really trying to map that back to previously—you know, those previous practices that they had. And so, along with that, for folks who are listening again, Mauricio and I had a chat during Chaos Conf a while back, and he was sharing some of the practices that Pisma has done for Chaos Engineering. And I always liken that back to, you know, Chaos Engineering really is very similar to traditional disaster recovery testing, in many ways, other than oftentimes, your disaster recovery would never actually, you know, take things down. Mauricio, I’m curious, can you share a little bit more about what you’ve been doing with Chaos Engineering and in general, with reliability. Are there any new programs or processes that you’ve worked on within Prismo around Chaos Engineering and reliability?

Mauricio: Well, I think that the first thing to realize, and I think this is the most important point that you need to have very clear in your mind when we’re talking about Chaos Engineering is that we’re not testing something when we’re doing Chaos Engineering; we’re experimenting with something. And there’s a subtle but very important distinction between those two concepts. When you test for something, you’re testing for something that you knew what will happen; you have an idea of how it should behave. You’re asserting a certain behavior. You know how the system must behave and you assert that, and it makes sure the system doesn’t deviate on that by having an automated test, for instance, a unit or integrated test, or even functional tests and such.

But Chaos Engineering is more about experimenting. So, it’s designed for the unknowns. You don’t know what will happen. You’re basically experimenting. It’s like a lab, you’re working in a laboratory, you’re trying different stuff and see what happens, you have an idea of what should happen and we call this a hypothesis, but you’re not sure if that is how we will behave.

And actually, it doesn’t matter if it complies with your expectations. Even if it doesn’t behave the way you expect it to behave or the way you want it to behave, you’re still gaining knowledge about your system. So, it’s much more about experimenting new things instead of actually testing for some something that you know about. And our journey here into Chaos Engineering at Pismo, it all began about a year-and-a-half ago when we got a very huge outage on one of our major cloud providers here. And we went down with them; they were out for about almost an hour.

But not only we were affected by it, but other digital banks here in Brazil, but also many other services like Slack, Datadog, other observability tools that were running at that time, using that cloud provider went down, together with them. So, it was a major, major outage here. And then we were actually caught off guard on this because we have lots of different ways to make sure the system doesn’t go down if something bad happens. But that was so bad that we went down and we couldn’t do anything. We were desperate because we couldn’t do anything. And also we can even communicate properly because we use Slack as our communication hub, so Slack was down at that time, also, so we cannot communicate properly with our official channels.

Also, Datadog that we were using at a time also went down and we couldn’t even see what was happening in the system because we didn’t have any observability running at the time. So, that was a major, major outage we had there. So, we started thinking about ways we could experiment with those major outages and see how we could find ways of still operating at least partially and not go down entirely or at least have ways to see what was happening even in the face of a major disaster. And those traditional disaster recovery measures that were valid at the time, even those couldn’t cope with the kind of outages we were facing at that time. So, we were trying to look for different ways that we can improve the reliability of our services as a whole.

So, I guess that’s when we started looking into Chaos Engineering and started looking for different tools to make that work, and different partnerships we could find, and even different ways we could experiment this with our existing technology and platform.

Jason: I really love how you characterized that difference between testing and Chaos Engineering. And I think the idea of being more experimental puts you into a mindset of having this concept of, you know, kind of blamelessness, right, around failure. The idea that, like, failure is going to happen and we want to be open to seeing that and to learning from it. More so than a test, right? When we test things, then there’s the notion of a pass-fail and fails are bad, whereas with an experiment, that learning is, if it didn’t happen the way you expect, there’s learning around that and that’s a good thing rather than a bad thing, such as failing a test.

Mauricio: Yeah, and that works in a higher framework, I guess, which is resilience itself. So, I guess, chaos experiment, chaos engineering, and all that stuff, it’s an important part of a bigger whole that we call resilience. And I guess a key to understand resilience is that this point exactly, the systems never work in unexpected ways. They always behave the way it is expected to behave. They’re deterministic in nature. So, we’re talking about machines here, computers. We told them what we want them to do.

And even if we have complexity and randomness involved, say if a network connection goes down, it still will behave the way we programmed them to behave. So, every failure should be expected. What we have here is that sometimes they behave in ways we don’t want them to behave. And sometimes they behave in ways we want them to behave. So, it’s more of a matter of desire, you know? You want something, you want the system to behave a certain way.

So, in that sense, success should be measured as a performance variability, you know? So, sometimes it will work the way you want and sometimes it will work your way in ways that you don’t want it to behave. And I guess, realizing that, it’s key also to understand another point that is, in that sense, success is the flip side of failure. So, either it works the way you want it or it works the way you don’t want it. And what we can do to move the scale towards a more successful operation, the ways you can do this, you must first realize also that—let’s go back a little bit then say, if you have a failure and you look at why it happened, almost never it is the result of one single thing.

Sometimes it is, but this is very rare. Most of the failures and even mainly when we’re talking about major failures, they’re most likely the result of a context of things that happened that led to this failure. And you can see that the same thing, it’s valid for successes. When you have a success at one point, it’s almost never the result of one thing that you did that led to a successful scenario. Most of the time is a context of different things you did that maximizes your chances of success.

So, to turn this scale towards success, you should create an environment of several things, of a context of things. And this could be tooling, this could be your organizational culture and stuff, all of those things that you do in your company to maximize their chances of success. It’s not, you cannot plan for success in the sense because planning is one thing you can do, and planning doesn’t involve strategy, for instance. Because planning should be done thinking about things you can do, tasks you can perform, while strategy, you should be turning tables to [laugh] think in terms of strategy. So, you have to put all of this in the same way in a table and try to organize your company and your culture, your tools and your technology in ways you maximize your chances of success and minimize your chances of failures.

Jason: That’s such an interesting insight. So, I’m curious, can you dive into some of the things that you and your team have done to maximize your chances of success?

Mauricio: Okay. When we started working with Chaos Engineering, it was in this sense of trying to do one more thing to maximize our chances of success. And we partnered up with Gremlin and we saw that working with Chaos Engineering, using Gremlin mainly, it’s so easy—that is, it’s also easy to lose track of what you’re doing. It’s easy for you to go just for the fun of it and break things down and have fun with it and stuff. So, we had to come up with a way to bring structure to this process.

And by doing so, we should also not be too bureaucratic in the sense of creating a set of steps you should take in order to run a chaos session. So, one way we thought about was to come up with a document. That is the bureaucratic part, so this was a step you should take in order to plan for your chaos session, but there is one part of it—and I think it’s one of the most important parts of this chaos session planning—is that you should describe what you’re going to test, but more importantly, why you’re going to test this. And this is one of the most important questions because this is a fundamental question: why you’re doing this kind of experiment. And to answer that, you have to think about all the things in context.

What are the technologies you’re using? Why it fails in the first place? Do the fails that I expect to see are actually fails or is it just different ways of behaving? And sometimes we consider failure in a business rule that was not complied, that was not met. So, this is an opportunity to think about, are those business rules correct? Should we make it more flexible? Should we change those business logic?

So, when you start asking why you’re doing something, you’re asking fundamental questions, and I think that puts you in context. And this is one of the major starting points to maximize our chances of success because it makes every engineer involved in running a chaos session, think about their role in the whole process and the role of their services in the whole company. So, I think this is one powerful question to ask before starting any chaos session, and I think this contributes a lot to a successful outcome.

Jason: Yeah, I think that’s a really great perspective on how to approach Chaos Engineering. Beyond the Chaos Engineering, you mentioned that the staff engineering group that you’re part of that Prismo is really responsible for seeing new technologies and new trends and really trying to bring those in and see how they can be used and applied within the financial services sector. Are there any new technologies that you’ve used recently or that you’re looking at right now that has really been fruitful or really applied to finding more success as you’ve mentioned?

Mauricio: Yeah, there are some things we’re researching. One of those already went past research and we’re already using it in production, which is data—cloud-based, multi-region databases and multi-cloud—also—databases. And we’re working with CockroachDB as one of our new database technologies we use. And it’s a database built from the ground up to be ultra resilient. And that’s why the name Cockroach, I guess, if there’s a [laugh] a world nuclear war here, all that will survive would be cockroaches in our client’s data. [laugh]. So, I guess that’s the gist of it.

And we have to think about that in different ways of how we approach this because we’re talking about multi-cloud data stores and multi-region and how we deal with data in different regions. And should we replicate all the data between regions and how we do partition data. So, we have to think in different ways, how we approach data modeling with those new cloud-based and multi-region and globally distributed databases. Another one that we’re—this is more like of a research, is having a sharded processing. And that is, how we can deal with, how we group different parts of the data to be processed separately but using the same logic.

And this is a way to scale processing in ways that horizontal scaling in a more traditional way doesn’t solve in some instances. Like, when we have—for instance, let me describe one scenario that we have that we’re exploring things along those lines. We have a system here called ‘The Ledger,’ which keeps track of all of the accounts’ balances. And for this system, if we have multiple requests or lots of requests for different accounts, there’s no problem because we’re updating balances for different accounts, and that works fine. And we can deal with lots and lots of requests. We have a very good performance on that.

But when we have lots of requests coming in from one particular accounts, and they’re all grouped for this particular account, then we cannot—there’s no way around locking at some place. So, you have to lock it either at the database level, or at a distributed locking mechanism level, or at the business logic layer. At some point, you have to lock the access to this account balance. So, this degrades performance because you have to wait for this processing to finish and start another. And how can we deal with that without using locks?

And this was the challenge we put that to ourselves. And we’re exploring different ways, lots of different ways, and different approaches to that. And we have lots of restrictions on that because this system has to respond quickly, has to respond online, and cannot be in an asynchronous process; it has to be synchronous. So, we have very little space for double-checking it and stuff. So, we’re exploring a sharded processing for this one in which we can have a small subset of accounts being routed to one specific consumer to process this transaction, and by doing so, we may have things like a queue of order transactions so we can give up locking at the database and maybe improve on performance. But we’re still on the POC on that, so let’s see what we come up with [laugh] in the next few months.

Jason: I think that’s really fascinating. Both from a, you know, having been there, having worked on systems where, you know, very transaction-driven, and having locks be an issue. And so, you know, back in my day of doing this, you know, was traditionally MySQL or Postgres, trying to figure out, like, how do you structure the database. So, I think it’s interesting that you’re sort of tackling this in two ways, right? You’ve got CockroachDB, which is more oriented towards reliability, but a lot of the things that you’re doing there around, you know, sharding and multi-cloud also have effects for this new work that you’re doing on how do you eliminate that locking and try to do sharded processes as well. So, that’s all super fascinating to me.

Mauricio: Exactly. Yeah, yeah. This is one of the things that makes you do better the end of the day, you know? [laugh].

Jason: Yeah, definitely. As an engineer, you know, if anybody’s listening and you’re thinking of, “Wow, this all sounds fascinating and really cool stuff,” right, “Really cool technologies to be working with and really interesting challenges to solve,” I know, Mauricio, you said that Pismo is hiring. Do you want to share a little bit more about ways that folks can engage with you? Or maybe even join your team?

Mauricio: Yeah, sure. We’re hiring; we have lots of jobs open for application. You can go to pismo.io and we have a section for that. And also, you can find us on LinkedIn; just search for Pismo and then find us there.

And I think if you’re an engineer and looking for some cool challenges on that, be sure to check our open positions because we do have lots and lots of cool stuff going on here. And since we’re growing global, you have a chance to work from wherever you are. And this also imposes some major challenges for [laugh] for new technologies and making our products, our existing products, work in a globally distributed banking system. So, be sure to check out our channels there.

Jason: Fantastic. Before we wrap up, is there anything else that you’d like to promote or share?

Mauricio: Oh no, I think those are the main channels. You can find us: LinkedIn and our own website, pismo.io. Also, you can find us in some GopherCon conferences, KubeCon, and other—Money20/20; we’re attending all of those conferences, be it in the software industry or in the financial industry. You can find this there with a booth there or just visiting or participating in some conferences and so on. So, be sure to check that out there also. I guess that’s it.

Jason: Very cool well thanks, Mauricio for joining us. It’s been a pleasure to chat with you again.

Mauricio: Thank you, Jason. And thanks for having me here.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called “Battle of Pogs” by Komiku, and it’s available on loyaltyfreakmusic.com.

What is Break Things on Purpose?

A podcast about site reliability engineering (SRE); Chaos Engineering; and the people, processes, and tools used to build resilient systems. Sponsored by Gremlin. Find us on Twitter at @BTOPpod.