Technology Now

How do you keep a computer running non-stop? This week Technology Now explores the world of fault tolerant computing. We dive into how fault tolerance works, what industries use it, and why such a useful form of computing isn’t as ubiquitous as we might expect. Casey Taylor, Vice President and General Manager HPE Nonstop Compute tells us more.

This is Technology Now, a weekly show from Hewlett Packard Enterprise. Every week, hosts Michael Bird and Aubrey Lovell look at a story that's been making headlines, take a look at the technology behind it, and explain why it matters to organizations.

About Casey Taylor: https://www.linkedin.com/in/getcaseytaylor
Our previous episode with Casey: https://hpe.lnk.to/missioncriticalfa

Sources:

https://edition.cnn.com/2024/07/24/tech/crowdstrike-outage-cost-cause
https://edition.cnn.com/2024/07/24/tech/crowdstrike-outage-cost-cause
https://www.kovrr.com/reports/the-uk-cost-of-the-crowdstrike-incident
https://science.nasa.gov/mission/voyager/mission-overview/
https://science.nasa.gov/mission/voyager/where-are-voyager-1-and-voyager-2-now/
A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr and D. K. Rubin, "The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design," in IEEE Transactions on Computers, vol. C-20, no. 11, pp. 1312-1321, Nov. 1971, doi: 10.1109/T-C.1971.223133. 
https://www.cs.unc.edu/~anderson/teach/comp790/papers/Siewiorek_Fault_Tol.pdf

Creators and Guests

AL
Host
Aubrey Lovell
MB
Host
Michael Bird

What is Technology Now?

HPE news. Tech insights. World-class innovations. We take you straight to the source — interviewing tech's foremost thought leaders and change-makers that are propelling businesses and industries forward.

MICHAEL BIRD
right Aubrey um i know you know technology now super well can you cast your mind back to technology now episode number 90 sort of go through your …

AUBREY LOVELL
My index?

MICHAEL BIRD
your index. you know it well right

AUBREY LOVELL
well, by number, no, but I'm sure that it was wonderful

MICHAEL BIRD
that was our episode that we did about non-stop computing. We finish the interview with this.

CASEY TAYLOR
They should care because I think these days, the public have a high expectation for uptime. All we need to look at is in the last six, 12 months, some of the really visible major IT outages that have brought down industries. And I think that they need to care about uptime to protect their brand reputation and to keep their commitments to their end customers.

AUBREY LOVELL
I think the saying goes, there's no time for downtime, right? So that's the world we live in. And I guess we're heading back to nonstop this episode, right?

MICHAEL BIRD
We absolutely are, and we’ve even got Casey back to tell us more.

I’m Michael Bird

AUBREY LOVELL
I'm Aubrey Lovell

And welcome to Technology Now from HPE.

MICHAEL BIRD
I don’t know about you, Aubrey – but in my day-to-day life, I don’t really think about IT systems going down... that is unless it affects me. If everything is up and running properly, it’s easy to forget how devastating an outage can be.

AUBREY LOVELL
Yeah, that's true. mean, I feel like we've talked about this before, right, when something crazy happens. And I'm pretty sure there was like a massive global outage last year, right, that we talked about where suddenly people couldn't pay for things using, you know, like our cards. And I think over a thousand flights were cancelled. I mean, big things happen and it has big impacts.

MICHAEL BIRD
That’s right – and this outage was indeed global. According to CNN, the insurers estimating the damage the outage caused have said that it cost Fortune 500 companies over five billion dollars and a report from cyber risk company Kovrr estimated that the outage could have caused the UK economy to fall by over two billion dollars and we have, of course, linked to these figures in the show notes.

You had it on your laptop, didn't you I didn't have it on mine, but you had it on yours.

AUBREY LOVELL
. Yeah, absolutely, we couldn't work. When you think about these things happening, it's almost like a cyber hurricane coming in and just absolutely destroying all of your processes and technology that it's just not functioning, right? So it is pretty critical that we fix these things.

MICHAEL BIRD
Now, this wasn’t malicious or anything, it was a simple mistake, but even a mistake can be incredibly costly.

Later in the episode, I’ll be talking to Casey Taylor. She’s the Vice President and General Manager HPE NonStop about the importance of fault tolerance in preventing unexpected downtime but Aubrey, you have something insightful, always fascinating to talk about first.

AUBREY LOVELL
You know I do. And you know what time it is because we are going to space in

…Technology Then.

AUBREY LOVELL
Okay, so it's 1977 and the human race are about to begin a mission to space which is still going on today. Michael, you're really good with this. Can you think what this might be?

MICHAEL BIRD
Ooh ooh ooh ooh! Yes, yes yes. These are the Voyager missions. The two probes that are still going to this day. I... I love the Voyager missions. I love hearing updates. So yeah, tell me more, tell me more, tell me more.

AUBREY LOVELL
So you were right, the Voyager missions. These two spacecrafts were sent out into space to explore the distant solar system and have become the first human-made objects to enter interstellar space.

It is pretty cool, right? But this week, we don't care about them being in space or the scientific data they're collecting. What we're interested in are the computers on board and more importantly, how they prevent faults which could make them fail because when your computer is well over 15 billion miles away, that's very long, and a command from you takes over 23 hours to reach it , solving any problems which arise becomes much more complicated.

MICHAEL BIRD
Talk about long distance relationships.

AUBREY LOVELL
I know, I know. And we were speaking recently on how that even works in space. It's pretty crazy that we even have communication that can go that far. But anyways, how do you avoid having to solve problems? Well, it's simple. You don't allow them to occur in the first place. And I'm secretly laughing because if life were only that easy. But anyways, the computers had to be able to not only self-diagnose any issues, but also self-repair them too before the damage became an actual issue. Now the Voyage spacecraft use a technique called block redundancy for fault tolerance.

So the Voyage spacecraft use a technique called block redundancy for fault tolerance. Onboard Voyagers 1 and 2 are multiple redundant computers just waiting to be woken up and set to work if anything goes wrong with the main system . So sci-fi.

The list of errors is obviously far too long to read out. However, there is one I want to mention which is part of the command and control subsystem and Michael you're gonna geek out about this. So every two seconds it looks out for a message which basically translates to, “I'm healthy” as it constantly checks in on every other system before any messages are sent to make sure everything is working properly

MICHAEL BIRD Response
man, I wish we could do an episode on the Voyage Emission. I think it's one of the coolest achievements that we've done as humans. I think the fact that it's running technology from the 70s and still going, I think is a testament to them being built, yeah, how well built they are, and I guess how default tolerance they have in the system.

AUBREY LOVELL
you're right, know, especially when we talk about fault tolerance too, right? In modern computers, it's going to be a bit different to something that was almost built 50 years ago, right? We would assume that.

MICHAEL BIRD
Oh yeah, yeah totally and you know the requirements are completely different. and of course there is the conversation around always on 100 % uptime so to find out what always on 100% uptime computing is used for today, I spoke to Casey Taylor. She's the Vice President and General Manager of HPE NonStop.

Now, we've already done an episode about HPE NonStop, which we will link to in the show notes.. But for those of you who haven't had a chance to listen to that yet

I started off by asking Casey just to explain again exactly what non-stop is and what it actually means.

CASEY TAYLOR
Non-Stop ultimately is a platform for customers who are looking for extreme fault tolerance and what it does really well is high volume transaction processing with a built-in database and this kind of secret source of software that allows our customers to have uninterrupted mission-critical platform for their most mission critical applications, right? So we think about this as a mission critical platform with software and hardware that delivers ultimate uptime and It has unlimited scalability which means that they can add on additional nodes, you know, up to about 4,000 nodes and that really gives our biggest customers that are looking at the most performance out of their mission critical workloads, a lot of flexibility and scale.

MICHAEL BIRD
With a lot of fault tolerance?

CASEY TAYLOR
Yes. So it’s hugely fault tolerant and technically what we say is it's IDC level four in terms of fault tolerance. And we talk about five nines of availability. And really what five nines of availability means is it's 99.999 % of the time up. And that equates to about five minutes in a calendar year of downtime. But the reality is for nonstop is that we're actually even better than that. We just don't promise that, right? We have had customers including a major auto manufacturer for example who's been running our platform for more than 35 years and they have not had any unplanned downtime in that 35 plus years. you know it actually it really is fault tolerant and of course there are always issues that can arise outside of our own platform you know natural disasters etc but basically you know it's non-stop and it does what it says on the tin.

MICHAEL BIRD
How do you make a system that never goes down? What's the differentiator between a standard rack server that you can buy off the shelf?

CASEY TAYLOR
Well, the really cool thing is that this is our owned IP in HPE and it goes back to 50 years ago when a company called Tandem was incorporated and they came up with this kind of visionary architecture design and it started out as being really in the hardware itself. they had physical peers of nodes that were working in Tandem, that's why the company was called Tandem, and so it would make sure that rather than failing over to something if something went wrong, it was all already doing it twice and therefore it could immediately move that transaction forward regardless of components outage. So that original architecture is really what started it all and made it different from the way that we architect other systems. But then over the time we have innovated and what we've really done is we've abstracted that fault tolerance away from the hardware and actually into the software

MICHAEL BIRD
So it's sort software defined fault tolerance.

CASEY TAYLOR
Exactly

MICHAEL BIRD
So I've seen that the term self-healing architecture used in reference to non-stop computing, how does that work?

CASEY TAYLOR
So it's interesting, really what this system is designed to do is constantly look for anomalies, errors and issues. And you can think of it maybe a little bit like the body's immune system, whereby if we're going to get sick, our immune system is looking for any issues or anomalies in our body. And what it's designed to do, if it's running properly, is to intercept that and heal it before we actually get sick. And that's exactly what the system is supposed to do. So it's going to look and see when there is an issue happening within the platform and it is going to correct it and in the meantime it's going to use the other non-impacted node to run the transactions while it heals itself on the other side.

MICHAEL BIRD
And does it use AI?

CASEY TAYLOR
We think about AI with non-stop as kind AI adjacent. So at the moment, we are not planning to add a GPU or a DPU into our non-stop platform, which obviously are the processing units that give AI capabilities, right? But we do see a place for AI with non-stop, and what we see that as a really good match is because non-stop ultimately is the source of a lot of companies mission critical data. And what does AI need? It needs data, right? It needs a great data set without errors. And so we are partnering with some really innovative companies to create solutions that work in unison with the non-stop platform, but don't necessarily put the analytics of running that AI on the platform, right?

We would, in real time, intercept a transaction for example, and pull that out off the non-stop and run the analytics on an adjacent machine, and then go straight back into the non-stop real-time. And fraud detection is a really great example of this. During a transaction, an adjacent machine working with the non-stop will be able to spot a fraudulent transaction in real-time. And it's using AI to do that, but we're not running the AI workload on the non-stop machine. Because ultimately our customers really want to make sure that we guarantee that transaction. That's kind of the foundational idea of non-stop. And so, you know, we have to remain focused on what's important for our customers while introducing AI in safe way.

MICHAEL BIRD
So, , where do we go from here? Like, I assume you haven't worked out a physics-defying way of having more than 100 % uptime.

CASEY TAYLOR
Yeah correct, I don't think there's anything greater than 100%. There are ways that we can continue to be better. we have to stay true to our roots of fault tolerance and that will always be our guiding kind of north star, because it's our differentiator. But we have to look for ways to innovate to bring a better experience to our customers, to ensure that the non-stop systems are, I say playing nice with other enterprise tools.

and that we are modernizing the non-stop platform in a hybrid cloud environment while remaining true to those core foundational aspects of availability, scalability, and security.

MICHAEL BIRD
So why is non-stop computing not built in-to the fabric of standard servers? Wouldn't it make sense for everyone to use a system which never goes down? Is it like a cost thing? it just the practicality of it?

CASEY TAYLOR
, cost is definitely part of it for sure and you know non-stop part of its secret source is its own operating system and so It's a good and bad thing right. mean ultimately the operating system the non-stop OS It's it's Linux like but it is not Linux and and so it it requires some special management I would say It's actually really easy to run by itself. So the number of administrators that you need and operators to run a one stop system versus your standard Linux, know, x86 environment is very few in comparison, which is great in terms of the total cost of ownership view, right? We keep that down because it kind of runs itself. it is different and there is a cost involved, mean, everything costs and if you want ultimate uptime and availability, then you're paying for that value through the software that we offer. And I think that that's why, you know, a good enough as out there. There are high availability systems that are not fault tolerant, right? I guess that's the difference. High availability you could think about as in trying to minimize downtime. Fault tolerant is preventing downtime at all. That's the goal.

MICHAEL BIRD
And I suppose there are some particular workloads where you absolutely cannot have downtime, financial transactions on a ledger to some extent.

CASEY TAYLOR
Exactly, so that's why we have around 70 % of our customer base are in the financial services industry and it is incredibly important and sometimes for sovereign nations, we are the backbone of their banking infrastructure. yes, absolutely, any outage or any downtime can be devastating, right? These days we rely so heavily on being able to transact from our phones, online or at an ATM and so things are absolutely mission critical and that's why non-stop is a really sweet spot.

MICHAEL BIRD
What sort of numbers are we talking about? How many of the world's financial transactions go through non-stop? Do know?

CASEY TAYLOR
Well, you know, we have to be careful about what we can say, but I would say 6 out of the top 10 corporate banks in the world run their corporate banking system on non-stop. In the US, around 90 % of credit card transactions run through a non-stop system. Some of the major rental car companies use non-stop for their reservation system as well as the transactions and payments. And so you can see that in any given day, the general public could be interacting with a non-stop system multiple times and not realize.

MICHAEL BIRD
Wow, my goodness. Okay. So do you expect to see more non-stop computing being used as AI workloads increased?

CASEY TAYLOR
I think that we have to try to do what we can to help our customers embrace AI. Obviously as a company, that's what we're trying to do is make AI accessible and available to enterprises, no matter the size. I think that Nonstop plays a part in that. But as I said, we have specific business groups within HPE that are focused on how do we optimize for AI workloads. And that's not what Nonstop is known for and the value that we bring. But we absolutely need to make sure that we are, as I say, not AI adjacent and making sure that we're bringing the AI to nonstop rather than running AI on it.

MICHAEL BIRD
because as you said, AI is powered by data and fundamentally you're processing data.

CASEY TAYLOR
Exactly, yeah, and we are housing some of our customers' most critical data. And built into a non-stop platform is our Sequel database. And so that's the fault tolerant database that we offer as part of that full software stack. And of course, yeah, is where some of our customers' most mission critical data resides.

AUBREY LOVELL
Well, all I can say to that is that I wish that my internet provider was fault tolerant.

But that's actually really interesting and kind of cool to understand not only the outputs of why that's so critical, but also the thinking and the architecture as well of how that works and how it's like a backup to a backup. That's pretty cool.

MICHAEL BIRD
Yeah, yeah, I thought, I thought, you know, the fact that in any given day, like, you're, you're, it's very likely, that you'll be interacting with a non-stop system. If you're buying anything on the internet or, you know, interacting with your bank, which I think is, yeah, that's, Real Testaments are just how, how trusted they are.

what I thought was also interesting was when I asked Casey about, know, why didn't you just, why didn't we just, why isn't every workload running on a completely fault tolerant system? And actually I think she gave the correct answer, which I think boils down to like, use the right tool for the right job. Like don't, there's no need in some systems to have it completely fault tolerant and actually just high availability is, is all that you need. And actually, I think that is so reflective of everything in our industry, in the way that we approach problems is like, actually use the right tool for the right job.

AUBREY LOVELL
Mm-hmm. Well said.

MICHAEL BIRD
finally, like many of the people that we have on this podcast, Casey didn't always want to work in tech. Any thoughts about what you think Casey might have wanted to be before she did what she is doing now?

AUBREY LOVELL
My first guest is some sort of medical field, somebody important in the medical field, I feel like.

MICHAEL BIRD
Well, I'm not gonna give it away. I'll let Casey say.

CASEY TAYLOR
I wanted to be a paediatrician, a doctor for children and I actually started out my university career on that path and within about six months I decided I wanted to switch to business because you know the reality of dissecting things and being in a lab doing biology kind of hit home. But yeah that was my dream as a child.

MUSIC STING

AUBREY LOVELL
Okay that brings us to the end of Technology Now for this week.

Thank you to our guest, Casey Taylor,

And of course, to our listeners.

Thank you so much for joining us.

As always, all of our sources are linked in the show notes so make sure to check them out if you want to delve deeper into non-stop computing.

If you’ve enjoyed this episode, please do let us know – rate and review us wherever you listen to episodes and if you want to get in contact with us, send us an email to technology now AT hpe.com and don’t forget to subscribe so you can listen first every week.

MICHAEL BIRD
Technology Now is hosted by Aubrey Lovell and myself, Michael Bird
This episode was produced by Harry Lampert and Izzie Clarke with production support from Alysha Kempson-Taylor, Beckie Bird, Allison Gaito, Alissa Mitry and Renee Edwards.

AUBREY LOVELL
Our social editorial team is Rebecca Wissinger, Judy-Anne Goldman and Jacqueline Green and our social media designers are Alejandra Garcia, and Ambar Maldonado.

MICHAEL BIRD
Technology Now is a Fresh Air Production for Hewlett Packard Enterprise.

(and) we’ll see you next week. Cheers!