The Space Industry

Episode 83 of The Space Industry podcast by satsearch is a conversation with Adrian Helwig, Analog Field Application Engineer, and Michael Seidl, Systems Engineer from Texas Instruments (TI), about designing space systems with integrated Fault Detection, Isolation, and Recovery (FDIR) strategies.

TI is a global electronics manufacturer with a wide portfolio of space-grade components to support space missions across the spectrum.

In the episode, Adrian, Michael and satsearch COO Narayan Prasad Nagendra discuss:
  • FDIR as a complex, critical sequence in space system design: Since equipment in space cannot be manually repaired, systems must quickly and reliably detect faults, isolate the damaged unit (e.g., by switching it off), and recover mission operations, often by engaging a redundant unit.
  • Trade-off between reliability, performance, and cost: Engineers face this trade-off particularly when selecting components that must withstand extreme environments (radiation, temperature cycles) and long missions (LEO vs. GEO/Deep Space). Using non-space-grade parts introduces significant risk and defeats the purpose of FDIR.
  • Effective fault containment based on integrated, smart strategies: Strategies that avoid complexity, using methods like galvanic isolation, fast load switches, and highly-integrated space-grade components that incorporate diagnostics and can execute complex decision-making based on multiple sensor inputs (voltage, current, temperature) prevent fault propagation.
You can find out more about TI on their satsearch supplier hub. And if you would like to learn more about the space industry and our work at satsearch building the global space supply chain, please take a look at our blog.

[Music from Uppbeat (free for Creators!): https://uppbeat.io/t/all-good-folks/when-we-get-there License code: Y4KZEAESHXDHNYRA]

What is The Space Industry?

The Space Industry by satsearch - sharing stories about the businesses taking us into orbit.

We delve into the opinions and expertise of the people behind the commercial space companies of today, who could become the household names of tomorrow. Find out more about the companies and technologies discussed on this show at satsearch.com.

Narayan (00:00)
Hi and welcome to The Space Industry podcast by satsearch. My name is Narayan, COO at satsearch, and I'll be your host as we journey through the space industry.

The space sector is going through some seismic changes, promising to generate significant impact for life on Earth and enable humans to sustain life elsewhere in the cosmos. At satsearch, we work with buyers and suppliers across the global marketplace, helping to accelerate missions through our online platform.

Based on our day-to-day work supporting commercial activity, my aim here during this podcast is to shed light on the boots-on-the-ground developments across the globe that are helping foster and drive technical and commercial innovation.

So come join me as we delve into a fascinating, challenging, and ultimately inspiring sector.

Narayan (00:58)
Hello and welcome back to The Space Industry podcast. Today we're going to be speaking with Michael Seidl and Adrian Helwig from Texas Instruments. In short, TI is a global semiconductor manufacturing company with expertise in analog and embedded processing chips. TI’s space portfolio includes over 270 active products.

In this episode, Michael and Adrian will be speaking specifically about insights for integrated Fault Detection and Isolation and Recovery (FDIR) strategies, specifically addressing design of space systems.

Michael and Adrian, welcome back to The Space Industry podcast. It's always terrific to have you guys here and look forward to hearing more about FDIR and the topic that we have here to discuss with you today.

Michael Seidl (01:48)
Good morning.

Adrian Helwig (01:49)
Good morning.

Narayan (01:50)
Great. So let's actually begin by talking about what does really fault detection, isolation, and recovery really mean? Just for the audience's sake, I think it's important to maybe set a tone on what really the subject means because not obviously everybody is probably an expert in this. So, I would love to have you guys begin by talking about what does FDIR really mean and why is it important?

And then once you're done with that, there's always the question about balancing reliability and complexity, right? So where basically when you consider a system that is being built, there is always going to be a mean time to failure for that particular system. And then the moment engineers start adding components to that, those... the reliability of the system obviously hurts, right? So there is this fine balance that engineers need to find between reliability and complexity of the system. So if you can add some insights into that from the perspective of FDIR, it would be great to begin this episode.

Michael Seidl (02:59)
Yeah. Narayan, very happy to set the stage a little bit on this. FDIR, or if you spell it out: Fault Detection, Isolation, and Recovery. That is really the sequence that has to happen in case something goes wrong with the equipment in space. Remember, you cannot send up a technician to repair anything up there. So the system must be able to recover itself. And along that sequence there, first, a fault has to be reliably detected, of course. And once it is detected, there has to be a strategy in place to isolate the fault from the system. For example, you're turning the failing unit off, or even better, if you separate the unit from the satellite with maybe the help of a relay or something else that isolates it, truly galvanically isolates it. And then, very important, the system has to recover. Only then if it truly recovers, you are really able to save the mission's purpose. And so that's what we're talking here about how to do this.

Michael Seidl (04:12)
And in your question, you hinted us already. Well, FDIR is there to assure the reliability of the system or improve the reliability of the system. It's a dilemma because you are adding components most likely, and adding components... that they come with their own unreliability. So they're adding to the budget of the risk assessment in the wrong way. So that dilemma is probably what we need to find ways and optimize the systems to make sure we're really giving a positive contribution to the risk mitigation to the mission. So, as we say, adding parts is not desired for the size, weight, and power and also the cost requirements, of course. But also we're not adding parts necessarily because they also contribute to the FIT rate (Failures In Time) of their own.

Michael Seidl (05:07)
So that's one thing we need to address. The next thing is there is the topic of failure propagation. So if something fails, we need to make sure that failure does not propagate through and bring the whole satellite down.

So we need to find ways to stop it there. So the example maybe we can have there in mind, you have a FPGA that all of a sudden gets because of a failing LDO or DC/DC converter connected to the 5-volt directly, but it's only able to handle 3.3 volts. Now the 3.3 volts maybe on the other side come out, the output is also 5 volts, and then destroys the next group downstream.

So that is what we need to make sure we avoid. So we need to quickly detect and we need to then quickly turn off things. And how to do those things is not entirely trivial. What people also put in there to recover the system is redundancy. But redundancy is not always the best approach.

First, it adds to the size, weight, and power and cost, essentially you're doubling the PCBs, right? And further, you need to add intelligence and actual switches for deactivation, isolating of the broken unit and activation of the redundant unit. And ideally, all the components we are using to add, we are making sure they come with a very strong FIT rate. And even better if we don't have to add components, meaning we have components already in the system, they provide some hooks, some capabilities like a current limiter or a fault output, or maybe also were having MCUs in the system like safety MCUs like the automotive does it, that help us preventing the worst to happen.

Narayan (06:58)
Right. And when you talk about the emergence of NewSpace, one of the things that you actually have to realize is the trade-off that engineers need to do between achieving good performance, and typically performance in space also is driven by obviously radiation tolerance that is out there for those components, and then balancing that with prices and lead times, of course.

Adrian, if you can walk us through what would be the practical considerations when choosing between radiation tolerant, and what in your terminology within TI you call as Space Enhanced Plastic components for the Low Earth Orbiting constellations, and then the radiation hardened components which typically are called space grade components that are there for deep space or GEO missions. Yeah, what are your insights on this?

Adrian Helwig (07:54)
Yes, sure. I can try to do this. So I can give you obviously a very short answer and that would be: Choose your parts and mitigation to match your radiation environment, mission duration, and component criticality. Because obviously not one size fits all. But if you think about this, there are more details into it. Let me maybe explain a little bit.

So for those high volume and cost sensitive LEO constellations, you will typically accept so-called radiation tolerant components, sometimes with additional system-level mitigation techniques that needs to be implemented. So that's one possibility. But for GEO or those deep space missions as you mentioned, you very often need a really radiation hardened silicon and even more shielding because obviously for those GEO missions the total dose requirements as well as the single events requirements are much higher and the failure consequences usually also greater.

Adrian Helwig (09:07)
So regarding our portfolio at Texas Instruments, we are offering parts in QML Class V as well as QML Class P qualification, which means 50 kilorad or more, typically that's 100 kilorad, and more than 60 MeV for single event performance. At the same time, we are also offering so-called enhanced products, Space Enhanced Products for LEO missions, and those are typically between 20 to 50 kilorad and 43 MeV.

So really we need to keep in mind that the orbit and as well mission duration is really the key factor to decide about the quality of component. So for example, a one-year LEO mission has very different TID requirements compared to a 15-years GEO mission, right, where your TIDs and single events are pushed even higher. So that's another aspect of this topic.

Adrian Helwig (10:11)
I also wanted to comment on something else here. Because very often someone could think for this cost sensitive application, I can simply use an automotive qualified part. And if you think about this, that may be a valid approach, but as always, we need to look into details. And if we look into details, you will find out that's actually a very bad idea. So let me give you some examples. Material selection. Obviously when talking about automotive parts and space parts, we are using a different material set for those components. And think about this, those LEO satellites, they need to withstand a very high temperature cycles. You probably know from minus 150 degrees C to plus 150 degrees C in less than 90 minutes. So very high temperature cycle or extreme temperature cycles. That's one example. Another example: vibration during launch. It requires as well a different material set to build a device, right? So that's another example.

Adrian Helwig (11:23)
And last but not least, think about this. We are really permanently optimizing our automotive parts to save costs. This means for a customer, buying a part today compared to a part that they will buy in three to five years, for example, even if it's the same part number, the same electrical specification, the radiation performance could be very different. So please keep this in mind. And this can mean for a customer additional cost, maybe some delays in the mission schedule, right? So keep this in mind.

So to summarize, my recommendation would be really for high volume LEO constellation, so short, medium lifetime, use radiation tolerant devices. We call them SEP, where the majority is qualified up to 50 kilorad and 43 MeV. And for GEO or deep space, telecom or science mission which requires longer lifetime, use rad-hard devices which typically offers 100 kilorad and more than 60 MeV. So that would be my recommendation when selecting the right part for the right orbit.

Narayan (12:35)
Actually we do see this in our own platform and your recommendations are in line with how we observe the market as well. So now let's dive a little deeper into the actual fault detection, isolation, and recovery decision-making architectures now. And from an engineering perspective here, you could go from on the lower end of the scale having a very simple logic-based fault detection systems versus on the right end of the scale a full-fledged microcontroller unit that can do some really intelligent control. So from your experience, Michael, when is it better to go on the left end of the stick versus the right end?

Michael Seidl (13:16)
Yeah. I think what really proves to be a very good principle is to keep things as simple as possible. You may remember this famous quote for reliability engineering is: "simplicity is a prerequisite for reliability". I think that holds very true and is a very strong guidance, I would say. The challenge is just that when we look at our satellites, they are very far from simple. They are actually very complex and the complexity is even going up. So that is something we need to really keep in mind there and need to see how can I make my life easier.

So the first recommendation I want to give is, try to work as much as possible truly with space-qualified products to give yourself a strong foundation with a strong inherent risk mitigation. Otherwise, one would have to keep all those extra risks from all those non-qualified components in mind as you do your calculations. If something is space-qualified, you can most likely just put it in the lump sum of things that don't contribute much.

Michael Seidl (14:26)
The other angle is the complexity of the semiconductor products and they can be a very wide range, right? They can be a simple LDO, a simple op-amp, or as you said, they can be a full-fledged MCU that have of course a lot of complexity and it's very hard to get your hands around that thing in, if you really want to analyze this all yourself. So best is, even the more complex the devices, it's even more important that you take something that is space-qualified and it's ideally even done for you with a... from the vendor with full risk mitigation in mind, like a safety MCU even, as we have it in the automotive space. And so you can abstract the risk of this whole device as a whole, making it total simple enough. So you don't have to worry in this device about every gate, about every transistor, about every output buffer itself. You just take the total FIT rate of the device and if that is low enough, you may even be able to, yeah, almost forget about it and concentrate on the other risks you still have to mitigate.

Narayan (15:36)
Right. Your quote actually is very interesting for me because you mentioned about simplicity being a prerequisite for reliability and it actually brings me memories of a professor that I had worked with who said that the only way of ensuring 100% reliability in a system is to not have the component at all. So that way you have this dilemma as a designer, right, at the end. So now let's focus a bit on the fault containment in itself in such systems. So what strategies do you think have proven really effective in preventing fault propagation in tightly built-up integrated subsystems on a satellite?

Adrian Helwig (16:21)
Yeah, that's a very interesting question. And to be honest, when talking to our customers, that's really a challenging task, because engineers, they really need to manage this fault propagation. And in particularly in those tightly integrated systems, that's a challenge. Because failure of one component can propagate through the whole system. And the risk is really at the electronics level, at the component level where, for example, an overvoltage event from a Point of Load converter can really lead and widespread to other subsystem of the satellite causing a catastrophic failure. So one effective strategy to prevent this fault propagation is to implement galvanic isolation. And this is a method that ensures that the electrical fault do not transfer through the whole system and obviously protecting then the downstream components from damage.

Adrian Helwig (17:23)
Now, while this is a gold standard, it cannot always be implemented, right? It's not feasible in every design. Another possibility is to use isolating switches such as relays, eFuses, so this kind of devices can as well provide a robust solution. But you need to think about this, those switches needs to be designed to switch very fast.

Another example, if we think about a power isolation, we can work with isolated power supplies and isolated power topologies. And here I especially have in mind our reference designs we are offering on TI.com. Like for example, a Flyback design, which is isolated power topology. And one example could be the PMP23546, which as I mentioned you can find on TI.com. And this particular design is using a newly released PWM controller with integrated GaN driver, TPS7H5020-SP. So that's one possibility using isolated power supplies and power topologies.

Adrian Helwig (18:36)
When we talk about signal isolation, please look at the device like ISOS141-SEP. It's a radiation tolerant digital isolator using our capacitive isolation technology instead of well-known traditional optical isolators. And this is a device supporting data rates up to 100 megabits per seconds and can offer you really reliable signal transfer. And this is essential when talking about fault propagation in those tightly integrated subsystems. Now, if designers needs to deal with analog signals, there is also a very interesting part. We newly released S510-SEP. This is pretty interesting because this device features actually a transistor output opto-emulator with analog behavior. And this is radiation tolerant device and this can again be very important when maintaining fault isolation.

Narayan (19:39)
Great. And obviously we are here discussing FDIR and the keyword here obviously in the beginning would be fault detection as well. I would love to have you guys discuss briefly about how do we even detect the fault?

Michael Seidl (19:54)
Yeah. Very important question actually, how to detect. And it's not just about that we detect it, we need to detect it quickly, especially in the absence of any galvanic isolation because now we really have to prevent that the fault propagates through the system. So we need to be very fast. And of course we also have to be very reliable, right? Because we don't want false alarms. If we have many false alarms, we would maybe have any issues with availability of the overall system, right? Permanently rebooting or just trying to recover instead of doing the actual work. So looking at fault detection, we need to first of course a sensor. A sensor is needed. So typically what we're monitoring are the physical parameters like voltage, current, or temperature. I would say these are the most common parameters to monitor.

Michael Seidl (20:47)
For current sensing, TI offers here a device called INA901-SP and another one is called the INA950-SEP, also for rad tolerant option. And both of these devices are extremely well suited here because they offer a wide common mode input range to the... of up to 65 volt for the INA901 and 80 volt even for the INA950-SEP respectively. Making these really suitable for any kind of systems and voltages. Further, these devices have an optimized bandwidth to assure the fast detection of the overcurrent event. While also having at the same time a high power supply rejection and a very fast settling time to make sure we have a very accurate sensing and avoid any false alarms that way. So that means we have high sensitivity and accuracy without compromising the availability of the system.

Michael Seidl (21:52)
For the voltage comparison or any threshold detection like it could come from our current sensor, TI has just released the TLV1H103-SEP. That is a comparator with only 2.5 nanosecond delay. So very rapid and can very quickly do the actual decision making whether we need to turn something off or not. So that's a, I think a very important ingredient to the system.

Temperature, we also said is another physical parameter we need to monitor. And that is something where we also say something may go wrong, something goes up in temperature. It's one thing it's for early fault detection, but you can also use this further for thermal management to always assure you limit the stress on the electronics as much as possible.

Michael Seidl (22:45)
So there is a space-grade temperature sensors from TI. These ICs offer really high accuracy and very minimal design overhead. The... here an example is the TMP461-SEP and another one the TMP9R01-SEP, also here we have rad-hard, rad tolerant options for our customers. And they really have the, take advantage of this predictable temperature dependence of the silicon bandgap. And they achieve an accuracy that way of better than plus minus 0.1 degree Celsius.

And these devices, as said, is like we want to keep the component count always very low for FDIR reasons. And these devices come with a lot of integrated features. I think we should be finding very exciting... there's the excitation current generation for the temperature sensor inside. There is also the analog to digital converter, the ADC including the input driver. And there is also the window comparator for the actual fault detection. And in the case of the TMP9R0-SP, that one enables even up to eight external sensor inputs. So, a really high level of integration assuring you're not adding too many components for your FDIR capability.

So, the output from the comparator alarm signal you're generating from the temperature sensor, they can be used, of course, to switch something away, or they must be used to turn something off, or I better switch it totally away to isolate it. And this is where maybe also a device comes to mind—a load switch, the TPS7H2221-SP. And that device again is not just a load switch; it also has further capabilities that further add to the overall robustness and recoverability of the system. So, what's integrated here is a short-circuit protection. There's an inrush current limiting capability to reduce the stress on the upstream components and upstream power supply. There's a thermal shutdown with an automatic restart. And there's also a very important feature called quick output discharge, in short (QOD).

So that feature is really very interesting. If you detect an overcurrent and most likely that comes from a latch-up situation at a downstream component. So all the energy still in the circuit is still harming the device, since this load switch here offers a alternative path of the load or the charges into the ground, not through the latch-up device anymore. That is a way to really increase the chances that we can really rescue the downstream components that way with a quick output discharge capability.

Adrian Helwig (24:26)
And there is also something, if I may add Michael, because you describe now how to detect all the failures, but there is something else. Because sometimes you know it could be very shortsighted if you simply assume a failure of the unit has failed because of single event without any intelligence. So that's something we need to think about this. And especially because very often those redundant units we have in our subsystems, they are limited to only one, right? So if your redundant unit is engaged very early in the mission, it requires that it will stay reliable for the remaining duration of the mission. I think it's very important to enhance those fault detections mechanism a little bit further and add maybe some complex decision-making process to those.

Adrian Helwig (25:21)
And this can be done for example with microcontrollers. Why? Yeah, because a microcontroller can analyze various monitoring signals. And Michael was talking about this, think about different voltages, temperatures, currents. And based on those, a microcontroller can, for example, control retries after a fault. And obviously to collect all those external monitoring signals, you will need an ADC. And I'm thinking here about the very well known ADC128S102 which TI is offering in both full space as well as Space Enhanced Plastic version. This is a 12-bit, 8 channels ADC which can be perfectly used for this task.

Now to give you an example for a microcontroller, please check the TMS570LC4357-SEP. That's our dual-core, lockstep microcontroller, which can take this capability I just described even further. Because this device is designed with this high integrity systems in mind. It is using a certified development process that aligns with the safety standards like the ISO 26262, really ensuring minimal systematic faults. And the dual lockstep architecture allow for real-time fault detection. And this could be really critical for several missions.

So to summarize it a little bit, a combination of galvanic isolation, signal and power isolation technologies, as well as those advanced microcontroller capabilities with real-time detection features, can allow designers to really effectively contain faults in those tightly integrated systems. And at the same time you can maintain the system, the mission safety and longevity. So that's something I wanted to add to that.

Narayan (27:12)
Great. That's terrific. So let's actually get into a specific enough topic where an example can be then taken on. And you did mention about redundancy in the previous answer. And when you talk about smart redundancy in power systems, for example, I'm sure that TI has had experience dealing with tens if not hundreds of teams while building such systems for satellites. So can you actually give some real-world examples where redundancy in satellite power systems had to be carefully designed to avoid introducing any new single points of failure?

Michael Seidl (27:48)
Yeah. I think the most common example for redundancy are definitely power supplies. Since power is so fundamental for any circuit, it must be designed with highest reliability in mind. On top of this, designers do then even add a redundant unit to the system to further mitigate the risk. So this is where we see this quite a bit.

And to bring a practical example as you asked is for the optimized redundancy, that is actually nicely presented in a joint white paper by Texas Instruments and STAR-Dundee, where we work together here on... that one is detailing a fault protected power architecture for the Xilinx KU060 FPGA. And this application brief called "Power Supply for the STAR-Tiger SpaceFibre Routing Switch" demonstrates redundant power input management and also the proper power sequencing and a comprehensive fault detection and isolation mechanisms. And you'll be surprised with a very low number of components added.

Michael Seidl (28:57)
And this design for example uses the TPS7H2201-SP, a smart load switch that integrates overvoltage and undervoltage protection, overcurrent and current sensing, along with the thermal protection internally, or externally controlled load switching. And that configuration, that device here really enables a really strong integration and allows us to implement all that logic with a very low overhead. And further this design uses the SN54AC00-SP for the rad-hard logic devices. To extend this a little bit, meanwhile we have also rad tolerant logic devices brought out to the market. And these devices out of the SCXT logic family, they interestingly, they provide a single supply level shifting across a broad voltage range. So from 1.2 volt logic up to the 5.5 volt. And that is interesting to eliminate any need for additional level translators in the system.

Michael Seidl (30:04)
And further there is also, beside the standard parts like a NOR gate or AND gate, like the SN54AC02-SEP, that's the NOR gate, or the SN54AC08-SEP, a four channel AND gate. There is also a very interesting device called SN54SC3T97-SEP. This 97 device, that is actually a device that is a configurable logic device. So one of the inputs of the three inputs says what type of logic gate that is and the other two pins are the regular logic input pins. And that one allows you to really just buy a single part number, so only one procurement item, but has you... gives you all the flexibility in the system to build what you want there. So that is I think a very strong example I want to recommend to people to really take a step and read and see how things can be done.

To summarize it all right, implementing the FDIR in electronic designs for space missions is definitely a very complex task, right. For sure you need to use components that can withstand the radiation, the temperature extremes, and also the, hold on for the long mission durations. And that is something that standard commercial components are really not built for. And we need to keep in mind, if one uses for FDIR functions non-space products, one would most likely rather add to the risks, than really doing the system a favour and reducing the risks by that.

And to help designers the best, we can, TI offers here products with integrated diagnostics and fault-handling features, and really make sure we reduce the overhead in these FDIR efforts. And further, dedicated solutions enable effective isolation and avoidance of any fault propagation. And TI can also provide support for the full range of system recovery strategies, from a simple switch over up to complex decision-making based on multiple sensor inputs, like the safety MCU that Adrian had pointed out there. So a lot of things are out there, so this where where we really want to invite customers to please to refer to ti.com/space for further information on the topic, where you find whitepapers, actually FDIR directed whitepapers, if you look for those, application notes, reference designs, and many more things.

Narayan (31:07)
Great. Again, Adrian and Michael, very insightful as always. And as a rundown I think we did discuss everything from initially having you guys explain about what is FDIR and then going on about various aspects, such as balancing reliability, complexity, radiation trade-offs, architectures for FDIR and the examples were really great, especially the ones that you actually mentioned. So thank you again for being a part of this episode and look forward to hosting you guys again in one of the future topics.

Michael Seidl (31:39)
Thank you so much.

Adrian Helwig (31:40)
Thank you very much.

Narayan (31:42)
Thanks for joining me today for another exciting story from the space industry. If you have any comments, feedback or suggestions, please feel free to write to me at info@satsearch.com.

And if you're looking to either speed up your space mission development or showcase your capabilities to a global audience, check out our marketplace at satsearch.com.

In the meantime, go daringly into the cosmos, till the next time we meet.