Talking Autonomy,

Fusing Vision & Radar for Resilient Perception

How Ghost uses late fusion to combine sensor modalities

By Ghost

August 22, 2023

minute read

Matt Kixmoeller
Welcome to Talking Autonomy. In previous episodes, you've heard about Ghost Vision (KineticFlow), Ghost Radar, lane perception, and this is where it all comes together, Fusion. My friend, Prannay here is gonna walk us through the process for how we take all those different signals and turn them into an actual view of the scene that we can drive with. So Prannay, welcome to Talking Autonomy.

Prannay Khosla
Great to be here, Matt.

Matt Kixmoeller
Prannay, why don't you introduce yourself real quick. What do you do for Ghost and what have you done in your career before this?

Prannay Khosla
So I'm a model engineer. I work across the stack and I'm responsible for putting it all together, and writing out the drive program. I have a computer science degree and before this, I spent my time in high-frequency trading where I was mostly, again, looking at noisy data, trying to extract signals. So a lot of similarities there and focusing on the high signal-to-noise ratio problems.

Matt Kixmoeller
It's really interesting to hear about how the work in high-frequency trading translates to driving. And although they're very different domains, there's a lot of commonality between them.

Prannay Khosla
I think in the end, it's just simple regressions everywhere. Building pipelines, building workflows which allow you to work with a lot of data consistently for a lot of use cases and finding ways to scale it out, finding how it all generalizes, and then testing that rigorously. So all of those things just immediately translate and the rest of it is just applying different first principles.

Matt Kixmoeller
Let's get into some of the details here at least on the fusion dimension of it. So why don't we just start at a very high level. Can you explain to us what fusion is all about? And I think you brought some visuals that can kind of show us what happens under the hood?

Prannay Khosla
Fusion is basically the point where you have all your inputs coming in. You have multiple sensors from the same sensor modality. You're extracting signals in different ways. You're trying to understand traffic flow, you're trying to understand obstacles. You have Vision, you have Radar, you have lanes. You're trying to put it all together to construct the scene that you can plan off of. And this is where a lot of things get interesting because you don't just care about the calibration of one sensor. You care about the calibration of all the sensors and how they relate to each other. So on the scene here, you can basically see on the red, we have the information that's coming from Radar, and the green, we have the information coming from Vision. And then we overlay them together and find the best likelihood, position of obstacles. We also understand how it is all flowing over time, and we put it in the context of lanes and that actually tells us, okay, what is the velocity of our current lane? What's the velocity of the lanes around us? Because we have to control for everything in order to give maximal amount of safety and comfort, and extracting all of that information into very simple-to-understand scene variables is basically the name of the game here.

Matt Kixmoeller
I guess one of the opportunities comes from the fact that we have multiple sensors that are looking at the same scene, just in their own way. So you have vision that is detecting obstacles, but also giving you distance, giving you velocity. You have Radar that's detecting obstacles as well and giving you distance, giving you velocity, and so the fusion is is bringing these two different modalities together into one view.

Prannay Khosla
That's right. And the interesting thing about fusion is that it helps you correct for errors in interesting ways because the Radar unit sits in a different place in the car. It is exposed to different kinds of noise with the car pitching up and pitching down. The cameras are in a more stable position, but cameras are limited by how far they can see. Radar is limited by a lot of like false alarms close by. So our approach has been to basically try to use a given sensor in the environment where it shines. So Vision will nearly never miss anything close by. Radar will nearly never be bad at estimating velocities. So fusion is where we go and correct for errors between different locations, but also in the different modalities in the data because, in the end, it's all electromagnetic waves. But how you measure them really matters, and you're working in different frequency domains. And that is why you're able to extract the signals in different ways, and different frequencies.

Matt Kixmoeller
So we'll get into a little more on kind of where each of the sensor types shine and how to think through that process in a second. But when people talk about fusion, there's often a debate between whether you do early fusion or late fusion. I'll spoil the surprise, we do late fusion. But why don't you describe the two approaches and why we do late fusion.

Prannay Khosla
So the reason picking where you do fusion is interesting is because that determines what your debugging development cycle looks like for each of those signals, and the outcomes you derive from it. So early fusion versus late fusion determines when you build your models at what the point at which you combine your information has to be the point where you have maximal amount of signal-to-noise ratio to do that. And for early fusion, that usually means that you're trying to combine them as early as possible because you want access to the most raw amount of information, most maximal amount of information, and then you think that you'll be able to generate signal from it anyway. Late fusion approaches are usually around saying that when I have too much information, it is harder for me to extract the signal from it. So you want to compress the information down to just the signal part of it essentially. And then you have these two sets of signals coming in or three sets of signals coming in, and you want to combine them at that point. Our approach has been to stick to late fusion for multiple reasons. The biggest one being that the development productivity we get with that, and our ability to switch between different sensors in different scenarios, for example, different vehicles. Or if we want to do partial hardware upgrades and mix and match, or to handle sensor failures. It's been very easy to think through these scenarios, while in early fusion, some of these problems are a little more hairy even though we are giving up on extra information in some sense which sometimes is noise, but sometimes it's also signal.

Matt Kixmoeller
My understanding is also that part of the reason why a lot of the early approaches used early fusion because you had Radar that was maybe not quite as intelligent as new generation auto radar. And so there's a lot of noise and it was pretty important to fuse it with a camera early on to get a sense for, okay, the camera found a car here, so I can listen to the radar in that specific spot and ignore all the noise in the rest of the scene. But we're in a position where we have a very capable Vision stack and a very capable Radar stack, and we're trying to treat each independently.

Prannay Khosla
That's right. And with a software-defined Radar that gives us the ability to feed back information without having to do early fusion. It does not preclude us from our ability to find, to handle all of the noise downstream because a software-defined radar usually also gives us a much denser signal for the point clouds that we are looking at.

Matt Kixmoeller
So gimme some practical examples. Because we're doing late fusion, we can, in different situations, maybe listen more to Vision signals or Radar signals depending on the scene. So give me some situations about how we think about those differently.

Prannay Khosla
Let's say you are driving on an open road, highway, and you're seeing obstacles very far away from you. Now, in Vision, these are gonna show up as really small obstacles, and that's true for humans as well. We usually see something very far, but at that distance we don't have very accurate perception of its velocity. We just kind of have this idea of, okay, it's expanding towards us or not. And then we just keep driving that way. This is where Radar will shine because it outputs very accurate velocity. And what that allows you to do is drive with both comfort and safety. For comfort, you want to appropriately start slowing down into traffic if you see density ahead of you, right? Figuring out density is much easier from Radar than it is with Vision, and so is velocity. But now, let's completely switch the situation where now you're in stop-and-go traffic. In stop-and-go traffic, the most important thing is to make sure that you see everything around you. You want a very low latency system, you do not want to have any lag in the system. This is where Vision will shine because you will see everything, even if it's very up close. When you're in really dense traffic, Radar often has all of these multipath problems and side lobes. You have trucks which have detections coming from different parts of the truck, from the wheel, from under the truck, from behind the truck, and all of that. Vision really is much simpler, and you only see what is around you, and in this case, Vision will completely shine. But are you just switching modalities when you go from one to the other? This is where life gets interesting, that if in every domain, you have one modality which is in some sense, superior, but each has its quirks as well. For example, even in stop and go, you want extremely accurate velocity. So once you know where everything is and what you care about from Vision, figuring out its velocity from Radar is really simple. And on the other hand, if you go back to an open-road scenario and you see something really far away, when something is really far away, you do not know it's height. So there is no way you can differentiate between a firetruck and a bridge. And that is where Vision comes in and that it allows you to do that.  So each has it's scenario where it shines, but also helps in the other scenarios as well.

Matt Kixmoeller
Yeah, it's not about one or the other, it's just about using the different signals you can get out of each modality in different ways. We're just talking near and far, but there's differences in weather situations where Vision might be affected by weather, and Radar compares through some of that. There's differences in tunnels.

Prannay Khosla
Whether one is really interesting because, yes, there are occlusions.  Vision, as a modality, is really sensitive to even small occlusions like salt, and we handle all of that. But having Radar obviously helps because at that point, you know what your ground truth is, and it's really easy to disambiguate between whether there's actually something there or just noise? Again, that's where fusion comes in. I think that tunnels are also particularly interesting for fusion because when you go inside tunnels, suddenly the scene becomes really dark. There are often a lot of reflections in the tunnel from Radar and that is where, suddenly, both of your signals start showing their noise/error modes. But when you fuse them together, they will only agree where there is actual information and for everything else one modality or the other is just gonna reject it. Driving in tunnels is also interesting because GPS doesn't work there. As I said before, we combine everything that we have from all our sensors, and from the lanes neural network as well. Sometimes the only way to drive in tunnels is through obstacle information and controlling against the obstacles in the scene. That is something that can only happen if you combine your entire scene and put all the relevant information together to extract the right scene variables out.

Matt Kixmoeller
Fascinating, lots of different cases and a lot that goes into fusion. It's just not as simple as, you know, listen to this sensor here, listen to that sensor there. I know one final topic that you spent a lot of time thinking about is calibration. We do calibration for individual sensors, but there's a calibration that happens as all the sensors come together.

Prannay Khosla
Calibration for a single sensor is focused around figuring out how it is it oriented with respect to some point in the car. As the car moves, there are going to be distortions, there's gonna be road noise, there's gonna be somebody touching something, and all that affects every sensor. Fusion is the point where when you have to put things together, and if you want to reject information, you need to make sure that you're not compounding the noise from errors in your calibration. Nothing's gonna be perfectly calibrated, and it actually also kind of opens up the question, can you even do fusion if you do not have proper sensor-to-sensor calibration? So our solution has been that solving the problem of fusion and solving the problem of correcting for small errors in calibration happens at the same time where you simultaneously have to go correct for the errors, find the most likely fusion, and then go back and ask the question, okay, with this fusion, how consistent are the errors that I thought I have in the system that I calculated? And this is a feedback loop that you run with all your sensors and across time to converge to whatever error you have right now, and to improve your fusion estimates. So that really also goes back to the question you were asking earlier of late fusion versus early fusion and extracting maximal amount of signal. The purpose of that was to be able to combine information in order to do two things, which was both calibration and fusion and not just one or the other, assuming that the other thing worked.

Matt Kixmoeller
So it seems like it's inherent to the fusion process?

Prannay Khosla
That's right.

Matt Kixmoeller
All right, so there you have it. Fusion where Ghost brings together visual perception, radar perception, lane detection, all into understanding and fusing together a single view of the scene, so that we can drive. Prannay, thank you very much.

Talking Autonomy
Technology