People & Music Industry

Aki Mäkivirta, R&D Director at Genelec, talks to Sam Inglis about the latest developments in immersive audio, from calibrating your system using their GLM software to personalising your headphone monitoring experience with Aural ID technology.

Chapters
00:00 - Introduction
00:34 - Immersive Audio
02:07 - Loudspeaker Density
05:44 - Automating Calibration With GLM
08:10 - Choosing Your Speaker System
11:27 - Mixing Immersive Music
13:29 - Aural ID For Audio Professionals
18:45 - Mixing For Different Surround Formats
20:57 - Improving How Audio Is Received
22:19 - The Future Of Immersive Audio

Genelec Biog
For over 40 years, Genelec studio monitoring solutions have delivered truthful, neutral sound reproduction — enabling engineers and creatives to make accurate and reliable mix decisions, even in challenging rooms.

Founded in Finland by childhood friends Ilpo Martikainen and Topi Partanen, the company’s first monitor, the S30, instantly became the blueprint for Genelec’s future direction. Its active design delivered consistent performance, total reliability, and the ability to adapt to the acoustic environment it was operating in.

Genelec’s growing range of Smart Active Monitors work closely with GLM calibration software, allowing each monitor to be completely optimised for the room, producing mixes that translate consistently to the outside world - from stereo to immersive.

For headphone users, Genelec’s latest Aural ID software development
delivers a more truthful, reliable and completely personalised listening experience, allowing the user to switch between monitors and headphones seamlessly.

Aki Mäkivirta Biog
Aki Mäkivirta joined Genelec in 1995. He originally worked for the Nokia Research Centre and teamed up with Ari Varla of Genelec during a joint venture between the two companies, where Mäkivirta demonstrated how to replace analogue filters with digital processing using the 1031A nearfield monitor. As a result, Mäkivirta joined Genelec to pioneer the creation of the original 8200 series of Smart Active Monitors, before becoming R&D Director in 2013.


https://www.genelec.com/

Sam Inglis Biog
Editor In Chief Sam Inglis has been with Sound On Sound for more than 20 years. He is a recording engineer, producer, songwriter and folk musician who studies the traditional songs of England and Scotland, and the author of Neil Young's Harvest (Bloomsbury, 2003) and Teach Yourself Songwriting (Hodder, 2006).

https://www.soundonsound.com

Catch more shows on our other podcast channels: https://www.soundonsound.com/sos-podcasts

Creators and Guests

Host
Sam Inglis
Editor In Chief Sam Inglis has been with Sound On Sound for more than 20 years. He is a recording engineer, producer, songwriter and folk musician who studies the traditional songs of England and Scotland, and the author of Neil Young's Harvest (Bloomsbury, 2003) and Teach Yourself Songwriting (Hodder, 2006).

What is People & Music Industry?

Welcome to the Sound On Sound People & Music Industry podcast channel. Listen to experts in the field, company founders, equipment designers, engineers, producers and educators.

More information and content can be found at https://www.soundonsound.com/podcasts | Facebook, Twitter and Instagram - @soundonsoundmag | YouTube - https://www.youtube.com/user/soundonsoundvideo

Hello, and welcome to the sound on sound people and music industry podcast with me, Sam Engles. Today, I'm delighted to be joined by Aki Mäkivirtu of Genelec in Finland. Welcome, Aki. Thank you.

I wonder if we could start by finding out a little bit about you. What's your role at Genelec? I'm head of R& D, so I'm responsible for all new product designs, running the project and getting things done. Uh, in R and D section. Excellent. So in this podcast, we're going to be focusing on an area that's been a major sort of area of development for Genelec and for loudspeaker companies generally in recent years, and that's immersive audio.

Now, I think a lot of our listeners will be familiar with older surround formats such as 5. 1 and quad. Could you briefly explain what that is? What's different about modern immersive audio formats from those older formats? Yeah, the main difference actually is that, uh, the older formats basically mostly give you audio at one level.

If you think of height, you know, there's, there's one layer and that layer is usually located at the ear height. So basically, What immersive audio gives you in addition to that is the sensation of the height dimension, much more than what you used to be able to get from the earlier formats. Maybe, um, apart from Quadraphonic, but Quadraphonic had a challenge that it was mainly a single person system.

So one person could experience The space, but the difference from that to the current day immersive formats is that immersive formats are able to be enjoyed by more than one person at the time. And that is mainly determined by the loudspeaker density or channel density that you you use for creating the recording.

In other words, in a loudspeaker based immersive monitoring system, the more loudspeakers, the better. Yes, this particularly can be done with object based formats, more than channel based formats. And that's where the world is going these days, that you see more and more object based formats turn up.

Because those have the additional, Capacity to render audio at the time of presentation, and that is the key to actually being able to cover, uh, an arbitrary number of large speakers or you have the possibility of increasing the last bigger density at the time of presentation so that you can cater a wider audience reasonably well, I guess one of the other key benefits of These modern immersive formats is that they are scalable across different loudspeaker arrays.

You don't have in the old channel based formats like 5. 1, you had to have five speakers and a subwoofer, otherwise you couldn't really play it back. Exactly. Yes. And that's, that's why I would like to talk about loudspeaker density instead of talking about number of loudspeakers and number of channels, because now with the capability of using rendering, you can actually decide how many.

loudspeakers, let's say four square meter you want to have, and that gives you certain benefits. And is there a minimum loudspeaker density that you think is, is necessary in order to work effectively with these formats? That depends on how much, uh, in a positional distortion you allow. At the time of presentation.

So if you have only one listener, you can pretty much have a fairly low density because you know more or less where that listener is going to be located. But if you have more than one, if you have an audience, so apart from the house owner, you also have some relatives sitting on the same sofa, then you have an area that you want to be able to cover for the presentation.

And this means that then you have to exploit the fact that, uh. People can locate, uh, the physical loudspeakers exactly where they are, irrespective of, uh, where people are seated in the room. Whereas, uh, like it is for most, uh, audio presentations, you are actually creating virtual sound images between the loudspeakers using two or three or more loudspeakers.

together to create the virtual image in thin air. And this virtual image in thin air is going to move immediately when the listener is moving. But it's going to stop moving, uh, at the real loudspeaker. So by increasing the loudspeaker density, you can adjust the position of presentation for everybody in the room, even for large audiences.

So with a low loudspeaker density, you are in effect reliant on something akin to what we used to call the phantom center in a stereo setup. Whereas the more speakers you add, the more you're able to locate things directly to a speaker. In a sense, it would be fair to say that in all cases, in all situations, you are still relying on the phantom images.

But the question is how much the phantom image can move. Given that the listeners are moving. So, um, if you have a higher large speaker density, then the, the, any phantom image can move less relative to the listener. So you have less distortion appearing. You will have distortion for off axis listeners, but it will be less noticeable.

I think one of the things that probably puts a lot of people off diving into setting up an immersive rig besides the cost is. The sheer complexity or apparent complexity of the process and of setting up and of the fact that a space that might be okay and for two speakers could also be problematic for more speakers.

Now, at Genelec, you've done a lot of work into addressing these problems, a lot of it through your GLM speaker management system. Can you explain a little bit about how that works? Yes, the main problem with immersive loudspeaker layouts is that you have high difficulties in placing the loudspeakers in acoustically similar environments for all the different loudspeakers and different channels that you want to have in the room.

For example, some loudspeakers would be close to the ceiling, some would be far away from the ceiling, some would be close to the side wall. And some may be far away from the side wall, depending on where the main, uh, listening position is located in the room. And in that situation, what happens is that you will have spectral differences between loudspeakers.

And unless you take steps to equalize the effects that the room creates on the loudspeakers, then you will have very uneven presentation of the total sound. And, uh, in order to, to solve this issue, we are providing an automatic tool that is able to take precise measurements of the frequency response for every single loudspeaker in the room, and then to compensate each one of those individually so that they become as flat as possible, as neutral as possible.

Uh, so irrespective of the location, you get better mapping of the sound, uh, after that process. So as the user, this is a process I can do myself. Once I've got the speakers fixed to their positions, it's not something I need to get a consultant in to come and do for me. Yeah, absolutely. And it takes you a couple of minutes.

Uh, we tried, uh, today equalizing, um, a 7. 1. 4 system, and it took us how much, maybe five, five, 10 minutes to work through automatically. The whole system gets the frequency responses compensated times of flight adjusted and levels aligned. So the whole thing is properly calibrated for accurate monitoring after that.

So which qualities are key to ensuring that the immersive effect is, comes across properly in a monitoring system? In, uh, well, there are, um, a number of items, so, and, and they are all basically familiar to us. So, we obviously all know about the level difference between loudspeakers that is able to move your virtual sound image.

So we use this in form of level panning, in mixing consoles so that the level difference is one obvious source of, uh, distortion. The other one would be timing distortion. So if the time it takes for audio to fly from, uh, one large speaker is different from, uh, the time it takes for audio to fly from another large speaker and use the, you use these as a stereo pair.

then again you would have shifting, a moving of the virtual sound images. So these are fairly straightforward and easy to understand, but you would get a more complex effect if you have a frequency dependent level variation that is different between the two loudspeakers. In other words, the frequency response from your left and right would be different.

You would get a frequency specific moving of the virtual sound image. That's pretty complex and it can make your audio images fuzzy or not well focused. On top of that, you could have even time domain related inaccuracy. In other words, if you are not using the same loudspeaker type for the left and right loudspeakers, then you would get timing differences that are frequency specific.

Again, you would have complex changes to how your virtual sound image is being created. It becomes less focused and you would lose accuracy in monitoring. So you have all of these different effects. Some of these are easier to compensate for. Some of these are more difficult to compensate for. And the easy formula to apply would be to always use the same loudspeaker Model and make for the left and the right large speaker in your pair of large speakers, or if you are having an immersive system, if you can, it would be a good idea to use, to have all the large speakers of the same type.

Most of the time this may not be possible because you are, you have some, um, for example, space constrain constraints or some, some other factors like this that are affecting your choice of large speakers. In that case, it would be a good idea to select loudspeakers that have similar performance. They have neutral frequency response, they have constant time delay through the loudspeaker, and use those to build your system.

Even after that, it's always a good idea to take care that you adjust The levels to be the same from all loudspeakers to the listening position and that you adjust the times of flight to be the same from all loudspeakers. And if your room is giving you these kind of spectral effects, uh, that are, are specific to each loudspeaker, then it would be a good idea to compensate for those as well.

So immersive audio is already pretty well established in cinema, in some areas of broadcast and in gaming. Immersive music mixing is perhaps a newer thing, although we're seeing a major push on it at the moment from companies like Apple. Are there any particular considerations that apply when you're thinking about immersive audio for music mixing, or is it exactly the same as those other use cases?

Um, in, in principle for, for immersive music, then you want to understand about the delivery channel to the end user. And you want to understand how. your recording is going to be treated. You want to be adhering to the standard that is being applied. So if, for example, you have a specific standard loudspeaker layout that, uh, that you're supposed to use, then it's a very good idea to stick to that, make sure that, that you follow that standard.

And, uh, then if it's possible, sometimes it is, it's always a good idea to check how your delivery channel is going to change and the product that you're creating. So if there are going to be some changes for whatever reason, there could be coding, there could be things like this that can change how your recording appears, then it's a good idea to try and understand that effect as much as possible.

So if there's any possibility, for instance, that your immersive mix will be folded down to stereo in some circumstances, it's important to check that it's going to sound okay when that happened. Absolutely. If it's possible, you know, if you can do that, this, then it would be a very good idea because then if there's anything you can do in your immersive mix to make sure that this folding is happening successfully and you get the artistic impression that you want to have.

Mixed stereo output then it's a very good idea to try and understand how this is going to work for you And of course the consumption of immersive music Predominantly is taking place on headphones rather than in speaker systems Whereas I would imagine that most of the actual mixing is taking place on speaker based Immersive setup that can seem like quite a Big gap to bridge.

You've got this potential translation issue between the two formats. Uh, but at Genelec, you've developed a system called Aural ID, which is designed to help bridge that gap. Perhaps you could explain a little bit about what that is and how it works. Yes, Aural ID is basically a method of, of acquiring your own, uh, way of hearing audio.

So, so what we do is we capture your head related transfer function. This is what happens to audio depending on the direction of arrival to your ears. And then we offer this in a form where it can be used for processing audio for headphone based monitoring so that you gain accuracy and you gain important qualities in monitoring over headphones.

One of these is, is the fact that instead of hearing audio inside your head and you have the possibility of experiencing sound externalized out of your head and much more in a normal position where you would expect the sound to come from relative to you. One of the key elements of this system that you've developed is a new way of measuring head related transfer functions.

Um, for those of you who don't know what a head related transfer function is, it's basically a set of impulse responses that describes the physical, the impact of the physical shape of your head, your ear, your torso. And until recently that could only really be measured by actually putting microphones in your ears and measuring them.

Recording test tones, but you've come up with a new system that does this visually. Can you tell us a little bit about how that works? Yes, it's a, it's a two step process, basically. So the first step is that we collect a video around you. And so this means that we get a sequence of images that are linked together so that the direction of looking at you is slowly changing when we are working our way around you.

And then we can use this set of, uh, images to create your shape. And we, this, the method that you use for this is called photogrammetry. And basically photogrammetry can calculate the shape that gives all these different images. And once we have your shape, then we can use standard, uh, physics to understand what kind of sound field effects your shape will have.

And that's what we do next. That's the second step. We actually calculate the sound field around you. And, uh, that results in, uh, us being able to extract these impulse responses from all the different directions around you that are going to be delivered in, in the Aural ID. And then within the Aural ID plugin, you can actually synthesize almost any theoretical monitoring environment.

Yes. So basically what we can do is, uh, with Aural ID plugin is to use this information inside Aural ID, your personal head related transfer function, and we can use this so that we can create virtual monitors. And, uh, the way to use the virtual monitors, for example, would be that if you know that, uh, in your real monitoring setup, you would need monitors in certain, directions relative to you, then we do the same in the virtual environments.

We create the virtual monitors at the same directions relative to you, and you can experience, therefore, the same presentation as you would be able to do over real monitoring loudspeakers. So for you, the primary use case is someone who has a speaker based immersive rig, but also wants to be able to perhaps check mixes on their laptop or do some work on the road.

Yeah, first I would say that we are aiming at professionals, so people who are creating audio recordings and who are working with audio recordings. So we try to create a tool that is able to deliver the accuracy that is needed for the user. This level of working. So there, there's a, now a difference between consuming audio and creating audio here that, uh, we are not so much aiming at consuming audio, but, uh, aiming to create the tool that is actually going to be, uh, increase, uh, increasing the efficiency of the professional working to create audio presentations.

But this is not something you'd advocate as a, as a complete substitute for a speaker based system? Uh, most likely not, because there, it's very difficult to completely replace, uh, the confidence of being able to monitor over a loudspeaker based monitoring system. I would rather say that this can be a very good addition to the tool set that is available to professionals.

But if, say, I have my, uh, system configured for Atmos and I need to produce a mix that's going out to another Standard which has different speaker placement Sort of specifications I could do that Load up that standard in the oral ID plug in and just check that actually my mix still sounds okay, even though Speakers are over there in this case Yes, this this is possible and You obviously have much more flexibility with this kind of virtual approach to creating a monitoring system over a physical approach, because for physical approach, you actually have to have the loudspeakers at the directions where you need them.

I've come across several other products that appear similar on the surface and something that a lot of them do is to actually model the acoustics of a control room. But you've chosen not to take that path with Oral ID. Why? Well, the main reason is that, uh, there's a fine line between providing, um, a system that is able to externalize audio for the listener and creating a space simulation that is going to interfere with the recorded space in the mix and we try to avoid kind of overdoing this if you like so that we wouldn't create distracting cues for the but we would keep things very, uh, clean and neutral.

and just provide the essential, uh, function that we need here. The essential function is related to this individual's way of hearing audio that is embedded in the information that we have in the HRTF contained in Oral ID. And we are not trying to, you know, create, uh, a synthetic space, if you like. We are, we are trying to create the acoustics of the person who is listening to the audio recording.

So with GLM and Aural ID, I mean, that's a pretty interesting and powerful response to the needs of people working in a more immersive audio. Um, has it impacted your research work in other ways too? Well, this naturally, uh, teaches us a lot about how people hear audio. So, so all the work that we are doing here is very educational for us.

And, uh, I think it's important for us to engage in this type of work because it can enhance our understanding of how people process audio, how people experience audio. If we were just concentrating on the loudspeaker and the design of the loudspeaker, that would be like half of the job of understanding how loudspeakers work.

The other half is with the receiving end, that's with the listener. So it's actually very good that we are building some depth in understanding how, um, how individuals, uh, work to, Decode the content in audio signals. Because that, that will ultimately, I think, uh, improve our understanding of how we should also design the loudspeakers.

In the end, anyway. Well, thank you, Aki. This has been absolutely fascinating. If I could just ask you one last question. Where do you see all this going in the future? What direction will immersive audio take for people working in studios? I think one of the kind of basic properties potentially available with immersive audio is a more perfect experience of real sound space.

And that is something that really has not, we haven't been able to deliver that in stereophonic reproduction. I mean, there is a bit of space, a bit of depth that you experience, but I think we can all agree that this is not very close to what you experience in real life. Then we come to the one layer, single layer surround systems.

They give us direction much more than what you can do with stereophonic systems, but they are still lacking in presentation of the space. I think what the immersive formats are giving us is the first time to come relatively close to being able to represent something of an acoustic space. Something that's actually very natural to us and something that we are looking for and ultimately something that our reproduction system should be able to do Is to put us in Inside the acoustic space and give us the full experience and that's that's what we are kind of working towards at the moment So it's almost as though the end goal is to create an experience that's indistinguishable from Experiencing that sound as it originally would have been heard without any reproduction system?

Well, it could be that, but it could also be an artistic aim. However, the, in terms of the quality of presentation, we should be able to come as close to perfection as possible. Now, whether, whether what you actually put in terms of content into that kind of presentation would be more or less synthetic or actually recorded somewhere, that's, that's another matter.

But in terms of technical presentation, we should get as close to perfection as possible. Well, I hope one day we'll be able to present our podcast in a format that will sound like we're in the room with the listener. Thank you so much, Haki. This has been absolutely fascinating. Thank you so much. Thank you for listening and be sure to check out the show notes page for this episode where you'll find further information along with web links and details of all the other episodes.

And just before you go, let me point you to the soundonsound. com forward slash podcasts website page, where you can explore what's playing on our other channels.