Make an IIIMPACT - The User Inexperience Podcast

Welcome back to another exciting episode of "Make an IiIMPACT." In today's episode, IIIMPACT's lead product AI Integrators - Brynley Evans, Makoto Kern, and Joe Kraft, discuss the Worst Synthetic Data Mistakes You're Making Right Now

Today, we're diving into the fascinating realm of synthetic data and how it's turning the world of AI on its head. Imagine leveraging massive datasets without compromising privacy—whether it's in the medical, legal, or retail sectors. Synthetic data can extract valuable trends and insights without personal identifiers, making it a game-changer for data-driven industries. Tune in as we unravel the future of data privacy and discuss how regulations will be pivotal in navigating this innovative landscape. Get ready to be amazed by the incredible potential of synthetic data!

IIIMPACT has been in business for +20 years. Our growth success has been rewarded by being on the Inc. 5000 for the past 3 years in a row as one of the fastest-growing private companies in the US.  Product Strategy to Design to Development - we reshape how you bring software visions to life. Our unique approach is designed to minimize risk and circumvent common challenges, ensuring our clients can confidently and efficiently bring innovative and impactful products to market.

We facilitate rapid strategic planning that leads to intuitive UX design, and better collaboration between business, design to development. 

Bottom line. We help our clients launch better products, faster.

Support this channel by buying me a coffee: https://buymeacoffee.com/makotob


Timestamp:

00:00 Everything's online; AI models use vast data.

04:09 Synthetic data bypasses restrictions, fills data gaps.

09:25 Train model with new data for improvements.

12:17 Customer service calls reveal common pricing misunderstandings.

14:20 AI personality affects interaction across demographics.

19:18 Human oversight needed to verify AI-generated content.

20:26 3D model expedites generating varied hand images.


You can find us on Instagram here for more images and stories:   / iiimpactdesign  

You can find me on X here for thoughts, threads and curated news:   

 / theiiimpact  


Bios:

Makoto Kern - Founder and UX Principal at IIIMPACT - a UX Product Design and Development Consulting agency. IIIMPACT has been on the Inc 5000 for the past 3 consecutive years and is one of the fastest-growing privately-owned companies. His team has successively launched 100s of digital products over the past +20 years in almost every industry vertical. IIIMPACT helps clients get from the 'Boardroom concept to Code' faster by reducing risk and prioritizing the best UX processes through their clients' teams.

Brynley Evans - Lead UX Strategist and Front End Developer - Leading large-scale enterprise software projects for the past +10 years, he possesses a diverse skill set and is driven by a passion for user-centered design; he works on every phase of a project from concept to final deliverable, adding value at each stage. He's recently been part of IIIMPACT's leading AI Integration team, which helps companies navigate, reduce their risk, and integrate AI into their enterprise applications more effectively.

Joe Kraft - Solutions Architect / Full Stack Developer - With over 10 years of experience across numerous domains, his expertise lies in designing, developing, and modernizing software solutions. He has recently focused on his role as our AI team lead on integrating AI technology into client software applications. 


Follow along for more episodes of Make an IIIMPACT - The User Inexperience:    / makeaniiimpac..  .

What is Make an IIIMPACT - The User Inexperience Podcast?

IIIMPACT is a Product UX Design and Development Strategy Consulting Agency.

We emphasize strategic planning, intuitive UX design, and better collaboration between business, design to development. By integrating best practices with our clients, we not only speed up market entry but also enhance the overall quality of software products. We help our clients launch better products, faster.

We explore topics about product, strategy, design and development. Hear stories and learnings on how our experienced team has helped launch 100s of software products in almost every industry vertical.

Speaker 1:

They're predictions that most data collection refinements that we can make now will be exhausted in the next few years. So I guess the question is where do we go from here? And the answer could be synthetic data.

Speaker 2:

Yeah. Sorry. I was thinking, like, whenever you think of, like, AI sort of basically creating data to train other AIs, it's almost like you you kind of wonder, you know, is it just, like, you know, recording of a videotape, like, multiple times, and, you know, how good is that data eventually. And I guess if you're trying to create a whole huge dataset, are you still gonna have a human go through it all and and try and confirm it all looks accurate, or will you just use it anyway?

Speaker 3:

Who reads the privacy, you know, things as you're signing up for an app or anything like that? I mean, you just check it and let's let's start using it. And I'm assuming those regulations are gonna start getting pushed and pushed as we want more and more of people's datas. Everybody, welcome back to another episode of Make an Impact. Your host, Makoto Kern.

Speaker 3:

And today, I've got, again, our AI stars, Brinley and Joe.

Speaker 1:

Good to be back.

Speaker 2:

Nice nice intro. I like it. AI stars.

Speaker 3:

And yeah. So today, we'll we'll be talking about, synthetic data and its use in AI and training. So today, Brinley is going to kick us off on that topic and, excited to jump right into it.

Speaker 1:

Yeah. That sounds sounds good. So, yeah, I was, I I've always had a kind of a grasp of synthetic data, but just started researching it a bit more and just found this it was pretty fascinating. So I thought it would make a great topic for us to to talk about. So kind of looking at I mean, when did we cross over into into that age?

Speaker 1:

It was, like, late nineties. Right? Mhmm. That, that was probably probably over 20 years ago that we Internet kind of went mainstream. And we looked at since then, most industries have have moved over to the the Internet or gone online.

Speaker 1:

So we looked at, like, publishing for newspapers and magazines. They, you know, kind of both became digital. Banking went online. Retail shifted to ecommerce. Look at TV stations, into streaming platforms or the new straight streaming platforms started up and education content.

Speaker 1:

You name it. Like, everything's shifted online. And then we looked at, you know, those old sort of chat rooms that you speak to people. Those sort of shifted to social media, and you have just 1,000,000,000 and and billions of human interactions and thoughts, creative outlets, and it's already creating a massive dataset. And, you know, we've entered the age now of these large language models, and, really, these have a a colossal, unrestricted, kind of untapped bank of data to train on.

Speaker 1:

And and this data was being acquired when, you know, most people and organizations were completely unaware that, it was being collected and and used for training AI. So looking at 2024 now, we're seeing AI models that have had access to almost everything that mankind has been creating, at least publicly available areas. So their prediction is that most data collection refinements that we can make now will be exhausted in the next few years. So even if we can get additional data, we're pretty much exhausting all that accessible data in the next few years. So then you look at what what sort of and you probably both heard it in the news, like, more and more, I guess, a rise in the restrictions brought about by things like copyright infringement, lawsuits, legislative changes, all things that are kind of restricting the the collection and use of data.

Speaker 1:

I mean, you look at what is

Speaker 2:

the what

Speaker 1:

are some of the big ones like New York Times sort of, you know, suing, saying, look. You're generating a lot of content in our, in our tone, and, you know, you're not paying for that. So you can't use our data.

Speaker 3:

So Adobe, I think. Right? Adobe or any of the ones that are using art and and imagery.

Speaker 1:

Exactly.

Speaker 3:

Has been a big one.

Speaker 2:

Yeah.

Speaker 1:

Exactly. So it's it's almost like the best time was now for kind of collecting, you know, because it was completely untapped. And now it's just more and more sort of restrictions and, I don't know, stonewalling collection of data. So I guess the question is, where do we go from here? And the answer could be synthetic data.

Speaker 1:

So what synthetic data is is artificially generated data, and it mimics real world data, but it's not directly obtained from real world events or measurements. And I think what we see is the potential lying in the fact that it can circumvent these copyright issues and it can remove privacy issues and also fill in the blanks from datasets that are not currently available. So you think there may be a lot of datasets that are private that, you know, these sort of data scrapings or data collection process can't actually obtain. And these synthetic datasets can be created using 2 different scenarios or techniques. So the one is generative models, and they're models called GANs, j a n s, or VAEs.

Speaker 1:

So techniques like generative adversarial networks or variational autoencoders. And what these basically do is generate text that mimics the style, tone, and structure of any of the real world content that they collect from. So basically, without directly copying any copyright content, you can get very similar data. It sounds the same, matches fairly well, but it's not infringing on that copyright. And obviously, that sort of synthetic data can be used to then train the large language models without violating any of these intellectual property laws.

Speaker 2:

But is it good of interest sorry, Brendon. Is it is it generating, like, new data that you're mentioning earlier, like, we're exhausting what's currently just publicly accessible to the scrape from. So is it is it still scraping from something that humans have generated, or is it generating its own data, the the models and so on?

Speaker 1:

So My understanding, these, generative adversarial networks and variational ordering coders are taking a lot of data now and almost relabeling it. So if there are privacy issues with the data, it can mimic the same sort of patterns, but it can generate unique so if you have, let's say, a database It's like

Speaker 2:

data laundering. That seems weird.

Speaker 1:

It is in a way. So so imagine imagine you had 500,000 contacts, and they have certain sort of demographic representation. So it may maintain things like that, but completely rename you know, generate a whole lot of content to to still be able to train systems on, but it's not violating any of those. Yeah. And from my understanding

Speaker 2:

know that.

Speaker 1:

Filling in some of the blanks through the second scenario, which is called simulated scenarios. So basically, organizations can create synthetic datasets by simulating real world scenarios. So something we'll look at is like customer service interactions. You may be able to actually simulate with the data like we've seen with the generative models. You know, you're able to sort of generate new content based on assumptions of, you know, what a user could be doing and, you know, what you may be seeing in other datasets.

Speaker 1:

So things like customer service interactions, medical consultations, legal discussions, those could all be private data that no one is willing to provide at the moment, but you can have these simulated scenarios that could say, well, if you had this court case and you did this and this, these could be the outcomes. Therefore, it could train it on what you want to do in those scenarios. It really, what you could do is sort of tailor those datasets to specific use cases. So, again, ensuring that the copyright sort of infringements take place. I just picked a couple.

Speaker 1:

There are a lot of different examples that I found. I picked a couple of the interesting ones, so I've just got 3 that I wanted to chat through. The first was improving conversational AI with diverse interaction scenarios. So, you know, we've been working on sort of chatbot related AI implementations, and you may often get sort of edge cases that a chatbot could struggle with. So there's not particular data about how you'd handle a certain case.

Speaker 1:

So often, you'd be able to, you know, have a solution that you could generate this this sort of synthetic dialogue data that, you know, would cover these rare or unusual scenarios. And then the AI models can be trained to handle a wider variety of interactions, which is really useful.

Speaker 2:

So

Speaker 1:

you may have variety, and there's the subset that you simulate a whole lot more. So an example is creating synthetic customer support conversations that involve complex problem solving or cultural nuances not well represented in existing data. And that kind of goes into the you know, if we're thinking of all the the data that's been collected to date, you have all those those biases and, you know, some of the differences in in culture that may make certain answers imbalanced. And what's nice is if this could be controlled, you you have datasets that will circumvent those those biases as well. So you're getting, you know, more useful balanced data to actually train models on, which I

Speaker 2:

Interesting. So yeah. Just so just to understand it, it's really you've got a model. It it can answer questions like you're using it for customer services, you're saying, and you want to augment it because you've noticed it's responding to in certain situations, responding in a way that you don't want it to for certain, you know, users that might be using it. So instead of, like, trying to fix it maybe with prompts or trying to, you know, add in a whole bunch of contextual data for each reply for it to, like, try and understand how to reply, you can actually just rather train the model itself with data that you didn't necessarily have in the beginning.

Speaker 2:

So you may have, like, you know, transcripts of your last 5 years of customer service interactions. But let's say now you're entering a new market, for instance, in a whole new region, and you want to match the tone of what users in that region expect. Instead of having to go, okay. Well, wow. How do we do this now?

Speaker 2:

You can, as you just say, sort of augment that whole dataset and sort of, yeah, modify it to match the tone that you want and then train your model with that instead. So you don't have to sort of, you know, recreate all that data from scratch. That's really interesting. Yeah. It's all

Speaker 1:

So it's it's pretty pretty powerful as well when you think of, you know, all the data that's missing at the moment and how difficult it is to get that. Yeah. I think there's sort of people on the fence as to, you know, it's either going to make models way better or it's going to if it's not done correctly, it's gonna generate a lot of junk that is maybe not helpful.

Speaker 2:

Yeah. Whenever you think maybe you think of positive. Yeah. Sorry. I was thinking, like, whenever you think of, like, AI sort of basically creating data to train other AIs, it's almost like you you kind of wonder, you know, is it just, like, you know, recording of a videotape, like, multiple times and, you know, how good is that data eventually?

Speaker 2:

And I guess if you're trying to create a whole huge dataset, are you still gonna have a human go through it all and and try and confirm it all looks accurate, or will you just use it anyway? It's kind of interesting to see, like and it's hard to tell, I guess, too when when you're training a model. I mean, you have your test cases against it to see if it's matching the output that you expect, but you can't really test against everything. Yeah. Just trying to

Speaker 1:

What I really like about it and why I feel it's promising is from what we've seen working with with datasets, you may have everything represented in some form, but it isn't explicitly almost written out. And what I see, you know, you could take your data, and you say, well, we'll build scenarios around this, and it's really giving you an expansion of your your data. Everything's there, but it's been said in possibly a briefer form. And the fact that you're giving more data there just means that you can map out those scenarios, and you could check. And and even if it's through, you know, a for smaller datasets, definitely through some sort of review, you could actually get get where you need to be a lot quicker with potentially more accuracy than if you had someone actually writing that out and trying to capture it

Speaker 2:

all.

Speaker 1:

So Yes. What's interesting,

Speaker 3:

and I'm just thinking in a real world scenario, one of our, actually, one of our impacts clients, I listen to customer service calls. We do it every, you know, every so often, and we listen to what they have to say. It's interesting the people that are in on those calls, we listen to the interaction between the actual customer service person and the the customer. And you hear the complaints and you hear the responses of the management where they say, oh, they've gotta follow this type of script. This is what they should say when they this happens.

Speaker 3:

And right away, you can see you can either create something that either works along with the human being as they're interacting, saying, oh, basically, everybody almost has similar issues where they don't understand why something happened, what the confusion of what they thought was going to be the price of something now all of a sudden is now more expensive once once they're a customer. That type of misunderstanding seems to happen quite a bit. And it's just as you're approaching that, you see you could probably simulate based on just psychological kind of, like, the comments that are being made as a conversation is happening Mhmm. To tell AI, like, hey. This is this person is saying this.

Speaker 3:

You look at it. You know, somebody who's actually new doesn't know how to handle the response quite well just yet. Oh, I know I'm supposed to say this or that, but how you say it is so important. And so the AI can navigate the person who's, you know, using it, who's a customer service to tell, hey. This is what we mean by this.

Speaker 3:

This is what you meant by that. Because there might be a misunderstanding, a, because a lot of customer service are now what? Offshore. So the you know, they're interacting with people who are within a different country, and they don't know the nuances between that. So I think there's a lot of potential there just because if you're gonna leverage the different ways in which people speak and how they pick up on that and then just Absolutely.

Speaker 3:

Prompting them, it's there's potential.

Speaker 2:

Yeah. That's good. Yeah. It's also interesting to think about, you know, personality with AI is always like a it makes a big difference if you're chatting to something that seems to have a personality instead of robotic responses. And that that can become a very, you know, demographically appropriate thing to use.

Speaker 2:

You know, certain demographics, even younger people, will expect, you know, maybe more vibey sort of interaction, but someone, you know, in their older age may want to something a bit more formal and a bit more to the point. But, you know, trying to come up with how you can have a model that has that sort of new terminology that, you know, maybe our kids are using or or young people are using, which, you know, existing datasets don't have. But you can't exactly create create that. You have to sort of simulate that with synthetic data. Mhmm.

Speaker 2:

Sort of like another use case and just trying to yeah.

Speaker 1:

Yeah. Exactly.

Speaker 2:

Trying to get that to certainty into Yeah. Be interesting to see how that would work.

Speaker 3:

I don't wanna jump the gun because I know you've got 2 more examples, but I'm assuming anything that comes with medical or legal, you know, or even retail, you're going to there's obviously privacy issues. But who reads the privacy, you know, things as you're signing up for an app or anything like that? I mean, you just check it, and let's let's start using it. And I'm assuming those regulations are gonna start getting pushed and pushed as we want more and more of people's datas. Like, oh, we just won't put your name in the medical thing.

Speaker 3:

You know, there's there are HIPAA laws and things like that that are supposed to protect us, same with legal. But

Speaker 1:

And that's some of the synthetic data services are exactly that. It's pretty much scrubbing. It's still getting if you have massive datasets, it's still getting those same trends out of them, but it's not using any any private information. So you could feed your database into that, and they would say, right. They're going to run a simulation and convert it to, you know, to have the same real world trends in it.

Speaker 1:

No one is recognizable in that set, which

Speaker 3:

is interesting. To regulate the regulation for that. Exactly. And make sure they're not taking too oh, wait. You took too much of that one data set.

Speaker 1:

Yeah. Exactly. So, yeah, you're mentioning medical as well. That was sort of the the second, what is it, application would be kind of emerging, technologies and generating training data for those. So you look at fields like quantum computing or new medical treatments.

Speaker 1:

They often lack that sufficient sort of real world data, either because they're new or they have proprietary research that, you know, can't be shared. So, you know, the solution there is kind of generating this the synthetic data to simulate the outcomes, processes, or even interactions in these kind of emerging fields.

Speaker 3:

So With biomedical companies and, you know, I've got a friend that runs a biomedical company, and, you know, that is a big thing is to have scientists on staff to test for whatever you're trying to make Mhmm. By having the ability almost like, you know, with UX. You wanted to create a wireframe. You wanna test out a simulation of of what you wanna build. Is this an application of the same thing where they can create fake data and say, oh, this cancer drug or this weight loss drug simulation, fake data, or synthetic data shows that it actually works.

Speaker 3:

There's no side effects or there's a, you know, certain percentage.

Speaker 1:

And

Speaker 3:

that will at least get them to a point of getting investment, say, hey. Look. These synthetic tests are showing promise. Yeah. And that helps them to get to that point.

Speaker 1:

Yeah. Exactly. So this is pretty fascinating. And, I mean, you look at the benefits on on, you know, solutions like that to, you know, obviously allow AI models to be trained and developed in in these cutting edge fields and just accelerating the innovation and, you know, things like cancer research and and all that. It's so, you know, vital that if we can just speed up the processes, we can get, you know, better treatments.

Speaker 1:

And, so, you know, it could be a a great kind of industry for for that as well. And then the last one I had was expanding training data for visual AI. We've seen things like we've chatted about the saurus and things before where they've mentioned, look, you'll see video generated with, you know, 6 fingers on the the hand. That's because there's not a lot of valuable reference images for the AI to work from or a walking cycle. You may not have enough images.

Speaker 1:

And that's where synthetic data can also be used. And, actually, you know, where they often struggle to have a large amount of of labeled data, it's really time consuming. We often need to, you you know, to put together manually. As a result, pretty costly. The solution is then to have these images and videos generated to augment these, existing visual datasets.

Speaker 1:

So an example would be, like, creating synthetic images of a product in various different environments or conditions that you maybe don't have well represented in the real world. And it obviously, you know, just then improves the performance of of the visual AI models, and, they can recognize and process a much wider variety of objects, scenes, contexts, etcetera. And so

Speaker 2:

I'd imagine I'd imagine you still need human to, like, do a pass through that content it generates. Right? So let's let's go with an example where you want your AI to draw hands. Right? And it keeps drawing 6 fingers.

Speaker 2:

You go, okay. Let me use synthetic data. Well, let's say there's 2 options here. Either I could go and buy 10,000 images of human hands that have been, you know, photographed over the years and find some huge database out there and get the images and spend $10,000 buying those, or I could use an AI to generate all of these images for me and then pass it over to the model, which is specifically tasked to generate hands. Right?

Speaker 2:

But if you're using them if if it's struggling to that to begin with, that means that the other model probably also struggles with with trying to, like, generate its hands. Right? Because that that just is, like, the fundamental issue. Otherwise, you just use the other model to do the task for you. But it can probably do 20,000 images and maybe 15,000 are correct.

Speaker 2:

Right? So you wanna then move those 15,000 correct ones over to the model, which is, you know, as synthetic data, which is now accurate. But it's still easy humans to go through all that, right, and and probably vet all those images. I'm just trying to think, like, practically how you would make sure this data is correct. That's some kind of struggling to

Speaker 1:

I mean, off the top of my head, you think, alright. If you if you're doing the the hands, then you potentially want to be able to, you know, have a 3 d model of a hand and, you know, give it different, different kind of poses and things then go and generate 50,000 images of all different scenarios that a hand can be writing, which, you know, then you could still go through and improve, but it's way quicker than saying right now sit out and go ahead and take all those pictures. And once you got the basics, I've mentioned it would go pretty well. So Pretty cool. Yeah.

Speaker 1:

The applications are things like autonomous driving, retail analytics, surveillance, which is which is pretty cool. Yeah. So we need to ensure that synthetic data is really high quality, representative, and free from harmful biases. And I think if we're going to use synthetic data, the best balance would probably be to combine it with real world data and maybe some form of validation process just to help us create more reliable and ethical sound large language models in the future.

Speaker 2:

Great. Yeah. Sounds good.

Speaker 3:

Well, thanks for, joining us on today's episode of Make an Impact. That was a a good wrap up and conversation about synthetic data and AI. And, tune in next time. We'll, we'll be talking something, very interesting on our next podcast. And, everyone take care, and, don't forget to hit that like and subscribe button.

Speaker 3:

Alright. See you soon.

Speaker 1:

Catch you soon.

Speaker 2:

Yep. Thanks. Bye.