Pondering AI

Fernando Lucini explains the potential applications, pitfalls, and work still to be done to make synthetic data ubiquitous.

Show Notes

Fernando Lucini is the Global Data Science & ML Engineering Lead (aka Chief Data Scientist) at Accenture.

Fernando Lucini outlines common uses for AI generated synthetic data. He emphasizes that synthetic data is a facsimile – close, but not quite real - and debunks the notion it is inherently private. Kimberly and Fernando discuss the potential pitfalls in synthetic data sets, the emergent need for standard controls, and why ensuring quality - much less fairness - is not simple. Fernando assesses the current state of the synthetic data market and the work still to be done to enable broad-scale adoption. Tipping his hat to fabulous achievements such as GPT-3 and Dall-E, Fernando identifies multiple ways synthetic data can be used for good works and creative endeavors.

A transcript of this episode can be found here.

Creators and Guests

Host

Kimberly Nevala

Strategic advisor at SAS

Guest

Fernando Lucini

Global Data Science & ML Engineering Lead Accenture

What is Pondering AI?

How is the use of artificial intelligence (AI) shaping our human experience?

Kimberly Nevala ponders the reality of AI with a diverse group of innovators, advocates and data scientists. Ethics and uncertainty. Automation and art. Work, politics and culture. In real life and online. Contemplate AI’s impact, for better and worse.

All presentations represent the opinions of the presenter and do not represent the position or the opinion of SAS.

KIMBERLY NEVALA: Welcome to "Pondering AI." My name is Kimberly Nevala, and I'm a strategic advisor at SAS. I'm so pleased to be hosting our third season in which we are joined by a diverse group of thinkers and doers to explore how we can create meaningful human experiences and make more mindful decisions in the age of AI.

Today, we are joined by Fernando Lucini. Fernando is the global data science and machine learning engineering lead - did I get that right? - at Accenture. And he joins us to discuss the role of synthetic data. So welcome, Fernando.

FERNANDO LUCINI: Lovely. Thank you. You used my long title. Chief data scientist works a lot better, but let's make it nice and long and descriptive, right?

KIMBERLY NEVALA: [Laughing] I'm going to use chief data scientist from now on. Nothing like a good host challenge. Anyway, we're going to be addressing some common misconceptions and overlooked considerations for synthetic data. And with that in mind, it probably behooves us to start with a definition.

So how do you define synthetic data, and are there some commonplace examples you could share?

FERNANDO LUCINI: Yes, and we can get very pedantic on this. So you'll reel me back, and I'll reel you back, and then we'll meet in the middle somewhere where everybody would understand us, right?

KIMBERLY NEVALA: Sounds good.

FERNANDO LUCINI: The way I tend to define synthetic data, as in AI generated synthetic data - because of course, we have the idea of what synthetic data is - which is data that has been created by a machine, right? That's one definition of what a synthetic data set is.

But in the discussion today, I think we're going to focus on AI-generated. So this is data that's been created by a machine, and it can be-- and let's give a couple of examples will help.

It can be data that is created randomly because it's for convenience. We need a little bit of random data to do something with. So hey, take examples of 50 people and give me random heights for people. Take a random height from 1 meter 60 to 1 meter 90. Just give me some randomness. And the machine goes and creates some random stuff.

The more interesting stuff, which I think will have some fun with today, is where we have a data set. So we have, let's say-- the example I always use is we have a room full of people. And we have their heights. We have their skills. We have a bunch of things like this. But obviously, we can't walk out of the room and tell anybody about the fact that Fernando is Spanish and 1 meter 80 or whatever. So what we do is we ask the machine to look at that data, to create the signal. What is the signal that represents all of that? And out of that signal, we're asking you to recreate a new data set that will fit that signal, the pattern that is contained within the room full of people. It creates a set of data that has all of the same patterns but none of the characteristics of that original data. And we're going to get into a second about privacy and all these things that are quite complex.

So in simple terms, it is machine generated data that is created as a facsimile - not as a copy, but as a facsimile - of the signals and patterns within the original data.

KIMBERLY NEVALA: So is it fair, then, if I wanted to greatly oversimplify the matter, that we are using synthetic data in cases where we might want to fill in a sparse data set, or we want to randomize a data set or to ensure that the data set that we're looking at accurately represents or is representative of the population that we want to evaluate? Is that about right?

FERNANDO LUCINI: Good point. Let's thing of a few examples to get everybody in the mood, I think, right? Because yeah, I think you've got a few different variants.

You got one set of use cases which are-- let's use the simplest set of more obvious use cases, which is where I'll have a data set that I cannot use as is. It has a set of, let's say, privacy concerns or IP concerns. We haven't talked a lot in the industry about IP concerns with data, but maybe I have a data set that I want to share so people can actually do some modeling on. But I don't particularly want to share the data itself just yet because there's value in it. I want to sell it or whatever.

But let's say that you have a data set that has either a strong IP that you don't particularly share, or it happens to have privacy concerns, and you want to use it and overcome those privacy concerns. So use case number one, it helps with that. And we'll touch later on the reality of privacy.

Secondly, it is indeed whether you actually want to complement the data set. Now let's be clear. Synthetic data is not a replacement for real data. So synthetic data, private data, it's more like a distorted version of the data, right? So modeling and inference on the data will come with risks.

So yes, you can take the original data, add some synthetic data on it, but I always joke with CEOs that what you're doing is you're stretching the rubber band of it. So if your data is a round rubber band, you're stretching one end of it. But it doesn't stop being the rubber band. So to some degree, you're stretching it a bit because it gives you some convenience in terms of modeling and stuff like that. But it comes with trade-offs. But for me, those are the two main veins of use cases that makes sense.

KIMBERLY NEVALA: And this is great because this leads us into two of the most common misconceptions about synthetic data. And the first prevalent theme is this idea of synthetic data as being inherently private or privacy preserving. So is synthetic data, in fact, the panacea for privacy concerns that it's sometimes purported to be? That was a lot Ps.

FERNANDO LUCINI: So my view is that it can be, but it's not quite as easy as that, right? So firstly, as you say, synthetic data is not automatically private. It's simply not true.

Most of the off the shelf sort of models that we use, the generative models, are not privacy preserving to begin with. That's not what they're doing. It can leak information.

So you can end up having models that have leaked some of this privacy-- they're vulnerable to privacy attacks. Not very good implementations where your focusing on the privacy can end up with vulnerabilities of all sorts, right? So the end of the day, you can get great privacy if you're taking real care of how the whole model is being created from the pipeline to the implementation and where to apply privacy.

You have to think about the architectures, the hyperparameters, the variable ranges, all of these things you have to think about. And you have to think about it with privacy in mind so you don't end up effectively with some of this leakage where some of the stuff can be inferred or it can be subject to attack. So the answer is no, it's not immediately. And this is where I think you and I probably are going to have great fun with this.

The truth is, there's an evolution to this where what are the controls, what are the processes, what are the tools, what are the methods that would allow anybody in the world that's not an expert to say: I've got a synthetic data set that I've got from Joe or Jane, and I know that he actually does have his privacy because of whatever those controls are. So we're not there at all.

So the answer is no. But does it have the potential to be to create the most amazing synthetic data that is private? Yes.

KIMBERLY NEVALA: So is there an example, you can think of that demonstrates that point about someone perhaps assuming that by virtue of the data being synthetic that it would, in fact, preserve privacy, but it really didn't?

FERNANDO LUCINI: Thankfully, I don't have an example of somebody doing it because then we'd have to stop them pretty quickly so they don't get into trouble, right? For me, it feels that it's a little bit of a humanization of AI. So you sit there and you think.

I mean, look how I've explained it. We take all this room full of people and all that data, and the superpower of some of these machine learning models is they can extract the signature of that. And out of that signature, we asked the machine to take that signature and statistically recreate me the data based on that signal and create me a copy of it.

And you think, well, intuitively, I think, well if it's not really looking at the original data at all, never looking at the original data, only at this probability distribution, and it's just sampling out of that distribution, there is a problem surely.

KIMBERLY NEVALA: Right. Surely

FERNANDO LUCINI: There is no statistical probability that there is a leak. Oops, there can definitely be a leak. I think it's that humanization. You know what I mean Kimberly? Humanization is the problem.

Where this becomes quite difficult is when you have to then explain to these people that, hey, you know how I explain to you in simple terms how this works? It doesn't mean it's private. And there you start having to explain it in a much deeper level, and it becomes very complex to say I know what I explained to you.

But in the context of really large information, statistically, you can still have this problem if you're not effectively looking out for it.

KIMBERLY NEVALA: Right. It's a lot like the conversation we've had in the past about anonymized data does not mean that you will not be able to sort of work back (to the original person), right? Health care is a great example of that. Where if you're looking at a data set for a cohort of patients for a disease that's not really prevalent. Even by virtue of not having any personally identifiable information it is, in fact, possible to go back and look at the area and the information and work your way back to identify those patients. And it sounds like you can actually do the same thing or replicate that same problem within your synthetic data set, essentially.

FERNANDO LUCINI: Exactly, exactly. And we've been doing differential privacy for years in trying to figure out how to do this. And I do think it's a step forward, but it is sometimes difficult to explain it in a way that it sounds like-- and by the way, we do have a lot of enthusiastic people like me and others, probably like you as well, that really love the topic.

KIMBERLY NEVALA: Yeah.

FERNANDO LUCINI: It's fascinating that you can have these very complex and amazing machines that can help us generate this data for everybody's convenience.

And if you think about my situation, so I'm sitting in a team - a very, very, very large team of data scientists. So it's pretty normal to encounter situations where enhancing the data set with synthetic data is a useful thing. So we can apply techniques a bit better with all of the controls and all of these things notwithstanding.

So I'm enthusiastic because I see it as a utility in that respect. But if I'm a bank or if I'm a retailer or let's say in two or three years, where we all have these marketplaces of data that we all dreamed about that doesn't seem to have happened for the last 10 years or 20 years-- but let's see we do.

What role does synthetic play in that? And I think it's fabulous if done properly and if all of these controls… by the way, there's people around like Mihaela van der Schaar and all these gods and goddesses of AI that are sitting there thinking, what are the controls? What are the methods? What are the things that we can standardize as an industry to say that one of these things has been done sensibly, appropriately, and I can rely on it, right? And it's incredibly interesting, incredibly interesting.

KIMBERLY NEVALA: So I want to come back to this, the idea of controls, and how do we, in fact, try to validate that a synthetic data set just like a "real" data set will, in fact, provide the input that we're expecting for our model.

But you touched on another issue. And so before I forget the question, why should people not just think about synthetic data as a replacement for real data?

FERNANDO LUCINI: Depends on the use case. But in the use case of it's complementing data because it's effectively just my rubber band example, it's just a stretch of the original data. So you're not really-- the signal, the patterns or the patterns are not really-- to some degree, as you say, it's a utility. You're helping yourself to a little bit more data because it helps you in the method you want to use for analysis for that one.

In other use cases, where you're just removing some of the-- let's use the example of IP. Something which has got rich IP, you might still want to know what Fernando has bought. The fact that you no longer have Fernando and the things you have bought are equivalents in the synthetic data-- so now you have the patterns, but you don't have the person.

You still want to know so you can target Fernando because Fernando is one of your customers, and you want to do that. So in the synthetic data set, you may get the ability to do the analysis and get the patterns and the general answers that you want to exam questions. But you still don't have the particulars. And if the particulars are needed, you still have to go back to the original.

You see where I'm going. Very rarely-- let's use another example. Health care data is a really good one.
Health care data, there's use cases where through COVID, if we had had the ability to create synthetic data - and some of this was done at the end of COVID but not maybe at the beginning. If we had the ability to say, let's take the population of the UK create a synthetic copy, totally private-- let's just say for the sake of the thought experiment, 100% private, total differential privacy. Could we have sent that to the people we trust in other global economies and doctors and PhDs so we could have had a much bigger data set? Where I don't care about Fernando specifically, but I care about the trends within and what happens to certain-- and let's face it. In medicine, everything matters-- gender, age, lifestyle, whatever the case may be, right? So would we have been better off? And I think the answer would have been yes: without ever having to know Fernando and Fernando's personal circumstance.

Different from a bank trying to say OK, what can I give Kimberly or Fernando as a great offer that's going to help them with the way they're spending their money.

And of course, getting the synthetic data and being able to create products that can be used on real data-- this is an interesting one. So I've created all my analysis and all my algorithms that should work. They still need to turn them on in the real data. Otherwise, what's the objective here, right?

KIMBERLY NEVALA: Yeah, and that's an interesting problem. And maybe again something we can talk a little bit about where if, for instance, a deep fake, right? Synthesized data. Maybe it's a face. So perhaps we're trying to synthesize a data set of human faces that have enough of every possible combination of complexion or hair color or size and all of those bits? The right breakdowns of gender so that we're not overrepresented or underrepresented for that model to learn on, which has been an issue with facial recognition.

So then I start to wonder, if I synthesize this perfectly statistically appropriate data set so that every demographic is represented fully enough for us to actually apply machine learning to it, does the model then learn these synthetic features? Which, as you said, are going to be close but not quite, real people's faces. And does it then, if we try to apply that in the real world for facial recognition-- not that we would want to do this - but is it in fact still not going to perform well because it has learned synthetic features that, as you said, are the stretched rubber band? It’s sort of a slightly blurry picture as opposed to real features.

FERNANDO LUCINI: Well, you touched on a very important piece that we haven't touched yet, which is take your data set, whatever it may be. And as you say, you do it. You get your variation to encode or whatever it is you're doing. And you identify the probability function and all that stuff-- proteins, density functions.

You create your machinery to create your synthetic data, and you have it. It has all the patterns, all the problems, all the biases, all the issues. It has everything you can think of. So one of the things that academics and others postulate is that it can be used for fairness, which is where you're heading with this, right?

KIMBERLY NEVALA: Mm-hm.

FERNANDO LUCINI: You can say, OK, well, if we can look at the original data and/or the synthetic data and it's very clear that it's - let's make it up. 90% people in this data set are males, for example. And it's a data set for upselling or cross-selling or whatever. Then could we take this and say, well, actually, could we take the patterns as they relate to the females and amplify that synthetically to reach parity? My view on this is that obviously this is very complex.
Fairness is such a complex issue sociologically, not only from a science perspective. I think if we think of privacy and think about being a difficult thing that we're trying to deal with even in the context of this.

Fairness is even less mature, and I do think it requires immense amount of research and not research just from data scientists or mathematicians. From sociologists and others that can say, what is the reality of not mechanically but from the context of business and society that we're taking a data set about people that we know has problems that we're then using synthetic to remove the problems? And Kimberly, I think you mentioned one case, but we can open it up to both, right?

One is we've created an entire synthetic data set. And within that synthetic data set, we're going to solve the problems. But another one is I've got the original data. And using synthetic data, I'm going to solve the problem in the actual real data, right?

I think we should challenge ourselves to put a lot of investment into figuring out how to do this sensibly so we have a better world and a better society. But it's going to be much better people than me.

We have sociologists and ethics experts that are going to have to tell us how you do this in the most sensible, rational way because I think the mechanics of the science will be easy to do. We'll get that. It'll be fairly simple because we got a way now by the way-- you can actually just look at a result and say, well-- and I've given you a really lame example.

I always use the example of insurance products for vintage vehicles where everybody looks like me: bald with a beard in their late 40s. And the truth is, my wife likes them as much as anybody. But the data for the last 20 years of 30 years is massively biased towards a particular type of buyer, right?

Whereas today, it's all changing. So we could easily just ignore that and change that and say, OK, well, don't care, and amplify the parts of the data that we need to. You see where I'm going with that, Kimberly?

It's a very tough one. I think you and I would want the world to be perfect and actually as fair as it can be because we're fair people and live hopefully in fair societies. But from that to being engineers that are saying, I'm going to from a synthetic perspective do this? OK, where to start?

KIMBERLY NEVALA: Well, back in my consulting days, I was working on data governance and analytics governance. And somebody asked me, hey, we're moving to the cloud. So how does that change our data governance issues and problems and concerns? And I said, well, it really doesn't.
I mean, the data is the data. You moved it from one place to the other. It doesn't do anything for the rest of it. You might have new tools to help you secure it in certain ways.

FERNANDO LUCINI: Right.

KIMBERLY NEVALA: But the idea about policies and what's appropriate usage…moving the data hasn't changed those questions and those concerns. In the same way when we're thinking about fairness or equity or really just making sure that the data reflects the world, the real world, that we, in fact, want to operate in. It sounds like whether it's "real data" or data that we synthesized, those issues and considerations still need to be really thoughtfully and mindfully considered.

FERNANDO LUCINI: Yeah, yeah. And I really don't mind touching the topic because I think if we all have a part, then we all want to be fair, right? So I don't mind touching the topic.

But it's the consequences of measurement, of action, of-- it's those things that I worry about. But privacy is the same, by the way. With privacy, we want to be able to say this has inherent privacy because it we've done x, y, and z, and we've measured xy this way, and we've got these standards that we've met. But privacy is binary, right? You either want to be private, or you don't want to be private.

KIMBERLY NEVALA: Right.

FERNANDO LUCINI: Fairness is not binary. Fairness is in the eye of the beholder in general terms, right? The ultimate fairness is in the eye of the beholder.

So dealing with that becomes very interesting. And let's not forget that these models that we're using to generate some of this data are really complex. I mean, some of them can be very opaque. We haven't talked about it, but they can be massively opaque.

You got these amazing generative models that create these enormous high dimensional-- and by dimensional, we mean pieces of it, right? Different aspects of it. You know, immensely high dimensional synthetic data which is magnificent from a science perspective. But they're also pretty much like a black box. So you can't explain it to a normal human being without massive mathematics.

And even then, there's areas of it you can't explain at all. It's a natural reaction to the design that you've created, right? So then that adds another layer of interesting things around this, right?

KIMBERLY NEVALA: So understanding that a lot of the common techniques that are being used to create synthetic data, as you've said, bring their own issues. Like opacity and not necessarily being able to exactly understand how and where and when different aspects of the data are being generated or why.

Say you've gone out and you've generated a whack of data -- technical term. What kinds of controls or validation criteria do you need to apply to this? I know this is actually an area of substantive research right now. What are we seeing and what are we learning about how to validate these types of data sets?

FERNANDO LUCINI: At this point, I think it's too early. It's way too early. So as I said, this is where we call people like Mihaela van der Schaar and other goddesses and gods of the academic world because this is where academia is needed.

They need to sit there, and they need to think hard and long and test and validate because what are you going to validate? You don't validate the techniques. Techniques are techniques, right? You validate how they're applied, how they're applied towards privacy.

And again, these are things that we talk about generative models like they're a simple thing or variational transcoders. But underneath, these are mathematically super complex things that you can manipulate in endless ways.

So to some degree, the problem becomes within all this complexity, how do you create the minimum amount of controls that are the most sensible that will prove to somebody that the methodology used-- not only for the creation but also the testing - let's not forget that.

Maybe we simplify like that, Kimberly, in such a complex area. You're going to have to create a set of controls that say, I have created these using these controls, and depending on the model, the controls will be different. But it tells you that I know how to use this model in a way that is sensible and it's focused towards privacy, for example.

Then you're also going to have to prove that you've created the testing that actually goes back and says, I've tested this. So I've not only created in the way that I think is differentially private or has the right privacy as far as the controls. But I'm also looping back into it and testing it, or I have tested it and I can certify I have tested it. So it's not full of leaks or even arbitrary things.

We talked a little bit at the beginning that you can have these leaks. But you can also have these-- if the data is really big - you can have these simple statistical coincidences.

So how do you prove both sides, the controls of how you've created the thing so it's sensible and the testing that you can prove that you can literally say I built a thing, and then when I test it back, I can tell you it has. And it'll never be black and white.

It'll be-- because you're never going to be able to test a data set which has a billion columns or something. You're not going to sit there checking every single one. Oh, maybe you do. Maybe you do. So there it is. I don't know quite what we're going to land on that, and it's going to need cleverer people on me to figure it out unless you figured it out, Kimberly, and you're going to tell me now that's it's all sorted out and I'm way behind the times.

KIMBERLY NEVALA: I think I'm talking to you because I have no idea whatsoever. And you know, again, it's an old problem made new. You know, we used to talk about data quality with very simplistic data.
Do you have all the right sort of numbers for accounting or whatnot? And even there, if we look at how folks are managing to understand the completeness, the comprehensiveness, the accuracy and quality of their very basic data sets that we know: operational business data, finance, accounting, and basic personal information. We don't do a very good job. So it strikes me that this idea of doing that to quality for this kind of data is a whole… it's just a whole new problem.

FERNANDO LUCINI: Yeah, and that's the problem. At the moment today, it's down to who you are talking to and whether you can be convinced that they have the right arguments, that they are doing this in that controlled way. There are startups that do - we haven't talked about this - but there are startups that will tell you that they do create private data.

So there's elements of these folks proving to you that have done things in a way that convinced you that is correct. So there's some of that. And that will evolve.

But today, I think for the average company that wants to use some synthetic data, they have to really, really be educated so they can be convinced that the method used and the method of test is appropriate. And there is your trouble.

You see what I mean, Kimberly? This is where it becomes troublesome. This is where suddenly, we've gone to synthetic data is wonderful and we should be creating an AI-driven synthetic data. And then the reality is that it's quite hard to do. Quite hard to prove that you've done it right. Quite hard to test that you've done it right. And quite hard to explain to somebody that the very complex methods you've done to do it right are whatever they are and the listener is going to understand them - the person from risk in a bank or whoever it may be.

So I think we're going to have to work very hard on trust. We're going to have to work very hard on formalized measurement that we can all count on that we don't need to understand. I just know that there's a methodology created by blah, blah, blah, which is part of the IEEE (I make it up, wherever it may be) that as long as those controls are followed and they can be proven, then I don't need to understand how variational transcoder works. Never mind how that implementation works or how the test harness works, et cetera, et cetera, et cetera. And I think those things are still to come.

KIMBERLY NEVALA: Yeah, which is somewhat problematic because to some extent, I think the market, the appetite for synthetic data is growing. I've heard that spoken about even in the context of small or medium sized businesses. Or just businesses that aren't digitally native, that aren't able to generate the data sets and don't have the historical data sets that a digital native company would. And therefore not necessarily able to compete.

Now certainly, there's other ways for them to potentially use-- to buy models and things like that that have been pre-trained with all of the issues and caveats that come with that. But--

FERNANDO LUCINI: Absolutely.

KIMBERLY NEVALA: --it sounds like we're still-- the cart is still a little bit before the horse for you to be able to just go out and buy a synthetic data set and cross your fingers and hope for the best that it's actually representing what you've been told.

FERNANDO LUCINI: Oh, absolutely. And if you think about how the average company is seen now. So everybody is very, very focused on getting their data to the cloud for all the right reasons-- because they want the agility and all these things. So we've spent a lot of time and we continue to spend time doing that. And maybe we do more of that than we do data science or clever things like that because we need to get the data there. We're generalizing, but let's just say that that's a general trend.

At some point in the next year or two, that will end. The data will be there, more or less, and you're ready to have some fun, at which point you'll realize that if you have client data through the years of COVID, this is totally screwed up. All the patterns change. Whatever the normality was before is being totally wrangled up in COVID. And COVID is now kind of going away. So what do you do?

You now have a two-year blip in the data where you can't really use it for machine learning because frankly, it's going to tell you something that's not helpful. So you sit there, and you think, OK, right? So there's no point cutting out two years of our life because then there's no point going three or four years back to learn what I need today because sadly, our society moves too quickly for us.

And we're cutting out examples of long-term running turbines and machine data of that kind that may not have been impacted. But a lot of the world's data was impacted. So then we sit there, and we think, hm, I'm going to have to either look for techniques that don't use that data even though I spent so much time and money putting it in the cloud-- but that's fine. I still need it there to run the operation or because I need to know Kimberly's balance in her bank for the last five years or whatever.

But for machine learning and for insights, hm, OK. That's a problem. So I either start doing things like reinforcement learning and I start using mechanisms that don't necessarily need the data, right that maybe I start learning right now. And by the way, three years ago, we talked about reinforcement learning in a company, and we might look, OK, well, maybe that's the domain of the Googles of this world.

KIMBERLY NEVALA: Right.

FERNANDO LUCINI: The truth is, these days, it's in the domain of most people if they kind of know what they're doing. It's either that or that plus a synthetic data to say, well, maybe there's elements of data that I can synthetically use. Maybe I can get data from other parties. Maybe I can. There's a maybe. When we
can use synthetic data, that maybe becomes more possible.

And I think this, to your point on synthetic being used more, I think that's what we're seeing. We're seeing now people looking back and going there's been a real change in scenery of hearing people wanting to use reinforcement, for example. Real change: before COVID, after COVID.

And I know time has passed. But even then, there's reality, which is based on that data having problems. Secondly to this is there's been a real change in people talking about synthetic but not so much people implementing, Kimberly.

I don't see that many because the reality is that if you're going to do synthetic data, if you're a big firm, you'll probably look-- you have some data scientists that might be telling you the reality, which is, this is quite hard. So it's not for the faint hearted.

So maybe we're careful. Maybe you go to a software company that's advertising out there, and you think, well, I'll use one of these. And some of them don't use quite modern AI synthetic data. They're using maybe less modern and more traditional stuff, which may still serve your purpose.

But I think we're in that moment of we need another year or two for these things to mature a bit so the average company can use it without feeling massively exposed. And with the market having tooling and a lot more research that validates that what they're doing is sensible.

It kind of feels like that's where we are. I don't know. Did you observe the same? Is that something you see as well?

KIMBERLY NEVALA: Yeah, I think so. I've been really struck by-- I suppose folks might say we're early in the - what's the Gartner terminology? Hype cycle for synthetic data. And I get concerned when I hear some of these things being put out there without the associated buyer beware label attached to it. But it feels like there's a lot of potential if done correctly. But patience in this interim period may just be required.

FERNANDO LUCINI: Yeah and be ready to get a lot of hate on this because I get it all the time. We'll get people who'll tell us, we'll do it all the time. You guys are not sensible. It's been done all the time.

There's always dangers with generalizations. There are great people doing synthetic data and small amounts for specific use cases, and it's all fine.

I think what you and I are trying to discuss-- certainly, I am-- more the generalization of large companies using this as a norm. We're very far from that-- very far, I think. When I say very far, I mean three years, two or three years.

KIMBERLY NEVALA: It's amazing how short those “very far time frames” are, these time horizons are these days.

FERNANDO LUCINI: Oh, well, Kimberly, think about it. Six months ago, we didn't have Dall-E. Now we have Dall-E. So suddenly we've come from a world where nobody could even think about writing a text that says: draw me a picture of a banana smoking a cigar on the beach. And now you can just go to open AI, write that sentence, and you get that picture. And three years is a monstrous amount of time in the AI world these days.

KIMBERLY NEVALA: Yeah, and another great example of then the more philosophical conversation that comes as a result of these technologies, right, which, what is art? What is…

FERNANDO LUCINI: These are synthetic data, right? I mean, I keep on talking to customers about GPT-3 as being -- oh, sorry. For the audience, GPT-3 is an amazing long big language model from OpenAI, right? And I keep on talking to clients about GPT-3 is a synthetic data generator.

KIMBERLY NEVALA: Right.

FERNANDO LUCINI: It's as simple as that. If you ask it-- the other day, I did a presentation where I used GPT-3 to explain why context was such a difficult thing for machines. So why is context such a difficult thing for machines? And GPT-3 explained exactly why it's difficult. And I think that's entirely synthetically generated. Yet we all accept it because it's sensible.

KIMBERLY NEVALA: Right.

FERNANDO LUCINI: Because it's one of the things we can control. We can read what this thing produces, and we can say, OK, that's entirely sensible. That's exactly what I would have said.
I accept that synthetic generation.

It's a slightly different thing from what you and I are talking. Which is large swathes of data that might contain quite-- by the way, why wouldn't you use GPT-3 to generate synthetic data within that data that you and I have just described? And it still has the issues that it's still generated from something. So it still has a learning. Is that learning-- it's an endless loop of interesting use cases, right?

KIMBERLY NEVALA: Right, but again, that's another great example where you can start to try to validate the outputs. And you can see a whole span of things that look perfectly reasonable. But because the scope of what can be generated is so broad then you don't see those other cases - which may be just as plentiful - where it's just really nonsense. And so where this starts to get a little concerning is when we are generating - if we're using this to write opinion pieces or Politico. Or in general, where it's not necessarily that obvious that it might be not fact-based or not factual or someone's human opinion, if you will.

So there's a lot of philosophical questions and certainly issues with GPT-3. Again, because even though it's synthetic, issues of gender bias, for instance, which comes from the language in the corpus that it was initially trained on even though it's now generating new data. So all the way back to that initial conversation that says just because it's synthetic…

FERNANDO LUCINI: Going back to that, right?

KIMBERLY NEVALA: …doesn't mean it…yeah, it's not born from whole cloth.

FERNANDO LUCINI: And GPT-3 and other large language models, by the way, we should take our hats off. This is a magnificent thing. I mean, Dall-E and GPT-3 are magnificent. And we should understand that the people at OpenAI and other people are people that are thinking of society and have no desire for this to be used in the wrong way. These are very moral people.

At the same time, we need to recognize the magnificence of what they've done and that it will evolve, right? At the same time that we say, well, if we're asking you to explain something as lame as what I-- give me an explanation of why context is such a difficult thing for machines to understand.
It gives you a magnificent answer.

But we could ask it lots of strange things where heavens knows where we could come up with. So thus is the trouble with synthetic data, and we started the talk with synthetic data.

And I think what you and I could have done is say that we were talking about big-sized data. That we are creating synthetic copies on all those big databases, big tables, big blah, blah, blah, right? And we're here now talking about synthetic data created by a large language model. And the truth is, it suffers from exactly the same… all of this suffers from the same concerns, right? It's as simple as that, right?

KIMBERLY NEVALA: That's a good point. Simply unsimple. So we've been talking a lot about the need to set aside this idea of synthetic data as a panacea for privacy. And we've been talking about all the applicable warnings about the state of affairs for, as we've said, broad scale operational generation and application of this data.

But that being said, there are really interesting and important use cases and applications for synthetic data today. Far beyond these issues - if we ignore privacy and we think of all these things. Like around gene analysis and computational creativity.

Why don't we end on that note? Can you share some areas where this technology and the use of synthetic data is, in fact, really helping us solve some tricky problems or think differently in ways that we would not have been able to?

FERNANDO LUCINI: Or at least think-- we can allow ourselves to dream a bit, right? And you've named a couple that I really like. So in health care and life sciences, where all of us get the value from sharing the data, we should all open the door and say governments should be creating synthetic data to help all of us evolve.

It should be a global movement. I don't know if you agree, but it's one of those where I don't want to say my private data about my health care. But that's very different from a synthetic data that… I'm not in there, not really. Why wouldn't that be shared to solve the problems of massive issues? And we all have in our lives our personal use cases of family, friends around us that we can probably share a little tear when we think maybe if we knew more and we invested more in this, would it help cancer? Would it help COVID? Would it help all these other things that affect us, right?

So I think that's an entire story of things. And in gene analysis and stuff, there's generation of synthetic data and this has happened for years because they're just generating variations. Which leads us to the very fun…you talked a little bit about it. But we're using things like computational creativity, right?

And Dall-E is a great example. So imagine a world where you're in a marketing organization, and you're constantly using clipart, and it's limited, and there's a speed thing. And there's all these things where you can literally just ask for what you want. Be as creative as you want. Ask for whatever thing you enjoy and you like using text. And you can get that instantly because the machine just creates and with that computational creativity creates you that thing.

Another variation of this which I quite enjoy is it's all of those companies where you're creating a product which requires variation in order for it to be consumed, right? Of course, there are endless examples, right? Toys are a great example-- variations of colors of toys. We've all had-- in Spain when I was growing up, we had several fads of these yo-yos. We had yo-yos, and everybody had to have the one which was particular color. And oh my God, there's not many there. The truth is, it's the one they have the less pigment. They didn't do enough of them. And suddenly, we all want it.

But this possibility the machines can actually look at, for example in flavors or in things like that, they can analyze through the patterns what are the combinations of aspects of these products that make it more attractive to us and let the machine create brand new interesting combinations to this. And I always joke that it's a little bit like the Harry Potter multiple flavor gum. Do you not remember this from the movies?

KIMBERLY NEVALA: Yes.

FERNANDO LUCINI: They have these little gums, and they have different flavors. But some of them are absolutely horrific, are like bogey flavor and things like that, right? And I always joke. It's a little bit like that but actually without the ugly flavors. It's actually creating very interesting flavors that you will like, right, and mixing the ingredients together in ways that a human wouldn't necessarily do because it's not in the palette or limited palette. So there are elements like that that I think the machine can do.

Another example I really like is in design, there are elements here where we can teach a machine, say, the physics of a chair. What does comfort mean? And we can let the machine really help us with how to design a chair in a way that it actually is beyond the ergonomics that we already know, for example. Not putting down the fact that people work incredibly hard in the physical sciences to understand our body and our chairs, but I'm giving it as a bit of an example.

And I think there's an entire element where synthetic data can be-- and there's plenty of artists, by the way, now creating-- I don't know if you know, but you can go online and buy synthetic art that people are creating-- artists are creating synthetically.

So I think there's an entire area of that creative use of synthetic data that I think could be interesting and magnificent. And Dall-E, everybody go on and look at Dall-E. It's fabulous. They've got a lovely Instagram that you can have a look at all the fun variations. It's just the beginning of mainstreaming computational creativity through synthetic data.

KIMBERLY NEVALA: I'm intrigued. And still, because I'm always-- another guest said she's a skeptical optimist-- intrigued and still slightly skeptical. So this will be great.

FERNANDO LUCINI: I know what you mean.

KIMBERLY NEVALA: Well, thank you so much, Fernando. I really appreciate you coming on today and helping us set some appropriate expectations for synthetic data today and providing some fun glimpses into the possible future. So thank you.

FERNANDO LUCINI: Oh, it's my pleasure. Always lovely to talk to you, and we should do this again often.

KIMBERLY NEVALA: Yes, we should. I'm going to take you up on that. Speaking of buyer beware, huh?
I'll be back. Now in the meantime, next up, Patrick Hall of bnh.ai is going to join us to discuss the legal ramifications of AI and how we keep the scientific method present in data science. You're not going to want to miss it. So subscribe now.

More episodes

Chapters

Show Notes

Creators and Guests

What is Pondering AI?