Secure Talk - Simson Garfinkel
Justin Beals: Hello, everyone, and welcome to Secure Talk, where we explore the cutting edge of cybersecurity, data protection, and privacy with the industry's most innovative minds. I'm your host, Justin Beals. In today's data-driven world, organizations face an impossible challenge. How do we extract valuable insights from sensitive data while protecting individual privacy?
Many of us operate under dangerous misconceptions about data privacy that mathematical research is now definitely disproven. For instance, we once believed that simply removing obvious identifiers like names and addresses was sufficient to protect privacy. We thought that aggregating data into statistics would make it impossible to trace back to individuals.
We even assume that the size of modern data sets would provide safety through obscurity. But the mathematics of differential privacy have proven all these assumptions wrong. Researchers have demonstrated that seemingly anonymized data sets can be reverse-engineered to reveal personal information with alarming accuracy. At the US Census Bureau, researchers discovered that they could reconstruct race, age, and location for millions of Americans from published census STA statistics.
Even large, complex data sets provide far less protection than we once thought. Creating what privacy experts call a mosaic effect, where multiple data sources can be combined to reveal sensitive information. Today we're diving into one of the most promising mathematical solutions to this problem: Differential privacy.
This revolutionary approach provides a formal mathematical framework for measuring and limiting privacy loss when analyzing sensitive data. Our guest is Simpson Garfinkel, who writes about the intersection of security, privacy, society, and ethics for both popular and academic audiences.
In his most recent book, “Differential Privacy,” in the MIT Press Essential Knowledge series, he presents the underlying concepts of differential privacy, explaining why it is necessary in today's information-rich environment how it was employed as the privacy protection mechanism for the 2020 census, and why it has sparked controversy in certain communities.
In addition to being a journalist, Simson is a noted computer scientist a fellow of the American Association for the Advancement of Science, the Association for Computing Machinery, and the Institute for Electrical and Electronic Engineers. Garfinkel has authored or co-authored more than 70 peer-reviewed academic articles.
Previously, a tenured associate professor at the Naval Postgraduate School. He has also held technical leadership positions at the US Census Bureau and the US Department of Homeland Security. Join me in welcoming Simson today as our expert guest in what is the future of privacy and how do we measure it.
—---
Justin Beals: Simson, thanks for joining us today on SecureTalk. We're really excited to chat with you, learn more about your expertise and, especially your book on differential privacy. So, thanks for spending some time with us today.
Simson Garfinkel: It's great to be here.
Justin Beals: Excellent. I did a little research, of course, on your background beyond the book, and it's quite a storied career, Simson.
A lot of times we ask people about, you know, their career and how they got to where they are. But you have so much going on. I could spend a whole episode on it. So, I thought I'd kick it off with a slightly different question for you. Are there any really critical experiences or projects that you look back on from your career that were important in the work that you're doing today, your book or other things that you're working on today?
Simson Garfinkel: That's a very difficult question to answer. Again, it's really great to be here. My career is very difficult to parse because I have been a journalist and I've also been a university professor, and I've started five companies, and I have 15 years experience in the federal government. So it's really, there's a lot of different pieces.
But I would tell people not to mirror what I've done because it was exceedingly high risk. And instead, I think that it's really important for people, especially in today's economy, to specialize, to go deep. There's a lot of people say, “Oh, I do a lot of different things”. And that's really amazing. But the value that we have is going deep, and having people who have deep expertise in particular areas is just incredibly valuable.
I will say that the most important skill I have is being able to do research, being able to find out new things. I treat computer science as an experimental science. I believe that research means both being able to find things in libraries, but also being able to do systematic experiments knowing how to challenge your core beliefs, and being willing to change your beliefs based on new data, and people have a real problem with changing their beliefs based on new data.
Justin Beals: Yeah, I love that statement. The ability to change beliefs is a difficult challenge. At one point, I was diving deep into some neuroscience work, and I found a study that measured the amount of blood flow activity, andd they were talking about how, you know, when the brain is so inclined to believe in its current pattern configuration, and the amount of energy expended in changing that pattern configuration was exponentially larger than making the same decision. And so it's resistant, you know, to changing how it thinks.
Simson Garfinkel: And that's relevant to the book on differential privacy because the book opens with me encountering one of the four inventors of differential privacy, Kobi Nissim. And at that point, I really wasn't convinced. that differential privacy was necessary. I thought it was an interesting theory. It was a theory I could barely understand.
And I thought that other approaches that we had for protecting privacy could work also. And he confronted me. He said,” What, you don't believe the math?” And I'm like, well. Sure. I believe the math, but I just don't think the math is necessary. And during my first six months year at the Census Bureau, I really learned that we had to take principled approaches to protecting privacy.
And because if we didn't, if we didn't use approaches based on first principles based on mathematics, then they could fail in ways that weren't anticipated. And most of the big data privacy snafus that we've had has been when the assumptions that people had when they published, their confidential data that was processed, those assumptions turned out not to be true.
And so with, with differential privacy, with any approach that's based on, on mathematics, it's possible to have a very tight bound on what those assumptions are and not to be surprised.
Justin Beals: Yeah. And I love this part as I learned in reading your book about what differential privacy is, is that it is a mathematical style function, you know, it is not a methodology or a set of rules, but more like a formula.
Simson Garfinkel: There is a formula about the privacy loss. That happens to people in the data set when confidential data is used to make statistics, and the mathematical formalism is comforting to people who understand it, but it's off-putting to others.
So, in part the, it's not the privacy loss that happens, but it's the maximum privacy loss that can happen. And so one of the things that doesn't make sense until you really get into it, is that by doing better math proofs, you can show that your mechanism is actually doing a better job protecting privacy than you thought it did.
And that's because this is all based on worst-case analysis, not average-case analysis or, you know, what we hope things might be.
Justin Beals: Yeah, I could certainly see how this was would be difficult for what might be the legacy privacy practitioners. And I think of law, law practitioners and certainly maybe advocates or public policy advocates.
Simson Garfinkel: So the lawyers have not been the problem. When I work with law professionals and policymakers, they're largely on board with the idea of moving to differential privacy. Those that are pushing back. have been some of the data users who are frightened by the idea of adding noise to statistics to protect privacy.
They say, “how dare you add noise to statistics? People will die if you add noise”. No, really, like at Harvard, um, the, the, at the Berkman Center, people there working on differential privacy have been told that differential privacy will kill babies. Because it will impact our ability to understand medical data.
So there's been some pushback from people who are economists or people who are demographers. The other problem has been that differential privacy is very young, and we only can apply it to a few areas. And so people who've tried to apply it to things like protecting photos. Or protecting geographic plans or protecting graph data, have had a really hard time.
And they view that as a, as a problem with differential privacy, but it's really a problem with the youngness of the field. And it's as if it was in the mid-1990s, and we were talking about the problems with public key cryptography, that it was slow, that it was hard to use. It was error-prone, and you didn't need it for all those things.
And now, and now they're 25 years later, we, we have a much better understanding of what we do and do not need public key cryptography for. So, I think that we're going to see the same sort of evolution with differential privacy and with other formal privacy methods. And that's another thing to point out, which is differential privacy is one tool in the toolbox for protecting privacy and statistical databases, but it's just one tool, and there are other tools at this point.
We've probably lost everybody because we haven't actually said what differential privacy is and why I wrote a book about it and why I'm going to talk about why the US Census Bureau decided to use differential privacy for the 2020 census, and why differential privacy is in Google Chrome, and why it's in the Microsoft operating system, and how it's actually going to make the world a better place.
Justin Beals: That's a big list of solutions, but let's start with your first one. I think that's great. Let's clarify, what is differential privacy?
Simson Garfinkel: So, differential privacy is an approach. Which was invented in, 2006 for protecting the privacy of individuals whose data are used to make statistics and differential privacy allows you to put mathematical bounds on the amount of privacy loss that those individuals might suffer when their private data are used.
But I just did a trick of hand, right? Because I said, I moved from protecting privacy to limiting privacy loss. So, what differential privacy does it, actually gives us a mathematical definition for privacy the first time ever. Now,you can argue with that definition, but it's a workable definition, and we don't have others.
And Differential Privacy is designed to work, with any data set, the theory, but the implementations that we have work best with tabular data or with linked tables, and it does much, much better with plain, with a single table than with linked tables.
Justin Beals: Yeah, structured information, record style analysis.
I can see why that's so interesting to the Census Bureau. And you really do highlight in the book the growth of the problem from a U.S. Census Bureau perspective. Maybe you can highlight why what challenges they were struggling with. They're trying to solve two different problems.
Simson Garfinkel: So, so let me answer your question, but take a little step back. And before we do that, let's talk about what the mathematical definition is.
Justin Beals: Yes.
Simson Garfinkel: So the idea of differential privacy is that if you are, if your data are in a data set and it's published, then something about that published information, can be linked back to you.
And if your data are not in a data set and either that data set is published or there's some statistics that are published based on that data set, then it can't possibly impact your privacy because your data, they're not in that data set. Now, if that data set is like genetic information and there's 100 people in the data set and, you know, I'm not in it, then it's hard to see how it's going to impact me.
Unless those hundred people are all my relatives, right? Because there's linkage between people that might exist in the real world. So the first question that you have to think about is what do we really mean by my data and what do we really mean by, records being independent, and how independent can things actually be?
But aside from that, the basic intuition is that your privacy can be impacted. If your data are used to make a publication, and they can't be impacted if your data are not used. And so what differential privacy tries to do, is to limit the difference between a publication that is based on that data set with your data and the same data set without your data.
And the difference between those two, if we can control how big or how small that can be, The bigger that difference can be, then the more the impact can be on your privacy. But if the difference between a data set with your data and without your data is very small, then that data set, that publication isn't having much impact on your privacy.
And, what differential privacy allows us to do is it allows us to tune that difference. We can tune it. So there's no difference. And then your data don't matter. And of course, It's completely private publication, but it's not useful for making statistics. Or we can make it so that the differences are easy to discern.
So, for example, a data set that has my name in it, and if the publication mechanism is release all the names. Then the data that has my name in it versus the one that doesn't have my name in it are going to be very significant. So, the mechanism works with, the definition works with any release mechanism, but it only makes sense with certain release mechanisms, such as when we're making public statistics right now.
Now, the Census Bureau is charged, um, under the Constitution Congress is required to conduct an a decennial census every 10 years for the purpose of apportioning the House of Representatives. And now that is just one of a hundred different data programs that the U. S. Census Bureau has.
Many of those Census Bureau programs are calibrated using the decennial census. So, if you are engaged in marketing physical objects and you're trying to figure out where to put stores, you might look at the American Community Survey, or you might look at a dataset based on the American Community Survey.
And that's a mandatory stratified probability sampling of the United States, all the people who live in the United States, asking them all sorts of invasive questions. And then we take the Census Bureau takes those results, and each person is weighted, uh, so that like one person might be worth five, because they represent five people, and another person might be weighted at 47. They might represent 47 people, and those weights are determined by the decennial census.
Justin Beals: Yeah, I mean, we want, or I think I would lik,e let me just put myself out here. I would like data driven decisions. Around a lot of our policy work for the greatest positive impact for the greatest size of the population.
Simson Garfinkel: So like a real good example for people who want to understand what you mean by data-driven decision might be a school system is trying to decide where to build a new school or where, or which schools to close. And one way to do that would be by looking at census information to see where all the children are right now, aged between, say, one and five, because it might take five years to build that middle school.
And so you want to know which middle school should we start to build or which middle schools should we, we think about closing and that's information that. Only comes from the census because that's the only reliable source that we have for information about children under five. Census Bureau collects that information that's generally not available commercially, and it's some of the most sensitive data that the Census Bureau collects information about people's children.
That information can also be a source of privacy issues like you can imagine. If you were to fill out a mandatory form telling the government that you had three Children under one year old, and then a few months later, you got a mailing saying you have three Children. And we're interested in selling you these products and giving you these coupons.
That might be very invasive. So lots of people, when they have children, the hospital sells their information and people find that invasive. You can request that that information not be sold, but like lots and lots of mailings start coming, lots of email, lots of ads on the internet, that information, if it's coming from the government is, is more of a problem because you don't have a choice whether the government has that data or not.
Justin Beals: Yeah, so we have this challenge, right? We want to collect the data to provide the best resources at the right time to the right need. But there is a fear, uh, and it's gotten; I think we've of course, seen some impact of large data collectors.
And how there's been privacy issues for us in these large data collections, so we have a fear of both. And my understanding is that we're trying to balance these two vectors. In differential privacy, like the ability to perform good analysis and not, and also understand what privacy we might be giving up in that statistical analysis.
Simson Garfinkel: So it used to be the case that, um, there were these privacy activists who had these unarticulated, well, articulated fears, but not very specific fears about the dangers of having lots amounts of data collected.
But now we, we have a better sense because we've actually seen some incidents of people who, um, have had, um, the fact that they were using certain apps, uh, revealed and geolocation information combined with apps, resulting in people losing jobs, and people being, um, uh, there's a possibility of legal action.
This is of course, usually around abortion issues. Um, there was a case of a journal calledThe Pillarr, which was finding priests that were using gay hookup apps and using that to out them because the company that was making the hookup apps available was selling ads on the apps and the ads, uh, were generating geolocation data and The Pillar bought geolocation data of who was using the app and where were they using it.
And there's some places where only one person lives, and, and that person happens to be a priest leaving, living in employer-provided housing. Um, so, irrespective of whether or not, you know, you think about the ethics of that, that is an example of a case, we, we only have a few of those kinds of cases.
But we also now have many other sort of data provenance cases where people's data was used to create, uh, data products that those people found objectionable.The New York Times ran this great article about people whose photos had been uploaded to,a photo-sharing platform, having those photos used to train, uh, face recognition systems that were being used by law enforcement.
Justin Beals: Yeah. And certainly, I've worked a fair bit and on the data science side of machine learning, there's a lot of danger in those types of systems. They're probabilistic, you know, they're, they're, they're rarely exacting, and they also represent oftentimes the worst of society at large assumptions about certain people.
Simson Garfinkel: So one of the things about differential privacy that we talk about in the book. Is that the techniques that it uses? to protect privacy. The adding of noise also helps machine learning systems in other unexpected ways.
Simson Garfinkel: It prevents overtraining or overfitting the data on specific individuals. That's a way that it protects privacy, but it also means that the models tend to be a little bit more fair, tend to be a little bit less individually specified because they, they don't have that.
So the hope is that we'll see better machine learning, um, when the, when the data are fuzzed out a bit like that, that the danger is that they won't be as specific and that we won't be able to get the very small signals so that that is one of those trade-offs, uh, differential privacy is very exciting for companies doing machine learning, because it prevents training data extraction from machine learning models.
Justin Beals: Yes. And actually, that was one of the big appeals for me in the book and reading it. And it's actually not a fear I have, um, because there's so many of the machine learning techniques that I've used in the past where we did add data to smooth the curve.
Or we were struggling with an overfitting outcome and we knew it statistically and needed to go back and either supplant the data with more information or just generalize the data set in such a way that, you know, it's, probabilistic outcomes were more fair.
Simson Garfinkel: And this is a big concern for companies that want to deploy models on devices like cell phones. If you're giving the model to a user to use in the field, then that means that a malicious user has the ability to run many, many samples through the model as the ability eventually to look at all the model weights and to do a training day extraction, to do membership inclusion tests.
As a result of that, the added security of differential privacy is useful, even ignoring the privacy protecting features of it. But I said earlier that differential privacy was one tool in the toolbox. So two other tools that viewers may have heard of are homomorphic encryption and secure multiparty computation.
Now, both of those are approaches that allow multiple parties to pool their data in an encrypted form and to come up with answers to statistical questions. So using a very simple mechanism, we could have 10 people in a room and we could find, their average salary in such a way that it was impossible for any of them to figure out what each other's person's salary was.
And we could actually calculate any function based on their salaries, using homomorphic encryption. Now, the problem is that if you just use homomorphic encryption, or if you just use secure multiparty, then it's as if, then you still have the problems with the data that are published might be reverse-engineered.
You could build a large constraint system. It's like playing Sudoku with the data, and you can get back to the source data. Even though no participant can figure out the computation from the publication, it still might be possible to learn people's individual values.
And an example we give in the book is if you have, like, you're in a math class, and the teacher says the class average is 98, there are 10 people in the class, and you get your score back, and you see that you got an 80 and so now, you know, immediately that everybody else in the class got a perfect, and that's just because that's the only possible solution that works.
And that shows that when you give exact answers to statistical questions, you create the possibility of constraints that limit the only, you know, the possibilities on the raw data.
Justin Beals: What I loved about your, this example in the book is you proved to me very early on that this is mathematics, right? Like the ability to look at something like, Hey, we have X number, of records. We can give an average for the value on the grade and then we can reverse engineer where people might have gotten a grade was like, “Oh, this, this is math”.
Like, we're moving the equations back and forward, and we're able to learn a lot. And reveal what could be private information. I mean, in this instance, this is a FERPA violation because we've revealed the score of one of the students.
Simson Garfinkel: Well, it could be a FERPA violation because, you know, plain reading of FERPA. But, as, as a lawyer who read the book said you'd actually have to go to court to find out if it truly is a FERPA violation.
Justin Beals: Good, Simon. Yeah.
Simson Garfinkel: So, but I do want to talk about this reverse engineering of data relations. This was what the U. S. Census Bureau used as the argument, the proof that we had to move to differential privacy in 2016 and 2017. People at the Census Bureau took the publications from the 2010 census and reconstructed the raw data that could have generated those publications and what that team found out, I was a part of that team.
What we found out was that for a large number of people, for tens of millions of Americans, there was only one possible value of their census return on the 2010 census that would have produced the tabulations for the block and the census tract that were published by the Census Bureau. And so, since the data collected included race and age and marital status, And, sex, and ethnicity and of course the address that they were at for many people, it was possible to take the census publications and to figure out what race they had reported and many people consider self-reported race to be confidential information. It was possible to find the self reported race for people's children, and for in many of the situations, the reconstructed data was geographically segregated, so it would have been possible for somebody else who did this to actually label it up to actual individuals based on the address or based on other commercial data that could be purchased and to do a data linkage attack.
Justin Beals: Okay, so I, I picture a briefing at the U. S. Census Bureau and a lot of jaws on the floor in this moment where you're, well, it was, it
Simson Garfinkel: was a multi-year process gettingthe Bureau to that point, but it, we were successful, and the director of the, well, the, deputy director of the Census Bureau said that we had to use, uh, differential privacy for the 2020 census because we didn't want to go to jail. And he said that on the record. It was great.
Justin Beals: That's wonderful. Yeah. I mean, I have had a couple of these moments where we're working with large data sets from, like, “Oh, no, we made some assumptions that weren't the best”.
Simson Garfinkel: And so the Census Bureau operates under. So there's a siren outside. I don't know if you can hear it. The Census Bureau operates under title 13 of the US code, and section, nine of that is that they have to assure the confidentiality of the data that they collect such that anything that is published cannot be traced back to a particular individual or a particular establishment.
Justin Beals: Yeah, um, there's something about differential privacy that you highlight in the book that I think is just a really important thread for security broadly that we sometimes leave off, and I want to provide a quote here, where you say “Security should come from the strength of the design rather from its secrecy”.
And I know that I work with a lot of security experts that still grasp at obfuscation or things like that. Tell us a little bit about access to the concept of differential privacy.
Simson Garfinkel: So I thought that in 2025 you would still be working with security professionals who don't believe that it's important to publicize the security mechanism that's being used.
As a way of ensuring the security of that mechanism. That that's a principle that goes back in computer security to the early seventies, and it goes back in general security for hundreds of years to early lock design. And the idea was if you can't tell somebody how a lock works, if the thing that is providing the security is the design of the lock, and not the key itself, then that's a problem.
Because people can get the design of the lock, they can steal a lock, they can take a system and reverse engineer it. You have to build the security into something that is easily changeable. In, in the case of like an encryption system, we don't believe in having secret algorithms in the computer security community, we believe that all algorithms should be published and that the full strength of security should be in the key.
Likewise for, um, say Amazon, Amazon security system for, uh, Amazon Web Services. They clearly document their security model, and it's up to you to properly implement it. But they don't just say it's secure. Trust us. So the, the beautiful thing about differential privacy is that there are no secret parameters.
The security comes from the use of truly random numbers that are added to data before the data are published. So we, we add statistical noise and, and getting the Census Bureau to that point, getting the developers at the Census Bureau to the idea that we would not have repeatable random numbers, that we would really trust the randomness.
That we wouldn't, um, have some sort of backdoor way for checking things that was hard to do, but that is, in fact, part of the privacy guarantee.
Justin Beals: Yeah, and I love that differential privacy works this way. It gives me a lot of confidence and, but certainly, in my experience, things like open source movements have been a lot about right, like trusting that a system is highly reviewed and anyone can look at it and its relative strengths or weaknesses, and we can talk about it openly.
Simson Garfinkel: Well, I will say that it's an open question that of open source guaranteeing that that security is better because I, I believe it. But we also know that when you publish the software, that allows the attackers to look at it as well.
And not all attackers will reveal what they've learned. So in addition to making your data your security mechanism public, it's good to have it be mathematically validated. That's why I'm a big believer in the use of formal methods for building software, but it's also important to actually have people look at it.
So the, the Kerberos, um, software security system that was developed at MIT and used for secure login to systems for 10 years, it had a, a very basic security vulnerability that was not publicly revealed. And that is that the random number generator wasn't being properly seated. It's an obvious thing to look at and just nobody looked at it.
So at the census Bureau, we actually had multiple code reviews of our software that implemented differential privacy to make sure, you know, to add to the hope. That it was doing the right thing. Differential privacy itself is formally proven to limit privacy loss within bounds. What we, we actually had the random number generator part of our differential privacy looked at with, a number of code analysis methods.
And at one point there was a problem that was found, in the random number generator while I was there, and it was corrected. Because we actually spent a lot of time looking at that and making that publicly available. This was the first time that the privacy mechanism used by the Census Bureau was published.
So, all previous times the privacy mechanism was black art. We weren't telling people how it was done. And, one of the things that the chief scientist did while I was there is that, Uh, the Census Bureau authored a number of, uh, monographs on the confidentiality mechanisms that have been used to protect previous, decennial censuses.
So, um, that, that's been a, an important part of the adoption of differential privacy to take the old techniques that were used and to reveal that they were not as confidential. As good as people hoped that they were or thought that they were.
Justin Beals: Yeah. One of the things that you mentioned earlier in our conversation, which I think your book does a great job of explaining, is the concept of tunability.
We see that in computer science systems where we're dealing with more than one vector of analysis, and we want to tune it down. You know, If I were a practitioner I were developing a large data set, and how should I approach differential privacy and getting the right tuning?
Simson Garfinkel: So tunability is actually one of the reasons that some policymakers don't like differential privacy because it requires that they make a decision.
To balance how accurate do they want the results to be and how damaging of people's privacy, Do they want the results to be if you don't have a system that's tunable. Or if the system is tuned by experts and they say this is the correct setting, then there's less risk for the policymakers that they will have made the wrong decision.
But with differential privacy, it takes that question and puts it squarely in the arms of the people who are charged with setting policy or with, with implementing policies set by maybe Congress or policy makers. Now, you mentioned earlier the problems with the laws. There is a problem with many laws and that many laws say they have absolutes on you may not make any publication, as I just said on title 13 where the person can be identified.
Differential privacy can't prevent that; it can just make it less likely. And how less likely we want it to be is a tunable parameter. So what should the proper tuning of that parameter be? The lawyers can't tell you what that it should be. That's a public policy question where you have to weigh off the impact on people's being found their re identifiability versus the purpose for which the data release is being made.
And that kind of weighing of, different equities is a really difficult thing to do. And it's something that for many years, the policymakers or the, you know, the leaders were just putting off on some technologists. And the technologists assured that they were doing the right thing. And it really wasn't a big topic for discussion.
Justin Beals: It's interesting to me because I think this used to be the purview of policymakers, and it's becoming some of the purview of corporate technology systems, you know, we're in an era where. I can put a lot of data on a fairly cheap system and generate a statistical analysis of the outcome. And I, uh, in my role as an executive at my company, I want to be aware of the tuning and decisions we're making.
The impact of a bad decision on an analysis versus the privacy reveal could be really poor for my business. We could lose customers.
Simson Garfinkel: So for businesses, one of their big concerns has been not so much on statistical privacy. But on just preserving confidentiality, that data are collected, and then they're put on systems that are connected to the Internet and not very well protected, and people, hackers are able to exfiltrate all the sensitive data so that that happened to me.
Um, it's happened to everybody in the United States. We get these breach notifications. Differential privacy doesn't protect against that because it's not a computer security technique. It's a statistical data processing technique. Another thing that people want to do with differential privacy that it's not well suited for is people want to make data set publications rather than statistics publications.
They want to take data and de-identify them and then make them available in a way that allows for additional statistical work, but which protects the privacy of all the people whose identity was stripped out of that data and de-identification really doesn't work. Now what you can do with differential privacy is you can build a statistical model and use that to create synthetic data.
But one of the problems that we now know is that there are limits with the current state of the art on how accurate that synthetic data can be. And so it's. It's relatively straightforward that we'll use to generate synthetic data that will capture many of the interactions between three or four or five variables.
But if you have records that have 20 or 30 variables in them, and you want to make accurate synthetic data that capture all those interactions, it's that's beyond the state of the art right now.
Justin Beals: Yeah, I want to highlight something that you said here that I've learned over the years is that de-identification rarely works, you know, it's an easy thing to try and grasp on to.
And, you know, as, as in my work in data science, I've been like, oh, well, we'll just, you know, anonymize that data would be a discussion we'll have, but it just takes a couple of data points and then a separate data set entirely. To essentially reveal all the records.
Simson Garfinkel: Yes, and there's no way to mathematically bound that.
And there's also no way to know that there's no data set in the future that doesn't exist today, but which in the future will be brought into existence and reveal the data set that you currently have. Um, and so, uh, I've actually, I'm the co-author on two NIST publications from the National Institute of Standards and Technology, one on basically, de-identification snafus, uh, and one on de-identifying government datasets.
And so that has actual guidance to other government agencies. And, and we still have to use de-identification for many situations, and there are ways of using it in a, a principled manner rather than just hoping for the best. And so we, we discussed some of those, those techniques, but, um, the, for the kinds of data that, that differential privacy doesn't work for, we still have to use de-identification.
So, an example of that would be video. Differential privacy doesn't work very good for video. And so, Google, um, when they brought out Street View, it, it showed people's faces, it showed people's street addresses. And, um, and Google said, well, this is all public data and Google got hit with a lot of privacy actions because even though it was public data, but because they were photos taken outside by aggregating that information and making it very easily accessible anywhere in the world, it was having a significant privacy impact, even though all Google was doing was taking public data from one place and moving it to another place that act alone was privacy-invasive.
So. And so Google adopted some statistical approaches. They, they look for faces, and they blur them in Street View. They blur street addresses. Those sorts of approaches, um, there's a methodology for discovering them and for implementing them and for getting feedback and actually for having multiple people get together and make the decisions. So those sorts of methodologies are what we talk about in, in that document.
Justin Beals: I certainly thought about the audience for the book as I was reading it a little bit. And one of the things that I hope some of my colleagues would pick it up and read it are those of us that are perhaps working in the corporate space but building large data sets with a large number of sensitive records.
I can think of a number of health technology companies and education technology companies, and I just think that the senior architects that are developing some of these systems could really use your book as a primer on how to implement differential privacy in the data set analysis or outputs.
Simson Garfinkel: So people these days are especially architects, system architects are more sensitive to aggregations of sensitive private information.
But it still is, I, I think it is useful, but, um, not all those systems can be helped with differential privacy today. One of the things that we've learned is that deploying differential privacy is more than just adding some algorithms to the output stage of some databases. It frequently requires re architecting the way data is processed in an organization.
And the active doing that re-architecting has privacy-protecting capabilities all in itself because a lot of organizations have developed data pipelines that are, are not good about data that are collected sensitive data. And in the process of really documenting what those data pipelines do, a lot of data that might have been used and released before, might not even need to be collected.
Or, it might be possible to, to aggregate that data early on, and, and one of the, the, the things you can show mathematically with differential privacy is that if you aggregate data early in the pipeline, at that data aggregation step, you can have a very small amount of noise and result in a very large privacy protection, whereas if you aggregate the data very late in the processing and then add noise, it's, it's harder to protect privacy.
And the best example of that is when you just want to publish record-level data in anonymized data sets, you end up having to add noise to every single record, and you have to add a lot of noise to protect privacy.
Justin Beals: Excellent. Simson, I want to thank you for a really great book.'m I'm by no means an academic and didn't do very good at math, but I I loved reading it and learned a lot.
Simson Garfinkel: You should tell people about the cartoons, right? So it's in the book. And the cartoons are by an award winning national cartoonists who, um, does editorial cartoons. So we, we had a grant for this book from, the Alfred P. Sloan Foundation, and they've, they've invested a lot of money in developing differential privacy.
And they also, uh, were very generous and they gave us a grant for this book. Uh, and one of the things that we were able to do with that grant money was to pay for Ted Worrell as a cartoonist. So I, I think that those cartoons have a lot of contextual, background to the, to the book and why you would want to use differential privacy, and the sort of things also that it can't protect against. And, so those, those cartoons are all going to be made available on the Internet so that people can incorporate them into their own presentations.
Justin Beals: Yeah, they were brilliant and certainly, you know, kind of illustrated a lot of the concepts in a really effective way.
And I thought the book was incredibly approachable, um, and really appreciated that.
Simson Garfinkel: And it's a small book. It's intentionally under 40, 000 words. It's part of the MIT Press Essential Knowledge Series, and it's designed to be readable in just a few hours. So, um, Good.
Justin Beals: Well, we certainly have links to, for folks to come and, uh, purchase a book or, or get access to it in the show notes, as well as I'd love to include the NIST papers that you mentioned.
Absolutely. Great. In the show notes as well. And we just are super grateful for you sharing your expertise around this topic with our listeners. It means a lot.
Simson Garfinkel: All right. Well, thank you very much for having me on. I really appreciate it.