1
00:00:05,200 --> 00:00:09,540
[CLAIRE] Welcome to Talking Postgres. It's a monthly podcast for developers who love Postgres.

2
00:00:09,960 --> 00:00:15,320
And I'm your host, Claire Giordano. In this podcast, we explore the human side of Postgres

3
00:00:15,640 --> 00:00:21,020
and databases, and open source, which means why do people who work with Postgres do what they do

4
00:00:21,400 --> 00:00:26,420
and, sometimes, how did they get there? Thank you to the team at Microsoft for sponsoring this

5
00:00:26,440 --> 00:00:33,520
conversation. Today's guest is Andres Freund. Andres is a Postgres major contributor and a

6
00:00:33,740 --> 00:00:39,260
committer and a member of the seven-person Postgres core team, which is like a steering committee for

7
00:00:39,350 --> 00:00:46,040
the Postgres open source project. And he's been working on the Postgres database for more than 15

8
00:00:46,320 --> 00:00:52,460
years. He's employed by Microsoft, where he works full-time on Postgres open source, and he's lead

9
00:00:52,480 --> 00:00:59,660
of Microsoft's open source contributors team and has been since 2019. His fingerprints can be seen

10
00:01:00,040 --> 00:01:05,880
all over the Postgres source base, but in a good way, including things like logical decoding and

11
00:01:06,020 --> 00:01:16,760
scalability and most recently asynchronous I/O. Welcome Andres.

12
00:01:11,760 --> 00:01:12,600
[ANDRES] Hi, thanks for having me.

13
00:01:16,800 --> 00:01:23,520
[CLAIRE] I'm so glad you're here, and it's time to talk about today's topic which is what went wrong, and what went right, with AIO.

14
00:01:24,660 --> 00:01:30,600
Now, for regular listeners, they probably know that this is not your first time on this podcast.

15
00:01:31,350 --> 00:01:36,160
For people who are interested in your origin story, how you got started as an engineer,

16
00:01:36,620 --> 00:01:41,860
as well as in Postgres, you can just go ahead and listen to episode eight, and we'll be sure

17
00:01:41,860 --> 00:01:47,200
to drop that into the show notes. That was really about, well actually you weren't the only

18
00:01:47,360 --> 00:01:52,800
guest on that episode either, it was you and Heikki Linnakangas and you both shared your stories

19
00:01:53,050 --> 00:02:00,400
about how you got start, started, started, I can speak. But, before we dive in, I'm curious

20
00:02:00,520 --> 00:02:06,120
what do you work on mostly these days? Is it all AIO, or are there other things on your plate as well?

21
00:02:06,960 --> 00:02:15,360
[ANDRES] It is primarily AIO related, but it's not so much the AIO subsystem itself, but working

22
00:02:15,520 --> 00:02:21,100
on the infrastructure to be able to use AIO in more parts of PostgreSQL, which is not

23
00:02:22,600 --> 00:02:29,440
really directly touching AIO pieces, but redesigning other subsystems so that they actually can

24
00:02:29,460 --> 00:02:32,280
use AIO. There's also some other performance related work and similar things and trying

25
00:02:34,960 --> 00:02:35,980
to help out some others too.

26
00:02:39,080 --> 00:02:40,900
[CLAIRE] So some context for people who are listening,

27
00:02:42,480 --> 00:02:45,660
Postgres 18 is about to GA.

28
00:02:45,940 --> 00:02:48,160
It's about to release in General Availability.

29
00:02:49,160 --> 00:02:53,860
And Release Candidate 1 is available for anybody to download and check out the release

30
00:02:53,980 --> 00:02:55,740
notes for right now, correct?

31
00:02:57,280 --> 00:03:01,420
[ANDRES] Yes, that is correct. And please test and report back if you find any problems.

32
00:03:02,400 --> 00:03:05,500
[CLAIRE] And yes, because this is the chance,

33
00:03:05,720 --> 00:03:09,900
if there are any showstoppers, to catch them and get them fixed before the GA.

34
00:03:10,420 --> 00:03:17,540
But AIO, asynchronous I/O is part of the Postgres 18 release, but that doesn't mean it's over yet.

35
00:03:17,960 --> 00:03:23,980
And the goal of today's discussion is really to kind of explore your journey leading that project

36
00:03:24,260 --> 00:03:30,680
and what went wrong, what went right, what happened. So I guess we should start with

37
00:03:31,120 --> 00:03:35,260
why did we do it and when did it start and what's the beginning of the AIO project

38
00:03:35,840 --> 00:03:36,540
for Postgres?

39
00:03:38,820 --> 00:03:42,520
[ANDRES] The beginning is probably even further back than me starting to work on it.

40
00:03:44,290 --> 00:03:49,520
Personally, I've been interested in adding AIO support to Postgres basically shortly after

41
00:03:49,680 --> 00:03:52,960
I started using Postgres, early in the 2010s.

42
00:03:54,240 --> 00:03:59,080
But I only started working on it like around 2019.

43
00:04:01,180 --> 00:04:02,560
It might have been late 2018,

44
00:04:02,840 --> 00:04:03,400
I don't know.

45
00:04:03,880 --> 00:04:15,540
And the reason for why we did it was that Postgres until very recently basically relied on the operating system to do efficient reads from storage.

46
00:04:16,220 --> 00:04:17,959
For some things that works rather well.

47
00:04:18,220 --> 00:04:27,560
For example, if you have a sequential scan or something similar, the operating system, or at least most operating systems, can do reasonably efficient readahead.

48
00:04:28,860 --> 00:04:33,480
And that allows Postgres to not be blocked by storage.

49
00:04:33,970 --> 00:04:43,500
But if you have anything more complicated, like a bitmap index scan or a bitmap heap scan or a vacuum that skips blocks or something similar,

50
00:04:44,180 --> 00:04:49,180
then the storage, the operating system can't do this readahead for us because the operating system

51
00:04:49,360 --> 00:04:54,280
doesn't know as much. So it does not, can't look into the future and do a readahead because it

52
00:04:54,340 --> 00:04:59,060
just doesn't know the future, even though Postgres could know the future because we know what we're

53
00:04:59,060 --> 00:05:08,319
going to do in the future. So the goal of this is to basically give the operating system and the storage

54
00:05:08,340 --> 00:05:12,920
the information to do more efficient reading.

55
00:05:14,380 --> 00:05:22,340
And one motivating factor for why I started around that time was that Linux had a new

56
00:05:22,450 --> 00:05:28,580
feature called io_uring, which allowed to do asynchronous I/O in more cases.

57
00:05:29,200 --> 00:05:35,060
Before that, Linux had asynchronous I/O support and had that for a long time, but it only worked

58
00:05:35,060 --> 00:05:36,080
with direct I/O.

59
00:05:36,500 --> 00:05:41,380
What that means is that it only worked if we did not use the kernel page cache.

60
00:05:41,880 --> 00:05:43,920
But it turns out that that's much harder to use.

61
00:05:44,340 --> 00:05:48,100
And there's also a lot of use cases where that's not really the right setup to use.

62
00:05:49,020 --> 00:05:56,160
So with the introduction of io_uring, it was suddenly possible to do native AIO in more cases.

63
00:05:56,940 --> 00:06:00,020
And that's kind of what made me start looking into it.

64
00:06:00,940 --> 00:06:06,060
At the same time, there were two important changes, I think, afoot.

65
00:06:06,760 --> 00:06:12,180
One was that we had much faster storage than we used to do due to NVMe storage.

66
00:06:12,500 --> 00:06:15,720
That's like fast local SSDs that can have very large bandwidth.

67
00:06:16,380 --> 00:06:18,880
And it turns out that the CPU overhead of doing I/O.

68
00:06:19,300 --> 00:06:22,960
suddenly matters a lot more when having that fast storage.

69
00:06:24,640 --> 00:06:28,420
And at the same time, more and more workloads were migrating into cloud systems

70
00:06:29,180 --> 00:06:34,100
where you could have storage that has like reasonably high number of IOPS

71
00:06:34,240 --> 00:06:35,400
or a reasonably high bandwidth.

72
00:06:35,480 --> 00:06:39,080
but the latency towards the storage is fairly high at the same time.

73
00:06:40,480 --> 00:06:46,700
And that's what means that to saturate the storage or to fully utilize the storage that we pay for,

74
00:06:47,060 --> 00:06:53,720
one actually needs to issue I/O more in parallel than we could do until recently.

75
00:06:54,080 --> 00:06:58,880
And that was the second motivation basically for investing time in working on I/O.

76
00:07:00,100 --> 00:07:02,260
[CLAIRE] So let me see if I follow that properly.

77
00:07:03,140 --> 00:07:13,040
For both of those changes, both with NVMe as well as with workloads in the clouds with these large IOPS,

78
00:07:14,060 --> 00:07:17,420
are you saying that those were opportunities to take advantage of?

79
00:07:17,580 --> 00:07:18,340
That's not what I heard.

80
00:07:18,340 --> 00:07:24,600
I heard you saying that, wow, we really needed to fix this problem because it was becoming a bigger problem.

81
00:07:24,880 --> 00:07:25,620
Did I get it right, or not?

82
00:07:27,260 --> 00:07:29,220
[ANDRES] I mean, that's kind of two sides of the same coin.

83
00:07:29,980 --> 00:07:35,880
Like either we would need to benefit to fully utilize the hardware or we're not performing

84
00:07:35,950 --> 00:07:39,680
as well because we are not utilizing it.

85
00:07:39,990 --> 00:07:41,920
But yeah, otherwise I think you're right.

86
00:07:43,580 --> 00:07:45,560
[CLAIRE] I'm just looking at it the more negative way.

87
00:07:45,840 --> 00:07:46,820
Like we had to do it.

88
00:07:47,680 --> 00:07:48,660
Like we had no choice.

89
00:07:50,080 --> 00:07:59,380
Well, maybe it's true that you had to do it because you were so motivated and had been thinking about it for, it sounds like, eight years.

90
00:07:59,780 --> 00:08:00,460
[ANDRES] Something like that, yeah.

91
00:08:01,820 --> 00:08:02,700
[CLAIRE] Okay,

92
00:08:03,320 --> 00:08:04,320
so that's why.

93
00:08:04,940 --> 00:08:06,700
Had you ever

94
00:08:07,120 --> 00:08:08,420
led an architectural

95
00:08:08,740 --> 00:08:10,280
change as big as this?

96
00:08:11,300 --> 00:08:12,720
[ANDRES] I don't think so.

97
00:08:14,020 --> 00:08:16,560
I've worked on reasonably big changes to Postgres,

98
00:08:17,230 --> 00:08:21,800
but they were all more narrow.

99
00:08:22,080 --> 00:08:24,540
They didn't need changes to as many parts of Postgres

100
00:08:24,940 --> 00:08:27,940
and they were more focused.

101
00:08:28,840 --> 00:08:35,680
And even though I think some of them might have actually been more lines of code or something like that,

102
00:08:37,099 --> 00:08:43,120
they were never quite as hard to integrate because they touched fewer places.

103
00:08:45,440 --> 00:08:46,940
So that's definitely the hardest project I've ever worked on.

104
00:08:50,790 --> 00:08:51,920
[CLAIRE] So how did you even begin?

105
00:08:52,270 --> 00:08:53,420
How did you get started?

106
00:08:56,600 --> 00:08:58,640
I mean, obviously with a prototype.

107
00:08:56,720 --> 00:09:01,560
[ANDRES] It's long enough ago that I am not 100% sure about all the details anymore.

108
00:09:02,170 --> 00:09:09,000
But I think I just started out doing some very minimal testing in the sense of like, I tried

109
00:09:09,010 --> 00:09:17,600
to do the minimal hacking on Postgres to use AIO in one very narrow place and then tried to

110
00:09:17,720 --> 00:09:20,880
see whether I could see any performance benefits from that.

111
00:09:22,000 --> 00:09:22,940
And initially I didn't.

112
00:09:23,380 --> 00:09:28,600
And then I just started trying to use it in other places to see whether there's bigger gains there.

113
00:09:29,400 --> 00:09:37,580
And eventually I learned more about what the problems were and where we can really gain performance.

114
00:09:38,140 --> 00:09:47,800
And I started to try to generalize how the AIO subsystem would look like in a somewhat understandable way.

115
00:09:48,500 --> 00:10:04,800
And over the next couple of years, I on and off worked on developing an AIO subsystem that was generic and tried to introduce users of it in more and more places.

116
00:10:05,660 --> 00:10:12,980
And as part of that, there were lots of subprojects that were somewhat independent and could be committed independently.

117
00:10:13,500 --> 00:10:20,340
And there were improvements to Postgres that could be committed, even though AIO was not merged.

118
00:10:20,440 --> 00:10:35,180
For example, I think Postgres 16 or 17, I don't fully remember, 16, I was able to commit a change to make relation extension, that's like making the table bigger, faster.

119
00:10:35,500 --> 00:10:37,760
and that actually was interesting to do

120
00:10:38,920 --> 00:10:42,840
because it allowed a fairly substantial part

121
00:10:42,840 --> 00:10:46,200
of the architectural changes that were necessary

122
00:10:46,490 --> 00:10:50,280
to be merged without the rest of the AIO changes.

123
00:10:50,960 --> 00:10:54,980
And generally that's something we tried to do

124
00:10:55,120 --> 00:10:58,360
was to find bits and pieces that we can merge earlier

125
00:10:58,550 --> 00:11:00,460
because trying to merge the whole thing at once

126
00:11:00,530 --> 00:11:03,620
was just going to be infeasible.

127
00:11:03,760 --> 00:11:05,220
that was clear pretty early on.

128
00:11:07,260 --> 00:11:13,760
[CLAIRE] I know that in Postgres 17 there was a feature or a collection of features under the umbrella

129
00:11:13,820 --> 00:11:18,580
name of streaming I/O was that considered part of the AIO project as well?

130
00:11:15,700 --> 00:11:15,900
[ANDRES] Yes.

131
00:11:18,900 --> 00:11:21,520
That was definitely one fairly crucial part.

132
00:11:23,900 --> 00:11:26,460
And as part of the prototype for AIO,

133
00:11:26,680 --> 00:11:29,020
I had written something that was called at the time,

134
00:11:29,020 --> 00:11:32,600
I think, streaming read,

135
00:11:32,960 --> 00:11:45,060
It was streaming read, and then Thomas Munro tried to make what I had prototyped into something more general and independently mergeable.

136
00:11:46,220 --> 00:11:49,940
And that is what got merged into Postgres 17.

137
00:11:50,400 --> 00:11:58,160
And that had substantial benefits on its own, because what it added was the ability to merge multiple I/Os for [...] blocks.

138
00:11:59,020 --> 00:12:05,000
In Postgres those blocks are typically eight kilobytes large, into one larger read of up to by default I

139
00:12:05,000 --> 00:12:11,340
think 128 kilobytes, if they were neighboring blocks, and that alone can reduce the CPU overhead

140
00:12:11,340 --> 00:12:15,940
of doing AIO very substantially because fewer system calls are required.

141
00:12:18,420 --> 00:12:18,520
[CLAIRE] Okay.

142
00:12:19,080 --> 00:12:21,979
So the title of today's...

143
00:12:19,600 --> 00:12:21,120
[ANDRES] The other

144
00:12:24,220 --> 00:12:26,140
part why that was interesting is that it allowed us to

145
00:12:28,920 --> 00:12:35,300
introduce uses of AIO without actually having AIO merged because the whole idea behind this

146
00:12:35,570 --> 00:12:42,980
interface that was added to 17 was that it allowed code to have the same interface

147
00:12:43,790 --> 00:12:51,760
in 17 as they would have in once AIO was merged and then automatically get AIO and also already

148
00:12:51,840 --> 00:12:57,420
get some other benefits before that, and I think that was a pretty good path to take,

149
00:12:58,100 --> 00:12:59,380
I think that's one of the things that went well.

150
00:12:59,500 --> 00:13:05,060
[CLAIRE] Okay. So what we want to cover today are things that went wrong, and things that went right,

151
00:13:05,480 --> 00:13:09,840
which is a little bit different from what you did when you gave a talk a couple months ago in

152
00:13:10,060 --> 00:13:16,760
Montreal at PGConf.dev. And there you focused primarily, and the title of that talk was just

153
00:13:16,920 --> 00:13:23,520
what went wrong with AIO. So, but here we want to talk about both, you know, those big challenges,

154
00:13:23,720 --> 00:13:25,480
and things that went wrong, as well as what went right.

155
00:13:25,840 --> 00:13:27,920
You just talked about something that went right, right?

156
00:13:28,100 --> 00:13:33,980
That you seeded some of these changes into Postgres 16 and Postgres 17.

157
00:13:35,060 --> 00:13:39,840
So it wasn't a big bang code contribution, if you will, in Postgres 18.

158
00:13:41,700 --> 00:13:43,040
But I guess my question to you is,

159
00:13:43,040 --> 00:13:46,880
do you want to weave these two things that were good

160
00:13:46,960 --> 00:13:50,380
and things that were a problem together throughout today's conversation?

161
00:13:50,780 --> 00:13:53,380
Or should we actually start with the what went wrong part?

162
00:13:54,040 --> 00:13:54,940
How do you want to do it?

163
00:13:56,260 --> 00:13:58,180
[ANDRES] I think either for me works okay.

164
00:13:58,380 --> 00:14:03,119
I think maybe it's easier for the audience if you do separate the two, but I'm sure that

165
00:14:03,140 --> 00:14:08,700
even if you do that, there will be some back and forth, just because that's how our brains work.

166
00:14:03,500 --> 00:14:03,940
[CLAIRE] Okay,

167
00:14:08,680 --> 00:14:10,500
because you can't help it? [Yes.]

168
00:14:10,590 --> 00:14:10,720
Okay,

169
00:14:13,080 --> 00:14:14,720
So you talked about,

170
00:14:14,940 --> 00:14:16,000
in that talk at Montreal,

171
00:14:16,250 --> 00:14:17,180
which I listened to it,

172
00:14:17,180 --> 00:14:18,420
I was there for part of it,

173
00:14:18,430 --> 00:14:20,019
but I also listened to it again this morning,

174
00:14:21,020 --> 00:14:23,320
you talked about a handful of mistakes.

175
00:14:24,610 --> 00:14:26,860
But you also talked about why it took so long,

176
00:14:27,380 --> 00:14:29,480
and are you happy or not happy

177
00:14:30,000 --> 00:14:30,780
with how long it took?

178
00:14:31,860 --> 00:14:33,680
Did it take the right amount of time as a project?

179
00:14:34,500 --> 00:14:35,640
[ANDRES] I think it definitely took too long.

180
00:14:39,680 --> 00:14:43,000
I think there's like several reasons why I think it's too long.

181
00:14:43,140 --> 00:14:50,280
One is just that it's extremely hard to maintain motivation and over that long time.

182
00:14:50,580 --> 00:15:01,340
And I think some of the slowdown was just related to needing to do something else just because I couldn't see the word AIO anymore.

183
00:15:02,640 --> 00:15:07,900
And if the whole project had taken a shorter amount of time, then that would have been less of a factor, I think.

184
00:15:09,480 --> 00:15:10,540
But I also think that

185
00:15:11,810 --> 00:15:12,880
it took too long

186
00:15:12,950 --> 00:15:13,360
just because,

187
00:15:14,410 --> 00:15:15,400
in the sense for,

188
00:15:15,420 --> 00:15:18,060
it would have been good for Postgres for it to have happened sooner

189
00:15:18,280 --> 00:15:19,940
and there were plenty of

190
00:15:20,110 --> 00:15:21,760
other projects that were kind of

191
00:15:24,060 --> 00:15:25,340
blocked by not having AIO

192
00:15:25,680 --> 00:15:27,640
and also several people

193
00:15:27,690 --> 00:15:29,580
in the community in our team

194
00:15:29,610 --> 00:15:31,720
at Microsoft were helping with AIO

195
00:15:31,830 --> 00:15:33,480
and they would probably have been happy

196
00:15:33,780 --> 00:15:36,000
if that happened more quickly

197
00:15:37,399 --> 00:15:37,920
because

198
00:15:38,080 --> 00:15:42,640
at the pace it was happening, they had needed to switch back between helping out with AIO or doing

199
00:15:42,920 --> 00:15:47,560
AIO related projects and doing other projects. And that's the more context switches one has,

200
00:15:48,920 --> 00:15:51,880
for a lot of us at least, the slower things go.

201
00:15:55,139 --> 00:16:00,620
[CLAIRE] I guess I wonder if you're being too hard on yourself saying that it took too long?

202
00:16:01,120 --> 00:16:17,180
I mean, by definition, don't big projects go down dead ends? Isn't that like a normal part of the design process? That there will be dead ends or there will be things that you wish you hadn't done the way you did in the prototype?

203
00:16:18,580 --> 00:16:28,940
[ANDRES] Yes, that's definitely part of it, and I think a good part of a project of this complexity

204
00:16:28,950 --> 00:16:32,680
I think it's impossible to do without exploring dead ends because, otherwise,

205
00:16:34,120 --> 00:16:39,600
one wouldn't have been ambitious enough to find the actual right design and would have just gotten

206
00:16:39,750 --> 00:16:51,240
stuck in some local minimum of okayish design. But I think there probably were cases where

207
00:16:52,040 --> 00:16:57,540
I could have done better like I invested a lot of time in trying to make the prototype kind of work

208
00:16:58,680 --> 00:16:59,800
in some edge cases,

209
00:17:00,220 --> 00:17:01,720
even though there were just

210
00:17:02,400 --> 00:17:04,220
known fundamental architectural mistakes

211
00:17:04,360 --> 00:17:04,780
in the prototype.

212
00:17:06,260 --> 00:17:08,079
And that cost at least a year.

213
00:17:08,540 --> 00:17:09,900
And if I had

214
00:17:11,980 --> 00:17:13,980
focused more on giving up

215
00:17:13,980 --> 00:17:15,800
the prototype at some point and just starting

216
00:17:15,980 --> 00:17:18,040
with a real thing, if I had done that earlier,

217
00:17:18,140 --> 00:17:19,180
I think it would have been better.

218
00:17:19,720 --> 00:17:21,860
But I think there's also a second aspect

219
00:17:22,800 --> 00:17:23,760
of issues

220
00:17:24,079 --> 00:17:26,120
that were not related to me personally

221
00:17:26,420 --> 00:17:27,439
where I did something wrong.

222
00:17:27,880 --> 00:17:43,540
I think a lot of the time that it took for AIO was that we had some aspects of Postgres where we just hadn't invested the time necessary to allow for faster-paced development.

223
00:17:44,620 --> 00:17:52,880
One aspect is that before I started working on AIO, Postgres did not have a CI that could be run by everybody.

224
00:17:53,520 --> 00:17:59,860
And for something that has so many portability effects like AIO has, that is just not feasible.

225
00:18:00,140 --> 00:18:10,360
If we can't automatically test Postgres on different operating systems and so on, then it's not really feasible to develop something like AIO.

226
00:18:11,380 --> 00:18:23,700
And so one of the large timeless things that I was on the path to getting AIO anywhere was to merge CI infrastructure into Postgres.

227
00:18:24,100 --> 00:18:25,000
And that took a lot of time,

228
00:18:25,180 --> 00:18:31,380
and if that hadn't been the case, then the AIO project would have gone faster.

229
00:18:31,380 --> 00:18:37,140
And I think it's an area of Postgres that we, just as a projec,t had underinvested in.

230
00:18:37,260 --> 00:18:48,040
And I think even though there were some initial skepticisms about adding CI, I think that has generally borne out to be a very crucial enabler for lots of different projects.

231
00:18:49,560 --> 00:18:51,360
[CLAIRE] Now that's something that Bilal worked on right,

232
00:18:51,620 --> 00:18:52,880
Nazir Bilal Yavuz?

233
00:18:53,520 --> 00:18:54,560
Probably other people as well.

234
00:18:53,620 --> 00:18:53,700
[ANDRES] Yes.

235
00:18:54,980 --> 00:18:55,780
Yeah, I think it was Bilal and me were doing a lot of the work initially

236
00:18:59,960 --> 00:19:00,780
and then since then, plenty of other people have chimed in.

237
00:19:04,440 --> 00:19:07,640
I think in the last couple of years, Bilal has done most of the work.

238
00:19:09,760 --> 00:19:15,400
[CLAIRE] I had no idea that that was an enabler, if you will, for the AIO project.

239
00:19:15,990 --> 00:19:16,720
That's pretty cool.

240
00:19:17,270 --> 00:19:26,840
I thought it was a general improvement to the overall way the Postgres contributors and engineers tested the project,

241
00:19:26,970 --> 00:19:29,620
but I didn't realize there was an AIO connection.

242
00:19:30,780 --> 00:19:32,540
[ANDRES] I started doing CI purely because of AIO, that was directly the motivation.

243
00:19:37,340 --> 00:19:37,940
[CLAIRE] Okay then,

244
00:19:38,930 --> 00:19:40,860
that should go on Bilal's,

245
00:19:41,810 --> 00:19:44,120
I don't know, his next promotion justification

246
00:19:44,470 --> 00:19:45,340
or something like that.

247
00:19:45,040 --> 00:19:46,240
[ANDRES] I'm pretty sure that I did that.

248
00:19:48,280 --> 00:19:49,520
[CLAIRE] Okay, good, good, good.

249
00:19:51,399 --> 00:19:53,880
Do you want to walk us through some,

250
00:19:54,210 --> 00:19:58,240
since a lot of the people that listen to this show are engineers and developers,

251
00:19:58,480 --> 00:20:08,480
they probably are hungry to hear specific examples of things that, decisions you made, or dead ends you went down, that you wish you hadn't.

252
00:20:09,560 --> 00:20:10,940
Do you want to give us some of those examples?

253
00:20:13,440 --> 00:20:13,920
Do you remember?

254
00:20:13,500 --> 00:20:14,000
[ANDRES] I can try.

255
00:20:14,760 --> 00:20:18,140
Some of them will be a bit far down into the weeds.

256
00:20:19,080 --> 00:20:21,340
I don't know how easy they're to explain on a podcast.

257
00:20:23,580 --> 00:20:28,220
I think one of the dependencies that probably should not have been a dependency was that

258
00:20:30,480 --> 00:20:36,680
I got very frustrated with running tests and Postgres before working, like while working

259
00:20:36,780 --> 00:20:39,200
on AIO, but also while working on other features.

260
00:20:40,459 --> 00:20:45,700
And that indirectly led me down to adding a support for a new build system to Postgres.

261
00:20:46,540 --> 00:20:53,540
And I think that was a very good investment into Postgres, but I think it was not a very

262
00:20:53,560 --> 00:21:00,160
good investment in the sense of doing it before AIO was complete. And I think I knew that at the

263
00:21:00,240 --> 00:21:04,620
time, I just needed to do something other than AIO. So maybe it was the right thing to do,

264
00:21:04,740 --> 00:21:16,040
but it definitely did not help the timeline. Another example of like more technical--

265
00:21:09,820 --> 00:21:09,940
[CLAIRE] Okay,

266
00:21:11,940 --> 00:21:13,540
and when you talk about [Go ahead.]

267
00:21:16,340 --> 00:21:18,520
just how big of a distraction was that?

268
00:21:18,580 --> 00:21:22,360
Are we talking about two or three months on your part or a year?

269
00:21:24,980 --> 00:21:29,140
[ANDRES] It's hard to say because it was not like drop one thing and do only the other thing,

270
00:21:30,300 --> 00:21:36,000
but like the work definitely went on over like nine months to varying degrees or something

271
00:21:35,920 --> 00:21:42,200
so it was a substantial time investment.

272
00:21:42,920 --> 00:21:43,320
On the

273
00:21:43,980 --> 00:21:45,180
more technical front

274
00:21:45,880 --> 00:21:46,440
I think one

275
00:21:47,240 --> 00:21:48,880
very hard thing

276
00:21:49,420 --> 00:21:51,220
about adding support

277
00:21:51,480 --> 00:21:53,240
for AIO into something

278
00:21:53,640 --> 00:21:54,400
like Postgres which

279
00:21:55,500 --> 00:21:57,260
just was not written with anything

280
00:21:57,720 --> 00:21:59,400
like asynchronicity in mind

281
00:21:59,940 --> 00:22:01,000
is that

282
00:22:01,380 --> 00:22:03,520
one invariably needs something like callbacks

283
00:22:03,620 --> 00:22:05,180
or something to react to the

284
00:22:06,200 --> 00:22:07,100
completion of

285
00:22:07,540 --> 00:22:09,000
I/Os and

286
00:22:09,020 --> 00:22:15,620
I definitely went down many, many different dead ends in how to make that correct and

287
00:22:15,820 --> 00:22:16,860
not super failure prone.

288
00:22:17,260 --> 00:22:23,300
And initially, one of the biggest mistakes was that I allowed those callbacks to start

289
00:22:23,440 --> 00:22:27,580
more I/O after the completion.

290
00:22:28,160 --> 00:22:33,020
And that turned out to have very complicated nesting issues because it then meant that

291
00:22:33,040 --> 00:22:39,360
if an I/O completed while deep in some subsystem, then that more I/O could be triggered and that

292
00:22:39,500 --> 00:22:44,540
could then recursively reenter the same subsystem. And it made everything very fragile and hard to

293
00:22:44,660 --> 00:22:51,960
understand. And I think I intuitively knew that that wasn't quite the right direction to go in,

294
00:22:52,140 --> 00:23:03,000
but like, it was hard to go back and redo everything to get rid of that decision. And

295
00:23:03,020 --> 00:23:07,220
the wrongest decision about all of this on a technical level.

296
00:23:09,280 --> 00:23:12,340
And what Postgres now has is a much more restricted level

297
00:23:13,350 --> 00:23:13,960
set of callbacks.

298
00:23:14,460 --> 00:23:17,820
One is not allowed to start new IO inside those callbacks.

299
00:23:18,500 --> 00:23:20,280
One is not allowed to allocate memory

300
00:23:20,450 --> 00:23:21,340
inside those callbacks.

301
00:23:21,370 --> 00:23:22,620
And like, it's very restrictive

302
00:23:22,960 --> 00:23:24,460
and that's good for some things,

303
00:23:25,360 --> 00:23:28,320
but it's also makes it a lot more restricted.

304
00:23:28,800 --> 00:23:31,440
And that probably will make some other features harder,

305
00:23:31,680 --> 00:23:34,560
but it's the only way I could see to make the feature actually,

306
00:23:36,680 --> 00:23:40,100
understand it little enough to believe in its correctness to some degree.

307
00:23:43,420 --> 00:24:02,740
[CLAIRE] Obviously, if someone really wants to go deep on understanding some of the things you did that in hindsight, with 2020 hindsight, you wish you hadn't done, they can go watch your talk, which is available on YouTube, from Montreal, the what went wrong with AIO.

308
00:24:02,740 --> 00:24:04,980
And I can't remember if that was a half hour long talk.

309
00:24:05,250 --> 00:24:05,940
I think it was,

310
00:24:05,950 --> 00:24:07,020
I think it was about a half hour, could have been longer.

311
00:24:07,760 --> 00:24:09,020
[ANDRES] I think it was 45 minutes or something,

312
00:24:09,160 --> 00:24:11,740
it was the full length talk, but I'm not entirely sure.

313
00:24:11,200 --> 00:24:11,380
[CLAIRE] Okay.

314
00:24:12,140 --> 00:24:15,340
So you dive deep in that talk.

315
00:24:15,380 --> 00:24:21,120
But are there a few other examples that we can kind of consider?

316
00:24:22,480 --> 00:24:27,080
I mean, what I want to get to after you share the examples is, what's your takeaway?

317
00:24:27,500 --> 00:24:34,520
Are there learnings that other developers can steal from you,

318
00:24:34,900 --> 00:24:44,340
or, that if you embark on a similar architectural project in the future, things you will know to do better next time.

319
00:24:44,720 --> 00:24:49,140
But before we get to the learnings, I just feel like we need to go through a few more examples if we can.

320
00:24:50,300 --> 00:24:57,480
[ANDRES] Yeah. Another example of failures that were more like project failures rather than my personal

321
00:24:57,640 --> 00:25:06,140
failings is that there just are, and particularly were, significant parts of Postgres that just

322
00:25:06,160 --> 00:25:12,580
had no tests. And it turns out that if you then redesign parts of Postgres, it's very

323
00:25:12,660 --> 00:25:19,860
easy to break those other subsystems that had no tests. And like, for example, we found

324
00:25:19,960 --> 00:25:30,780
out very late in the development of, or merging, of AIO that it broke some stats that are emitted

325
00:25:30,800 --> 00:25:34,440
whenever there are checksum failures.

326
00:25:35,660 --> 00:25:36,740
But we just had no tests,

327
00:25:36,770 --> 00:25:39,560
so I just did not think about that until the last minute somehow.

328
00:25:40,030 --> 00:25:43,100
And I think as a project lesson,

329
00:25:43,170 --> 00:25:46,620
I think it's that we have to continue to invest more

330
00:25:46,940 --> 00:25:51,020
into testing infrastructure and different types of testing.

331
00:25:52,080 --> 00:25:57,060
And I think that's also, in a way, a personal lesson.

332
00:25:57,200 --> 00:26:02,920
Obviously I invested time in working on CI and stuff like that, but I should

333
00:26:02,980 --> 00:26:09,720
probably have done more testing infrastructure earlier on to find some of the gnarlier hard to

334
00:26:09,840 --> 00:26:22,300
find bugs and, yeah, that was not perfect. I think the development process was that I first

335
00:26:22,320 --> 00:26:29,340
wrote that prototype and then only in like about a year ago turned that prototype into some, like

336
00:26:29,540 --> 00:26:36,940
rewrote the prototype from scratch, to get something mergeable, and I think I we added too many

337
00:26:37,180 --> 00:26:43,480
features to the prototype. We basically had already learned nearly all the lessons

338
00:26:45,080 --> 00:26:51,500
that you could have learned, but I tried to make it better and better. Like one big part of what you

339
00:26:51,540 --> 00:26:55,220
eventually want to use AIO for is to do WAL writes.

340
00:26:56,920 --> 00:27:05,420
And I invested at least a year and a half into trying to make asynchronous WAL writes work very

341
00:27:05,560 --> 00:27:12,480
well in all situations. Even though getting the performance exactly right of that was not all that

342
00:27:12,480 --> 00:27:19,980
important a decision. It wasn't that important for the design of AIO. It was important to prototype

343
00:27:20,040 --> 00:27:23,160
that we could do asynchronous WAL writes,

344
00:27:23,380 --> 00:27:24,860
but it was not important to get the performance

345
00:27:26,850 --> 00:27:30,800
to be on par in all situations with current Postgres

346
00:27:29,860 --> 00:27:32,740
because it was always to be a prototype, not the real thing.

347
00:27:33,370 --> 00:27:35,940
So I invested inordinate amounts of time in that,

348
00:27:36,160 --> 00:27:38,740
and I think knowing when to stop with a prototype

349
00:27:38,890 --> 00:27:41,920
is probably something that I learned a lot about as part of this project.

350
00:27:44,960 --> 00:27:46,200
Well what is the answer to that?

351
00:27:49,480 --> 00:27:55,560
[CLAIRE] Knowing when to stop with a prototype, that's hard to give a rule of thumb around.

352
00:27:53,800 --> 00:27:58,100
[ANDRES] Yes, and I think, generally,

353
00:28:00,560 --> 00:28:06,460
the problems where things go wrong are not going to be hard and fast

354
00:28:07,960 --> 00:28:13,120
zero or one kind of things where like there's a right or is it wrong. It's always a question of like

355
00:28:14,180 --> 00:28:19,160
a graduation where like at some point you go definitely invested too much time in it, at some

356
00:28:19,160 --> 00:28:26,040
point you invested too little time, but like where exactly the right spot is is a large bandwidth

357
00:28:26,220 --> 00:28:31,680
between those and I think that's where most of the things that went wrong were of that nature,

358
00:28:32,480 --> 00:28:40,000
and I don't think I know the answer right now, it's know that the spot I picked in some cases was

359
00:28:40,020 --> 00:28:46,700
definitely wrong. I don't know where the right spot would have been.

360
00:28:46,820 --> 00:28:49,660
[CLAIRE] Okay, so more examples.

361
00:28:55,940 --> 00:28:57,020
I'm putting you on the spot.

362
00:28:56,020 --> 00:29:03,860
[ANDRES] One, the way that the AIO subsystem works is that one can get an I/O handle and then that with that

363
00:29:04,000 --> 00:29:09,200
I/O handle one can associate like a read or write and some callbacks that are to be called when

364
00:29:09,260 --> 00:29:15,860
when the AIO completes. Initially there was no hard limit in each backend how many of those, could be,

365
00:29:16,360 --> 00:29:21,640
handles could be used, and it was actually somewhat expensive to get one of those handles

366
00:29:22,190 --> 00:29:27,920
and it took, because it was so somewhat expensive, all the parts that

367
00:29:28,080 --> 00:29:37,279
used those handles cached them for reuse and it turns out that if you cache a lot of handles in

368
00:29:37,300 --> 00:29:42,460
a lot of places that the total number of those handles can get very large but because of PostgreSQL's

369
00:29:42,500 --> 00:29:47,660
multi-process designs the state for all of those handles has to be in shared memory,

370
00:29:48,760 --> 00:29:51,860
which then means that we have to pre-allocate them at the start of the server.

371
00:29:53,780 --> 00:30:01,200
So this caused a problem that we could run out of handles and that then meant caused a lot of

372
00:30:01,220 --> 00:30:07,780
problems, because if we are in the place that wants to do, for example, WAL write which may not fail

373
00:30:08,060 --> 00:30:12,340
without taking down the server, and we ran out of handles, there was not really a good way forward.

374
00:30:14,059 --> 00:30:18,320
And that was like a multi-layered

375
00:30:20,620 --> 00:30:26,960
descent into a wronger and wronger design. And it turns out that the root cause basically was that

376
00:30:27,000 --> 00:30:30,440
that it was expensive to get new handles.

377
00:30:30,560 --> 00:30:33,080
And because of that, we had to do the caching,

378
00:30:33,180 --> 00:30:35,620
and without all of that, once it was cheap to get handles,

379
00:30:36,060 --> 00:30:38,740
the whole set of problems related to this went away.

380
00:30:40,019 --> 00:30:43,000
And I think I could have recognized that earlier.

381
00:30:44,380 --> 00:30:50,280
But I think that issue I feel not as bad about as some others

382
00:30:50,460 --> 00:30:54,340
because that was just a new design space that we needed to explore,

383
00:30:54,880 --> 00:30:56,660
and in hindsight, everything is easier.

384
00:31:00,300 --> 00:31:00,820
[CLAIRE] Well, in hindsight, it's all obvious, right?

385
00:31:03,879 --> 00:31:07,380
[ANDRES] It's not obvious, but more obvious maybe.

386
00:31:09,300 --> 00:31:09,500
[CLAIRE] Okay,

387
00:31:11,240 --> 00:31:12,800
I like the phrase you just used.

388
00:31:13,070 --> 00:31:15,960
You said "a multi-layered descent into wronger and wronger design."

389
00:31:16,600 --> 00:31:26,160
I'll replace wronger with bad, but I like that quote.

390
00:31:26,060 --> 00:31:28,220
Is there any takeaway from that,

391
00:31:28,380 --> 00:31:36,460
or did you just have to go through that exploration to get to that result?

392
00:31:35,320 --> 00:31:38,840
[ANDRES] I think we needed to go through that exploration, but I think I should have, or we should have, stopped earlier

393
00:31:47,320 --> 00:31:52,920
and did the necessary redesign to get rid of those problems.

394
00:31:54,020 --> 00:31:58,160
And that was one of the things that make it really hard to work with a prototype

395
00:31:58,350 --> 00:32:05,980
because it would lead to these nested subsystems that had very complicated problems

396
00:32:06,230 --> 00:32:08,200
that were interacting with each other.

397
00:32:10,760 --> 00:32:14,700
And I tried to put more band-aids on more band-aids,

398
00:32:14,710 --> 00:32:16,260
and that just made it even harder.

399
00:32:16,650 --> 00:32:21,740
And I think it's related to the decision to stop earlier in the prototype

400
00:32:22,800 --> 00:32:25,080
and just rewrite in a cleaner way from scratch.

401
00:32:25,700 --> 00:32:27,440
And I think that's actually one of the positive lessons

402
00:32:27,720 --> 00:32:31,720
is that for complicated projects,

403
00:32:31,810 --> 00:32:33,840
it really often will be worth it

404
00:32:34,020 --> 00:32:35,600
to write a throwaway prototype

405
00:32:36,040 --> 00:32:37,680
where basically no code will survive

406
00:32:37,900 --> 00:32:39,440
from the prototype to the real thing

407
00:32:40,000 --> 00:32:42,140
just because by the time the right design will be clear,

408
00:32:45,700 --> 00:32:49,060
there will be so much garbage left in the prototype

409
00:32:49,080 --> 00:32:55,520
that it's not really worth trying to keep the code and going from there to something mergeable.

410
00:32:56,860 --> 00:33:02,440
[CLAIRE] That's something that you said in your talk in Montreal, that in hindsight you wish you hadn't

411
00:33:02,520 --> 00:33:08,780
spent the time you spent trying to get to production level quality in the prototype.

412
00:33:09,650 --> 00:33:13,020
Like if you knew upfront that this is going to be a throwaway prototype,

413
00:33:13,790 --> 00:33:18,660
you might have saved a little bit of time there. Is that the right takeaway?

414
00:33:19,740 --> 00:33:24,500
[ANDRES] Yeah, and the hard part of that is trying to know which of the problems in the prototype

415
00:33:24,780 --> 00:33:30,000
are architectural problems that need to, where it's not yet clear how the right solution looks

416
00:33:30,120 --> 00:33:35,500
like and which are architectural problems that we now can fix because we now know about

417
00:33:35,500 --> 00:33:39,080
them and so it's easy to avoid them while writing the real thing.

418
00:33:42,320 --> 00:33:46,140
Obviously, that's not an easily generally answerable question.

419
00:33:50,120 --> 00:33:53,260
[CLAIRE] So before we flip to what went right with AIO,

420
00:33:54,660 --> 00:33:56,340
is there anything else that went wrong

421
00:33:56,980 --> 00:34:00,520
that gives you one of those takeaways, those learnings,

422
00:34:00,600 --> 00:34:02,660
those "I'm not gonna make that mistake again?"

423
00:34:06,040 --> 00:34:13,220
[ANDRES] I don't know whether, like, I think one other big thing that it didn't go right, but I don't know how wrong it went,

424
00:34:13,520 --> 00:34:26,260
and I don't know whether I really know how to do it better, is trying to tackle a complicated architecture problems while collaborating with others.

425
00:34:27,740 --> 00:34:43,280
Because it is very hard while exploring something that is in an architectural void to share the problem space with somebody else and to try to delegate parts of the problem to them.

426
00:34:43,720 --> 00:34:55,820
Because it requires a fair amount of experience and a fair amount of tolerance for uncertainty, I would guess, is the best way of describing it, to work in that void.

427
00:34:56,159 --> 00:35:01,640
And I think that's something that didn't go right in all cases.

428
00:35:01,750 --> 00:35:08,700
I think I tried to delegate some projects that were too underspecified and perhaps were too early.

429
00:35:10,200 --> 00:35:26,040
And on the other side of the coin, I think there were projects where I made myself the bottleneck for far too long and did not delegate or did not, delegate is the wrong word, did not hand off subsets of the problem

430
00:35:26,140 --> 00:35:37,260
to other. But it's very hard ahead of time to know which side of

431
00:35:37,260 --> 00:35:43,640
the lines some subset of some problem is going to be. And I hope I am getting better at it but

432
00:35:43,780 --> 00:35:49,160
like I've been hoping to get better at particularly this task for a long time, so I don't know whether

433
00:35:50,680 --> 00:35:53,500
I am now better at it or whether I know the right solution.

434
00:35:54,440 --> 00:35:56,960
But yeah, I found that to be a very hard problem.

435
00:35:55,660 --> 00:35:57,060
[CLAIRE] Well, I think that anybody listening,

436
00:35:59,360 --> 00:36:00,120
anyone listening who's a technical lead like you are,

437
00:36:02,960 --> 00:36:05,660
is probably identifying with what you're saying.

438
00:36:06,660 --> 00:36:07,720
Because like you said before,

439
00:36:08,240 --> 00:36:09,920
it's not like there's a right or wrong answer

440
00:36:10,030 --> 00:36:11,720
or a zero or one answer, right?

441
00:36:11,880 --> 00:36:16,040
It's figuring out what can be delegated

442
00:36:16,050 --> 00:36:18,820
and it's also who you're involving.

443
00:36:19,160 --> 00:36:21,560
Some people are very good at tolerating uncertainty

444
00:36:22,160 --> 00:36:25,060
and other people need things to be more clearly specified.

445
00:36:25,500 --> 00:36:28,320
And so kind of knowing that, right,

446
00:36:28,740 --> 00:36:31,700
knowing those people and figuring out what to carve up,

447
00:36:31,780 --> 00:36:37,020
that's just one of the big challenges of leading a project like this.

448
00:36:36,380 --> 00:36:38,180
[ANDRES] Yeah.

449
00:36:37,860 --> 00:36:40,640
And quite often the problem is it's not known whether they are a hard problem or an actually

450
00:36:43,460 --> 00:36:46,520
easy problem without first having spent the time to solve the problem.

451
00:36:47,220 --> 00:36:54,320
And that means that delegating the problem is like kind of a roll with a die, to see

452
00:36:54,440 --> 00:36:58,780
like, it might go well or it might not, but without you having ahead of time the information to

453
00:36:59,980 --> 00:37:01,440
decide whether it's a good match.

454
00:37:03,859 --> 00:37:09,140
[CLAIRE] Okay, so before we flip to what went right and to look at the things that you

455
00:37:09,220 --> 00:37:13,460
want to celebrate or you want to repeat or you hope others repeat, is there

456
00:37:13,720 --> 00:37:19,400
anything else that went wrong that leads to a lesson that you want to share with

457
00:37:19,540 --> 00:37:20,300
other engineers?

458
00:37:23,900 --> 00:37:30,840
[ANDRES] I think there's a lot more, but I don't know how much of those are worth investing time

459
00:37:30,920 --> 00:37:31,940
on this podcast.

460
00:37:33,240 --> 00:37:34,540
Maybe one interesting

461
00:37:37,080 --> 00:37:37,240
challenge

462
00:37:37,780 --> 00:37:38,840
around this was that

463
00:37:39,660 --> 00:37:41,300
it turns out that hardware is very

464
00:37:41,600 --> 00:37:43,140
diverse and has very many

465
00:37:43,360 --> 00:37:43,740
odd behaviors.

466
00:37:45,940 --> 00:37:47,260
I spent a fair bit of time

467
00:37:47,540 --> 00:37:48,400
trying to understand

468
00:37:49,240 --> 00:37:51,300
how different SSDs

469
00:37:52,180 --> 00:37:52,740
work

470
00:37:53,400 --> 00:37:54,440
across different workloads

471
00:37:55,240 --> 00:37:57,080
and it turns out there's very little

472
00:37:57,340 --> 00:37:58,660
information out there to

473
00:38:00,880 --> 00:38:10,640
understand that and some SSDs like much bigger writes but very little I/O concurrency, but other

474
00:38:10,800 --> 00:38:15,760
SSDs, even from the same manufacturer in some cases, want a lot of concurrent writes but

475
00:38:16,560 --> 00:38:21,720
not have them be very large because otherwise the latency increases dramatically, and that makes it

476
00:38:21,780 --> 00:38:28,280
very hard to have generally applicable auto-tuning systems.

477
00:38:28,560 --> 00:38:35,800
And I think we spent a fair bit of time trying to make subsets of the AIO project

478
00:38:36,140 --> 00:38:41,880
not have a lot of configuration knobs for every user, because like users are not

479
00:38:42,040 --> 00:38:43,740
going to know how to tune those.

480
00:38:44,520 --> 00:38:50,200
But I think we had a hard time finding good ways to do that.

481
00:38:50,340 --> 00:38:51,660
And that was definitely a challenge.

482
00:38:51,830 --> 00:38:56,540
And I think we went with very simple algorithms for now,

483
00:38:56,630 --> 00:38:59,640
but it's definitely not where it could be.

484
00:39:00,180 --> 00:39:02,040
And there's lots of challenges still

485
00:39:02,280 --> 00:39:04,280
remaining with dealing with different hardware,

486
00:39:05,560 --> 00:39:08,380
and particularly because no individual developer will ever

487
00:39:08,580 --> 00:39:10,660
have access to all kinds of different hardware.

488
00:39:14,900 --> 00:39:17,600
[CLAIRE] Okay, so you're suggesting there's more work to do in the future,

489
00:39:18,360 --> 00:39:20,720
especially around tuning. [A lot more work, yes.]

490
00:39:21,850 --> 00:39:24,600
So before we dive into what went right with the project,

491
00:39:25,160 --> 00:39:28,480
maybe let's tell people, like, where is this project now?

492
00:39:29,040 --> 00:39:36,120
And how much work, how much change is going to happen in the future

493
00:39:36,880 --> 00:39:39,180
in Postgres 19, in Postgres 20?

494
00:39:39,540 --> 00:39:43,920
Like, let's just state of the world, AIO and Postgres, Postgres 18.

495
00:39:45,740 --> 00:39:50,340
[ANDRES] In Postgres 18, there are quite a few uses of AIO.

496
00:39:51,930 --> 00:39:56,160
For example, sequential scans, bitmap-heap scans, vacuum,

497
00:39:57,799 --> 00:39:59,720
all use AIO.

498
00:40:00,819 --> 00:40:04,460
And in several of those, it can lead to substantial speedups.

499
00:40:05,580 --> 00:40:08,100
The reason for the speedups actually differ somewhat

500
00:40:08,250 --> 00:40:11,440
between the different uses of AIO,

501
00:40:11,470 --> 00:40:14,080
but it is pretty decent speedups.

502
00:40:15,040 --> 00:40:28,320
However, there are very important, heavy IO dependent paths in Postgres that do not use AIO yet.

503
00:40:28,610 --> 00:40:39,120
And the most crucial one is probably that index scans, like not bitmap index scans, but plain index scans, do not yet use AIO.

504
00:40:39,680 --> 00:40:42,520
And that means that if you have a workload

505
00:40:42,660 --> 00:40:45,220
that does a lot of ordered index scans, for example,

506
00:40:46,440 --> 00:40:48,480
you're not going to want to,

507
00:40:48,720 --> 00:40:51,560
you're not going to benefit from AIO,

508
00:40:51,800 --> 00:40:54,500
even though it's a workload that can, in theory, very, very

509
00:40:54,600 --> 00:40:56,900
heavily benefit from AIO.

510
00:40:57,140 --> 00:40:59,840
There's a prototype that's being worked on,

511
00:40:59,840 --> 00:41:04,600
or a project to add a readahead for index scans.

512
00:41:05,260 --> 00:41:09,640
And in some cases, the speedups are

513
00:41:10,120 --> 00:41:16,900
8x, 9x, compared to not using readahead.

514
00:41:17,500 --> 00:41:22,900
And that also means that, let me retract a tiny bit,

515
00:41:23,440 --> 00:41:29,720
one of the motivations for adding AIO to Postgres was to be able to use direct I/O,

516
00:41:30,220 --> 00:41:36,660
but that means that we do not rely on the kernel caching, buffering, I/O for us,

517
00:41:36,920 --> 00:41:41,280
and the kernel also does not do any readahead.

518
00:41:42,880 --> 00:41:49,160
And that can be a lot faster than relying on the kernel page cache,

519
00:41:50,320 --> 00:41:52,040
and it can avoid a lot of double buffering,

520
00:41:52,180 --> 00:41:55,180
where the same data is cached in Postgres' buffer pool

521
00:41:56,480 --> 00:41:57,460
and in the kernel page cache.

522
00:42:00,400 --> 00:42:04,640
But without supporting AIO in a few more places,

523
00:42:04,740 --> 00:42:11,500
that's just not viable to use in any non-toy workload. Today, when turning on direct I/O in

524
00:42:11,620 --> 00:42:16,000
Postgres 18, it is going to be faster for sequential scans in a lot of cases. However,

525
00:42:16,320 --> 00:42:22,400
if you ever have an index scan, it will be a lot slower than before. That index scan can

526
00:42:24,100 --> 00:42:31,060
utilize readahead by the operating system. So I think one big part that is remaining is to just

527
00:42:31,060 --> 00:42:37,300
use AIO in more places. Often that will not actually require a lot of work on the AIO

528
00:42:37,600 --> 00:42:45,480
subsystem itself, but it will just require work in the subsystem that wants to use AIO.

529
00:42:45,880 --> 00:42:54,480
For example, for the index scan, a big part of work is to, like the index interface,

530
00:42:55,100 --> 00:43:02,240
how to represent the ability to do more readahead or to present readaheads in there and how to

531
00:43:03,020 --> 00:43:09,500
handle the pinning of buffers across, for longer time ,and similar things, and I think there will be

532
00:43:09,640 --> 00:43:14,460
a lot of other areas like that.

533
00:43:14,120 --> 00:43:17,420
[CLAIRE] So, let's pause for a second.

534
00:43:17,500 --> 00:43:23,100
For users that are listening to this, the story is not yet written.

535
00:43:23,260 --> 00:43:28,320
Postgres 19 is likely going to have, so the Postgres 19 is the release that will

536
00:43:28,360 --> 00:43:33,260
come out in the September-ish timeframe of 2026, a year from now.

537
00:43:34,080 --> 00:43:41,880
It's likely to have even more users of AIO, potentially such as index scans that will then

538
00:43:41,900 --> 00:43:44,560
reap the performance benefits for some workloads.

539
00:43:44,740 --> 00:43:46,380
That's what you're saying, right?

540
00:43:47,240 --> 00:43:52,680
[ANDRES] Yes, and I suspect that that will go on considerably longer than Postgres 19.

541
00:43:53,880 --> 00:43:58,540
Although I think if you add a few more...

542
00:43:54,640 --> 00:43:54,820
[CLAIRE] Okay,

543
00:43:55,240 --> 00:44:02,280
so also more things in Postgres 20, et cetera.

544
00:44:02,320 --> 00:44:04,080
And you were about to say, the second part...

545
00:44:04,860 --> 00:44:17,640
[ANDRES] Is that post that AIO in 18 is only used for reads. There's no writes that are utilizing AIO.

546
00:44:16,160 --> 00:44:21,880
[CLAIRE] What? Okay, but that's just because it hasn't been done yet, right, it's going to be used for writes, in the future?

547
00:44:20,660 --> 00:44:23,640
[ANDRES] Correct, yes, but in 18 we're not yet and the reason for that is that there are lots of

548
00:44:28,060 --> 00:44:32,140
architectural issues outside of the AIO subsystem that need to be tackled. That's actually what I'm

549
00:44:32,200 --> 00:44:38,420
currently working on is to make the buffer manager ready to do AIO writes and it turns out that there's

550
00:44:38,520 --> 00:44:44,180
just a bunch of larger AIO independent projects that need to be done to make that feasible.

551
00:44:46,620 --> 00:44:47,760
And then there are currently patches to do some of the preliminary work to

552
00:44:57,740 --> 00:45:04,440
make it easier to later do AIO writes. And some of them have substantial performance benefits on

553
00:45:04,750 --> 00:45:11,460
its own. Melanie posted a patch to do write combining for writes in checkpointer, for example,

554
00:45:11,960 --> 00:45:16,780
and that can speed up checkpoints rather substantially and I think it also does some of

555
00:45:16,780 --> 00:45:22,960
the work that we then later need to do to turn those into asynchronous I/O writes and

556
00:45:24,340 --> 00:45:30,300
that's another that's a big, I think, set of improvements that we can do and then as mentioned

557
00:45:30,480 --> 00:45:39,460
earlier one thing that I really want to do with AIO eventually is AIO writes for WAL writes and

558
00:45:40,500 --> 00:45:46,600
that will be a pretty large project that requires like infrastructure changes that are not really

559
00:45:46,780 --> 00:45:52,780
related to AIO but that will hopefully have their own performance benefits. Yeah I think that's

560
00:45:52,900 --> 00:45:53,540
roughly the current state.

561
00:45:54,620 --> 00:46:02,140
[CLAIRE] And if somebody is listening to this and they are a contributor to Postgres and they're not

562
00:46:02,380 --> 00:46:11,180
already involved in helping drive all of this future work for AIO in Postgres 19 or Postgres

563
00:46:11,360 --> 00:46:13,440
20, like how do they get involved?

564
00:46:13,660 --> 00:46:19,260
It's just via the mailing list, or via reaching out to you, or just starting to do the work,

565
00:46:19,500 --> 00:46:21,800
like what is that process like? Maybe there's a PhD student somewhere who's listening to this.

566
00:46:23,940 --> 00:46:29,460
[ANDRES] I think all of those can work.

567
00:46:30,060 --> 00:46:36,620
You can just decide that you want to start using AIO in one more place, and some of those

568
00:46:36,680 --> 00:46:37,940
are not going to be very hard.

569
00:46:41,020 --> 00:46:46,960
And you can convert those to use a read stream to use AIO for reads.

570
00:46:47,320 --> 00:46:51,640
And that can be done fairly easily, I think, in some cases.

571
00:46:52,620 --> 00:46:57,420
And you can reach out to me or to the entire list to ask for suggestions

572
00:46:57,820 --> 00:47:02,740
or to get review for the idea or the actual patch.

573
00:47:03,280 --> 00:47:06,200
You can also go to the PostgreSQL Hackers Discord,

574
00:47:06,760 --> 00:47:09,520
it's linked on the community website,

575
00:47:11,100 --> 00:47:13,460
and ask for suggestions there.

576
00:47:14,360 --> 00:47:19,120
Another big area where I would definitely welcome help would be to

577
00:47:20,720 --> 00:47:26,380
review patches that are related around AIO.

578
00:47:26,460 --> 00:47:31,320
Like I, for example, posted patches for parts of the redesigns of the buffer manager.

579
00:47:32,200 --> 00:47:33,940
You would be more than welcome to review those.

580
00:47:35,800 --> 00:47:35,960
Yeah.

581
00:47:38,160 --> 00:47:38,260
[CLAIRE] Okay.

582
00:47:40,980 --> 00:47:46,260
We'll definitely include a link to the PostgreSQL Hackers Discord in the show notes for this episode.

583
00:47:47,000 --> 00:47:51,980
and as well as to the mailing list for anyone who's unfamiliar with that.

584
00:47:52,780 --> 00:47:56,140
I'm curious whether there's a list, like a punch list,

585
00:47:56,310 --> 00:47:59,440
you know how when a house is mostly built,

586
00:47:59,510 --> 00:48:03,400
but there's still this laundry list of some big, some small things

587
00:48:03,660 --> 00:48:05,380
that the builder still needs to finish?

588
00:48:06,040 --> 00:48:08,980
Is there a punch list for all of these pieces

589
00:48:09,340 --> 00:48:11,860
that still need to be built out to leverage AIO?

590
00:48:14,360 --> 00:48:18,580
[ANDRES] There's a wiki page, but it's not, I should probably go and update it.

591
00:48:19,530 --> 00:48:24,640
I did some work on updating it after AIO got merged, but it needs some more work.

592
00:48:24,780 --> 00:48:30,000
But that's probably a good place to look, but just with a caveat that it might not be

593
00:48:30,100 --> 00:48:30,780
perfectly up to date.

594
00:48:33,220 --> 00:48:38,560
[CLAIRE] Okay, so it's a work in progress, if you will, and it'll change over time, [Yes, definitely.] depending on when

595
00:48:38,640 --> 00:48:46,620
somebody listens to this. Okay, so let's pivot to things that went right in the project,

596
00:48:47,020 --> 00:48:53,660
things that you feel good about, you and other people who you collaborated with.

597
00:48:56,199 --> 00:49:02,120
[ANDRES] I think the thing I feel best about is that we actually managed to get it done at all.

598
00:49:03,000 --> 00:49:12,200
And when I started the project, I was not at all confident that this was a project that we could succeed in.

599
00:49:12,730 --> 00:49:19,060
I thought it was important to try to succeed in, but yeah, I was not confident that it would actually work out.

600
00:49:21,020 --> 00:49:25,260
And that's definitely something I'm very happy and proud of.

601
00:49:26,640 --> 00:49:33,640
I think another thing that went well, and I think those are the parts...

602
00:49:34,060 --> 00:49:35,220
Let me restart that.

603
00:49:35,690 --> 00:49:46,620
I think what went well was that we found sub-projects that could be merged independently, and that help independently, like the relation extension part that I mentioned earlier.

604
00:49:47,420 --> 00:49:51,680
Being able to upstream that first was pretty important.

605
00:49:51,690 --> 00:49:55,860
I think otherwise having to also carry all those changes at the same time would have been very hard.

606
00:49:57,220 --> 00:50:00,400
and getting the read stream stuff into Postgres 17

607
00:50:00,700 --> 00:50:04,200
and allowing various places to be converted

608
00:50:04,380 --> 00:50:05,440
to use the read streams

609
00:50:05,980 --> 00:50:12,340
was actually fairly crucial to merge the AIO in Postgres 18

610
00:50:12,520 --> 00:50:16,640
because that meant that with just merging the AIO subsystem

611
00:50:16,760 --> 00:50:21,500
and doing a few dozen lines of change in read stream,

612
00:50:22,900 --> 00:50:25,840
all of those places suddenly started to use AIO

613
00:50:25,960 --> 00:50:46,520
And that made it a lot more reviewable than if after merging the whole AIO subsystem had to also go into all these other places, and change them to use the read stream interface, because that sometimes required non-trivial work in those places, because in some cases just to get rid of other architectural debt and similar things.

614
00:50:49,200 --> 00:50:59,520
I think several people that worked on AIO gained a lot of experience, and I think that was a

615
00:51:01,480 --> 00:51:07,980
pretty good success. And even though, as I'm sure that some people that might be listening

616
00:51:08,560 --> 00:51:14,520
would confirm, it was not always pain-free. And I would like that to have been a more pleasant

617
00:51:14,520 --> 00:51:18,820
experience, but I think it was still a lot of knowledge was gained across all the people

618
00:51:19,060 --> 00:51:29,460
involved. And I think that's great. Yeah, I don't really have other thoughts.

619
00:51:33,060 --> 00:51:40,900
[CLAIRE] I mean obviously the fact that you got it done is something to feel good about but I'm struck

620
00:51:41,120 --> 00:51:45,880
by what you said after that, that you were not at all confident this was a project we could succeed

621
00:51:45,980 --> 00:51:53,920
in. And I almost wonder, I wonder if that's your nature. Is it fair to say that you are inherently

622
00:51:54,440 --> 00:51:59,520
skeptical of something in the beginning, that you're like picking up that idea and looking at

623
00:51:59,520 --> 00:52:05,020
it from different angles to figure out what could go wrong and obviously try to prevent those things

624
00:52:05,090 --> 00:52:08,580
from going wrong? Isn't that how you're wired or am I misreading you?

625
00:52:09,500 --> 00:52:14,480
[ANDRES] I think that's part of it, but I don't think that is all of it.

626
00:52:14,580 --> 00:52:20,080
I've definitely tackled projects where I was like 95% sure that I could succeed.

627
00:52:21,800 --> 00:52:26,640
Just because I've been working on Postgres for a long time by now, and I know the community,

628
00:52:27,660 --> 00:52:32,960
and I can roughly predict whether something has a chance or is going to be controversial or not.

629
00:52:33,460 --> 00:52:43,500
But with the AIO project I did not have confidence in either my own skills from a technical point of view that it would be doable and also on the community politics.

630
00:52:43,830 --> 00:52:56,880
I think that's perhaps like one angle I forgot to mention earlier, which is that a change that is of this size, getting that into Postgres requires convincing a lot of people.

631
00:52:57,780 --> 00:53:06,460
And historically, our to-do list had a point that said we do not want to use direct I/O.

632
00:53:11,600 --> 00:53:16,560
Political things are a lot harder to predict than purely technical things.

633
00:53:17,880 --> 00:53:22,940
So I think, yeah, that's why I think I was more skeptical about this project than about other project.

634
00:53:26,759 --> 00:53:27,740
[CLAIRE] Okay, so maybe let's just tease that out as something else that went right. I mean

635
00:53:32,880 --> 00:53:37,740
you and the other people involved in the project were ultimately able to convince a lot of people.

636
00:53:38,240 --> 00:53:53,540
So it wasn't just a matter of doing the work, right, and getting it done correctly, but selling people and bringing the rest of the committer and contributor engineers along with you.

637
00:53:54,620 --> 00:53:58,080
Like, that's something to feel good about, too.

638
00:53:58,690 --> 00:54:00,000
Maybe that's what you meant before. [Yeah, that's true.]

639
00:54:00,110 --> 00:54:02,980
It was, like, implied. Are there, are there...

640
00:54:06,700 --> 00:54:11,320
[ANDRES] I think it was implicit in what I said earlier, but another aspect I think that went right

641
00:54:16,280 --> 00:54:19,760
was to actually develop a prototype first.

642
00:54:22,420 --> 00:54:29,340
Because without like being able to just explore crazy things and then roll back and not be too worried

643
00:54:29,410 --> 00:54:34,200
about getting everything right it would also have not been able to actually get to a design point

644
00:54:34,460 --> 00:54:43,940
where it was mergeable and that was, I think, one more of those things that were like, some, it was important

645
00:54:43,940 --> 00:54:48,740
to do but I did it too much, but where the exact right spot is hard to tell, but I don't think

646
00:54:48,760 --> 00:54:52,800
without the plan to write a prototype that would not be mergeable, I don't think it could

647
00:54:52,820 --> 00:54:53,560
have gone anywhere.

648
00:54:55,440 --> 00:54:56,840
I think one more aspect that I think went okay, could have gone better, could have gone a

649
00:55:03,060 --> 00:55:06,420
lot worse, is corporate politics.

650
00:55:07,160 --> 00:55:13,000
I worked on AIO, I think, while working at two different Postgres companies or three

651
00:55:13,140 --> 00:55:13,900
different Postgres companies.

652
00:55:15,820 --> 00:55:26,280
And you have to convince the companies to actually allow you to spend so much time on something that does not actually have very immediate benefit.

653
00:55:26,460 --> 00:55:35,020
Because it was always clear that it would take a while to get merged and that even then it would take more years for it to get adopted.

654
00:55:35,580 --> 00:55:42,040
And I think that's definitely also an angle where I've had to learn a lot about how to

655
00:55:42,190 --> 00:55:49,240
do that and how to get buy-in into investing this much into a project with unclear outcomes.

656
00:55:49,800 --> 00:55:50,800
I think that went okay.

657
00:55:52,570 --> 00:55:57,120
And I'm proud that it did not go horribly.

658
00:55:57,240 --> 00:55:58,740
[CLAIRE] When you gave the talk at Montreal

659
00:55:59,040 --> 00:56:01,060
you actually gave a shout out to your boss

660
00:56:01,360 --> 00:56:01,860
Affan Dar

661
00:56:02,440 --> 00:56:04,800
for supporting you in your

662
00:56:05,060 --> 00:56:06,420
years working on this project,

663
00:56:07,920 --> 00:56:08,480
but

664
00:56:09,120 --> 00:56:10,980
I guess I've not been a fly

665
00:56:11,000 --> 00:56:13,240
on the wall in your one-on-ones with your boss,

666
00:56:13,350 --> 00:56:18,700
but it feels to me that there's a ton of support for

667
00:56:20,100 --> 00:56:24,820
what you and the team are working on, and the knowledge that like many of the

668
00:56:25,060 --> 00:56:29,200
decisions about what gets worked on in a future release ,or an upcoming release

669
00:56:29,250 --> 00:56:29,740
of Postgres,

670
00:56:30,160 --> 00:56:31,740
it's a very bottoms up process.

671
00:56:32,490 --> 00:56:32,860
Is that

672
00:56:33,030 --> 00:56:33,840
fair to say?

673
00:56:34,200 --> 00:56:36,340
And I feel like a Affan is supportive of that.

674
00:56:37,180 --> 00:56:38,860
[ANDRES] Yes, I agree. It turns out there were several other managers over time. And I think they

675
00:56:46,000 --> 00:56:49,840
were all supportive, but in different ways. And I think managing expectations of the timelines

676
00:56:53,200 --> 00:56:58,360
and stuff like that is pretty important, to just not, otherwise you deceive your

677
00:56:59,580 --> 00:57:01,780
manager which is not necessarily a good idea.

678
00:57:05,040 --> 00:57:11,000
[CLAIRE] All right. So other things that went right that you feel good about. I have one to throw out there.

679
00:57:12,220 --> 00:57:18,620
And you're going to shoot me for bringing this up because your moment of fame is behind you.

680
00:57:18,700 --> 00:57:24,100
It happened in like whatever that was, March, April, 2024, something like that. It was over a

681
00:57:24,180 --> 00:57:30,760
year ago. But were you actually, I think Thomas Munro had sent you something and asked you to do

682
00:57:30,780 --> 00:57:38,000
some performance testing on it and that's when you discovered the XZ Utils security backdoor [That's true.]

683
00:57:38,780 --> 00:57:44,540
and reported that security issue, and that kind of blew up the internet for a little while, but

684
00:57:44,920 --> 00:57:48,780
wasn't what Thomas Munro sent to you to investigate, wasn't that AIO related?

685
00:57:49,520 --> 00:57:53,760
[ANDRES] That was the read stream interface.

686
00:57:54,160 --> 00:57:57,760
We were trying to figure out why it had some regression in some observed workloads,

687
00:57:58,580 --> 00:58:00,740
and as part of that I did all the tests where I then found that SSH was using too much CPU, and yeah, [So that's something that went right.]

688
00:58:07,300 --> 00:58:12,800
it turns out that it's good, very good, to be to learn about low-level benchmarking. It has

689
00:58:13,780 --> 00:58:14,740
unexpected benefits.

690
00:58:18,020 --> 00:58:21,920
[CLAIRE] Yeah, I remember seeing an email from someone, I won't name names, but they were like

691
00:58:22,700 --> 00:58:28,700
"a database engineer wasn't going to be running low-level performance benchmarks like that,

692
00:58:28,720 --> 00:58:35,300
you've got to be kidding me," but they clearly have never met you, and are unaware of your

693
00:58:35,860 --> 00:58:43,640
commitment to investigating performance problems and getting to, I don't know, figuring them

694
00:58:43,800 --> 00:58:43,920
out.

695
00:58:44,200 --> 00:58:46,360
I mean, you can be very stubborn, can't you?

696
00:58:47,140 --> 00:58:47,520
Is that fair?

697
00:58:49,060 --> 00:58:52,320
[ANDRES] I refuse to answer on the grounds that it might incriminate me.

698
00:58:55,440 --> 00:58:55,800
[CLAIRE] [LAUGHS] All right.

699
00:58:56,240 --> 00:59:00,740
I'm going to go look at the chat really quickly because there's a bunch of other Postgres developers

700
00:59:00,990 --> 00:59:07,480
who are on the live parallel chat that's happening while we're doing this recording live,

701
00:59:07,650 --> 00:59:14,220
just to see if there are any other highlights of things that went right that you're not thinking of right now.

702
00:59:15,680 --> 00:59:18,900
Because I'm fishing, fishing for anything else you want to call out.

703
00:59:19,020 --> 00:59:24,780
I mean, for developers listening to this, is there anything else you did that you were like,

704
00:59:25,000 --> 00:59:28,400
huh, people making these kinds of large-scale architectural changes

705
00:59:28,620 --> 00:59:30,980
should definitely do this.

706
00:59:31,680 --> 00:59:33,480
And we did it, and you feel good about it.

707
00:59:35,620 --> 00:59:35,960
Fishing...

708
00:59:38,320 --> 00:59:43,200
[ANDRES] I mean, I think one of the things that turned out to be very crucial for being able to merge

709
00:59:43,320 --> 00:59:50,060
AIO was that we got a lot of review by Noah Misch,

710
00:59:51,900 --> 00:59:53,760
and that was not something that actually

711
00:59:54,440 --> 00:59:55,820
I had planned upon and that

712
00:59:56,120 --> 00:59:58,020
planned for, and I think that

713
00:59:58,180 --> 00:59:59,800
went very well and I'm very very

714
01:00:00,100 --> 01:00:01,960
thankful for Noah that he invested so

715
01:00:02,040 --> 01:00:03,900
much time in it. And I think in

716
01:00:03,980 --> 01:00:05,160
hindsight I would have

717
01:00:07,060 --> 01:00:08,020
probably tried

718
01:00:08,160 --> 01:00:10,000
to do a bit more backroom dealing

719
01:00:10,300 --> 01:00:11,840
for like trading of reviews

720
01:00:12,380 --> 01:00:14,180
with other people to line

721
01:00:14,180 --> 01:00:15,920
them up ahead of time so that

722
01:00:15,940 --> 01:00:17,940
I could be more confident that it would be reviewed,

723
01:00:18,580 --> 01:00:18,980
because like

724
01:00:20,000 --> 01:00:21,860
I think it's,

725
01:00:22,460 --> 01:00:43,060
from a community politics perspective, and from a diversity of thought, maybe it sounds not quite right, but if you work closely on one project together, like the team at Microsoft on AIO, then you might not see problems that somebody that comes more freshly from it at the problem from the outside will see.

726
01:00:43,240 --> 01:00:47,260
And that was definitely the case with Noah. He found a lot of problems that I just did not think about.

727
01:00:48,140 --> 01:00:56,420
And I'm glad that that happened, but in hindsight, I should have invested more. That was luck,

728
01:00:56,520 --> 01:00:58,100
that was not skill, that that happened.

729
01:00:58,900 --> 01:01:07,360
And I think luck is a skill, but I would invest more in trying to line that up ahead of time

730
01:01:07,480 --> 01:01:07,880
next time.

731
01:01:07,680 --> 01:01:17,220
[CLAIRE] So to try to put a fine point on what you just said, it wasn't luck that Noah found the problems, because that's something that he's probably good at, [He's very good at that.]

732
01:01:17,490 --> 01:01:23,240
it was luck that you enlisted Noah to help do the reviews and find the problems.

733
01:01:23,270 --> 01:01:24,540
Is that correct?

734
01:01:25,500 --> 01:01:30,200
[ANDRES] I did not enlist Noah. He volunteered. That's the luck. He just did it.

735
01:01:29,040 --> 01:01:32,480
[CLAIRE] Oh, he volunteered even better. [Yes.]

736
01:01:31,300 --> 01:01:33,420
So for those of you who don't know Noah Misch,

737
01:01:33,600 --> 01:01:36,200
he's a Postgres committer and contributor,

738
01:01:36,370 --> 01:01:37,180
he works at Google.

739
01:01:39,359 --> 01:01:42,920
And I think the first place I ever met Noah

740
01:01:43,160 --> 01:01:46,240
was at PGConf.dev in Vancouver last year,

741
01:01:47,340 --> 01:01:48,740
and he was there again this year too.

742
01:01:49,040 --> 01:01:50,660
And that's the annual conference

743
01:01:51,080 --> 01:01:52,860
where a lot of the Postgres contributors

744
01:01:53,820 --> 01:01:54,820
and engineers come together.

745
01:01:56,220 --> 01:01:59,060
And some users, but I would say mostly contributors.

746
01:02:01,420 --> 01:02:03,140
Okay, so you've given shout outs to

747
01:02:03,920 --> 01:02:04,420
Thomas Munro,

748
01:02:05,940 --> 01:02:06,420
Bilal,

749
01:02:07,880 --> 01:02:09,020
Yavuz, Melanie

750
01:02:09,200 --> 01:02:10,620
Plageman and now Noah Misch.

751
01:02:10,920 --> 01:02:13,120
Is there anyone else that you need to be sure to give a shout

752
01:02:13,120 --> 01:02:14,980
out to, or are there too many people to

753
01:02:15,180 --> 01:02:15,960
possibly list?

754
01:02:17,060 --> 01:02:19,020
This is like the Academy Awards now where you're trying to fit everybody in.

755
01:02:17,860 --> 01:02:30,580
[ANDRES] Thomas Munro, I think, did a fair bit of work too, both on the actual AIO subsystem and upstreaming

756
01:02:31,260 --> 01:02:38,020
the read stream interface in a very different form than what I had prototyped.

757
01:02:39,880 --> 01:02:44,400
And then I think I had a lot of discussions with various people over the years about different

758
01:02:44,560 --> 01:02:49,080
aspects of it, but I think the people that were just mentioned are the most important

759
01:02:49,220 --> 01:02:49,380
ones.

760
01:02:53,880 --> 01:02:59,040
[CLAIRE] One of the things that Melanie just chimed in on the chat and said is that getting someone

761
01:02:59,060 --> 01:03:03,680
experience to review these architecturally significant patches is hard because it's just

762
01:03:03,680 --> 01:03:12,200
so much work it takes forever to do, she said, and that is probably what makes Noah's contribution

763
01:03:12,680 --> 01:03:17,460
so so good. That he, not only do it, he volunteered for it, is what you're saying.

764
01:03:18,300 --> 01:03:18,900
Okay, so you've given a shout out to to Thomas, to Bilal, to Melanie, to Noah, is there anybody else

765
01:03:28,640 --> 01:03:29,800
that you want to shout out to,

766
01:03:33,920 --> 01:03:35,880
or too many to list?

767
01:03:33,960 --> 01:03:35,760
[ANDRES] I would probably have to look in the commit message.

768
01:03:35,960 --> 01:03:41,900
There were lots of other, smaller, projects that were done.

769
01:03:42,040 --> 01:03:47,540
I think David Rowley did some prerequisite work.

770
01:03:48,560 --> 01:03:52,620
I had a lot of discussions with Robert Haas about architectural aspects.

771
01:03:54,260 --> 01:03:55,400
I had lots of discussions about

772
01:03:57,540 --> 01:03:57,980
parts of it,

773
01:03:58,080 --> 01:03:59,580
and he did also do some review

774
01:04:00,080 --> 01:04:01,340
with Heikki Linnakangas,

775
01:04:03,420 --> 01:04:04,180
but I'm sure

776
01:04:04,280 --> 01:04:06,100
that there's many more that I'm just not

777
01:04:07,920 --> 01:04:08,480
thinking of

778
01:04:08,500 --> 01:04:08,820
right now. It's been, after all, like six or seven years.

779
01:04:09,540 --> 01:04:12,080
[CLAIRE] Well and that's one of the things that's nice,

780
01:04:14,420 --> 01:04:23,160
in the commit messages, the team does a really good job, and i'd say a better job

781
01:04:23,400 --> 01:04:30,340
this year than in the past, of including who reviewed this commit, who were the authors,

782
01:04:30,760 --> 01:04:36,700
who tested it, who was it reported by, like I think in a lot of open source projects

783
01:04:37,340 --> 01:04:44,040
giving credit where credit is due is an important part of the culture and I think that's certainly

784
01:04:44,180 --> 01:04:50,900
true in Postgres as well. So yeah, it's fair to say that there's a lot of other names that are

785
01:04:51,160 --> 01:04:57,080
listed in the plethora of commits that are associated with this project.

786
01:04:59,260 --> 01:05:09,280
Okay, so are there any other lessons you want to highlight to someone who's listening, who maybe is

787
01:05:09,920 --> 01:05:16,420
about to embark on their own architectural project, and is trying to make sure they

788
01:05:17,580 --> 01:05:18,900
only make original mistakes.

789
01:05:23,160 --> 01:05:27,060
[ANDRES] I think one other aspect that I only somewhat mentioned is to take care of yourself if you do something that takes this long.

790
01:05:33,370 --> 01:05:45,820
I found that development progress definitely was associated with how well I was doing in my personal life, and exercise, and all that kind of stuff,

791
01:05:46,260 --> 01:05:51,840
and that I feel that more strongly with projects that take this long

792
01:05:51,950 --> 01:05:55,780
because I can look back and remember I was working on this part

793
01:05:55,900 --> 01:05:58,380
when I was sick or something like that.

794
01:05:59,940 --> 01:06:02,420
And I think that is, particularly for long projects,

795
01:06:02,580 --> 01:06:06,360
something to remember that it's important to take care of yourself

796
01:06:06,450 --> 01:06:09,740
and not just invest into the project and do more and more hacking.

797
01:06:13,279 --> 01:06:20,020
[CLAIRE] There was a a woman that I used to spend a lot of time with at swim meets, both of my

798
01:06:20,180 --> 01:06:25,200
children were competitive swimmers growing up, and when they're at swim meets oftentimes you're

799
01:06:25,220 --> 01:06:30,920
literally standing by a pool for the entire day, like for hours and hours, like you're there for

800
01:06:31,000 --> 01:06:36,740
the whole day and they swim for, you know, three minutes or something like that. But she had

801
01:06:36,840 --> 01:06:43,540
just taken a job. It was a really big job as an executive. And what she realized in taking on

802
01:06:43,580 --> 01:06:49,140
all that additional responsibility is exactly what you said. She had to take care of her body

803
01:06:49,140 --> 01:06:51,660
in order to be successful in her job.

804
01:06:52,060 --> 01:06:54,060
So she had to change her diet.

805
01:06:54,600 --> 01:06:57,560
She had to find a way to exercise

806
01:06:57,830 --> 01:07:00,020
and have an exercise routine that she could do

807
01:07:00,170 --> 01:07:02,220
even when she was traveling and in hotel rooms.

808
01:07:04,600 --> 01:07:07,960
I think you're right,

809
01:07:08,010 --> 01:07:08,980
you got to take care of yourself

810
01:07:09,440 --> 01:07:11,120
or your brain isn't going to be able to do

811
01:07:11,400 --> 01:07:12,880
everything you need it to do.

812
01:07:14,460 --> 01:07:16,700
So that is worth shining a light on,

813
01:07:17,060 --> 01:07:18,020
I'm glad you brought it up.

814
01:07:20,860 --> 01:07:22,820
Anything else you would tell past Andres,

815
01:07:26,560 --> 01:07:27,640
if you could go back,

816
01:07:29,080 --> 01:07:30,520
and whisper in your own ear?

817
01:07:29,300 --> 01:07:37,680
[ANDRES] I mean, all the mistakes I mentioned. I would tell all the, if I knew ahead of time,

818
01:07:37,800 --> 01:07:43,820
which architectural decisions would be wrong and which prerequisites could be tackled independently

819
01:07:43,860 --> 01:07:46,560
earlier, then I would probably do that.

820
01:07:47,900 --> 01:07:51,980
But that feels like it's not really the answer to the question. [Oh, it's the answer to the question.]

821
01:07:54,700 --> 01:07:57,220
Yeah, I don't think I otherwise have anything

822
01:07:58,970 --> 01:08:01,520
very smart to say, unfortunately.

823
01:08:02,370 --> 01:08:02,620
[CLAIRE] All right,

824
01:08:02,680 --> 01:08:11,500
well, before we wrap, I guess I'm curious.

825
01:08:11,900 --> 01:08:12,800
You're here on a podcast.

826
01:08:13,120 --> 01:08:15,520
This is the second time you've been on the Talking Postgres podcast.

827
01:08:16,060 --> 01:08:16,580
Thank you for that,

828
01:08:17,180 --> 01:08:18,299
I really appreciate it.

829
01:08:18,940 --> 01:08:28,040
I didn't know if you would say yes, but I thought it was important for you to share your learnings from this project so that it will benefit other people.

830
01:08:28,319 --> 01:08:32,700
But I'm curious, now that you've been a guest, you've been a guest on this podcast twice.

831
01:08:33,440 --> 01:08:39,839
You've also been a guest on Oxide and Friends, and another security related podcast, maybe

832
01:08:40,080 --> 01:08:41,240
even more that I don't know about.

833
01:08:41,920 --> 01:08:44,279
But I'm curious whether you listen to podcasts.

834
01:08:46,640 --> 01:08:50,120
[ANDRES] I do listen to podcasts, but mostly non-technical ones.

835
01:08:51,500 --> 01:08:54,960
I think most of the time when I'm listening to podcasts

836
01:08:55,109 --> 01:09:04,040
I'm trying to let my brain do something else

837
01:09:02,930 --> 01:09:06,680
rather than focus on technical things

838
01:09:06,730 --> 01:09:08,839
because I already spent way too much time

839
01:09:09,240 --> 01:09:11,980
thinking about Postgres and stuff like that.

840
01:09:11,990 --> 01:09:13,920
I occasionally do listen to technical podcasts,

841
01:09:14,279 --> 01:09:18,400
but it's mostly when somebody mentions

842
01:09:18,640 --> 01:09:20,160
that something is particularly good

843
01:09:20,160 --> 01:09:22,779
when it's like very square in my interests.

844
01:09:23,770 --> 01:09:28,440
But most, yeah.

845
01:09:27,339 --> 01:09:29,180
I don't listen to too many technical ones.

846
01:09:31,000 --> 01:09:37,420
[CLAIRE] Okay, well I won't put you on the spot and ask you what they are. But I will say thank

847
01:09:37,620 --> 01:09:43,140
you, for the work that you do on Postgres. For those of you who don't know Andres's origin story

848
01:09:43,259 --> 01:09:48,339
Like I said, you should go back and listen to, I think it was episode 8 of Talking

849
01:09:48,540 --> 01:09:52,040
Postgres, where he and Heikki dove into how they got started.

850
01:09:52,569 --> 01:09:55,480
But you got started in kind of an unusual path,

851
01:09:55,740 --> 01:09:59,280
and I don't know, it's almost happenstance that you landed in Postgres.

852
01:09:59,560 --> 01:10:04,040
And I think it's fair to say that I, and a ton of other people, are really glad that you

853
01:10:04,160 --> 01:10:04,280
did.

854
01:10:04,980 --> 01:10:07,720
So I feel very lucky to work with you,

855
01:10:08,280 --> 01:10:11,880
and I guess that's a little bit of a fangirl type of thing to say.

856
01:10:12,320 --> 01:10:13,320
But I do,

857
01:10:13,880 --> 01:10:15,080
so I'm saying it.

858
01:10:15,600 --> 01:10:15,880
[ANDRES] Thank you.

859
01:10:17,380 --> 01:10:19,800
[CLAIRE] Yeah, and thank you so much for joining the show.

860
01:10:20,040 --> 01:10:23,680
I don't have any other topics or questions for us today.

861
01:10:24,680 --> 01:10:27,880
So unless you do, we will give it a wrap.

862
01:10:25,240 --> 01:10:25,280
[ANDRES] Cool.

863
01:10:27,310 --> 01:10:30,260
I don't think I do that right now.

864
01:10:31,020 --> 01:10:33,480
[CLAIRE] I want to say thank you to Andres Freund for joining us.

865
01:10:33,720 --> 01:10:37,620
And if you're listening and you liked today's episode, and I hope you did,

866
01:10:38,040 --> 01:10:40,580
and you want to hear more of these Talking Postgres episodes,

867
01:10:41,060 --> 01:10:41,840
you should subscribe,

868
01:10:42,420 --> 01:10:47,940
on Apple, or Spotify, or YouTube, or wherever you get your podcasts. And please tell your friends.

869
01:10:48,900 --> 01:10:52,620
If you tell your friends or leave a review, that helps more people discover the show.

870
01:10:53,980 --> 01:11:00,100
Word of mouth is the best way to discover a new podcast. You can always get to past episodes

871
01:11:00,560 --> 01:11:06,880
and get the links to subscribe at TalkingPostgres. And transcripts are included

872
01:11:06,900 --> 01:11:09,500
on the episode pages on TalkingPostgres.com too.

873
01:11:10,060 --> 01:11:13,080
And a big thank you to everybody who joined the live recording

874
01:11:13,560 --> 01:12:42,480
and participated in the live text chat on Discord.