1
00:00:00,060 --> 00:00:01,760
Nikolay: Hello, hello, this is
Postgres.FM.

2
00:00:02,540 --> 00:00:07,260
I'm Nikolay, Postgres.AI, and my
co-host is Michael, pgMustard.

3
00:00:07,780 --> 00:00:08,540
Hi Michael.

4
00:00:09,140 --> 00:00:10,020
Michael: Hello Nikolay.

5
00:00:10,680 --> 00:00:13,160
Nikolay: And guess what we are
going to discuss today?

6
00:00:14,600 --> 00:00:15,360
Michael: I'll guess.

7
00:00:16,700 --> 00:00:18,380
Is it append-only tables?

8
00:00:18,820 --> 00:00:19,320
Nikolay: Exactly.

9
00:00:20,240 --> 00:00:21,440
Michael: Ah, got it right.

10
00:00:21,620 --> 00:00:25,120
Nikolay: I was surprised to hear
we haven't discussed it in the

11
00:00:25,120 --> 00:00:28,220
past I'm sure we touched it many
times right

12
00:00:29,280 --> 00:00:33,040
Michael: yeah for sure it's come
up in episodes but me too when

13
00:00:33,040 --> 00:00:38,120
I saw it in our listener suggested
topics I did have a few searches

14
00:00:38,120 --> 00:00:41,120
on our site just to make sure we
hadn't talked about it already

15
00:00:41,120 --> 00:00:45,860
as a whole episode and yeah agreed
but it's not something I see

16
00:00:45,860 --> 00:00:49,160
all the time like it's it is it's
relatively common like to have

17
00:00:49,160 --> 00:00:55,620
an events table but even then I
mostly see append-mostly tables

18
00:00:55,720 --> 00:00:56,580
rather than

19
00:00:59,160 --> 00:01:02,300
Nikolay: like oh yeah I was thinking
we are going to discuss

20
00:01:02,640 --> 00:01:03,140
both.

21
00:01:03,760 --> 00:01:04,260
Yeah.

22
00:01:05,740 --> 00:01:09,440
Do you feel in the episode for
append-mostly tables?

23
00:01:10,440 --> 00:01:10,940
No?

24
00:01:11,260 --> 00:01:15,020
Michael: I don't, well, I actually,
I don't think this is super

25
00:01:15,020 --> 00:01:15,480
complex.

26
00:01:15,480 --> 00:01:19,940
I think it's quite nice because
almost by definition if we do

27
00:01:19,940 --> 00:01:23,260
accept that it's append-only We've
got quite a narrow scope and

28
00:01:23,260 --> 00:01:26,320
there's only a few things to consider
Maybe it gets a bit more

29
00:01:26,320 --> 00:01:27,560
complicated with append-mostly.

30
00:01:27,560 --> 00:01:29,200
But yeah, we can definitely cover
that today.

31
00:01:29,200 --> 00:01:32,340
I think it's still not that complicated

32
00:01:33,980 --> 00:01:39,560
Nikolay: yeah where do we start
like definition

33
00:01:40,680 --> 00:01:42,320
Michael: go for it yeah why not

34
00:01:43,200 --> 00:01:46,200
Nikolay: well if we just insert
that's append-only that's it

35
00:01:46,980 --> 00:01:50,220
Michael: insert and select don't
I would say yeah no updates

36
00:01:50,220 --> 00:01:50,960
no deletes.

37
00:01:51,200 --> 00:01:55,120
Nikolay: Select is allowed right
yeah insert and select this

38
00:01:55,120 --> 00:02:01,740
is the only 2 things we can allow
ourselves from DML data manipulation

39
00:02:01,860 --> 00:02:07,200
language and that's it yeah we
just select and insert this is

40
00:02:07,200 --> 00:02:12,980
append-only If we have occasional
deletes and updates, it's

41
00:02:12,980 --> 00:02:15,580
append-mostly or insert-mostly, I don't
know how to say.

42
00:02:15,680 --> 00:02:17,000
Michael: Yeah, I like that.

43
00:02:17,020 --> 00:02:19,820
Nikolay: And why do we care about
this particular use case?

44
00:02:20,820 --> 00:02:24,240
Because it has characteristics,
right?

45
00:02:24,240 --> 00:02:25,460
It usually has...

46
00:02:25,760 --> 00:02:29,140
If it's append-only, for example,
we don't care about dead tuples

47
00:02:29,140 --> 00:02:29,640
anymore.

48
00:02:29,920 --> 00:02:30,980
No bloat, right?

49
00:02:31,500 --> 00:02:32,220
It's good.

50
00:02:33,460 --> 00:02:37,840
And usually we talk about huge
volumes of data.

51
00:02:38,100 --> 00:02:43,580
And at some point we think, oh, we need to compress it, we need

52
00:02:43,580 --> 00:02:48,960
to offload it maybe to cheaper storage, or just clean up, because

53
00:02:49,400 --> 00:02:53,300
all data is not needed anymore in raw form.

54
00:02:53,300 --> 00:02:57,500
Sometimes it's aggregated and in raw form we can just remove

55
00:02:57,500 --> 00:02:58,400
it from database.

56
00:03:00,240 --> 00:03:02,820
Or we just need bottomless.

57
00:03:03,460 --> 00:03:09,640
And usually we want inserts to happen very fast, because a lot

58
00:03:09,640 --> 00:03:10,840
of volume is huge.

59
00:03:10,840 --> 00:03:14,760
So we need to make sure performance of inserts is good.

60
00:03:16,060 --> 00:03:17,180
Did I miss anything?

61
00:03:17,780 --> 00:03:20,720
Michael: No, I think those things aren't necessarily always true

62
00:03:20,720 --> 00:03:24,680
for append-only tables but they correlate like a lot of the use

63
00:03:24,680 --> 00:03:29,840
cases for very very fast growing data and and by definition append-only

64
00:03:30,060 --> 00:03:32,080
means it's never going to decrease in size.

65
00:03:32,080 --> 00:03:33,480
It's only going to keep getting larger

66
00:03:33,480 --> 00:03:33,780
Nikolay: and larger.

67
00:03:33,780 --> 00:03:35,260
Unless you clean it up.

68
00:03:36,340 --> 00:03:38,600
Well, there are specific cases.

69
00:03:38,600 --> 00:03:45,200
For example, imagine we discussed many times the topic of slow

70
00:03:45,200 --> 00:03:45,700
count.

71
00:03:46,420 --> 00:03:52,740
And if you can allow a synchronous calculation of count, maybe

72
00:03:52,740 --> 00:03:54,180
it's like materialized or something.

73
00:03:54,180 --> 00:03:54,960
I don't know.

74
00:03:54,960 --> 00:03:59,060
So idea is instead of updating the count somewhere on each insert

75
00:03:59,060 --> 00:04:02,920
in the original table You can aggregate operations in intermediate

76
00:04:03,000 --> 00:04:05,140
table, and then it's append-only.

77
00:04:05,420 --> 00:04:10,520
So you register events in some table, and then you process chunk

78
00:04:10,520 --> 00:04:14,680
and reflect this in count in final storage.

79
00:04:15,380 --> 00:04:20,980
And then you can delete it or better drop this partition or truncate

80
00:04:20,980 --> 00:04:21,680
or something.

81
00:04:22,080 --> 00:04:25,580
Right, in this case, it's append-only, but it grows, grows, and

82
00:04:25,580 --> 00:04:26,880
then the size drops.

83
00:04:27,520 --> 00:04:29,280
It happens also, right?

84
00:04:29,280 --> 00:04:35,920
Michael: Yeah, I think dropping partitions definitely pushes

85
00:04:35,920 --> 00:04:38,940
the definition of append-only, but it's the thing that makes

86
00:04:38,940 --> 00:04:42,660
sense to do in most, or a lot of cases, at huge scale.

87
00:04:44,040 --> 00:04:46,840
But yeah, is it still append-only if we're dropping partitions?

88
00:04:48,520 --> 00:04:50,800
Nikolay: Yes, this is how we should do

89
00:04:50,800 --> 00:04:54,020
Michael: it no I know but do you see what I mean you mentioned

90
00:04:54,120 --> 00:04:56,420
deleting data but

91
00:04:56,480 --> 00:05:01,120
Nikolay: well we again again again it's append-only we draw partition

92
00:05:01,160 --> 00:05:05,100
we never delete we never update it's append-only but we if we

93
00:05:05,100 --> 00:05:11,260
don't need like last year data we already processed it somehow,

94
00:05:12,180 --> 00:05:15,340
made all calculations we need, we can get rid of raw data.

95
00:05:15,340 --> 00:05:21,220
We just dropping partition, It's the best we can do instead of

96
00:05:21,220 --> 00:05:23,000
cleaning up somehow using deletes.

97
00:05:23,100 --> 00:05:26,820
I think we need to discuss it because I did it many times and

98
00:05:26,820 --> 00:05:29,660
participated in huge projects in very large companies.

99
00:05:30,720 --> 00:05:33,540
The idea, let's offload all archive data.

100
00:05:34,060 --> 00:05:35,040
It was e-commerce.

101
00:05:36,060 --> 00:05:39,600
Old orders, let's offload it to
cheaper storage for longer-term

102
00:05:39,640 --> 00:05:44,380
storage, and then we need to delete
it in original place, original

103
00:05:44,380 --> 00:05:46,200
database, and it was not partitioned.

104
00:05:47,020 --> 00:05:50,420
Deletes, it was a project for a
couple of months.

105
00:05:52,440 --> 00:05:57,900
Because it was, like downtime is
not acceptable, it costs a lot

106
00:05:58,140 --> 00:05:58,840
of dollars.

107
00:05:59,720 --> 00:06:05,380
E-commerce guys know very well,
they can calculate it, each second

108
00:06:05,380 --> 00:06:08,340
of downtime, how much it costs
to company.

109
00:06:09,060 --> 00:06:13,640
So if I had partitioned table there,
it would be magic.

110
00:06:15,060 --> 00:06:15,860
And it's append-only.

111
00:06:16,560 --> 00:06:18,340
That particular table was not append-only,
right?

112
00:06:18,340 --> 00:06:20,100
But it can happen with append-only.

113
00:06:20,460 --> 00:06:22,980
For example, we have audit log.

114
00:06:25,860 --> 00:06:29,380
Some actions are stored in some
append-only table, but we have

115
00:06:29,380 --> 00:06:32,360
a policy to store only 2 years
of data.

116
00:06:33,380 --> 00:06:36,520
Then I would prefer to drop partition
with all data, that's it.

117
00:06:36,660 --> 00:06:40,600
So cleaning up is a very important
topic for append-only tables,

118
00:06:40,600 --> 00:06:42,540
this is what I was trying to say.

119
00:06:43,040 --> 00:06:44,380
Michael: Yeah, I completely agree.

120
00:06:44,380 --> 00:06:47,140
I think there are other benefits
to partitioning with append-only,

121
00:06:47,700 --> 00:06:51,980
or append-mostly as well, due to,
like, if we do have the occasional

122
00:06:52,200 --> 00:06:54,860
update or delete by having

123
00:06:54,860 --> 00:06:57,180
Nikolay: partitions yeah

124
00:06:57,180 --> 00:07:00,460
Michael: well partitioning helps
with that as well right so let's

125
00:07:00,900 --> 00:07:06,040
let's zoom back out maybe we've
got we've got inserts and SELECTs

126
00:07:06,540 --> 00:07:10,160
so we do have to we might have
to if we're talking about a very

127
00:07:10,160 --> 00:07:13,800
very high volume, we might have
to worry about insert performance

128
00:07:13,800 --> 00:07:14,940
and SELECT performance.

129
00:07:15,780 --> 00:07:18,540
Nikolay: We can also have, sorry
for interrupting, we can also

130
00:07:18,540 --> 00:07:19,260
have copy.

131
00:07:20,380 --> 00:07:21,180
Yeah, sure.

132
00:07:22,200 --> 00:07:23,160
In both ways.

133
00:07:23,800 --> 00:07:24,000
And

134
00:07:24,000 --> 00:07:25,900
Michael: I guess that's about to
come up if we're talking about

135
00:07:25,900 --> 00:07:26,400
optimizing.

136
00:07:26,660 --> 00:07:30,060
Nikolay: Also, common table, you
know, like table, which reads

137
00:07:30,060 --> 00:07:30,560
everything.

138
00:07:31,860 --> 00:07:33,660
Yeah, but it's kind of SELECT.

139
00:07:34,600 --> 00:07:38,980
So yeah, inserts or SELECTs, no
UPDATEs, DELETEs and so on.

140
00:07:40,440 --> 00:07:41,580
So what use cases?

141
00:07:41,580 --> 00:07:43,400
You wanted to discuss use cases,
right?

142
00:07:44,540 --> 00:07:46,960
Michael: Or even, I was actually
thinking of diving straight

143
00:07:46,960 --> 00:07:53,300
to performance of, like, I think
there's a few things that we

144
00:07:53,300 --> 00:07:55,640
don't have to worry about, and
a few things that we can then

145
00:07:55,640 --> 00:07:56,360
optimize for.

146
00:07:56,360 --> 00:08:01,360
Like, if we're having to insert
at extremely high volumes, which

147
00:08:01,400 --> 00:08:04,540
sometimes these use cases do lend
themselves towards.

148
00:08:04,540 --> 00:08:08,160
You know, if we're, I think IoT
for example, Internet of Things,

149
00:08:08,160 --> 00:08:13,000
if sensors are sending information
and we're logging for each

150
00:08:13,000 --> 00:08:17,960
second amongst thousands or tens
of thousands of sensors that

151
00:08:17,960 --> 00:08:22,120
could be that can end up being a lot of data so Inserting can

152
00:08:22,120 --> 00:08:25,440
be a bottleneck and you might make design decisions for those

153
00:08:25,440 --> 00:08:28,500
tables that you wouldn't make If you had a different type of

154
00:08:28,500 --> 00:08:31,880
table a different type of data So there's that side of things

155
00:08:31,880 --> 00:08:35,380
but then there's also the read side of things I think you know

156
00:08:35,380 --> 00:08:37,660
and I think those things maybe sometimes play off against each

157
00:08:37,660 --> 00:08:42,480
other so But the fact we've got append-only we have some benefits

158
00:08:42,540 --> 00:08:46,010
to like index-only scans for example become even better I know

159
00:08:46,010 --> 00:08:49,020
I know you often talk about always trying to get index-only scans,

160
00:08:49,020 --> 00:08:52,940
but in a table where the data is often changing, that can be

161
00:08:52,940 --> 00:08:54,000
a losing battle.

162
00:08:54,520 --> 00:08:57,085
It can be a battle that's not always worth fighting, or it's

163
00:08:57,085 --> 00:09:00,420
maybe not always worth including as many columns to the index,

164
00:09:00,420 --> 00:09:01,120
for example.

165
00:09:02,120 --> 00:09:05,560
There's different trade-offs for append-only versus...

166
00:09:06,500 --> 00:09:08,280
Nikolay: Let's unwrap everything here.

167
00:09:08,740 --> 00:09:13,820
You mentioned so many things in just a minute, right?

168
00:09:13,840 --> 00:09:18,420
So first, Let's talk about performance of inserts.

169
00:09:19,660 --> 00:09:23,360
I would say the ideal situation is we don't have indexes and

170
00:09:23,360 --> 00:09:26,760
we don't have triggers, including foreign key triggers, because

171
00:09:26,760 --> 00:09:30,820
foreign key in Postgres is internally implemented via system

172
00:09:30,820 --> 00:09:31,320
trigger.

173
00:09:32,020 --> 00:09:35,460
This trigger is going to slow down inserts, especially if you

174
00:09:36,180 --> 00:09:38,040
need to insert a lot of rows.

175
00:09:38,880 --> 00:09:47,000
If you just have a few foreign keys, it can multiply the duration

176
00:09:47,080 --> 00:09:48,620
of this massive insert.

177
00:09:49,360 --> 00:09:57,380
So, ideally, we should get rid of foreign keys and keep as few

178
00:09:57,620 --> 00:10:00,400
indexes as possible for this particular case.

179
00:10:00,580 --> 00:10:04,540
I remember in some cases, I decided to go without primary keys,

180
00:10:04,700 --> 00:10:08,160
you know, breaking relational model and so on.

181
00:10:08,760 --> 00:10:11,960
There's no relational model in Postgres in any relational database

182
00:10:11,960 --> 00:10:15,380
which implements SQL model, data model, which has null and it

183
00:10:15,380 --> 00:10:17,660
breaks relational model completely anyway.

184
00:10:17,660 --> 00:10:19,580
But this is a different topic.

185
00:10:20,320 --> 00:10:21,060
Side note.

186
00:10:21,280 --> 00:10:25,360
Anyway, so it's not good to be without foreign, without primary

187
00:10:25,360 --> 00:10:29,280
keys, but sometimes you think, oh, I just need to dump these

188
00:10:29,280 --> 00:10:31,580
to some table reliably.

189
00:10:31,980 --> 00:10:36,760
So we have ACID, so Postgres guarantees, it's stored, it's saved,

190
00:10:36,760 --> 00:10:38,680
it's replicated, it's backed up.

191
00:10:40,360 --> 00:10:44,320
But even 1 index, sometimes you think, oh, it slows me down.

192
00:10:46,100 --> 00:10:50,840
And I remember I decided to leave without primary key.

193
00:10:51,820 --> 00:10:55,600
It was a weird case, but it was some archive, maybe just for

194
00:10:55,600 --> 00:10:56,460
audit purposes.

195
00:10:57,340 --> 00:10:59,980
I decided to use BRIN at that time.

196
00:11:00,040 --> 00:11:03,480
BRIN is actually a good idea to consider if we have append-only

197
00:11:03,480 --> 00:11:07,180
because layout, physically, rows don't move.

198
00:11:07,940 --> 00:11:12,540
If we have a row, it's a tuple, it's saved in some block, it's

199
00:11:12,540 --> 00:11:13,400
there, right?

200
00:11:14,060 --> 00:11:17,120
So this is exactly when BRIN indexes work well.

201
00:11:17,120 --> 00:11:20,780
And we had an episode, 1 of our first episodes, I remember.

202
00:11:20,840 --> 00:11:21,520
It was.

203
00:11:21,980 --> 00:11:22,780
BRIN indexes.

204
00:11:22,900 --> 00:11:26,300
BRIN is a block range index, right?

205
00:11:27,340 --> 00:11:27,660
Yeah.

206
00:11:27,660 --> 00:11:29,240
So it's very lightweight.

207
00:11:30,020 --> 00:11:34,840
It speeds up select performance, not as good as other indexes,

208
00:11:34,920 --> 00:11:38,040
especially B-tree, but it's still good, right?

209
00:11:38,540 --> 00:11:42,420
Or we might consider hash indexes also, right, because they might

210
00:11:42,780 --> 00:11:46,384
be more lightweight than B-tree sometimes.

211
00:11:46,384 --> 00:11:47,032
They're smaller,

212
00:11:47,032 --> 00:11:48,180
Michael: for example, right?

213
00:11:48,280 --> 00:11:54,300
Well, I think, Yeah, but when it comes to append-only, I think

214
00:11:54,340 --> 00:11:55,760
you make a really good point.

215
00:11:55,760 --> 00:11:59,600
Each index we have slows down the inserts.

216
00:12:00,140 --> 00:12:04,900
So the fewer the better, possibly none if we aren't, let's say

217
00:12:04,900 --> 00:12:07,540
it's a table we're never reading from or it's an audit log that

218
00:12:07,540 --> 00:12:11,380
we only ever have to read from extremely rarely we might consider

219
00:12:11,380 --> 00:12:16,320
1 or even 0 indexes on that maybe not an audit log because maybe

220
00:12:16,320 --> 00:12:22,240
that's not 1 you would actually be writing an insane volume to

221
00:12:22,440 --> 00:12:25,600
but I've read a timescale but sometimes they have to worry about

222
00:12:25,600 --> 00:12:28,440
this kind of thing that they have whilst they've designed for

223
00:12:28,440 --> 00:12:32,200
these kind of time series workloads They've written a good blog

224
00:12:32,200 --> 00:12:37,540
post on optimizing inserts and they list all the same things

225
00:12:37,540 --> 00:12:38,740
as you and go further.

226
00:12:38,740 --> 00:12:42,240
So they, as well as foreign key constraints, basically other

227
00:12:42,240 --> 00:12:44,160
constraints can add overhead as well.

228
00:12:44,160 --> 00:12:45,040
So for example...

229
00:12:45,340 --> 00:12:46,020
Nikolay: Of course. Checks

230
00:12:46,020 --> 00:12:49,400
Michael: A unique constraint... yeah check constraint but unique constraints...

231
00:12:50,280 --> 00:12:52,700
Nikolay: yeah index additional check for sure

232
00:12:53,480 --> 00:12:58,480
Michael: yeah so not having it basically deciding for each constraint

233
00:12:58,480 --> 00:13:02,600
if you really need it or what value it's adding having it and

234
00:13:02,600 --> 00:13:02,940
make

235
00:13:02,940 --> 00:13:06,820
Nikolay: yes that been said I must say like in most cases, I

236
00:13:06,820 --> 00:13:08,700
prefer having primary key.

237
00:13:09,720 --> 00:13:14,880
Because it's like the center of consistency, of data consistency,

238
00:13:14,880 --> 00:13:15,060
right?

239
00:13:15,060 --> 00:13:17,580
So it's good to have.

240
00:13:17,620 --> 00:13:18,620
But it depends.

241
00:13:19,120 --> 00:13:24,660
It's good that you mentioned timescale, but I think we will return

242
00:13:24,660 --> 00:13:25,340
to timescale.

243
00:13:25,440 --> 00:13:29,220
My question to you is a tricky question, but I think you already

244
00:13:29,220 --> 00:13:33,960
know, and I must admit, when 2 years ago we started the podcast,

245
00:13:33,960 --> 00:13:35,120
I didn't realize it fully.

246
00:13:35,120 --> 00:13:36,580
Now I realize it much better.

247
00:13:36,900 --> 00:13:38,340
So we have an index.

248
00:13:38,760 --> 00:13:41,300
What operations does it slow down?

249
00:13:42,180 --> 00:13:43,940
You said it slows down inserts.

250
00:13:44,080 --> 00:13:45,120
This is for sure.

251
00:13:45,860 --> 00:13:47,580
Does it slow down updates?

252
00:13:47,720 --> 00:13:48,560
Well, yes.

253
00:13:50,140 --> 00:13:56,200
And there's a mechanism, hot update,
which deals with it in a

254
00:13:56,200 --> 00:13:57,540
limited number of cases.

255
00:13:58,220 --> 00:13:59,620
Does it slow down delete?

256
00:13:59,820 --> 00:14:03,260
Well, maybe no, because during
delete, index is not updated.

257
00:14:04,700 --> 00:14:08,100
Postgres-only updates xmax, as
we discussed a couple of times.

258
00:14:08,300 --> 00:14:09,640
Does it slow down selects?

259
00:14:11,120 --> 00:14:11,980
What do you think?

260
00:14:13,520 --> 00:14:16,580
Michael: So we've talked about
how having a lot of them can.

261
00:14:17,080 --> 00:14:17,580
Nikolay: Yeah.

262
00:14:17,640 --> 00:14:17,980
Yeah.

263
00:14:17,980 --> 00:14:21,660
It slows down the selects, especially
if we have a lot of them

264
00:14:21,660 --> 00:14:25,280
and high frequency of selects,
and this is about planning time

265
00:14:25,680 --> 00:14:31,020
and a lock manager locks during
planning, all indexes are locked.

266
00:14:31,580 --> 00:14:35,180
It's some overhead in a very heavily
loaded systems to keep in

267
00:14:35,180 --> 00:14:35,680
mind.

268
00:14:36,180 --> 00:14:42,320
But in general, I would minimize
the number of indexes and try

269
00:14:43,480 --> 00:14:45,260
not to use foreign keys.

270
00:14:45,700 --> 00:14:51,360
Foreign keys, in many cases, we
can imagine they exist, have

271
00:14:51,360 --> 00:14:58,240
maybe routine checks that referential
integrity is fine, but

272
00:14:58,420 --> 00:15:02,240
drop them intentionally because
in this case, we want, for example,

273
00:15:02,240 --> 00:15:03,540
good insert performance.

274
00:15:04,000 --> 00:15:08,900
And as usual, I would like to remind
that when I say all this,

275
00:15:09,280 --> 00:15:14,060
in many cases when I deal with
new system, I have some of these

276
00:15:14,060 --> 00:15:18,900
principles, but I never trust myself,
I always check again checking

277
00:15:18,900 --> 00:15:23,480
should be like consider like sometimes
you spend time there right

278
00:15:23,480 --> 00:15:26,460
but it's worth doing experiments

279
00:15:27,380 --> 00:15:30,660
Michael: yeah I well and I would
say we're talking about extremely

280
00:15:30,740 --> 00:15:35,880
high volumes here if if you can
I would much rather normally

281
00:15:35,920 --> 00:15:40,320
have primary key have some foreign
keys if they make sense and

282
00:15:40,320 --> 00:15:46,320
have a unique key if I need it
and then test if like can I get

283
00:15:46,320 --> 00:15:47,720
better discs if I need to?

284
00:15:47,720 --> 00:15:51,960
Are there other ways I can improve,
like I can cope with higher

285
00:15:51,960 --> 00:15:53,860
write performance instead of...

286
00:15:54,340 --> 00:15:57,840
Nikolay: Perform checkpoint tuning
if you expect huge volumes

287
00:15:57,840 --> 00:15:59,560
to let into the store.

288
00:16:00,060 --> 00:16:02,740
Michael: Yeah, so maybe pay for
it in other ways.

289
00:16:02,780 --> 00:16:03,480
It's only at

290
00:16:03,480 --> 00:16:04,240
Nikolay: the apps.

291
00:16:04,740 --> 00:16:06,060
Bigger buffer pool.

292
00:16:06,340 --> 00:16:06,840
Exactly.

293
00:16:06,900 --> 00:16:09,800
Make sure backends don't write
all the time.

294
00:16:11,240 --> 00:16:12,180
It depends, right?

295
00:16:12,180 --> 00:16:16,400
So checkpointer is not crazy, it's
not too frequent, and so on.

296
00:16:16,800 --> 00:16:17,480
Yeah, yeah.

297
00:16:17,680 --> 00:16:19,540
And there's a lot of stuff here.

298
00:16:19,960 --> 00:16:25,760
And if we think about Selex now,
what's the number 1 problem

299
00:16:25,760 --> 00:16:26,260
usually?

300
00:16:27,040 --> 00:16:34,060
I think, it makes me so, I'm still
wondering how come we lived

301
00:16:34,060 --> 00:16:38,560
so many years until I think Postgres
12 or when

302
00:16:38,560 --> 00:16:41,620
autovacuum_vacuum_insert_scale_factor was
added.

303
00:16:42,640 --> 00:16:44,840
I think Darafei initiated it.

304
00:16:44,840 --> 00:16:45,340
Darafei.

305
00:16:46,320 --> 00:16:49,020
Michael: Version 13 I looked it
up yeah.

306
00:16:49,460 --> 00:16:54,560
Nikolay: Okay it's very recently
compared to like my my experience

307
00:16:54,600 --> 00:16:55,240
with Postgres.

308
00:16:55,240 --> 00:16:55,940
So strange.

309
00:16:56,680 --> 00:16:57,680
What it adds?

310
00:16:57,980 --> 00:17:03,960
Originally, Postgres vacuum, which
also maintains Postgres statistics,

311
00:17:04,640 --> 00:17:08,560
which is important for good query
performance, including selects,

312
00:17:08,560 --> 00:17:09,060
right?

313
00:17:09,400 --> 00:17:15,180
Originally, it was triggered only
after, say, like 10% by default,

314
00:17:15,180 --> 00:17:18,120
10 or 20% of rows are changed.

315
00:17:18,940 --> 00:17:21,900
There is some complex formula,
not very complex, but some formula.

316
00:17:21,900 --> 00:17:26,060
But roughly after 10 or 20% of
rows changed, change means deleted

317
00:17:26,060 --> 00:17:26,740
or updated.

318
00:17:27,780 --> 00:17:30,400
It triggers, but not after inserts.

319
00:17:31,600 --> 00:17:35,640
And only in Postgres 13, a specific
parameter was added.

320
00:17:35,980 --> 00:17:42,420
I think by default it's also 20%
or 10 which tells what the vacuum

321
00:17:42,440 --> 00:17:48,960
to run and process a table after
10 or 20% of rows were added

322
00:17:49,760 --> 00:17:53,480
Michael: yeah and I look this up
and it's it's like, it's because

323
00:17:53,480 --> 00:17:56,260
there's 3 jobs, right, of autovacuum.

324
00:17:56,600 --> 00:18:01,460
There's the removing, well, there's
roughly removing dead tuples.

325
00:18:01,500 --> 00:18:02,640
Nikolay: 4 jobs actually.

326
00:18:03,180 --> 00:18:06,320
Michael: Freezing and analyze statistics.

327
00:18:06,900 --> 00:18:09,740
Nikolay: Removing the tuples, maintaining
visibility maps.

328
00:18:09,860 --> 00:18:11,840
Michael: Maintaining visibility
maps, of course, yeah.

329
00:18:11,840 --> 00:18:12,600
For goals

330
00:18:12,900 --> 00:18:16,600
Nikolay: maybe actually more but
these 4 come come to mind quickly

331
00:18:17,100 --> 00:18:21,100
Michael: yeah and if you're only
doing inserts you don't need

332
00:18:21,540 --> 00:18:25,080
the removing their tuples yeah
But that isn't the only thing

333
00:18:25,080 --> 00:18:25,760
vacuum's doing.

334
00:18:25,760 --> 00:18:30,080
So this then enables, though, the
visibility map and the freezing

335
00:18:30,140 --> 00:18:30,840
to happen.

336
00:18:31,520 --> 00:18:34,620
Nikolay: Well, freezing will happen
regardless of inserts.

337
00:18:34,740 --> 00:18:35,640
It will happen...

338
00:18:35,980 --> 00:18:38,940
Well we can insert a different
table.

339
00:18:38,940 --> 00:18:40,280
Michael: Yeah okay yeah good point.

340
00:18:40,280 --> 00:18:45,620
Nikolay: And autovacuum we'll
see that xmin or xmax or both

341
00:18:45,620 --> 00:18:50,920
xmin right it's very very In the
past we have risk of wraparound,

342
00:18:51,020 --> 00:18:53,040
so it's time to freeze this table.

343
00:18:53,680 --> 00:18:58,520
We can have 0 operations in terms
of, like, table can be left

344
00:18:58,620 --> 00:19:02,400
unchanged for many, but at some
point, we can decide, okay, it's

345
00:19:02,400 --> 00:19:03,220
time to freeze.

346
00:19:03,580 --> 00:19:06,220
Michael: But you're right, visibility
map would never be...

347
00:19:07,060 --> 00:19:08,400
Nikolay: Visibility map is huge.

348
00:19:08,440 --> 00:19:11,540
You mentioned index-only scans,
the performance of aggregates,

349
00:19:11,660 --> 00:19:12,260
counts, right?

350
00:19:12,260 --> 00:19:15,480
So we do want to keep it up to
date.

351
00:19:15,480 --> 00:19:18,260
I think default is not enough as
usual with autovacuum.

352
00:19:18,680 --> 00:19:23,480
We must tune it and even cloud
providers, their defaults are

353
00:19:23,480 --> 00:19:24,220
not enough.

354
00:19:24,640 --> 00:19:31,200
We must tune it and go down to
1% or smaller and make sure autovacuum

355
00:19:31,620 --> 00:19:35,340
maintains statistics and visibility
maps more often so performance

356
00:19:35,380 --> 00:19:38,760
of SELECTs including index-only
scans are good right?

357
00:19:39,580 --> 00:19:40,080
Yes,

358
00:19:40,360 --> 00:19:43,380
Michael: another reason to partition
as well so you can keep

359
00:19:43,380 --> 00:19:47,500
those yeah yeah that makes sense
I was gonna say it's that it

360
00:19:47,500 --> 00:19:52,880
is it's 20% So it is quite high
still as you say, would you ever

361
00:19:52,880 --> 00:19:53,260
switch

362
00:19:53,260 --> 00:19:54,060
Nikolay: to same?

363
00:19:54,280 --> 00:19:59,640
I cannot imagine any OLTP system
any website any mobile app which

364
00:19:59,640 --> 00:20:06,400
would be okay with Postgres or
to vacuum defaults this like ah

365
00:20:07,740 --> 00:20:10,780
That's it like I don't know why
they are so.

366
00:20:11,020 --> 00:20:12,420
They are so for what?

367
00:20:13,260 --> 00:20:18,700
We have so many beautiful websites
working, Huge systems working

368
00:20:18,700 --> 00:20:19,400
with Postgres.

369
00:20:19,740 --> 00:20:24,360
It's like it's so cool to see that
Postgres main handles so big

370
00:20:24,360 --> 00:20:26,460
workloads, but these defaults

371
00:20:27,540 --> 00:20:31,220
Michael: Well, and the strange
thing is this 1 for example if

372
00:20:31,220 --> 00:20:36,180
we did reduce it to 1%, it would
add overhead on small systems.

373
00:20:36,180 --> 00:20:39,640
Sure, if you've only got 100 rows,
it runs vacuum every row for

374
00:20:39,640 --> 00:20:41,260
a while, you know.

375
00:20:41,540 --> 00:20:44,960
But who's running a small system
that can't handle a vacuum of

376
00:20:44,960 --> 00:20:46,940
100 row table every row?

377
00:20:46,940 --> 00:20:47,780
Like, that's fine.

378
00:20:47,780 --> 00:20:48,360
And also,

379
00:20:49,300 --> 00:20:53,540
Nikolay: with append-only specifically,
when some page is already

380
00:20:53,800 --> 00:20:58,160
processed, it's marked all visible,
all frozen, or whatever.

381
00:20:59,120 --> 00:21:00,660
Vacuum just skips it.

382
00:21:01,300 --> 00:21:02,320
Michael: Yeah, so it wouldn't even
be

383
00:21:02,320 --> 00:21:02,780
Nikolay: much overhead.

384
00:21:02,780 --> 00:21:07,860
There were many optimizations in
this area, so to not to do work

385
00:21:07,860 --> 00:21:09,020
which can be skipped.

386
00:21:09,580 --> 00:21:13,300
So it's doing good job skipping
and it's many years already.

387
00:21:13,900 --> 00:21:21,760
So I think like I never saw any
system and I saw maybe already

388
00:21:21,760 --> 00:21:27,460
hundreds of them, different sizes,
websites, like OLTP, right?

389
00:21:27,840 --> 00:21:31,120
I didn't see any time we decided,
oh, you know what, we need

390
00:21:31,120 --> 00:21:32,460
to increase scale factor.

391
00:21:33,700 --> 00:21:35,860
I don't remember this at all.

392
00:21:36,220 --> 00:21:40,680
We can throttle it if we like,
we can balance work among many

393
00:21:40,680 --> 00:21:47,340
workers and so on, but deciding
let's make work of autovacuum

394
00:21:47,360 --> 00:21:51,220
less frequent, 0 cases I had.

395
00:21:51,380 --> 00:21:54,720
Maybe my experience is not enough,
maybe 1 day I will see such

396
00:21:54,720 --> 00:21:55,380
a system.

397
00:21:56,120 --> 00:21:57,600
Michael: I've not seen 1 either.

398
00:21:58,020 --> 00:22:01,320
Nikolay: Enough rage about defaults,
my usual fun I have with

399
00:22:01,320 --> 00:22:01,820
Postgres.

400
00:22:02,380 --> 00:22:04,340
Let's talk about partitioning,
maybe, right?

401
00:22:04,340 --> 00:22:05,740
Why do we want it?

402
00:22:06,220 --> 00:22:11,820
I see several ideas here, and TimescaleDB
is definitely for a

403
00:22:11,820 --> 00:22:15,140
append-only table, so it's a good
thing to have in many senses.

404
00:22:15,800 --> 00:22:19,700
But unfortunately, it's not available
in managed offering except

405
00:22:20,340 --> 00:22:22,520
their own Timescale cloud, right?

406
00:22:22,780 --> 00:22:26,300
And some others, but those some
others usually choose Apache

407
00:22:26,680 --> 00:22:29,360
2.0 version which doesn't have
compression.

408
00:22:30,200 --> 00:22:30,700
Right?

409
00:22:30,900 --> 00:22:31,220
Michael: So...

410
00:22:31,220 --> 00:22:33,580
Doesn't have a lot of their good
features, yeah.

411
00:22:33,940 --> 00:22:37,100
Nikolay: Yeah, so partitioning
is good.

412
00:22:37,740 --> 00:22:42,340
Again, there's some rule, empirical
rule, we say, like many people

413
00:22:42,340 --> 00:22:43,580
say, not only I.

414
00:22:44,240 --> 00:22:48,480
Let's consider partitioning if
table exceeds 100 gigabytes or

415
00:22:48,600 --> 00:22:50,580
has chances to exceed 100 GB.

416
00:22:51,500 --> 00:22:52,780
Partitioning adds complexity.

417
00:22:53,440 --> 00:22:59,920
It's not as well automated as in
Oracle, but it's a very important

418
00:22:59,920 --> 00:23:00,840
tool to consider.

419
00:23:01,860 --> 00:23:02,900
Many factors here.

420
00:23:02,900 --> 00:23:07,940
First, for example, you might say,
okay, I have a partition where

421
00:23:07,940 --> 00:23:12,220
I insert and then many partitions
where it's like my archive.

422
00:23:13,860 --> 00:23:18,740
And as we decided, we want a very
low number of indexes in the

423
00:23:20,060 --> 00:23:24,140
main partition, which is receiving
inserts, and constraints like

424
00:23:24,140 --> 00:23:25,580
foreign keys and so on.

425
00:23:25,580 --> 00:23:30,380
But there is no such problem in
all archive partitions, right?

426
00:23:30,380 --> 00:23:35,140
We might have more indexes there
and constraints and so on.

427
00:23:35,500 --> 00:23:36,460
This is 1 thing.

428
00:23:36,460 --> 00:23:37,580
The second thing is autovacuum.

429
00:23:38,100 --> 00:23:42,500
If occasional deletes or updates
are happening, the block which

430
00:23:42,500 --> 00:23:48,900
contains the raw data basically
is out of visibility.

431
00:23:49,440 --> 00:23:53,740
It's marked not all visible anymore
and not all frozen anymore.

432
00:23:54,140 --> 00:23:55,960
So a vacuum needs to process it.

433
00:23:55,960 --> 00:24:00,480
And it's good to have data localities
or archive data is in some

434
00:24:00,480 --> 00:24:03,480
partitions and fresh data is in
particular partitions.

435
00:24:03,940 --> 00:24:09,900
So autovacuum is focusing on fresh
data in fresh partitions.

436
00:24:10,120 --> 00:24:14,020
It reduces the number of blocks
it needs to deal with, right?

437
00:24:14,160 --> 00:24:19,200
Because all data is rarely touched,
so we...

438
00:24:19,740 --> 00:24:22,080
autovacuum visits are very rarely,
right?

439
00:24:22,900 --> 00:24:24,020
This is another reason.

440
00:24:24,520 --> 00:24:26,780
Cleanup is another reason as well,
right?

441
00:24:27,180 --> 00:24:28,060
Michael: I think that's...

442
00:24:28,140 --> 00:24:30,320
I think cleanup's the biggest reason.

443
00:24:30,540 --> 00:24:31,160
I think...

444
00:24:32,780 --> 00:24:33,640
I think maintenance...

445
00:24:34,480 --> 00:24:36,880
Partitioning helps so much with
maintenance.

446
00:24:37,320 --> 00:24:41,580
It does have other benefits for
sure but it helps so much of

447
00:24:41,580 --> 00:24:44,060
maintenance that I can't help but
feel like that's the biggest

448
00:24:44,060 --> 00:24:48,260
1 and I actually I've started to
say I think I must have stolen

449
00:24:48,260 --> 00:24:51,800
this from somebody else because
it's too clever for me but partitioning

450
00:24:51,960 --> 00:24:55,640
based on how you want to eventually
delete data makes sense so

451
00:24:55,640 --> 00:24:58,840
if you want to eventually delete
old data partitioning based

452
00:24:58,840 --> 00:25:03,160
on time makes sense But for example
if you're a bit like b2b

453
00:25:03,160 --> 00:25:06,600
sass and you eventually want to
delete data based on a customer

454
00:25:07,660 --> 00:25:10,760
quitting the service you probably
want to partition based on

455
00:25:10,760 --> 00:25:13,100
Nikolay: or both a level of partitioning
calls

456
00:25:13,100 --> 00:25:18,780
Michael: yeah exactly but but that
being the like a guiding principle

457
00:25:18,900 --> 00:25:22,240
for how you partition because it
makes that deletion or dropping

458
00:25:22,360 --> 00:25:23,140
so easy.

459
00:25:23,820 --> 00:25:26,020
Nikolay: What will you do with
data?

460
00:25:26,460 --> 00:25:31,080
And as I said I participated in
projects where Delete was a big

461
00:25:31,080 --> 00:25:34,720
issue and of course with partitioning
it's very different and

462
00:25:34,720 --> 00:25:35,420
it's good.

463
00:25:35,740 --> 00:25:37,580
Deletes can be a problem.

464
00:25:37,920 --> 00:25:42,720
Postgres deletes, like if you have
a terabyte, 10 terabyte table

465
00:25:42,720 --> 00:25:46,260
and you need to delete 20% of it,
it's a big headache because

466
00:25:46,920 --> 00:25:51,540
you need to make sure vacuum will
be okay.

467
00:25:51,820 --> 00:25:55,140
It will, autovacuum will catch
up all the time.

468
00:25:55,160 --> 00:25:59,620
You need to, again, to pay attention
to a checkpointer and you

469
00:25:59,620 --> 00:26:03,980
need to find a way how to delete
so delete doesn't degrade.

470
00:26:04,000 --> 00:26:05,140
This was my problem.

471
00:26:05,140 --> 00:26:10,680
So I created beautiful queries,
but they degraded over time because

472
00:26:10,680 --> 00:26:14,200
of that tuple accumulation and
bloat accumulation as well.

473
00:26:14,280 --> 00:26:17,020
So I needed to adjust them and
so on.

474
00:26:17,020 --> 00:26:20,280
So there are many problems with
delete and it takes time to delete

475
00:26:21,100 --> 00:26:22,320
millions of rows.

476
00:26:23,940 --> 00:26:27,320
If you rush with it you can put
system down or have degradation

477
00:26:27,440 --> 00:26:28,120
of performance.

478
00:26:29,180 --> 00:26:33,280
Michael: Well yeah, And it can
really affect even your SELECT

479
00:26:33,280 --> 00:26:33,780
performances.

480
00:26:34,160 --> 00:26:39,020
So you mentioned, BRIN is probably
the 1 where it gets, it used

481
00:26:39,020 --> 00:26:42,640
to at least get affected the most
with the default way of creating

482
00:26:42,640 --> 00:26:43,580
a BRIN index.

483
00:26:44,060 --> 00:26:48,360
If you have a row inserted way
back in an old, if you don't have

484
00:26:48,360 --> 00:26:52,580
partitioning, if it goes miles
away and you get some real scattered

485
00:26:52,960 --> 00:26:57,540
data, BRIN performance can end
up effectively looking like sequential

486
00:26:57,640 --> 00:26:58,140
scans.

487
00:26:59,440 --> 00:27:02,720
Nikolay: And all indexes degrade,
B-tree degrades very quickly

488
00:27:02,720 --> 00:27:07,400
if you perform deletes and updates
and you need to rebuild it.

489
00:27:07,720 --> 00:27:11,000
And rebuilding is better with partitioning
because the smaller

490
00:27:11,000 --> 00:27:15,040
partitions are, the faster rebuilding
is, and Xmin horizon is

491
00:27:15,040 --> 00:27:17,260
not frozen, right?

492
00:27:17,260 --> 00:27:22,080
So autovacuum is not affected
in the whole database right

493
00:27:22,080 --> 00:27:23,460
Michael: yeah so yeah

494
00:27:24,280 --> 00:27:27,680
Nikolay: building and rebuilding
indexes vacuum itself maintenance

495
00:27:27,740 --> 00:27:31,340
tasks are good if you have smaller
physical tables or partition

496
00:27:31,400 --> 00:27:32,120
is great

497
00:27:32,860 --> 00:27:36,500
Michael: right yes on the BRIN
thing I'll link up the old episode

498
00:27:36,500 --> 00:27:39,640
we did, but the min-max-multi I
think makes a big difference,

499
00:27:40,160 --> 00:27:43,180
especially if you don't have to,
like, well, it handles loads

500
00:27:43,180 --> 00:27:45,060
of outliers, so I do think that's
easier.

501
00:27:45,060 --> 00:27:49,000
And if you are able to keep on
top of autovacuum, I guess the

502
00:27:49,000 --> 00:27:51,360
B-tree stuff doesn't degrade that
quickly.

503
00:27:51,660 --> 00:27:54,740
So I feel like these things aren't
as big a problem anymore.

504
00:27:55,240 --> 00:27:58,820
But yeah, often in these cases,
if you're dealing with high volume,

505
00:27:59,480 --> 00:28:03,540
like many, many thousands of queries
per second, like just extreme

506
00:28:03,540 --> 00:28:06,740
volume, anything you can do to
help fight on the performance

507
00:28:06,740 --> 00:28:07,780
front will be helpful.

508
00:28:08,400 --> 00:28:12,380
Nikolay: Yeah, and as usual when
we touch partitioning, the state

509
00:28:12,380 --> 00:28:17,580
of caches and buffer pool, For
example, if you have archived

510
00:28:17,580 --> 00:28:20,740
data which is touched rarely, those
blocks are evicted from the

511
00:28:20,740 --> 00:28:24,840
buffer pool, and cache efficiency
might grow, hit rate might

512
00:28:24,840 --> 00:28:25,580
be better.

513
00:28:25,920 --> 00:28:29,020
But yeah, I agree with you.

514
00:28:29,600 --> 00:28:31,820
So partitioning is good in many
senses.

515
00:28:32,320 --> 00:28:36,220
It comes with price of overhead
and maintenance as well, but

516
00:28:36,220 --> 00:28:37,760
it's worth to have it.

517
00:28:37,800 --> 00:28:42,080
But imagine, like all this said,
we moved slightly from append-only

518
00:28:42,260 --> 00:28:43,940
to append-mostly, right?

519
00:28:43,940 --> 00:28:45,460
But let's move back to append-only.

520
00:28:45,560 --> 00:28:49,460
Imagine we have many partitions
where data is not changed.

521
00:28:49,900 --> 00:28:50,400
Archive.

522
00:28:50,980 --> 00:28:51,920
Indexes created.

523
00:28:52,080 --> 00:28:53,440
All frozen, all visible.

524
00:28:53,600 --> 00:28:57,440
It's a beautiful state of data,
right?

525
00:28:57,440 --> 00:29:00,000
So all index-only scans are working
well.

526
00:29:01,840 --> 00:29:02,520
And that's it.

527
00:29:02,520 --> 00:29:03,380
Maintenance not needed.

528
00:29:03,380 --> 00:29:05,780
Autovacuum not needed there, and
so on.

529
00:29:06,100 --> 00:29:12,280
However, what if we have 100 terabytes
of data, and this is like

530
00:29:12,280 --> 00:29:14,840
heavily loaded cluster, we have
many replicas.

531
00:29:15,360 --> 00:29:17,140
The data is not changed as good.

532
00:29:17,400 --> 00:29:21,400
It's evicted from buffer pool,
but we still need to keep it on

533
00:29:21,400 --> 00:29:26,940
the main storage right and At some
point we think oh like we

534
00:29:26,940 --> 00:29:31,560
pay a big price because this data
is replicated it increases

535
00:29:32,280 --> 00:29:37,160
the volume of backups, full backups
if we consider, right?

536
00:29:37,360 --> 00:29:41,520
So this is like, this legacy, it's
a lot.

537
00:29:41,680 --> 00:29:46,120
And if we have, for example, 5
replicas, 1 primary, We need 6

538
00:29:46,120 --> 00:29:49,160
times to store the same data and
nobody is using it.

539
00:29:49,160 --> 00:29:51,360
Like people read it occasionally.

540
00:29:52,800 --> 00:29:55,400
At some point you think it's not
efficient.

541
00:29:55,840 --> 00:29:59,220
And you think I would rather store
it somewhere else, not on

542
00:29:59,220 --> 00:30:03,840
the main disks, not on SSD, fast
SSDs or I don't know, NVMes

543
00:30:04,080 --> 00:30:09,300
I have, or cloud storage, which
is also expensive, right?

544
00:30:09,720 --> 00:30:12,720
So this leads to 2 ideas.

545
00:30:12,720 --> 00:30:15,980
First idea is it would be good
to compress it.

546
00:30:17,380 --> 00:30:20,860
Again, TimescaleDB, full version
of TimescaleDB is doing a great

547
00:30:20,860 --> 00:30:24,180
job, and their blog posts about
compression are great.

548
00:30:24,720 --> 00:30:28,800
I like especially 1, I remember,
first big 1 which explained

549
00:30:28,860 --> 00:30:35,380
algorithms in row and basically
kind of column compression, although

550
00:30:35,380 --> 00:30:37,380
we still have row storage, it's
great.

551
00:30:37,800 --> 00:30:41,700
And also, I think, second topic
here, which opens up naturally,

552
00:30:42,260 --> 00:30:47,080
is what I know Aurora now offers
it, right, and Neon and Timescale

553
00:30:47,080 --> 00:30:49,320
as well, in cloud only.

554
00:30:49,600 --> 00:30:54,800
Bottomless approach, where all
partitions are offloaded to S3

555
00:30:54,800 --> 00:30:58,140
or object storage, GCS on Google
Cloud, or blob storage on Azure,

556
00:30:58,140 --> 00:30:59,740
or how it's called, I don't remember.

557
00:30:59,820 --> 00:31:01,820
And Now even Hetzner has it.

558
00:31:02,080 --> 00:31:05,340
They just recently released, which
is big news.

559
00:31:05,540 --> 00:31:08,940
I like it because I like their
prices and I worked with them

560
00:31:08,940 --> 00:31:13,220
since I think 2006 or so in a few
companies.

561
00:31:13,860 --> 00:31:17,680
When you bootstrap and you have
a small startup, Hetzner is like

562
00:31:17,680 --> 00:31:23,800
number 1 in terms of budgets and
the hardware they can offer.

563
00:31:24,140 --> 00:31:28,440
So they just recently released
S3-compatible object storage,

564
00:31:28,440 --> 00:31:28,840
right?

565
00:31:28,840 --> 00:31:31,500
So we can have normal backups and
so on.

566
00:31:31,500 --> 00:31:34,200
But what to do with old partitions?

567
00:31:34,540 --> 00:31:36,540
It's a natural way of thinking.

568
00:31:36,740 --> 00:31:42,740
We don't want to keep them on these
expensive disks we have,

569
00:31:43,480 --> 00:31:45,280
having multiple copies of that.

570
00:31:45,540 --> 00:31:50,600
So offloading it somehow, like
implicitly, like transparently

571
00:31:51,000 --> 00:31:55,280
in the background, to S3 or S3-compatible
mini or something,

572
00:31:55,280 --> 00:31:59,380
if you have self-managed Postgres,
it would be great.

573
00:32:01,060 --> 00:32:04,960
So we have it in timescale cloud
in I think new 1 also does it

574
00:32:04,960 --> 00:32:05,460
right

575
00:32:05,840 --> 00:32:06,840
Michael: I don't know

576
00:32:06,900 --> 00:32:10,640
Nikolay: Bottomless bottomless
like new 1 they store data on

577
00:32:10,640 --> 00:32:15,800
s3 originally anyway So idea is
we want to have petabyte size

578
00:32:15,800 --> 00:32:21,900
cluster, but don't pay for lots
of disks and headache it comes

579
00:32:21,900 --> 00:32:22,360
with.

580
00:32:22,360 --> 00:32:27,660
And for append-only, it's very
natural to decide, okay, we want

581
00:32:27,660 --> 00:32:34,760
to store data forever, not to clean
up, but we want cheap storage

582
00:32:34,760 --> 00:32:35,260
here.

583
00:32:35,380 --> 00:32:40,220
So S3 is a good idea to consider,
and it has tiers also, right?

584
00:32:41,140 --> 00:32:44,540
It can be slow to retrieve, but
it's okay because it's rare,

585
00:32:44,540 --> 00:32:45,040
right?

586
00:32:45,620 --> 00:32:48,180
Michael: Well, and it depends what
you mean by slow like I think

587
00:32:48,180 --> 00:32:51,420
there is cut there can be performance
advantages I think when

588
00:32:51,440 --> 00:32:55,520
some of this data is fresh we might
want to retrieve it row by

589
00:32:55,520 --> 00:32:59,160
row like if you do if you're looking
at some audit logs you might

590
00:32:59,160 --> 00:33:03,380
want to look at some recent ones
that you might want all the

591
00:33:03,380 --> 00:33:06,600
information about them but if you're
looking at data from 2 years

592
00:33:06,600 --> 00:33:09,320
ago there's probably a higher chance
that you're looking at it

593
00:33:09,320 --> 00:33:13,540
in Aggregate you know on average
how many audits of this type

594
00:33:13,540 --> 00:33:18,340
will be having in in 2022 versus
2023 and I think actually the

595
00:33:18,340 --> 00:33:22,000
types of queries that happen on
older data tend to be these Aggregate

596
00:33:22,000 --> 00:33:26,360
ones that often perform better
once it's Column store compressed

597
00:33:26,760 --> 00:33:31,160
you know these file formats often
suit that kind of Query so

598
00:33:31,260 --> 00:33:34,700
I could I don't even think I know
what you mean

599
00:33:34,760 --> 00:33:38,140
Nikolay: likes compression good
compression it likes it yeah

600
00:33:38,400 --> 00:33:42,340
and TimescaleDB compression I
have seen how good it is it's

601
00:33:42,340 --> 00:33:47,820
gonna be like 20 20 times smaller
and and indeed like if they

602
00:33:47,820 --> 00:33:52,760
even support data changes for compressed
data, which is great.

603
00:33:53,940 --> 00:33:58,200
Michael: I have seen a project
or 2 come up about lately, I think,

604
00:33:58,200 --> 00:34:01,280
open sourcing some of this stuff,
or at least putting it under

605
00:34:01,280 --> 00:34:02,460
the Postgres license.

606
00:34:03,300 --> 00:34:05,780
Is it pg_parquet that allows you
to...

607
00:34:05,820 --> 00:34:06,780
Nikolay: Yeah, but it's different.

608
00:34:06,780 --> 00:34:07,660
It's for analytics.

609
00:34:07,720 --> 00:34:12,340
And actually for analytics, we
also might want to consider append-only

610
00:34:13,020 --> 00:34:14,080
tables, obviously.

611
00:34:14,540 --> 00:34:18,800
But There is a new wave of this
and many, I know many people,

612
00:34:18,800 --> 00:34:21,100
companies look at it.

613
00:34:21,180 --> 00:34:23,440
PgDuckDB, not PgDuckDB.

614
00:34:23,560 --> 00:34:23,800
Michael: Yeah.

615
00:34:23,800 --> 00:34:24,300
Nikolay: DuckDB.

616
00:34:24,960 --> 00:34:25,440
Right.

617
00:34:25,440 --> 00:34:30,700
And the idea let's marry it with
Postgres.

618
00:34:31,240 --> 00:34:36,000
And there are a few projects I
looked at a few once recently.

619
00:34:36,900 --> 00:34:40,820
And 1 of them was just released
maybe last week.

620
00:34:42,100 --> 00:34:46,980
I remember they use logical copy
from original tables, regular

621
00:34:46,980 --> 00:34:50,880
tables, to these tables, which
are stored on, I think, in Parquet

622
00:34:50,920 --> 00:34:55,540
format on object storage, and then
DuckDB is used as processing

623
00:34:55,680 --> 00:34:56,400
for analytics.

624
00:34:57,040 --> 00:35:02,340
But I remember, I think Álvaro
commented on Twitter that I'm

625
00:35:02,340 --> 00:35:07,620
not going to consider it until
it works with like basically CDC

626
00:35:07,640 --> 00:35:09,960
logical replication or something
because right now it's only

627
00:35:09,960 --> 00:35:13,340
full refresh of like it's not serious
but they will do it I think

628
00:35:13,580 --> 00:35:18,340
also I think new guys looked at
DuckDB and I saw some activities

629
00:35:18,760 --> 00:35:20,140
and Hydra, right?

630
00:35:20,140 --> 00:35:21,160
They also looked at that.

631
00:35:21,160 --> 00:35:21,580
Michael: Yeah.

632
00:35:21,580 --> 00:35:24,920
But I understand that most of the
marketing at the moment is

633
00:35:24,920 --> 00:35:28,600
around analytics use cases, but
I don't see why it couldn't work

634
00:35:28,600 --> 00:35:30,320
for append-only

635
00:35:30,440 --> 00:35:31,200
Nikolay: data types.

636
00:35:31,460 --> 00:35:32,460
I'd be sure.

637
00:35:32,780 --> 00:35:35,980
I looked at a couple of extensions
because I have a couple of

638
00:35:35,980 --> 00:35:43,120
customers with such need to offload
all data and Parquet is this

639
00:35:43,120 --> 00:35:47,440
format supports only like mapping
of data types might be tricky

640
00:35:47,440 --> 00:35:49,780
if you have some complex data types,
as I remember.

641
00:35:50,320 --> 00:35:53,860
And when I looked at some extensions,
it didn't work well.

642
00:35:54,520 --> 00:35:58,760
And I think right now, I have plans
to look at Tembo's extension,

643
00:35:58,780 --> 00:36:02,120
which is called pg_tier, for tiered
storage.

644
00:36:02,980 --> 00:36:07,620
The idea is, with this extension,
we can have all partitions

645
00:36:07,780 --> 00:36:09,620
on object storage.

646
00:36:09,620 --> 00:36:10,660
It's a great idea.

647
00:36:11,160 --> 00:36:13,080
So if it works, it's great.

648
00:36:13,080 --> 00:36:14,940
I just haven't looked at it yet.

649
00:36:16,780 --> 00:36:23,760
If somebody from Tembo is listening
to us, please let us know

650
00:36:23,860 --> 00:36:25,880
how this project works already.

651
00:36:26,120 --> 00:36:29,780
Is it already in production or
beta stage?

652
00:36:30,040 --> 00:36:30,860
I'm very curious.

653
00:36:31,360 --> 00:36:33,540
And is it worth trying?

654
00:36:33,680 --> 00:36:35,880
And what are the limitations, for
example?

655
00:36:36,000 --> 00:36:40,020
Maybe we should actually have a
separate episode, because I think

656
00:36:40,680 --> 00:36:45,600
this extension might bring bottomless
idea to the Postgres ecosystem

657
00:36:46,540 --> 00:36:47,240
for everyone.

658
00:36:47,320 --> 00:36:47,840
It's great.

659
00:36:47,840 --> 00:36:51,640
It's open source, unlike what other
companies do.

660
00:36:52,200 --> 00:36:56,520
So kudos to Tembo for making this
open source

661
00:36:56,740 --> 00:36:57,240
Michael: extension.

662
00:36:58,660 --> 00:36:59,560
And is it transparent?

663
00:37:00,180 --> 00:37:01,400
Do I have to...

664
00:37:01,600 --> 00:37:05,100
Because I think some of the DuckDB
stuff, I would have to be...

665
00:37:05,540 --> 00:37:06,960
I'm not sure if I have to write...

666
00:37:07,060 --> 00:37:08,040
Nikolay: I don't know.

667
00:37:08,680 --> 00:37:14,540
It might be semi-transparent if
you need to make some transitions

668
00:37:14,540 --> 00:37:16,800
with old partitions, I'm okay with
that.

669
00:37:17,660 --> 00:37:21,360
And I can even create new partitions
in the background and move

670
00:37:22,080 --> 00:37:26,460
all the data from 1 old partition
to this kind of new old partition,

671
00:37:26,460 --> 00:37:28,200
which already has different storage.

672
00:37:28,260 --> 00:37:29,120
This is doable.

673
00:37:30,020 --> 00:37:32,900
It's not a big deal, as I see it.

674
00:37:33,150 --> 00:37:38,900
I think what, for example, Timescale
Cloud has, it's transparent

675
00:37:38,940 --> 00:37:44,160
and they have some kind of interesting,
as I remember, very rough

676
00:37:44,160 --> 00:37:44,660
form.

677
00:37:44,680 --> 00:37:48,760
They have something also with Planner
and some algorithms to

678
00:37:48,760 --> 00:37:53,680
decide when to bring this data
to caches and so on right so it's

679
00:37:53,680 --> 00:37:59,640
interesting but idea is we want
partitioning to evict blocks

680
00:37:59,640 --> 00:38:04,120
with all data from memory 1 thing
but then we think we want to

681
00:38:04,120 --> 00:38:07,660
evict them from our disks, because
disks are also expensive,

682
00:38:07,760 --> 00:38:08,000
right?

683
00:38:08,000 --> 00:38:09,180
Let's evict it.

684
00:38:09,640 --> 00:38:13,040
And this is an alternative idea
to cleanup and to compression,

685
00:38:13,580 --> 00:38:17,320
or compression goes well with offloading
as well.

686
00:38:17,320 --> 00:38:18,180
I don't know.

687
00:38:18,580 --> 00:38:22,000
So parquet format is definitely
good with compressing data in

688
00:38:22,000 --> 00:38:24,020
terms of column storage, right?

689
00:38:24,280 --> 00:38:26,700
So if it's time-serious, it's good.

690
00:38:27,560 --> 00:38:31,860
So there are interesting new directions
of development of Postgres

691
00:38:32,040 --> 00:38:33,140
ecosystem here.

692
00:38:33,340 --> 00:38:36,140
And I think we mentioned a few
projects, both commercial and

693
00:38:36,140 --> 00:38:37,700
open source, which is great.

694
00:38:38,100 --> 00:38:42,340
So if someone wants to store petabyte
of data, is preparing for

695
00:38:42,340 --> 00:38:46,080
it, doesn't want to delete everything,
I think there are ideas

696
00:38:46,080 --> 00:38:46,760
to consider.

697
00:38:47,220 --> 00:38:50,680
Well, we didn't mention with partitioning,
you can also have

698
00:38:50,680 --> 00:38:55,780
some kind of foreign data wrapper
and store it on very cheap,

699
00:38:56,200 --> 00:38:56,990
also Postgres.

700
00:38:56,990 --> 00:39:00,540
It can be Postgres, it can be not
Postgres as well, right?

701
00:39:00,640 --> 00:39:03,840
But for example, we can consider
a cheap Postgres cluster with

702
00:39:03,840 --> 00:39:08,800
very slow disks, HDD, for a good
price, and we can have foreign

703
00:39:08,800 --> 00:39:13,240
data wrappers and offload it from
the main cluster to that cluster

704
00:39:13,520 --> 00:39:14,600
and live with it.

705
00:39:14,600 --> 00:39:20,800
There will be some corner cases,
maybe, with transactions failing

706
00:39:20,800 --> 00:39:24,740
sometimes because of, I don't know,
because of...

707
00:39:25,760 --> 00:39:29,700
I expected foreign data wrappers,
Postgres, FDW, code in terms

708
00:39:29,700 --> 00:39:31,120
of how it works with...

709
00:39:33,140 --> 00:39:35,380
Basically, you need 2 PC, right?

710
00:39:35,380 --> 00:39:40,080
You need a two-phase commit to
have a reliable commit because

711
00:39:40,080 --> 00:39:43,760
it's distributed system already,
but without 2 PC, there are

712
00:39:43,780 --> 00:39:46,920
risks to have inconsistency, for
example.

713
00:39:47,980 --> 00:39:51,140
Michael: But not if you're, if
it's append-only, you're never

714
00:39:51,140 --> 00:39:55,440
going to change that data by the
time you're pushing it to a

715
00:39:55,440 --> 00:39:56,460
different server.

716
00:39:56,680 --> 00:40:00,060
Nikolay: I remember I inspected
code and found some kind of edge

717
00:40:00,060 --> 00:40:04,540
cases, maybe even corner cases,
but for inserts it's also possible.

718
00:40:04,540 --> 00:40:07,940
You wrote to 1 place, you're trying
to write another place, this

719
00:40:07,940 --> 00:40:10,220
1 is already committed, here it's
not committed.

720
00:40:10,940 --> 00:40:17,800
But I remember code was quite smart
to reduce the probability

721
00:40:17,900 --> 00:40:19,780
of some cases, but it's not 100%.

722
00:40:20,140 --> 00:40:22,260
As I remember, it was like 5 years
ago.

723
00:40:22,540 --> 00:40:25,360
So I haven't revisited it since.

724
00:40:26,000 --> 00:40:29,280
Michael: Tell me if I'm wrong,
but you're saying new partitions

725
00:40:29,540 --> 00:40:34,380
would be on the local Postgres,
and it would be old ones that

726
00:40:34,380 --> 00:40:36,380
we would move to the second.

727
00:40:36,380 --> 00:40:40,900
Nikolay: Yeah, if we don't have
inserting transactions which

728
00:40:40,900 --> 00:40:43,820
deal with multiple partitions,
there is no problem at all.

729
00:40:43,820 --> 00:40:48,840
Yeah, and old we can move to find
it with Postgres FDW to different

730
00:40:48,840 --> 00:40:49,340
cluster.

731
00:40:49,700 --> 00:40:50,700
So you're right.

732
00:40:51,040 --> 00:40:52,900
No inserts happening, no problem.

733
00:40:53,800 --> 00:40:54,300
Cool.

734
00:40:54,520 --> 00:40:55,460
Select only.

735
00:40:55,640 --> 00:40:59,020
Michael: I never considered using
FDW for partitioning.

736
00:41:00,060 --> 00:41:01,640
Or like, you know, the old partition.

737
00:41:01,640 --> 00:41:02,320
Nikolay: It's natural.

738
00:41:02,320 --> 00:41:03,540
Yeah, it's natural.

739
00:41:03,600 --> 00:41:08,740
This is how many folks were going
to have clustered Postgres,

740
00:41:08,800 --> 00:41:09,300
right?

741
00:41:10,120 --> 00:41:13,220
I mean, I mean, or sharding, sharding,
sharding.

742
00:41:13,780 --> 00:41:14,540
Yeah, sharding.

743
00:41:14,540 --> 00:41:17,200
Yeah, but this path has issues.

744
00:41:18,140 --> 00:41:20,500
Yeah, it's a different, different
story.

745
00:41:20,500 --> 00:41:21,740
Michael: Hereby dragons, is it?

746
00:41:21,740 --> 00:41:23,180
Or what's that old phrase?

747
00:41:24,280 --> 00:41:25,460
Nikolay: And I don't know.

748
00:41:25,600 --> 00:41:27,900
So, okay, we discussed many things.

749
00:41:28,660 --> 00:41:31,760
I feel we might be missing something,
right?

750
00:41:31,760 --> 00:41:35,100
But for a good overview, I think
it's enough.

751
00:41:35,540 --> 00:41:39,360
For a general overview of what's
happening with append-only tables.

752
00:41:41,400 --> 00:41:43,940
So I think it's an interesting
topic, actually.

753
00:41:45,600 --> 00:41:46,860
Many people need it.

754
00:41:47,080 --> 00:41:50,580
Many companies store logs and so
on in Postgres.

755
00:41:51,100 --> 00:41:55,040
And I'm looking forward to the
future where Postgres at some

756
00:41:55,040 --> 00:41:57,180
point will have better compression.

757
00:41:58,020 --> 00:42:00,120
Better compression and bottomless
feature.

758
00:42:01,270 --> 00:42:09,180
And as I like to say imagine TimescaleDB
was Postgres license.

759
00:42:12,040 --> 00:42:15,700
Okay this is I think should happen
at some point I know this

760
00:42:15,700 --> 00:42:19,580
is there are people who don't like
this idea, but they developed

761
00:42:19,700 --> 00:42:23,960
so good stuff that Postgres would
benefit from it.

762
00:42:24,280 --> 00:42:27,840
I know some people would not be
happy with these words, I know.

763
00:42:28,680 --> 00:42:32,540
But it feels natural to have these
features in Postgres itself.

764
00:42:33,260 --> 00:42:37,280
Michael: Some of them for sure,
but I do think some of it's tricky,

765
00:42:37,900 --> 00:42:44,940
like the bottomless stuff for example,
where would it go?

766
00:42:46,760 --> 00:42:48,440
Nikolay: There is extension already,
right?

767
00:42:49,280 --> 00:42:51,160
S3 compatibility is standard.

768
00:42:51,220 --> 00:42:53,940
Even Google Cloud GCS is also S3
compatible.

769
00:42:54,020 --> 00:42:55,220
Everyone does it.

770
00:42:55,640 --> 00:42:57,460
So I think, no, no, no.

771
00:42:58,100 --> 00:43:01,120
I don't see why not here.

772
00:43:01,120 --> 00:43:03,300
I mean, I see it, but it's manageable.

773
00:43:05,280 --> 00:43:07,660
The biggest question is business-wise,
I guess.

774
00:43:07,900 --> 00:43:11,140
So, TimescaleDB, license, yeah.

775
00:43:11,280 --> 00:43:16,460
I understand that business-wise
it's not going to happen in the

776
00:43:16,460 --> 00:43:20,460
near future, but it could be so
great to have good compression

777
00:43:20,740 --> 00:43:22,660
and bottomless and false width
itself.

778
00:43:22,840 --> 00:43:25,960
Even if it's extension, I'm okay
with extensions.

779
00:43:26,200 --> 00:43:29,940
I don't like extensions in general
because I like to...

780
00:43:31,980 --> 00:43:35,320
All people have some features,
but in this case, it's okay to

781
00:43:35,320 --> 00:43:36,000
have extensions.

782
00:43:37,200 --> 00:43:37,700
Cool.

783
00:43:37,860 --> 00:43:39,820
Extensions are great sometimes,
yeah.

784
00:43:40,480 --> 00:43:41,360
Good, okay.

785
00:43:41,420 --> 00:43:42,680
Have a good week.

786
00:43:43,640 --> 00:43:47,040
I know we have more than 100 suggestions
in our doc.

787
00:43:47,320 --> 00:43:47,420
Yeah.

788
00:43:47,420 --> 00:43:48,280
We read them.

789
00:43:49,020 --> 00:43:49,700
Keep posting.

790
00:43:49,700 --> 00:43:51,660
This was 1 of suggestions, right?

791
00:43:52,000 --> 00:43:52,500
Yeah.

792
00:43:52,780 --> 00:43:54,040
So you chose it.

793
00:43:54,280 --> 00:43:55,220
Yeah, good.

794
00:43:55,620 --> 00:44:00,520
Thank you for our audience for
patience reaching this point,

795
00:44:00,580 --> 00:44:01,260
you know.

796
00:44:01,640 --> 00:44:03,260
Okay, see you next time

797
00:44:03,340 --> 00:44:05,500
Michael: absolutely take care see
you next week bye