1
00:00:00,180 --> 00:00:03,680
Nikolay: Hello hello, this is Postgres.FM,
episode number again,

2
00:00:03,680 --> 00:00:10,200
I don't remember 126, great,
more than 2 years and my

3
00:00:10,200 --> 00:00:15,520
name is Nikolay Postgres.AI as usual
My co-host is Michael pgMustard.

4
00:00:15,560 --> 00:00:16,320
Hi Michael

5
00:00:17,380 --> 00:00:18,220
Michael: Hello, Nikolay.

6
00:00:18,220 --> 00:00:19,520
Nikolay: How are you doing today?

7
00:00:20,320 --> 00:00:21,040
Michael: Yeah, good.

8
00:00:21,040 --> 00:00:21,820
How are you?

9
00:00:21,820 --> 00:00:26,260
Nikolay: Great as well So let's play some Tetris column Tetris.

10
00:00:26,460 --> 00:00:28,680
I mean talk about this game

11
00:00:29,440 --> 00:00:30,860
Michael: It's not a very fun game.

12
00:00:30,860 --> 00:00:34,200
Is it I like Tetris, but Column Tetris not so much.

13
00:00:34,200 --> 00:00:41,540
Nikolay: Yeah well I think you can know 0 about it then you usually

14
00:00:42,040 --> 00:00:47,360
get excited about it and then you kind of think okay it exists

15
00:00:47,980 --> 00:00:50,900
but at some point you stop playing too much.

16
00:00:51,340 --> 00:00:53,220
Let's unwrap it.

17
00:00:53,960 --> 00:00:54,880
What is it?

18
00:00:57,360 --> 00:01:02,040
So columns can have different sizes, Data types have different

19
00:01:02,040 --> 00:01:02,540
sizes.

20
00:01:02,980 --> 00:01:03,480
Michael: Yes.

21
00:01:03,580 --> 00:01:07,900
Nikolay: And most folks say, for example, integer, big int, small

22
00:01:07,900 --> 00:01:15,360
int, but I prefer saying int4, int8, int2, because you see exactly

23
00:01:15,360 --> 00:01:17,820
how many bytes it occupies.

24
00:01:18,840 --> 00:01:22,920
Timestamps, regardless of with time zone, without time zone,

25
00:01:22,920 --> 00:01:24,840
they all occupy 8 bytes.

26
00:01:25,640 --> 00:01:27,680
UUID, I think, 16, right?

27
00:01:27,740 --> 00:01:28,580
It's huge.

28
00:01:28,780 --> 00:01:29,720
Or 8 as well.

29
00:01:29,720 --> 00:01:30,540
I think 16.

30
00:01:30,860 --> 00:01:31,820
I always forget.

31
00:01:32,020 --> 00:01:33,180
Michael: I actually can't remember.

32
00:01:33,340 --> 00:01:33,840
Nikolay: Yeah.

33
00:01:34,320 --> 00:01:39,440
Boolean, which should take only 1 bit, takes whole 8 bits, so

34
00:01:39,440 --> 00:01:40,540
1 byte, right?

35
00:01:41,180 --> 00:01:46,660
It's a very wasteful way to store your flags, like true-false

36
00:01:47,480 --> 00:01:47,980
values.

37
00:01:48,560 --> 00:01:49,320
What else?

38
00:01:49,800 --> 00:01:51,080
Michael: There are the big ones, aren't there?

39
00:01:51,080 --> 00:01:55,600
I guess you've got varying length, all the numeric ones.

40
00:01:55,600 --> 00:01:58,780
In fact the documentation is great listing what they all are

41
00:01:58,780 --> 00:02:02,120
but the other varying length ones I think of the other big category

42
00:02:02,480 --> 00:02:07,060
Nikolay: yeah Marlena like text and watch our and dress on a

43
00:02:07,060 --> 00:02:11,080
race just some be of course all these guys they are variable

44
00:02:11,120 --> 00:02:12,900
length and it's different story.

45
00:02:15,660 --> 00:02:20,140
They are not fixed size data types right but these data types

46
00:02:20,440 --> 00:02:25,100
which have fixed size it's interesting to see that sometimes

47
00:02:25,520 --> 00:02:29,640
just reordering columns you can save disk space and get better

48
00:02:29,640 --> 00:02:30,140
performance

49
00:02:31,520 --> 00:02:34,780
Michael: yeah I would say most people including myself when they

50
00:02:34,780 --> 00:02:37,100
first learn this find it very unexpected

51
00:02:37,580 --> 00:02:42,420
Nikolay: Yeah, I agree Same it
was long ago, but yeah, same

52
00:02:44,280 --> 00:02:45,980
Michael: And mine was probably
more recent.

53
00:02:46,520 --> 00:02:51,680
Yeah But yeah, so I don't know
if you want to get into any of

54
00:02:51,680 --> 00:02:55,160
the details as to why I don't fully
understand the reasons why

55
00:02:55,160 --> 00:02:57,940
but my understanding was reading

56
00:02:58,080 --> 00:03:01,560
Nikolay: let's just say it's something
about efficiency of how

57
00:03:01,560 --> 00:03:10,160
CPU works and if it's 8 bytes worth,
it's better for performance

58
00:03:10,680 --> 00:03:15,980
to read and write this number of
bytes, 8 bytes in modern systems.

59
00:03:16,560 --> 00:03:21,680
But it was a surprise to me to
realize, you know, we talked before,

60
00:03:21,820 --> 00:03:26,740
the recording, the alignment happens
not only to 8 bytes, but

61
00:03:26,740 --> 00:03:29,600
also sometimes to 4 bytes and to
2 bytes.

62
00:03:30,120 --> 00:03:32,280
So, yeah, but let's have an example.

63
00:03:32,780 --> 00:03:38,160
For example, if you have, very
typical example, you have a column,

64
00:03:38,800 --> 00:03:43,480
4 byte column, which is primary
key, say ID, integer.

65
00:03:43,660 --> 00:03:44,760
Michael: Which we don't recommend.

66
00:03:45,560 --> 00:03:48,460
Nikolay: Yeah, we don't recommend,
and I always, like, you know,

67
00:03:48,460 --> 00:03:52,400
it's interesting, like you see
something, you learn something

68
00:03:52,600 --> 00:03:57,800
usually by doing some mistakes
and some going through some pain

69
00:03:57,800 --> 00:04:01,420
in production and so on and then
and then you start recommending

70
00:04:01,480 --> 00:04:06,500
something but still people use
integer for primary keys for example

71
00:04:06,500 --> 00:04:10,640
Supabase and OrioleDB they
just had great release it was

72
00:04:10,640 --> 00:04:14,960
on on top of Hacker News but by
the way congratulations it looks

73
00:04:14,960 --> 00:04:21,100
great benchmarks look great beta
7 right yeah and I admire Alexander

74
00:04:21,140 --> 00:04:25,320
Korotkov a lot like he helped me
personally many times.

75
00:04:26,840 --> 00:04:31,340
And for example, recently we had
a discussion about track planning

76
00:04:31,400 --> 00:04:36,880
and we didn't mention that there
is a good thread and discussion

77
00:04:36,980 --> 00:04:41,260
of sampling to mitigate overhead
from pg_stat_statements.

78
00:04:41,880 --> 00:04:44,060
I discussed it with Korotkov some
time ago.

79
00:04:44,060 --> 00:04:47,280
I dropped a line and he immediately
commented on that thread.

80
00:04:47,280 --> 00:04:47,640
So it's cool.

81
00:04:47,640 --> 00:04:48,540
It's cool.

82
00:04:48,740 --> 00:04:49,900
I mean, great.

83
00:04:49,900 --> 00:04:54,560
But he used integer for primary
key in examples of benchmarks.

84
00:04:54,940 --> 00:05:00,460
But you know, when table is named
like, I don't like users, can

85
00:05:00,460 --> 00:05:05,600
we have 2 billion, 2.1 billion
insert attempts to such table?

86
00:05:05,600 --> 00:05:06,120
Maybe yes.

87
00:05:06,120 --> 00:05:06,740
I don't know.

88
00:05:06,740 --> 00:05:08,640
It depends on the size of the project.

89
00:05:08,860 --> 00:05:11,920
But anyway, we don't recommend
using integer for primary keys

90
00:05:11,920 --> 00:05:16,840
as you mentioned I I stay at this
point again, but I'm less aggressive

91
00:05:16,960 --> 00:05:20,820
judging people when they use integer
4 primary keys.

92
00:05:21,220 --> 00:05:26,080
Michael: So a lot of the argument
for int4 primary key versus

93
00:05:26,100 --> 00:05:29,080
int8 would be saving space.

94
00:05:29,380 --> 00:05:32,860
But due to this alignment padding
issue, if you're going to follow

95
00:05:32,860 --> 00:05:34,140
it with an 8 by 8 type.

96
00:05:34,140 --> 00:05:36,140
Nikolay: Create a timestamp, for
example.

97
00:05:36,340 --> 00:05:37,320
Michael: Yes, exactly.

98
00:05:37,640 --> 00:05:38,600
Super common.

99
00:05:39,160 --> 00:05:43,220
If that's the order when you create
your table, you're not saving

100
00:05:43,220 --> 00:05:46,400
space even, other than, well, and
we'll get to this, other than

101
00:05:46,400 --> 00:05:50,380
maybe in some indexes, you're not saving space by using the smaller

102
00:05:50,380 --> 00:05:50,860
data type.

103
00:05:50,860 --> 00:05:52,580
So I think it's really fascinating.

104
00:05:53,040 --> 00:05:54,880
And as I said, for me, it was unexpected.

105
00:05:55,580 --> 00:05:59,180
Not because I knew better, but because I think in like, like

106
00:05:59,180 --> 00:06:03,740
when you create things in any other, in most other formats, like

107
00:06:03,820 --> 00:06:06,500
you see their length grow as you add things together and you

108
00:06:06,500 --> 00:06:06,900
don't

109
00:06:06,900 --> 00:06:07,120
Nikolay: you

110
00:06:07,120 --> 00:06:09,920
Michael: don't consider that the lower level structure is going

111
00:06:09,920 --> 00:06:13,160
to be like grouped into little pay I guess I guess the 1 the

112
00:06:13,160 --> 00:06:19,620
1 exception is when you're creating maybe maybe some PDFs or

113
00:06:19,620 --> 00:06:21,840
something like let's say you wanted to print out the documentation

114
00:06:21,960 --> 00:06:26,040
into PDFs, you probably would want to start new chapters on new

115
00:06:26,040 --> 00:06:27,240
pages, for example.

116
00:06:27,840 --> 00:06:30,360
There's some formatting when you do books and things.

117
00:06:30,360 --> 00:06:33,420
You don't want the chapter to be at the end of a page.

118
00:06:33,480 --> 00:06:35,800
So it's that kind of thing, right?

119
00:06:35,800 --> 00:06:40,900
It's kind of shifting data to start at a new point to make things

120
00:06:40,900 --> 00:06:42,660
easier for the thing.

121
00:06:42,660 --> 00:06:46,020
Nikolay: Yeah, when you use Google Docs, for example, you insert

122
00:06:46,020 --> 00:06:46,920
a page break.

123
00:06:47,280 --> 00:06:47,780
Yeah.

124
00:06:47,920 --> 00:06:48,420
Yeah.

125
00:06:48,520 --> 00:06:49,020
Similar.

126
00:06:49,120 --> 00:06:50,780
A good analogy, I agree.

127
00:06:51,140 --> 00:06:52,340
Back to example again.

128
00:06:52,340 --> 00:06:58,000
If it's int4 primary key and then create it at timestamp,

129
00:06:58,480 --> 00:07:01,600
and then for example, I don't know, like org ID or something,

130
00:07:01,620 --> 00:07:05,680
group ID, also int4, or int8, doesn't matter.

131
00:07:05,820 --> 00:07:09,340
At storage level, what happens between Id and created, between

132
00:07:10,120 --> 00:07:14,620
the first and second columns, Postgres will just fill it up with

133
00:07:14,620 --> 00:07:15,360
4 zeros.

134
00:07:16,400 --> 00:07:22,840
And we can see it using pg_hexedit on Linux if you have Ubuntu

135
00:07:22,840 --> 00:07:23,500
or something.

136
00:07:23,500 --> 00:07:28,180
You can have this graphical interface which works on top of Page

137
00:07:28,180 --> 00:07:29,280
Inspect extension.

138
00:07:30,040 --> 00:07:33,140
You can see zeros with your own eyes, right?

139
00:07:33,640 --> 00:07:34,940
It's from Peter Geoghegan.

140
00:07:34,940 --> 00:07:35,920
Michael: Is that the Peter Geoghegan?

141
00:07:35,920 --> 00:07:36,280
Yeah.

142
00:07:36,280 --> 00:07:36,760
Yeah.

143
00:07:36,760 --> 00:07:37,240
Yeah.

144
00:07:37,240 --> 00:07:38,800
I've seen him using it.

145
00:07:39,100 --> 00:07:42,460
Nikolay: Yeah, I used it a few times, but I remember it required

146
00:07:43,040 --> 00:07:43,540
Ubuntu.

147
00:07:43,580 --> 00:07:46,800
I'm not sure maybe it's possible to run on macOS right now,

148
00:07:46,800 --> 00:07:50,280
so I use it in virtual machines like it would be overhead to

149
00:07:50,280 --> 00:07:50,780
run.

150
00:07:50,920 --> 00:07:53,400
Michael: But these zeros are that padding yes?

151
00:07:53,400 --> 00:07:58,380
Nikolay: Yes since the second column
is 8 bytes for efficiency

152
00:07:59,280 --> 00:08:03,660
Postgres adds 4 zeros so the first
column also basically takes

153
00:08:04,600 --> 00:08:06,260
8 bytes instead of 4.

154
00:08:06,820 --> 00:08:13,120
So 2 first columns, they both take
16 bytes instead of 12.

155
00:08:16,420 --> 00:08:24,180
And it means if you used int8 or
bigint, primary key, it would

156
00:08:24,180 --> 00:08:27,340
be the same in terms of storage
here, as you said.

157
00:08:27,980 --> 00:08:32,680
And we do recommend using integer
4 because who knows, maybe

158
00:08:32,680 --> 00:08:39,820
you will achieve 2.1 billion for
your sequence which is generating

159
00:08:39,940 --> 00:08:43,580
the numbers and once you achieve
that, sequence itself is 8 bytes

160
00:08:43,580 --> 00:08:44,840
always, it's okay.

161
00:08:45,140 --> 00:08:50,920
But if the column where insert
is 4 bytes, with zeros always,

162
00:08:51,700 --> 00:08:55,680
you will end up having problem
because you cannot insert into

163
00:08:55,680 --> 00:08:56,380
it anymore.

164
00:08:57,700 --> 00:09:01,420
Because larger values, larger than
2.1 billion won't be possible

165
00:09:01,420 --> 00:09:07,000
to insert, because the capacity
of 4 bytes is 4.2 billion and

166
00:09:07,000 --> 00:09:08,420
we have signed int.

167
00:09:09,060 --> 00:09:13,940
It's a bad situation because a
big surgery will be needed on

168
00:09:13,940 --> 00:09:14,680
this table.

169
00:09:15,860 --> 00:09:17,860
It's really not an easy topic.

170
00:09:17,860 --> 00:09:20,100
And we touched it a few times in
the past.

171
00:09:21,760 --> 00:09:25,400
But what I didn't realize until
very recently is that padding

172
00:09:25,400 --> 00:09:29,560
can happen also to 4 bytes and
to 2 bytes.

173
00:09:29,600 --> 00:09:36,200
So If you have, for example, int2,
int2, int4, 3 columns, they

174
00:09:36,200 --> 00:09:37,320
take 8 bytes.

175
00:09:37,660 --> 00:09:43,260
But if you have int2, int4, int2,
moving 1 of the int2s to the

176
00:09:43,260 --> 00:09:49,140
end, the first int2 is going to
be padded to 4 bytes, then 4

177
00:09:49,140 --> 00:09:50,460
bytes, then 2 bytes.

178
00:09:51,220 --> 00:09:53,040
So overall it will be 10 bytes.

179
00:09:53,760 --> 00:09:58,080
This was a surprise to me because
I thought they will take only

180
00:09:58,080 --> 00:09:58,780
8 bytes.

181
00:09:59,100 --> 00:10:00,040
It's not so.

182
00:10:00,460 --> 00:10:02,520
Padding happens at different levels.

183
00:10:03,900 --> 00:10:06,640
8 bytes, 4 bytes, and also 2 bytes,
as you said.

184
00:10:06,800 --> 00:10:11,980
Same thing if you use Boolean,
and you can see similarly that

185
00:10:13,180 --> 00:10:17,820
it's going to be padded either
to 2 or 4 or 8 bytes depending

186
00:10:17,840 --> 00:10:20,960
on the subsequent column's

187
00:10:21,040 --> 00:10:21,900
Michael: data type.

188
00:10:22,600 --> 00:10:24,120
Exactly, always the next 1.

189
00:10:24,380 --> 00:10:27,980
Nikolay: And also, to make the
picture complete, if we have,

190
00:10:27,980 --> 00:10:30,700
for example, just...

191
00:10:31,620 --> 00:10:33,980
The First example we used.

192
00:10:34,640 --> 00:10:39,160
Int4, primary key, createdAt, timestamp
taking 8 bytes, and then,

193
00:10:39,160 --> 00:10:43,740
for example, I don't know, like
orgID or something, also 4 bytes.

194
00:10:45,060 --> 00:10:49,840
So padding for first column is
going to be up to 8 bytes, then

195
00:10:49,840 --> 00:10:51,000
we have 8 bytes.

196
00:10:51,380 --> 00:10:56,940
And in the end, it also will be
padding, but not for the tuple.

197
00:10:56,940 --> 00:10:59,620
It will be padding for before next
tuple.

198
00:11:00,540 --> 00:11:05,920
So overall, we will end up having
3 8-byte wards, so it's 24

199
00:11:06,040 --> 00:11:06,880
bytes, right?

200
00:11:07,640 --> 00:11:12,940
And just moving the third column
to the second position, we'll

201
00:11:12,980 --> 00:11:15,980
go down from 24 bytes to 16 bytes.

202
00:11:18,640 --> 00:11:23,040
Also there are 24 bytes for tuple
header, which is actually 23,

203
00:11:23,100 --> 00:11:25,260
but 1 byte is not used, as I remember.

204
00:11:25,860 --> 00:11:28,880
But it's also padded always to
24 bytes.

205
00:11:29,340 --> 00:11:36,400
So overall, we will have 48 bytes
per tuple instead of 40.

206
00:11:37,360 --> 00:11:40,760
And this is amount of saving per
tuple we can achieve if we just

207
00:11:40,760 --> 00:11:47,880
move, we just back 2 int4 data
types together into 1 8 byte position.

208
00:11:48,640 --> 00:11:53,500
And this will contradict with our
recommendation to use int8

209
00:11:53,760 --> 00:11:57,180
primary keys, but again the reason
for int8 primary keys is to

210
00:11:57,180 --> 00:12:01,320
avoid the risks of surgery for
the table.

211
00:12:02,020 --> 00:12:05,740
If you do know you will never achieve
for your sequence value,

212
00:12:05,740 --> 00:12:08,980
you will never achieve 2.1 billion
value.

213
00:12:09,620 --> 00:12:10,660
Never ever.

214
00:12:10,760 --> 00:12:11,500
It's okay.

215
00:12:11,640 --> 00:12:14,760
I could even use 2 byte integer,
small int.

216
00:12:15,060 --> 00:12:21,300
If I know I want to achieve 32,
000, right?

217
00:12:21,680 --> 00:12:22,440
I don't know.

218
00:12:22,440 --> 00:12:23,500
Something like this.

219
00:12:23,660 --> 00:12:24,780
Michael: I'm not good at factors.

220
00:12:25,240 --> 00:12:27,640
Nikolay: It requires careful planning,
so you need to think about

221
00:12:27,640 --> 00:12:29,120
the future and all possibilities.

222
00:12:30,300 --> 00:12:34,900
My rule of thumb is just using
for int8 and that's it but back

223
00:12:34,900 --> 00:12:39,440
to this additional example this
is kind of significant saving

224
00:12:39,440 --> 00:12:46,560
right from from 48 to 40 bytes
It's kind of 20 percent, right?

225
00:12:47,520 --> 00:12:50,860
Michael: Yeah as a percentage,
but then I think I think where

226
00:12:50,860 --> 00:12:55,480
this becomes meaningful is when
people have huge tables so we're

227
00:12:55,480 --> 00:13:00,660
talking about only a small number
like bytes and this day and

228
00:13:00,660 --> 00:13:05,280
age even RAM's not that expensive
like it's expensive at scale

229
00:13:05,280 --> 00:13:09,140
but we can get quite powerful machines
for not very much money

230
00:13:09,140 --> 00:13:12,620
these days and loads of storage
for not very much money so I

231
00:13:12,620 --> 00:13:13,780
think this program

232
00:13:13,780 --> 00:13:16,500
Nikolay: because the storage is
not only about storage it's spamming

233
00:13:16,500 --> 00:13:17,820
RAM not with zeros

234
00:13:18,640 --> 00:13:21,660
Michael: I am a huge fan of doing
things efficiently and I but

235
00:13:21,660 --> 00:13:25,200
I think at scale it's where it
starts the investment in this

236
00:13:25,200 --> 00:13:29,440
kind of thing starts to pay off
and I remember a blog that 1

237
00:13:29,440 --> 00:13:32,700
of my favorite blog posts on this
topic is by the team at Braintree

238
00:13:33,080 --> 00:13:37,020
and they mentioned by going back,
like once they discovered this

239
00:13:37,020 --> 00:13:42,040
phenomenon, by going back through
their largest tables they calculated

240
00:13:42,080 --> 00:13:46,500
it was about a 10% saving on disk
space from where they were

241
00:13:46,500 --> 00:13:52,020
before to to a more optimal setup
so I believe that I don't see

242
00:13:52,020 --> 00:13:55,520
any reason why that that wouldn't
be true for a lot of organizations

243
00:13:55,560 --> 00:13:59,700
that hadn't done this deliberately
in the past or people hadn't

244
00:13:59,700 --> 00:14:04,080
been conscious about this but it's
a 10% a lot in the by some

245
00:14:04,080 --> 00:14:04,580
benchmark.

246
00:14:04,900 --> 00:14:05,400
Nikolay: Noticeable.

247
00:14:05,680 --> 00:14:06,420
It's noticeable.

248
00:14:07,200 --> 00:14:10,120
Michael: Yeah, but it's also not
like sometimes when we talk

249
00:14:10,120 --> 00:14:12,500
about performance, we often talk
about orders of magnitude and

250
00:14:12,500 --> 00:14:14,960
when I don't think we're talking
about that in most cases.

251
00:14:14,960 --> 00:14:18,140
Nikolay: Still, you know, like
again, like there are stages of

252
00:14:18,140 --> 00:14:20,200
impression here and minor counterpression.

253
00:14:20,360 --> 00:14:24,820
Well, it's important problem, but
I don't know, like it's It's

254
00:14:24,820 --> 00:14:27,900
worth thinking about it when you
first design a table, which

255
00:14:27,900 --> 00:14:30,340
will obviously be large in the
future.

256
00:14:31,000 --> 00:14:35,020
I know, for example, GitLab, they
have public documentation and

257
00:14:35,020 --> 00:14:37,060
there is a Column Tetris page there.

258
00:14:38,420 --> 00:14:40,020
I know it's a rule to check it.

259
00:14:40,020 --> 00:14:42,340
If it's a new table, to always
check it.

260
00:14:42,340 --> 00:14:44,940
Maybe there is also some automation,
I'm not sure.

261
00:14:45,420 --> 00:14:48,960
But we have bloat as well in different
dimensions.

262
00:14:49,280 --> 00:14:56,000
This dimension is kind of like
programmed bloat in columns level,

263
00:14:56,000 --> 00:14:58,880
but we have in rows level we have
bloat there.

264
00:14:58,940 --> 00:15:04,420
And 10% is probably… for example,
if I see extra 10% of load

265
00:15:04,820 --> 00:15:08,820
in both indexes and tables, it's
not triggering action from me.

266
00:15:09,180 --> 00:15:09,680
Exactly.

267
00:15:10,080 --> 00:15:10,580
Right?

268
00:15:10,680 --> 00:15:13,660
But if you design a table from
scratch, of course, it's worth

269
00:15:13,660 --> 00:15:14,840
thinking about it.

270
00:15:15,040 --> 00:15:15,540
Yes.

271
00:15:16,320 --> 00:15:19,280
Michael: Mostly because it's so
cheap to think about.

272
00:15:19,280 --> 00:15:21,100
And I guess we'll talk about this
in a moment.

273
00:15:21,100 --> 00:15:25,640
It's not a difficult equation to
work out how you should order

274
00:15:25,640 --> 00:15:26,140
things.

275
00:15:27,180 --> 00:15:28,880
Nikolay: And it can be automated.

276
00:15:28,980 --> 00:15:33,040
We discussed also before recording
and I have report, I have

277
00:15:33,040 --> 00:15:39,440
analysis script in my postgres_dba
old toolkit, which will suggest

278
00:15:39,440 --> 00:15:39,800
to you.

279
00:15:39,800 --> 00:15:40,460
It's easy.

280
00:15:40,640 --> 00:15:44,940
Just start from the largest or
from smallest, doesn't matter.

281
00:15:45,360 --> 00:15:47,140
Maybe it matters a little bit.

282
00:15:47,370 --> 00:15:51,600
Actually start from 1 byte columns,
then put 2 bytes columns,

283
00:15:51,600 --> 00:15:52,360
and so on.

284
00:15:52,360 --> 00:15:55,580
Or vice versa, start from the biggest
size bytes and then smallest.

285
00:15:56,760 --> 00:16:03,140
It will provide you, it will reorganize
columns so you have less

286
00:16:03,820 --> 00:16:07,440
cases when you see zeros in physical
storage.

287
00:16:07,740 --> 00:16:11,280
But don't you think it makes Column
Order weird?

288
00:16:11,280 --> 00:16:15,320
For example, ID can be not at the
first position and it's so

289
00:16:15,320 --> 00:16:17,860
strange to deal with such tables.

290
00:16:18,000 --> 00:16:21,700
Michael: Yeah, I haven't had to
do this optimization myself.

291
00:16:22,760 --> 00:16:23,900
Nikolay: I did it a lot.

292
00:16:24,520 --> 00:16:26,120
Michael: I think I would start
large.

293
00:16:26,120 --> 00:16:30,220
I think I would want to start at
least, maybe not largest to

294
00:16:30,220 --> 00:16:33,760
smallest, but at least the eights
first, or at least a group

295
00:16:33,760 --> 00:16:37,200
of eights first, because I'd want
my ID in the first place.

296
00:16:37,200 --> 00:16:41,540
If you're doing SELECT * [from a] table,
you're just doing some testing.

297
00:16:41,580 --> 00:16:46,060
Just having that ID in those first
few useful columns first is

298
00:16:46,060 --> 00:16:48,020
beneficial for me as a user.

299
00:16:48,020 --> 00:16:51,540
But also, I was reading the Braintree
post, and they've done

300
00:16:51,540 --> 00:16:55,580
some optimizations around, they've
suggested putting the ID first

301
00:16:55,580 --> 00:16:59,160
because it's going to be read the
most so in terms of unpacking

302
00:16:59,240 --> 00:17:02,780
the row Apparently there's some
optimization around that as well.

303
00:17:03,080 --> 00:17:03,580
Okay.

304
00:17:04,540 --> 00:17:08,620
So that, like, so they also, they
also recommended some other

305
00:17:08,720 --> 00:17:11,100
super minor optimizations because
they've built tooling around

306
00:17:11,100 --> 00:17:11,600
this.

307
00:17:11,680 --> 00:17:15,580
Like for example, within your,
let's say you've got a bunch of

308
00:17:15,580 --> 00:17:19,600
8 byte columns, within those you're
probably better off putting

309
00:17:19,600 --> 00:17:23,100
your not null columns first because
they're more likely to have

310
00:17:23,100 --> 00:17:24,060
data in them.

311
00:17:24,060 --> 00:17:27,720
And then putting your default columns
a bit after that because

312
00:17:27,720 --> 00:17:30,140
they will almost always have data
in them.

313
00:17:30,140 --> 00:17:33,420
So like there's some of these,
like probably right at the edge

314
00:17:33,420 --> 00:17:36,260
in terms of optimization, but now
we're not talking about storage

315
00:17:36,260 --> 00:17:40,440
They don't we're talking about
Like we're talking about read

316
00:17:40,440 --> 00:17:44,180
performance So it's and and actually
I owes a good point because

317
00:17:44,180 --> 00:17:47,360
if your data is taken up less space
some of your queries are

318
00:17:47,360 --> 00:17:52,200
going to also require less IO and
that's faster cheaper all the

319
00:17:52,200 --> 00:17:54,940
all good things right well it's

320
00:17:55,200 --> 00:17:59,700
Nikolay: this is I didn't see that
did they show some benchmarks

321
00:17:59,800 --> 00:18:04,960
to prove that there's a reason
I mean, foundation for these decisions

322
00:18:06,260 --> 00:18:07,360
to reorder columns.

323
00:18:08,200 --> 00:18:09,900
It's interesting, I need to check
that.

324
00:18:09,960 --> 00:18:12,260
I didn't know, I didn't see it.

325
00:18:13,080 --> 00:18:16,900
But yeah, I think probably you're
right and I need to adjust

326
00:18:16,920 --> 00:18:23,180
my recommendation logic and put
8 bytes before 16 byte columns

327
00:18:23,400 --> 00:18:26,880
because ID should be there, and
the created add should be there,

328
00:18:26,880 --> 00:18:27,760
and so on.

329
00:18:28,140 --> 00:18:33,780
Although if it's UUID type, Again,
it's going to be 16 bytes.

330
00:18:35,460 --> 00:18:36,300
I don't know.

331
00:18:36,300 --> 00:18:37,940
Anyway, this is the topic.

332
00:18:38,200 --> 00:18:43,040
I think it's super easy not to
know about it, obviously, but

333
00:18:43,040 --> 00:18:46,960
it's also super easy to overestimate
the benefit of paying too

334
00:18:46,960 --> 00:18:48,300
much attention to it.

335
00:18:48,820 --> 00:18:52,240
But interestingly, what happened
recently with Postgres versions

336
00:18:53,840 --> 00:19:00,740
17.1, 16.4, 5, a couple of weeks
ago they were released and caused,

337
00:19:00,800 --> 00:19:03,880
for example, TimescaleDB to not
work with them.

338
00:19:04,700 --> 00:19:08,940
What happened in some structures,
there was additional Boolean

339
00:19:08,940 --> 00:19:13,940
type which was added, and physically,
like physical layout changed

340
00:19:13,940 --> 00:19:19,740
in minor version of structures
and causing some extensions like

341
00:19:19,740 --> 00:19:21,760
TimescaleDB to stop working.

342
00:19:22,660 --> 00:19:26,620
And yeah, it was not a good situation
and I think lack of testing,

343
00:19:26,980 --> 00:19:34,240
but more interesting like next
week the deploy new releases so

344
00:19:34,240 --> 00:19:40,020
17.2, 16.6 and others, they fixed
this problem.

345
00:19:41,460 --> 00:19:44,900
And actually, originally, Boolean
value was added to fix some

346
00:19:44,900 --> 00:19:45,400
problem.

347
00:19:46,220 --> 00:19:51,300
It was a fix, because in minor
versions of Postgres, unlike some

348
00:19:51,300 --> 00:19:55,460
other systems, Postgres doesn't
release new features.

349
00:19:55,680 --> 00:19:56,820
It's only about fixes.

350
00:19:58,140 --> 00:20:03,600
But this fix caused incompatibility
in terms of physical layout.

351
00:20:03,600 --> 00:20:04,400
It's bad.

352
00:20:04,540 --> 00:20:07,860
And how next releases fix it?

353
00:20:08,680 --> 00:20:12,540
This Boolean was moved to different
location in terms of order,

354
00:20:13,260 --> 00:20:18,900
where some padding happened, where
some zeros were present, right?

355
00:20:18,900 --> 00:20:20,340
Feeling 1 of those zeros.

356
00:20:21,020 --> 00:20:25,620
So Column Tetris, as I've tweeted,
Column Tetris well played,

357
00:20:26,260 --> 00:20:26,760
right?

358
00:20:27,180 --> 00:20:31,080
So, I mean, Postgres hackers themselves
played Column Tetris

359
00:20:31,080 --> 00:20:38,040
a few weeks ago, fixing some problem,
fixing a bug which was

360
00:20:38,200 --> 00:20:39,900
another fix of another bug.

361
00:20:41,400 --> 00:20:43,040
That was interesting to observe.

362
00:20:43,780 --> 00:20:46,360
Michael: I think it's worth discussing,
We've discussed a bit

363
00:20:46,360 --> 00:20:51,960
why you would do this, like save
on disk space, less IO, better

364
00:20:51,960 --> 00:20:54,820
caching, like better use of RAM.

365
00:20:55,320 --> 00:20:58,820
But I think when is also useful
to discuss.

366
00:20:59,540 --> 00:21:02,420
Obviously, we can do this when
we're creating a new table.

367
00:21:02,640 --> 00:21:04,240
We can think about column order.

368
00:21:04,240 --> 00:21:06,860
That's normally when we're adding
the most number of columns.

369
00:21:07,660 --> 00:21:12,980
But I think also sometimes when
you're adding new feature or

370
00:21:12,980 --> 00:21:18,160
you're doing refactoring, you might
be adding a few columns at

371
00:21:18,160 --> 00:21:18,940
the same time.

372
00:21:18,940 --> 00:21:22,120
And I think it's worth pointing
out that that's another time

373
00:21:22,120 --> 00:21:24,940
where you can make use of this.

374
00:21:25,840 --> 00:21:30,260
But equally, I think we're moving
to a world where there's more

375
00:21:30,260 --> 00:21:33,120
and more emphasis on incremental
improvements.

376
00:21:33,340 --> 00:21:37,080
And I know for my own projects,
I tend to add 1 column at a time.

377
00:21:37,080 --> 00:21:39,440
Like, I tend to be doing a migration.

378
00:21:39,640 --> 00:21:42,020
Nikolay: Oh, yeah, like 5 releases
per day.

379
00:21:42,040 --> 00:21:42,540
Michael: Yeah.

380
00:21:43,500 --> 00:21:46,300
I'm not doing 5 per day, but, you
know, like, they keep things

381
00:21:46,300 --> 00:21:49,140
small, and they tend to not be
multicolored.

382
00:21:49,160 --> 00:21:52,580
Nikolay: Well, in software engineering,
this makes sense completely,

383
00:21:52,600 --> 00:21:59,340
because if you mix many things
in 1 big change, making them depending

384
00:21:59,440 --> 00:22:00,860
on each other, it's bad, right?

385
00:22:00,860 --> 00:22:05,500
So we try to split it to small
steps and release small steps

386
00:22:05,500 --> 00:22:08,860
sooner, because if we have a problem
with 1 of the steps, it

387
00:22:08,860 --> 00:22:11,260
doesn't block others from deployment,
right?

388
00:22:11,440 --> 00:22:17,220
This means that we probably indeed
add columns 1 by 1 or sometimes

389
00:22:17,220 --> 00:22:17,280
in pairs.

390
00:22:17,280 --> 00:22:20,820
Michael: Not normally, or maybe
other people have different experience.

391
00:22:20,820 --> 00:22:24,300
But yeah, so even in pairs, I guess
the order does.

392
00:22:24,720 --> 00:22:27,100
Nikolay: But in this context, in
previous context, what do you

393
00:22:27,100 --> 00:22:27,600
think?

394
00:22:27,600 --> 00:22:31,560
First of all, when Postgres creates,
When we run create a table

395
00:22:31,560 --> 00:22:34,240
with many, many columns, what?

396
00:22:34,300 --> 00:22:35,280
Why Postgres doesn't do it?

397
00:22:35,280 --> 00:22:38,100
Michael: You're going to ask why
doesn't it do it by default?

398
00:22:38,480 --> 00:22:39,640
Itself, yeah.

399
00:22:39,920 --> 00:22:42,040
Then it's unexpected in a different
way.

400
00:22:42,040 --> 00:22:45,140
Like we were saying, I want my
ID to come first.

401
00:22:45,540 --> 00:22:46,660
Nikolay: No, No, no, no, no.

402
00:22:46,960 --> 00:22:51,360
Imagine we had a logical order
and physical order.

403
00:22:51,420 --> 00:22:56,040
Instead of at num, column in
pg_attribute, we would have

404
00:22:56,040 --> 00:22:56,740
Michael: to num.

405
00:22:57,740 --> 00:23:00,520
Nikolay: And we know what to present
to user, like logical.

406
00:23:00,580 --> 00:23:03,880
And we know physical because at
creation time Postgres would

407
00:23:04,240 --> 00:23:08,880
Have some algorithm to reorganize
it right and I think it's possible

408
00:23:08,880 --> 00:23:12,160
I don't I'm not sure if it was
discussed like it should should

409
00:23:12,160 --> 00:23:15,460
be discussed a few times But obviously
it's not happened.

410
00:23:15,620 --> 00:23:20,380
It didn't happen yet but father
let's let's think like let's

411
00:23:20,380 --> 00:23:25,120
dream a little bit additionally
and Based on this case with new

412
00:23:25,120 --> 00:23:30,140
releases, minor releases, which
played Column Tetris, when you

413
00:23:30,140 --> 00:23:33,760
have an existing table, imagine
we have an existing table and

414
00:23:33,760 --> 00:23:37,620
we have some padding happening
there and we add a boolean column

415
00:23:37,880 --> 00:23:42,740
and table is huge we could add
it and Postgres could add it probably

416
00:23:44,340 --> 00:23:47,380
to some different position not
increasing size at all

417
00:23:47,620 --> 00:23:51,340
Michael: right well I don't know
what needs to then happen to

418
00:23:51,340 --> 00:23:56,260
what happens then to the data files
yeah

419
00:23:56,380 --> 00:24:00,340
Nikolay: well existing was is different
if we if we need fully

420
00:24:00,340 --> 00:24:05,580
right we do it right but since
post this 11 if you if you define

421
00:24:05,580 --> 00:24:09,340
default for a column, it will be
virtual, so without physical

422
00:24:09,360 --> 00:24:09,860
rewrite.

423
00:24:10,080 --> 00:24:14,840
It's just, it will say, okay, until
this moment, I think in xmin

424
00:24:16,180 --> 00:24:21,140
or something, until this transaction
ID, all old rows, They virtually

425
00:24:21,300 --> 00:24:25,320
have this value even we don't write
it physically But it was

426
00:24:25,320 --> 00:24:28,440
great optimization and actually
what I discuss here.

427
00:24:28,440 --> 00:24:31,520
It's also in the same area of possible
optimization of Postgres

428
00:24:31,520 --> 00:24:36,380
could have Right, So only future
rows will have it.

429
00:24:36,880 --> 00:24:39,220
They had zeros already, padding.

430
00:24:40,320 --> 00:24:41,940
Michael: Would the padding look
different?

431
00:24:42,660 --> 00:24:46,380
Nikolay: Well, there is a specific
additional place in Postgres,

432
00:24:46,380 --> 00:24:50,440
I don't remember, I forgot, but
in Postgres 11 it was created

433
00:24:51,220 --> 00:24:52,480
for defaults, right?

434
00:24:52,480 --> 00:24:56,840
Saying all rows kind of have this
value for this column which

435
00:24:56,840 --> 00:24:57,340
is new.

436
00:24:57,340 --> 00:25:02,680
All future values will be read,
All future rows will have it

437
00:25:02,900 --> 00:25:03,960
in normal way.

438
00:25:04,300 --> 00:25:05,100
Same here.

439
00:25:05,820 --> 00:25:08,660
We have padding zeros, right?

440
00:25:09,400 --> 00:25:15,540
In future, like future this column,
say it like, fired, right?

441
00:25:15,540 --> 00:25:16,040
Bullying.

442
00:25:16,560 --> 00:25:17,340
True or false.

443
00:25:17,780 --> 00:25:19,400
Or null, 3 value logic.

444
00:25:19,740 --> 00:25:25,120
We can have it not in, like, in
some position where we had zeros

445
00:25:25,160 --> 00:25:26,240
for old rows.

446
00:25:26,640 --> 00:25:28,440
All new rows will have not 0.

447
00:25:28,520 --> 00:25:29,280
That's it.

448
00:25:30,120 --> 00:25:30,720
It's good, right?

449
00:25:30,720 --> 00:25:32,860
Michael: Yeah, I mean, it would
be beautiful, it would be lovely.

450
00:25:33,400 --> 00:25:39,480
You know, in real Tetris, if it's
getting quite fast and you're

451
00:25:39,480 --> 00:25:42,380
starting to get some of these gaps,
every now and again you get

452
00:25:42,380 --> 00:25:45,600
a shape that looks exactly like
a gap you've got right down near

453
00:25:45,600 --> 00:25:48,460
the bottom and you want to just
pick it up and pop it in.

454
00:25:48,480 --> 00:25:50,400
It would feel like that, wouldn't
it?

455
00:25:50,540 --> 00:25:51,440
Nikolay: Similar, yeah.

456
00:25:51,600 --> 00:25:53,340
Another good analogy, right?

457
00:25:53,940 --> 00:25:56,740
So I think there is some sense
in this optimization.

458
00:25:57,660 --> 00:26:01,720
I don't know, maybe I should discuss
it with Andrey and Kirk.

459
00:26:01,720 --> 00:26:06,380
We're probably slowly returning
to our normal hiking sessions

460
00:26:06,380 --> 00:26:07,620
on Postgres TV.

461
00:26:07,920 --> 00:26:11,140
We have actually already quite
a long list of ideas.

462
00:26:12,040 --> 00:26:12,540
Michael: Yeah.

463
00:26:12,740 --> 00:26:15,800
There's 2 other things I wanted
to discuss with you on this.

464
00:26:16,200 --> 00:26:19,980
1 was doing it with existing data.

465
00:26:20,020 --> 00:26:24,360
So we find out that this is a phenomenon
or we've inherited a

466
00:26:24,360 --> 00:26:28,660
system that wasn't set up in optimal
or it's just evolved over

467
00:26:28,660 --> 00:26:31,500
time we've been adding columns
as we've been developing new features

468
00:26:31,500 --> 00:26:35,080
as a system evolved And we decide
it is worth it for us to, or

469
00:26:35,080 --> 00:26:39,640
like, we want to look into how
would we reorganize the table

470
00:26:40,120 --> 00:26:41,660
into a different column ordering.

471
00:26:42,040 --> 00:26:43,480
How would you go about doing that?

472
00:26:43,480 --> 00:26:45,040
Or how do you go about doing that?

473
00:26:45,040 --> 00:26:50,780
Nikolay: It's the same complexity
for surgery as for int4 to

474
00:26:50,860 --> 00:26:52,460
int8 primary key conversion.

475
00:26:53,400 --> 00:26:58,760
I wish pg_repack would do it, but
we had a discussion and I think

476
00:26:58,840 --> 00:27:03,300
I had some very quick and dirty
prototype to organize cold motor

477
00:27:03,740 --> 00:27:06,500
when you are repacking table using
pg_repack.

478
00:27:06,880 --> 00:27:09,720
It will be a good moment because
it rewrites everything, but

479
00:27:09,720 --> 00:27:14,440
there are some doubts and fears
and I think we just had lack

480
00:27:14,440 --> 00:27:18,400
of attention from very experienced
hackers there, so it was not...

481
00:27:18,820 --> 00:27:22,640
I think somebody in that pull request,
let's check it, and have

482
00:27:22,640 --> 00:27:25,020
it shown on somebody, try that
again.

483
00:27:25,240 --> 00:27:26,520
Because the idea is simple.

484
00:27:26,580 --> 00:27:28,660
pg_repack is rewriting the whole
table.

485
00:27:28,980 --> 00:27:33,060
If we need to physically reorganize
the order of columns, we

486
00:27:33,060 --> 00:27:35,140
also need to rewrite the table.

487
00:27:36,100 --> 00:27:39,060
Let's just change the column order
at this moment.

488
00:27:39,960 --> 00:27:43,840
But this is again, this is something
which is not ready.

489
00:27:44,240 --> 00:27:45,640
You cannot use it in production.

490
00:27:45,720 --> 00:27:47,000
I don't recommend it.

491
00:27:47,780 --> 00:27:51,840
For this, although there is a cooking
dirty prototype for this.

492
00:27:52,300 --> 00:27:56,160
How I would do it, I would just
apply this new table approach,

493
00:27:56,820 --> 00:28:01,560
same as in 8, But it requires some
additional effort.

494
00:28:01,560 --> 00:28:02,780
It's similar to pg_repack

495
00:28:02,800 --> 00:28:06,660
You need some kind of delta table,
you need trigger, and you

496
00:28:06,660 --> 00:28:09,900
need to write changes to that delta
table.

497
00:28:11,120 --> 00:28:14,600
Then you create a copy.

498
00:28:15,480 --> 00:28:19,200
While you create a copy, this trigger
should write all changes

499
00:28:19,200 --> 00:28:20,040
in this delta.

500
00:28:20,500 --> 00:28:21,660
Then copy is created.

501
00:28:22,280 --> 00:28:25,200
There are interesting tricks there
to minimize the duration of

502
00:28:25,200 --> 00:28:26,620
transactions, as I remember.

503
00:28:26,760 --> 00:28:29,880
It was long ago, last time I used
this and developed, I think,

504
00:28:30,060 --> 00:28:31,360
maybe 5 years past.

505
00:28:31,400 --> 00:28:36,680
We had a very serious workflow
developed for a big company originally,

506
00:28:37,300 --> 00:28:42,280
and we used it several times, this
approach, under very heavy

507
00:28:42,280 --> 00:28:46,280
loads and very mission-critical
systems where Postgres was in

508
00:28:46,280 --> 00:28:46,940
the center.

509
00:28:46,960 --> 00:28:49,540
And then you just have some transaction
to switch.

510
00:28:50,160 --> 00:28:51,880
You should be very careful.

511
00:28:52,580 --> 00:28:56,420
Foreign keys will be the biggest
problem actually in terms of

512
00:28:57,100 --> 00:28:57,600
Michael: switch.

513
00:28:57,620 --> 00:28:58,960
Well, I had an alternative.

514
00:28:59,640 --> 00:29:00,600
Nikolay: Logical replication.

515
00:29:01,740 --> 00:29:02,240
Michael: Yeah.

516
00:29:03,140 --> 00:29:06,760
And I think if you've only just
discovered this and you have

517
00:29:06,760 --> 00:29:08,200
several large tables that

518
00:29:08,200 --> 00:29:08,480
Nikolay: you want

519
00:29:08,480 --> 00:29:12,320
Michael: to do it on all at the
same time like I realize it's

520
00:29:12,320 --> 00:29:15,460
heavy-handed if it's not your largest
table or if there's You've

521
00:29:15,460 --> 00:29:18,100
got tons of data that is already
pretty optimal.

522
00:29:18,640 --> 00:29:21,600
But if you're doing 1 big retroactive...

523
00:29:22,740 --> 00:29:23,240
Nikolay: Maybe.

524
00:29:23,420 --> 00:29:24,580
Michael: Well, what's the downside?

525
00:29:25,440 --> 00:29:26,100
Nikolay: Of logical?

526
00:29:26,920 --> 00:29:32,500
As always, downsides of using logical
Under heavy loads, it's

527
00:29:32,500 --> 00:29:33,420
tricky sometimes.

528
00:29:33,940 --> 00:29:37,360
In this case, we need publication
for 1 table, so we need to

529
00:29:37,360 --> 00:29:40,840
carefully work with retries and
so on, because some locks will

530
00:29:40,840 --> 00:29:41,280
be needed.

531
00:29:41,280 --> 00:29:45,060
They won't block other sessions,
but you can just fail.

532
00:29:45,060 --> 00:29:48,980
For example, if you have a long-running,
as usual, long-running

533
00:29:49,020 --> 00:29:53,400
autovacuum process, which is processing
your table to prevent

534
00:29:53,400 --> 00:29:57,900
transaction ID wraparound, you
won't be able to get a lock for

535
00:29:57,900 --> 00:29:58,260
long.

536
00:29:58,260 --> 00:30:00,640
I mean, well, it's running, so
you'll fail it.

537
00:30:00,700 --> 00:30:01,460
You need to...

538
00:30:01,460 --> 00:30:03,260
There are some nuances here, right?

539
00:30:03,260 --> 00:30:06,840
But I think it's obviously a valid
approach.

540
00:30:07,740 --> 00:30:11,100
And combining all these thoughts,
I think PjSqueeze from CYBERTEC,

541
00:30:11,100 --> 00:30:15,020
which is an alternative to PjRepack,
probably is a tool where

542
00:30:15,020 --> 00:30:16,200
this should be maybe...

543
00:30:16,340 --> 00:30:18,020
Maybe it's already supported there.

544
00:30:18,620 --> 00:30:19,240
Michael: I checked.

545
00:30:19,240 --> 00:30:20,840
I couldn't see it in the documentation

546
00:30:20,900 --> 00:30:23,800
Nikolay: But it makes sense getting
rid of bloat let's also reorganize

547
00:30:23,800 --> 00:30:27,980
table a little bit and get rid
of padding which can be considered

548
00:30:27,980 --> 00:30:29,440
also kind of bloat right

549
00:30:30,800 --> 00:30:31,640
Michael: Yeah I think so.

550
00:30:31,640 --> 00:30:33,140
I think depending on your definition...

551
00:30:33,240 --> 00:30:34,060
Nikolay: Vertical bloat.

552
00:30:35,900 --> 00:30:42,620
Michael: Like, if you rebuilt your
table now, how much smaller

553
00:30:42,620 --> 00:30:43,480
would it be?

554
00:30:44,180 --> 00:30:47,320
Is kind of how I think of bloat,
the delta between.

555
00:30:47,320 --> 00:30:48,080
How big is it now?

556
00:30:48,080 --> 00:30:49,080
How big would it be if

557
00:30:49,080 --> 00:30:50,240
Nikolay: you think you could...

558
00:30:50,240 --> 00:30:51,340
Well, you're talking about dump
restore, you can check it like

559
00:30:51,340 --> 00:30:56,020
that, or vacuum full, but it won't
remove padding zeros.

560
00:30:56,540 --> 00:30:59,580
Michael: No, so therefore maybe
it doesn't count in that strict

561
00:30:59,580 --> 00:31:03,800
definition, but if you allow for
column reordering in between,

562
00:31:04,140 --> 00:31:05,840
then it suddenly, it counts.

563
00:31:05,840 --> 00:31:08,300
Nikolay: This would be a good feature
for pg_squeeze and unlike

564
00:31:08,300 --> 00:31:11,400
pg_repack I think I think it should
people should not have fears

565
00:31:11,400 --> 00:31:16,320
because if you use logical and
you organize your table like cold

566
00:31:16,320 --> 00:31:16,820
mother.

567
00:31:16,880 --> 00:31:19,220
I don't see problems with same.

568
00:31:19,260 --> 00:31:23,840
Yeah, because it is a pocket works
with substituting real file

569
00:31:25,240 --> 00:31:29,440
nodes and it's like basically substituting
files, it's kind of

570
00:31:29,440 --> 00:31:31,040
a hack, right?

571
00:31:31,240 --> 00:31:36,600
And it sounds scary, while pg_squeeze
is using official API basically,

572
00:31:36,600 --> 00:31:38,040
logical decoding, right?

573
00:31:38,140 --> 00:31:40,760
Which is good because sometimes
pg_repack is not available.

574
00:31:41,260 --> 00:31:44,440
If it's a managed Postgres offering
and they forgot to add pg_repack,

575
00:31:45,060 --> 00:31:49,340
I don't know why they could do
it, but actually I think Supabase

576
00:31:49,340 --> 00:31:50,320
doesn't have pg_repack.

577
00:31:50,900 --> 00:31:51,400
Michael: Interesting.

578
00:31:51,500 --> 00:31:53,200
Nikolay: Yeah, yeah, I think so.

579
00:31:53,400 --> 00:31:57,040
I think I checked last week, we
had a new client come, they are

580
00:31:57,040 --> 00:32:01,560
on Supabase, and I think I checked
and didn't see pg_repack among

581
00:32:01,560 --> 00:32:02,940
supported extensions.

582
00:32:03,320 --> 00:32:08,460
I wonder how Supabase clients
deal with bloat, or maybe they

583
00:32:08,720 --> 00:32:09,740
don't care yet.

584
00:32:09,960 --> 00:32:13,640
But by the way, again, congratulations
with OrioleDB.

585
00:32:14,860 --> 00:32:15,640
It's great.

586
00:32:15,700 --> 00:32:20,340
OrioleDB is a super innovative
thing, which has high chances

587
00:32:20,340 --> 00:32:25,680
to be widely used because it's,
I think it's an Apache license

588
00:32:25,680 --> 00:32:28,940
and the plan is to have it as an
extension for regular Postgres,

589
00:32:29,700 --> 00:32:30,560
which is great.

590
00:32:31,220 --> 00:32:36,180
But it's not yet there because
some changes need to be done in

591
00:32:36,180 --> 00:32:37,180
Postgres core.

592
00:32:38,520 --> 00:32:40,760
But there are chances it will be
done.

593
00:32:40,760 --> 00:32:43,940
Michael: And it's not recommended
for production used yet?

594
00:32:43,940 --> 00:32:47,220
Nikolay: Well yeah, of course,
because basically it requires

595
00:32:47,220 --> 00:32:49,580
patching the original Postgres.

596
00:32:50,140 --> 00:32:51,420
Michael: I think it's more than
that.

597
00:32:51,420 --> 00:32:52,820
I think there's still others.

598
00:32:52,820 --> 00:32:55,940
Nikolay: Yeah, it requires many,
many years to build a reliable

599
00:32:56,160 --> 00:32:56,660
database.

600
00:32:56,980 --> 00:33:00,540
It can be considered a kind of
new database because it's heavily

601
00:33:00,540 --> 00:33:02,580
modified Postgres, right?

602
00:33:02,580 --> 00:33:04,620
Storage layer is heavily modified there.

603
00:33:05,580 --> 00:33:10,220
I can't wait to see, like you know, we discussed it with Korotkov

604
00:33:10,680 --> 00:33:13,480
that branching can be native in Postgres.

605
00:33:14,020 --> 00:33:15,360
This could be huge.

606
00:33:15,860 --> 00:33:18,680
Michael: Yeah, there's a lot of promise there I think, yeah.

607
00:33:18,900 --> 00:33:22,360
Nikolay: But anyway, back to pg_squeeze, I think it's a

608
00:33:22,360 --> 00:33:24,520
good idea if it's supported.

609
00:33:24,520 --> 00:33:27,260
You have many ideas today, but back to practical.

610
00:33:27,260 --> 00:33:30,320
Michael: These are more episodes, these are separate episodes.

611
00:33:31,100 --> 00:33:31,300
Nikolay: Yeah.

612
00:33:31,300 --> 00:33:32,780
Michael: Yeah, back to practical stuff.

613
00:33:32,860 --> 00:33:36,940
There's 1 more thing, So yeah, so I think those are 2 like ways

614
00:33:36,940 --> 00:33:37,580
of doing it.

615
00:33:37,580 --> 00:33:41,820
And I also wondered at what kind of volume are you generally

616
00:33:41,840 --> 00:33:42,180
seeing?

617
00:33:42,180 --> 00:33:46,060
Like what kind of size tables are we talking about before this

618
00:33:46,060 --> 00:33:47,780
starts to make any sense normally?

619
00:33:48,620 --> 00:33:52,560
Nikolay: You know, it's hard to say, because again, like I told

620
00:33:52,560 --> 00:33:57,760
you, even if I have a 1TB table, which is already above the threshold

621
00:33:57,800 --> 00:34:01,640
where partitioning should be used, and I see additional 10% of

622
00:34:01,640 --> 00:34:02,860
bloat, I don't care.

623
00:34:03,120 --> 00:34:06,600
So if I see additional 10% of padding, I don't care.

624
00:34:07,300 --> 00:34:10,960
10% because of engineering time is very expensive, right?

625
00:34:10,960 --> 00:34:11,460
Yeah.

626
00:34:11,740 --> 00:34:13,500
Yeah, I know it's not only about storage.

627
00:34:13,500 --> 00:34:17,580
Like people say it's a storage source, but it's I'm mostly concerned

628
00:34:17,580 --> 00:34:21,060
not about storage, although storage also matters because it affects

629
00:34:21,160 --> 00:34:24,100
backup sizes and replication size and so on.

630
00:34:24,100 --> 00:34:26,540
What matters more for me is state of memory.

631
00:34:27,660 --> 00:34:32,180
And this is like additional spam in the buffer pool and page

632
00:34:32,180 --> 00:34:32,680
cache.

633
00:34:32,980 --> 00:34:33,960
It's not good.

634
00:34:34,460 --> 00:34:39,040
So 10% I wouldn't be bothered, but it's 20% I already kind of

635
00:34:39,140 --> 00:34:40,580
in warning mode, right?

636
00:34:40,760 --> 00:34:44,980
It's 30, 40, it's already, it's worth doing it, right?

637
00:34:45,320 --> 00:34:49,120
And by the way, it's interesting, you know, like I always tell

638
00:34:49,120 --> 00:34:53,580
people these bloat, let's connect topics in very weird way.

639
00:34:54,440 --> 00:34:59,340
So, bloat, regular bloat, when we have dead tuples and they were

640
00:34:59,340 --> 00:35:03,900
cleaned by autovacuum or vacuum, and we have gaps in terms of

641
00:35:03,900 --> 00:35:10,640
tuple, like slots for tuples are not filled and page has only

642
00:35:10,640 --> 00:35:12,740
few tuples and many empty spaces.

643
00:35:12,800 --> 00:35:14,240
This is called blow, right?

644
00:35:14,660 --> 00:35:19,340
So over time Postgres will probably put new tuples there, but

645
00:35:19,340 --> 00:35:23,200
maybe no, because new tuples are coming in different pace and

646
00:35:23,300 --> 00:35:28,260
we end up having too much space used than it could be.

647
00:35:28,260 --> 00:35:32,640
And dump restore would help, or repack, or pg_squeeze, or VACUUM FULL,

648
00:35:32,780 --> 00:35:35,420
which is not good in production, and so on.

649
00:35:35,500 --> 00:35:38,680
So usually how do we understand the bloat?

650
00:35:38,680 --> 00:35:41,280
How do we understand the bloat level?

651
00:35:41,280 --> 00:35:47,780
We use some scripts, some queries
or tools use some queries or

652
00:35:47,780 --> 00:35:55,220
monitoring systems use some queries
but these queries they are

653
00:35:56,000 --> 00:36:00,100
like they sometimes are heavy but
not so heavy as full table

654
00:36:00,100 --> 00:36:04,320
scan would be right still they
are light but they are estimate

655
00:36:04,700 --> 00:36:10,900
estimated yes And you see a bloat
level 40, but it's not 40.

656
00:36:11,760 --> 00:36:14,360
You dump restore and you still
see 40.

657
00:36:14,640 --> 00:36:15,320
How come?

658
00:36:15,960 --> 00:36:19,040
And Interesting, if you take example,
1 of those we discussed,

659
00:36:19,040 --> 00:36:24,300
for example, Boolean int8, Boolean
int8, Boolean int8, repeated

660
00:36:24,300 --> 00:36:30,520
like 10 times, and rinse, and check
bloat level using Estimated

661
00:36:30,520 --> 00:36:34,620
bloat using 1 of those queries
which Actually all originated

662
00:36:34,680 --> 00:36:38,600
from 1 Work, so you will see some
bloat.

663
00:36:38,860 --> 00:36:43,780
I can demonstrate bloat like 50%
It's insane right and you dump

664
00:36:43,780 --> 00:36:46,860
restore you are come for you Pj
repack and this bloat cannot

665
00:36:46,860 --> 00:36:48,620
be eliminated because it's not
bloat.

666
00:36:48,680 --> 00:36:53,720
I think it's very related to mistakes
the errors these scripts

667
00:36:53,720 --> 00:36:56,760
like error related to padding

668
00:36:58,860 --> 00:37:00,880
Michael: You could that could be
fixed right that

669
00:37:00,880 --> 00:37:04,300
Nikolay: could be fixed I tried
I didn't have enough time a few

670
00:37:04,300 --> 00:37:07,480
years ago, and the complexity was
high.

671
00:37:07,740 --> 00:37:12,500
I think, first I need to make sure
my analysis script in postgres_dba

672
00:37:12,500 --> 00:37:15,740
works really right.

673
00:37:16,220 --> 00:37:20,360
I need to check these 4 by 2 byte
padding cases.

674
00:37:20,900 --> 00:37:24,280
And if it's right, I think jumping
from there to those estimate,

675
00:37:24,280 --> 00:37:28,540
bloat estimate queries, I think
it's possible, should be, right?

676
00:37:28,860 --> 00:37:30,560
Michael: You need to know the column
order, but that's...

677
00:37:30,560 --> 00:37:31,740
Nikolay: I know it from practice.

678
00:37:32,080 --> 00:37:35,080
This is how I can easily demonstrate
that you should not trust

679
00:37:35,080 --> 00:37:38,940
blindly those estimated bloat queries.

680
00:37:40,400 --> 00:37:40,900
Yeah.

681
00:37:42,040 --> 00:37:42,540
Michael: Cool.

682
00:37:42,800 --> 00:37:48,220
Last thing, I think Posts on this
normally don't mention indexes,

683
00:37:48,900 --> 00:37:52,860
but 1 good 1 recently did by Renato Massaro.

684
00:37:53,620 --> 00:37:59,320
And I don't think it's as important
as heap for perhaps obvious

685
00:37:59,320 --> 00:38:04,540
reasons, But it did somewhat surprise
me that There are some

686
00:38:04,540 --> 00:38:08,700
somewhat surprising Alignment padding
issues related to indexes

687
00:38:08,720 --> 00:38:10,460
as well, and I thought that was
fascinating

688
00:38:11,260 --> 00:38:17,380
Nikolay: Yeah, but we usually we
cannot change Most cases Changing

689
00:38:17,380 --> 00:38:21,360
column order and yeses just because
of this is not a good idea,

690
00:38:21,360 --> 00:38:24,700
because column order matters in
terms of query performance.

691
00:38:26,000 --> 00:38:29,280
Michael: Yes, so I completely agree.

692
00:38:29,280 --> 00:38:33,460
I think there are some rare cases
where, if you're only doing

693
00:38:33,460 --> 00:38:38,240
equality type, there's some rare
cases where you can you do have

694
00:38:38,240 --> 00:38:42,100
some flexibility on the on at least
a couple of the columns if

695
00:38:42,100 --> 00:38:44,760
you're always including them in
all queries like that kind of

696
00:38:44,760 --> 00:38:49,740
thing but the reason I the reason
I brought it up was I've seen

697
00:38:49,740 --> 00:38:53,860
twice now, including the person
who wrote this article, twice

698
00:38:53,860 --> 00:38:57,740
now people have spent time on something
that then turned out

699
00:38:57,740 --> 00:38:58,680
to be not worth it.

700
00:38:58,680 --> 00:39:04,120
So they had spent time reducing,
like they didn't need a timestamp,

701
00:39:04,180 --> 00:39:06,220
they really only needed the date,
for example.

702
00:39:06,220 --> 00:39:10,580
So they thought, I'm going to, I've got this index on, they only

703
00:39:10,580 --> 00:39:13,940
had a single column index on this timestamp, and I'm going to

704
00:39:13,940 --> 00:39:16,960
create a B-tree, I'm going to create the index again, but only

705
00:39:16,960 --> 00:39:19,120
on the date, like the truncated part of it.

706
00:39:19,120 --> 00:39:22,220
And they spent development time thinking about this, doing this.

707
00:39:22,220 --> 00:39:23,600
Nikolay: And then they stopped.

708
00:39:23,600 --> 00:39:26,380
Well, date is maybe, how much does

709
00:39:26,380 --> 00:39:26,820
it take?

710
00:39:26,820 --> 00:39:27,560
4 bytes?

711
00:39:28,260 --> 00:39:30,640
Michael: It's padded to 8, actually, in BG.

712
00:39:30,900 --> 00:39:32,280
Nikolay: Yeah, you say, like, having

713
00:39:32,280 --> 00:39:38,620
two-column index, 4 and 4 bytes, and just single-column 4 bytes

714
00:39:38,620 --> 00:39:39,740
is kind of same?

715
00:39:40,080 --> 00:39:40,740
Or what?

716
00:39:40,760 --> 00:39:43,520
Michael: Single column 4 bytes is the same as single column 8

717
00:39:43,520 --> 00:39:44,020
bytes.

718
00:39:44,160 --> 00:39:45,520
And that's really surprising.

719
00:39:45,720 --> 00:39:46,920
Nikolay: Yeah, cool.

720
00:39:47,840 --> 00:39:50,840
Michael: So that time was wasted for them, even though it wasn't

721
00:39:50,840 --> 00:39:52,000
a multi-column index.

722
00:39:52,000 --> 00:39:54,840
The padding meant that they didn't gain anything.

723
00:39:54,840 --> 00:39:58,740
Nikolay: So guys have more two-column indexes if it's about 4

724
00:39:58,740 --> 00:39:59,620
byte columns.

725
00:40:01,240 --> 00:40:02,640
Michael: Yeah, maybe you would, yeah.

726
00:40:02,640 --> 00:40:04,620
Nikolay: Or maybe covering indexes as well.

727
00:40:04,700 --> 00:40:06,540
Oh, covering indexes may be different.

728
00:40:07,060 --> 00:40:08,220
You know about including?

729
00:40:09,800 --> 00:40:14,140
If it's a single column 4 byte index and we have another like

730
00:40:14,140 --> 00:40:15,180
passenger, right?

731
00:40:16,300 --> 00:40:16,400
Passenger.

732
00:40:16,400 --> 00:40:18,620
Michael: Yeah, but it only gets stored in the leaf pages.

733
00:40:18,620 --> 00:40:19,800
Nikolay: Yeah, so it's different.

734
00:40:20,580 --> 00:40:22,220
Okay, yeah, that's interesting.

735
00:40:22,540 --> 00:40:24,840
And I didn't go there ever.

736
00:40:25,120 --> 00:40:27,620
This is good that this article raises this topic.

737
00:40:28,480 --> 00:40:28,780
So

738
00:40:28,780 --> 00:40:31,020
Michael: Yeah, I think it's a good thing to be aware of.

739
00:40:31,020 --> 00:40:32,820
Probably not something you should be.

740
00:40:34,360 --> 00:40:37,540
You shouldn't, don't, reorder your indexes based on performance

741
00:40:37,580 --> 00:40:41,880
reasons not based on the storage size it would be my recommendation

742
00:40:42,720 --> 00:40:43,860
Nikolay: I agree yeah

743
00:40:43,860 --> 00:40:47,580
Michael: oh yeah but yeah a really good point that I hadn't realized

744
00:40:47,580 --> 00:40:48,740
until I read this article.

745
00:40:49,540 --> 00:40:50,640
Nikolay: Okay, good stuff.

746
00:40:51,220 --> 00:40:52,780
Thank you for choosing this topic.

747
00:40:54,320 --> 00:40:55,460
Michael: Oh, you're welcome.

748
00:40:55,520 --> 00:40:56,580
We finally got to it.

749
00:40:56,580 --> 00:40:58,180
We've brought it up a few times.

750
00:40:59,340 --> 00:40:59,560
And

751
00:40:59,560 --> 00:41:00,060
yeah.

752
00:41:00,240 --> 00:41:00,740
Nikolay: Okay.

753
00:41:01,220 --> 00:41:02,160
Michael: Have a good week.

754
00:41:02,160 --> 00:41:02,500
Take care.

755
00:41:02,500 --> 00:41:02,860
Bye.

756
00:41:02,860 --> 00:41:04,140
Nikolay: Have a great week.