A weekly podcast about all things PostgreSQL
034 JSON
===
Michael: [00:00:00] Hello and welcome to Postgres fm, a weekly show about all things Postgres, girl. I'm Michael, founder of PG Mustard, and this is my co-host Nikolai, founder of Postgres ai. Hey, Nikolai, what are we talking about?
Nikolay: Hi Michael. Jason B
Michael: Ooh, we're going straight for B. Yeah,
Nikolay: B is better.
Michael: Yeah, I found that funny. Is it, is that semi-official or is that just a joke?
Nikolay: Well, I remember it was, uh, when it was presented, was discussed and, uh, binary better. Nobody knows
Michael: maybe B, like number two, A and B. But no, I've never saw that discussed.
Nikolay: this, explanation I never heard of. So I doubt.
Michael: Yeah. So I mean, specifically I guess we're talking about storing Jason in Postgres and the options for doing so and how you can then use it. we've had a couple of requests for similar things in this area, so thank you to the people that have, requested those. Where did you wanna start?
Nikolay: Well, I will start with discussion of first normal form and, uh, atomic values versus non atomic values, and do consider [00:01:00] arrays, for example, a atomic and, uh, Yeah, we definitely should touch history and h Store. Well, arrays are supported for quite long and pocus, both numeric and other types of arrays, like text arrays and so on.
And, uh, multi-dimensional arrays actually. Then H Store, I, I think it appeared first in 2004, very long ago, almost 20 years ago. And, uh, H Store is like key value, and it's, it doesn't support. But it's still very already like flexible so erase and edge store, it already breaks for some understanding already breaks first normal form because, uh, these values are not atomic.
But, uh, actually we can argue that, for example, timestamp is not atomic as well. So let's leave this discussion for, uh, for theory part and focus on practice part, right, for practice. Definitely, I would say these values are very convenient and they are not, [00:02:00] uh, as evil as Nu right , and I mean, they, they can save your time not bringing too much, uh, issues, uh, in terms of unexpected behavior.
And so there are some cases when you have some unexpected behavior and the most interesting. . I remember like 15 years ago, I had a talk in Maryland. It was my first visit to the us and the talk was about how POGS is good for modern web web applications. At that time, web 2.0 was a thing, so my talk called was called like, pogs is good for Web 2.0 I did some benchmarks, uh, for EAV versus Edge.
or erases doesn't matter actually, in this case. Jason, uh, didn't exist that time yet. And actually I was also working on xml. It's yet another, uh, data type, which, which you can say is breaking first normal form, but also provides some flexibility in some cases. I think currently it's [00:03:00] not the most popular format, but still there is some development ongoing.
And, the thing is that EAV is very flexible EV's entity attribute value. When you have three tables with foreign keys between them, optional foreign keys, four keys is not the biggest problem there. The biggest problem there is that you need join to, to join extensively. So some entities, some attribute and, and additional like connection.
Table between them, which contains value, which stores value, so basically it's like your document, if you had originally XML or J Sound document, you shrink to three tables. It's very flexible, supports a lot of possible, documents. To construct document back, you need to join all of them, right?
And, uh, one thing we have, we have storage overhead issue. Usually a lot of more bites are need to be wasted, uh, [00:04:00] to store compared to, for example, text representation or this representation. And second thing, uh, you need to join everything together all the time. And. My very simple explanation to new POCUS users, you cannot have, you cannot have index on two tables.
This means you usually, for best performance, you usually want to have everything in one table. Joints, like joints are good. Pocus, can be very performant in joints, but not always, uh, nothing can beat, uh, index. Only scan, for example.
Michael: Yeah. And when, wait, so just to clarify, when you're saying you can't have an index on two tables, we mean you, if you want a multi column index, you can't pick two columns that are from separate tables.
Nikolay: Right. So eav approach is, uh, like I would say, radical normalization approach. You normalize everything. You have ideas everywhere, and then, you need to index. and then [00:05:00] you need to, to rely on join, algorithms. We have three, right? And, uh, compared to the normalized state, when you probably repeat some keys and so on, you have worse performance usually.
And so both, uh, storage overhead and worst performance to extract whole document. And, it usually means you would, you, you need to avoid eav. in most cases,
Michael: Yeah. I think it's worth even saying developer experience is worse. Right? If even to get the simplest. Data out of, if you're having to do multiple joins, it's just more complex as well. But yeah, I agree that those two are the
Nikolay: well develop developers. Uh, specifically I, I've observed several teams who loved the flexibility. It's like with star scheme of foreign databases, they, they love flexibility. They can quite quickly develop obstruction in their language of choice.
Michael: Yep.
Nikolay: uh, be very flexible, very good. But then at some point when we, we have billion of rows, we [00:06:00] already can, like, feel trouble with this approach.
So, uh, my, my talk 15 years ago, I, I was trying to prove that H Store or XML or arrays is better. So you need to go generalized way and rely on, on, uh, flexible indexing, uh, capabilities. POGS provides, for example, G Gin is the. Very flexible, uh, way to index complex data types. and then JSON appeared I think in 2014, 13, first
Michael: I've I've written it down. Yeah. 9.2 is 2012, but that was regular J that was Jason. without b, the,
Nikolay: Right? Syn syntax, sugar, j o let's call it. So,
Michael: Yeah, well actually if we're sticking to history, let, should we give a quick rundown of what that included so we, we could store Jason. It did some validation, but it was like a, it was just a blob, it was like a, a string basically.
Nikolay: yes. But uh, Was it so like it was [00:07:00] success even before jsb because, uh, by that time already JSON became a standard defacto for web application, mobile application, most pieces of systems started to communicate using JS O and uh, it was obvious that it's great to support it in database and also.
gave a birth to this movement. Let's eliminate middleware and communi, allow some, maybe some very thin middleware like Postgres. We had episode about it. And so, so front end can communicate with data stores, Postgres data database, uh, through some very thin, middleware, which doesn't have a lot of logic and a lot of logic. For example, react app or it goes to client code or mobile app. A lot of logic is there and the data logic goes to database closure. And it also gave another, wave of interest to, database site programming using PL PPGs scale or PL python and [00:08:00] so on, right? So this is interesting. Indeed. It started roughly 10 years.
Michael: Yeah. Wow. That's, longer than I would've guessed,
Nikolay: Yeah. Yeah. And then a few years later, Jason B was implemented and it has, indexing support, right? So it's binary format, which stores keys in ordered form internally. and, uh, it supports flexible, uh, indexing capabilities. Well, for JS O you also can index using B3 and path, so you can use B3 indexing for some path expression.
So you can create an index expressional index, but it expression, should return. Values, right? It cannot return whole value or you need to convert it to text, but it will be not very good. Basically, you can, this index can, uh, be meaningful only if your query has the same expression as well, right?
And sometimes it's good [00:09:00] because this beatus is very efficient and you can use it in some cases. But for Jason B, you can use gene index with, uh, some fires like, uh, Jason. Pattern ops, uh, and I don't remember, but there is a way to path ups. Yes. Yes. So you can, uh, say, I don't need to remember anything.
I just need keys and values and, uh, in this case, only limited, expressions will be supported, but in the size will be smaller. So you can, you have. You can choose, uh, how to index it. Right? And Gene is great in many cases, not often. There are also issues with Gene sometimes, especially with his fast update option, but it's another story.
So Jason B probably, what do you think, how much of popularity to pauses, can be associated with appearance of Jason or Jason B
Michael: Yeah, I think I mentioned in our, I think we did an episode roughly on this topic of, you know, why is Postgres [00:10:00] so popular? And I, I actually assigned quite a lot of value to this, but not necessarily for. Fully technical reasons. I think there are really good technical reasons for it, which we'll go into soon.
but I think for marketing reasons, I think Mongo DB was doing really well around this time. Not so much for technical reasons, but for marketing reasons and this. as I think we're gonna discuss, this didn't put an end to that debate, but it, it made meant people didn't have to make a decision either, or they could, store some of their, you know, key value data in Postgres and get nearly as good performance, uh, or, or better in some cases.
So, I personally think it was really important for p for the, the trend we're seeing now of people. seeing relational databases as called again, or the de facto again. I think we, we were at risk of losing that or losing traction there, and people growing more towards databases like Mongo.
But yeah, I, I don't know what, [00:11:00] I think it was very important, but I, I'm not sure.
Nikolay: Right, likewise. Uh, I think, uh, it's, uh, important and, uh, it, helped to join two worlds, right? So relational and non-relational world. And, uh, now everyone who wants to stay in Jason World can just have a table with like surrogate. And, and Jason like data, right?
Michael: no, don't give people bad ideas.
Nikolay: it's, uh, but, uh, of course I don't encourage, usually I say let's, let's understand which, columns we can say we know these columns and we know data types and all constraints and, uh, we, we should extract them and store separately and have Jason only for some.
Flexible, not yet understood parts of our data. But, uh, worth mentioning that, uh, some people, for example, elect Baranov mentions that usually in his, [00:12:00] slides on conferences, he shows some graphs that showing that, uh, when Jason B appeared, popularity of pogo started to increase. But if you look closer, actually popularity started to increase slightly earlier.
It's like, it's the same, it was the same year, but Jason was not yet released. It was released in the end of year, but Lar started to go up, earlier this, that year, if was 2013, maybe 12, 13. I don't remember. And as we discussed in the past, we suspect it was related to, uh, RRGs probably, support of POGS in.
Michael: back to Jason, I think before we move on from the comparison of Jason and Jason B I have seen some very smart people making a case for their, there are some use cases for Jason over Jason B. There's a few things it like it. You mentioned ordering. If you really want to store what you were sent, for example, for some.
Jason Suppo, it preserves the order. It preserves white space. It [00:13:00] even preserves duplicate keys. and there's one more advantage I saw that somebody share, which is on just purely on ingest rate. It can, it is faster. Yeah. so if for any reason those are, those are important to you, it can be useful, but it comes at a big cost. In terms of if you need to read the data. So I think for most people they want to access this data in some way in the future, you're probably gonna want Jason B.
Nikolay: Yeah, I think default choice should be Jason B, but in some cases, uh, indeed when we need to faster insert and we don't care about, better check of, uh, better validation in terms of like, uh, duplicate keys and so on. Also regular Jason is to, to takes, uh, roughly two more space, right? ah, no, no, no.
Vice versa, sorry. Yes, yes, yes. Vice versa.
Michael: Unless you count workspace.
Nikolay: Yes. Yes. It takes less space than Jason Viso if like the default choice Jason be. But if you want to optimize on, insert [00:14:00] speed and on storage space, probably like older Jason data that is.
Michael: basically, there are some differences. If, if these kinds of things are important to you, look into it, test it. But, that's why in if you read up on these topics, often people are just defaulting to Jason B and not really even explaining why.
Nikolay: Yeah, well, because right before, because for example, the, speed of accessing particular keys will be slower, especially if you have a lot of keys, nested structure and so on. And Jason b it's easy because it's, it's already passed, but for Jason, it'll be, par pocus will be passing on the fly trying to extract your part of your, of your value.
So it depends on workload. , but, uh, I would say go with Jason, be, and only in some narrow cases. Consider Jason. Jason, regular Jason.
Michael: Cool. Continuing with history. Should we, talk a little bit about, the Jason Path stuff added in it? Was it 20? Uh, it's [00:15:00] only a few years ago. 2019, I think. Version 12.
Nikolay: Right. Not everything was added. Uh, there, there is, first of all, there is a, whole part in SQL Standard right now, similar to Square xml. Now there is aqua JSON and JSON Path. It's a lot of stuff appeared there. Well, before that, I would say, POS already had a lot of manipulation capabilities, like a lot.
And, it's hard to remember all of. For example, you can switch from relational data back and forth, aggregation, accessing particular parts, also changing particular parts. But, it, it was done in, in native pogs, uh, way because no standard existed. I think POGS pioneered a lot of things in this area before even commercial databases.
Right.
Michael: Yep.
Nikolay: when standard appeared, A lot of, efforts were done to develop, uh, support of standard, but, uh, unfortunately not everything was, was [00:16:00] committed to recent positive versions because, uh, it's whole, it's a huge work and, uh, it needs more time for testing and so on. So only parts of that, uh, went to Pogo.
Michael: Yeah, fingers crossed for future versions. I know a lot of people are excited about that. Um, and on the functions front, I think you're right. I think my, I, I don't even try and remember them. I don't use it that much, but my general rule of thumb is, if you want to manipulate your data in some way and transform it in some way, there's probably a function for that.
the docs are great. I'll share a link to them in the show notes. but there's a serious list, so if, if you think something might be possible, check because it, it's probably a function that's already there.
Nikolay: Right. I, I, I worked with Jason a lot in, in Postgres context and not only, and, uh, I always, have, reference, uh, Postgre documentation open because it has a lot of stuff, uh, uh, to, to cover. Also, uh, speaking of, uh, skill, Jason, of course, we need to recommend, uh, lake [00:17:00] Barton of, and others Alexander work, they had a lot of.
Presentations, a lot of slides. I think we can link them and, uh, there is a lot of work there done over the last few years.
Michael: Yeah, a lot in terms of performance and indexing, especially, I believe.
Nikolay: Well, not only Lind Dixon, I mean, uh, the support of SQL J Standard, even if it's not, uh, fully finished. Yes. It started several years ago. It's like, A lot of work go check out. But indexing, yes. The same people also involved into gene support for j b. Indeed,
Michael: Rum or vodka? What? I forgot what it's called.
Nikolay: no, no. RAM vodka, it's different. It's extension of gene, uh, and RAM is, yeah. So the problem with Gene usually is that you have Gene, okay. You, you index your document. For example, Jason, doesn't matter. Maybe something. gin supports various things. For example, arrays as well, right? Or full tech search.
So also gin. [00:18:00] So we index some document and we have huge table with, uh, for example, ID created add something else and this data value indexed by. And then we, we want to have an like, Covered by gin. Supported by gin. But we want to also to have some ation in order by order by. We usually want to have not like traditional approach with Fulk search.
Fulk search. Why? I mentioned full search, because Gene originally was created for it, for full search only. Then it was propagated to complex data types. Uh, this like. R D three, Russian, all three for a race and so on. it's a historical, uh, knowledge, right? So, , usually we ordered by some relevance, , metric.
For example, this document is most relevant for this church and goes to the first place and so on. And this data is like an index, but in social media, uh, use [00:19:00] cases we usually order by time, right? We want.
Fresh data. Go, go. First. All, all data goes last. And ation how to do that. We need to order by either ID or created at timestamp.
And this data is not present in Gene. And this is a huge problem because Postgres now needs to choose what to use either B3 for that historical ordering, right, or uh, June for full search or Jason search. Three grams short. It can be anything. And, uh, how to combine it. The idea, let's combine it is a RAM index.
So, integer or timestamp failure goes to gene, it extends it. And we have RAM index, but RAM index in my test, honestly, I didn't do it in the past two or three years, so I'm, there might be improvements, but in my test in the past I had so big index. [00:20:00] And performance issue. So I decide not to go with it in production every time I tried it.
Maybe today it's better. I don't know. If our listeners know, have fresh data about RAM behavior, please comment, let us know. It's interesting.
Michael: Yeah, that'd be cool. Should we move back to, I feel like we haven't really even discussed when, when it is and isn't a good idea. There's some really good articles about this. my personal favorite is by, team at Cyber Tech. I think Lawrence, goes into great detail on a lot of, actually focuses mostly on mistakes.
He sees people. doing so, as we said, very popular people like the idea of the flexibility, but, uh, he, he listed a bunch of, uh, things that they often do that he wouldn't advise, and I think a very sensible suggestions. like for example, the, uh, first thing you mentioned, which is ID and then Jason, and that was it.
Uh, instead of pulling out, , some of the columns that you might.
Nikolay: Right. Classic [00:21:00] example here, if it's some, um, account data, for example, which has email and so on, we, I would definitely put it as normal columns to, for better control, uh, to have, , all constraints in place and to have better data quality. To ensure that everything is like filled, not now, uniqueness, uniqueness and so on.
It can be achieved in JSON because we can have constraint, like check constraints, , using some, uh, expressions over json. But it would be, , much, uh, harder to maintain.
Michael: Yeah, I think that might be an understatement. but yeah, I would not like to maintain a system that was doing that. But yeah, so, uh, constraints are a really good point. Uh, good, like foreign keys or an example, uniqueness. Yeah, so there's so many things that would be, that you're kind of losing out on nice support for,
Nikolay: You know what I, I remember from eczema mal practice, uh, 15 years ago when I worked on it for pocus. Uh, [00:22:00] of course there was a, a ways, several ways to, ensure some like semi-structured scheme of for, for EXL values. And for Jason, we don't have it. Of, of course we can use some check constraints as I mentioned.
But there is interesting project, I haven't, I only know about it. I haven't looked at it, from supervisor. It's called p. So there is something to ensure to, to have in place some schema for JS O and ensure that, uh, our values are matched against this, uh, the rules defined in this schema. So some, something interesting also.
But for account data, for the data like billing data, I would, I would never put the most important fields to Jason. I would, I would keep them in relational part of.
Michael: Yeah, I've got a few more kind of reasons here as to why. Like, uh, I think, I think it's worth saying that for, for example, if we, want to update some of the data in there, it's a hugely more expensive for, [00:23:00] for some technical reasons. Like, especially if it's, if it's been toasted,
Nikolay: Exactly the Toasting. It's not only about updating, but upfronting, it's uh, Question, but also for reading as well and all benchmarks which deal with like, which check, Jason values of different size. We can observe that when we achieve, thrust point. Two kilobytes per, per, or it's, it's, it can be controlled, right?
In this case, uh, degradation happens immediately. And, uh, then toasting adds huge overhead for both reading and, uh, modifications. But there is some work. I, I don't think it was finished and, uh, already committed. I think, no, there is some work, to improve the update speed, touching only those chunks of toast, which really need to be.
so this would be interesting improvement, but I think it's not yet finished.
Michael: Yeah, that'd be cool. But I've, I'm not sure how that would work, but that'd be very cool. the other, th a couple of other kind of more, basic points I think are quite [00:24:00] important are, Well, actually Lawrence mentions joining. If you want to join on one of the columns, for example, that is in the Jason, that's obviously gonna be a, a challenge.
but the, I think the more important part for me coming from the performance side is there's no statistics on the data in within
Nikolay: Yeah. That's why joining can be not working well because of lack of statistics for internal data. Right.
Michael: Yeah. So it's really difficult for the planner to make good
decision.
Nikolay: suspect we can co collect it No. Using create statistic and expression. No, I,
Michael: Can you, you can do that.
Nikolay: Why not
Michael: That would be cool,
Nikolay: Well, well,
Michael: but, but it's not automatic. Right? You'd have to remember, you'd have to do that on a, on a manual basis.
Nikolay: Right, but why not? I, it's just some fresh, uh, question came to my mind. I, I, I actually never needed it because I, I was very careful usually with Jason, Jason B values, uh, keeping only really flexible data there. So, and also worth mentioning, you [00:25:00] might, need to work with Jason even not storing any of Jason, right.
He just, provided as, uh, for example, in case if you use Postgres, you do, you do it all the time in implicitly. You might, uh, do it explicitly if you have some queries. So relational data in stored, but, Jason B or Jason as output,
Michael: Oh, like casting to it or like, or using a function, I guess Creating it as I never thought of it as. Yeah.
Nikolay: You take it as input and then you pars and store it. As, as a relation, actually there's some, uh, direction of research, , called updateable views, predictable XML views over relational data type. The relational data, I guess, uh, it could be done with for Jason as well. I'm not sure.
It's very PR practical. Interesting. Practically interesting. But, uh, yeah, so you can have, uh, like layers, so relational data stored, and, uh, Jason has communication language.[00:26:00]
Michael: Yeah, true. And I think what you said just then is that it is where a lot of people land with this is you're very sensible and stored a lot of it in re regular columns, but then, The kind of, the more you said the flexible part, the, the other way I've heard that described is fields where the vast majority of your roads would be.
No. Like if it's, uh, I think, a common example given is like a, an e-commerce store where you might want to store some data for some types of products that most products don't have that data for. You want it structured, but it's mostly for most of your objects, it's gonna be null. So you could store maybe five attributes for each item, but they could be different on for different items.
So that, I think that would be a really good
Nikolay: Good. And speaking of nails, uh, you know how a lab nails, of course, it's very important concept in all relational databases, but if you use Jason B set to modify the a part of your Jason value, you should be careful with nails [00:27:00] because if you use, uh, relational now, sequel now. Not relational sql. Now in this case, it'll nullify whole value.
You need to use Jason Bal. So it's like, string now converted to Jason B. In this case you'll be fine and it'll nullify only. A part of your Jason value, which you target it. So, it's, it is a lot of small things to learn when you work with it, but I think, modern development is impossible without Jason.
Right. Any other pieces of advice you have?
Michael: No, you mentioned that the that toast, val, that toast, value that it starts toting things over a threshold. Yes, it can be configurable. That's if anybody wants to look that up, that's the toast Tual target. And there's a really good blog post by a hacky Benita, uh, where I learn about that, about me storing medium size texts
Nikolay: But by the default would if we have Jason values much less in size than uh, two kilobytes, we should be fine.[00:28:00]
Michael: Well, much less, yes, if they're tiny, great. , if they're really big is there's, there can be performance issues, but it's that he actually makes a really good point that it's. , there's really weird, performance penalties for medium size. So things that are just slightly under that two kilobyte limit. so yeah, it's a really good, blog post.
quite long, but I highly recommend reading it cuz it's
Nikolay: Well, and as usual, I recommend, experiment oriented approach. Take your schema, take your workload. If you don't have workload yet, try to invent it. trying to predict the future. It's sometimes possible, and just benchmark, uh, all aspects and pay attention to toasting and, and so on.
Michael: absolutely.
Nikolay: I'm constantly learning, I don't know all video details. This reference of all functions related to JS and jsb, data types and interesting thing that I use usually quite a lot. I use roto Jason. And, Jason ags for aggregation and the interesting [00:29:00] small advice is, uh, that you can order buy inside aggregation.
So you can, uh, have group buy for example, and have Jason Ag and there inside parenthesis you write order buy-in trucks. For
Michael: That's so
cool. I didn't
Nikolay: it's independently working independently from main order by cloud in your select, statement. So it's interesting
Michael: Yeah. What.
Nikolay: Yeah. Well, it's not, sorry. It's, it's about arrays. Then you can order by then for arrays, it's not fully related, but then you can also have indexing, for example, I want, only first. members of this array. ah, a very important thing. new people, uh, sometimes make mistake here.
If you use arrow operator, regular arrow operator for Jason and Jason b values, you'll be getting Jason or Jason B. Values as a result. But sometimes you won't text and, uh, if you. [00:30:00] You will see double quotes if you later converted to text. These double quotes will will be preserved there and it's not good.
In this case. One of the things I wish I knew earlier, right? Double arrow operator will direct the converter to text eliminating double quotes. It's very handy.
Michael: Yeah, that's a great tip. Wow. Some awesome little tips at the end as well. Bonus,
Nikolay: We can write an article. Jason and
Jason be tips.
Michael: yeah, that'd be great. Oh, actually there was one last blog post that I wanted to, so there's a, a couple of good ones. One by scale grid that I'll share and one by the HEAP team, that I'll share as well. that were good and a bit of further reading
Nikolay: Yeah, and, and we should definitely share benchmark results from our Barton of stock talks. , last was maybe a year ago or something. . there is a comparison with MongoDB there also. Interesting. and speaking of MongoDB, there is a new project called, fer db Fer db. Right.
Uh, which attempts to build, Mongo, Debi like database system on top of [00:31:00] pogs, which is interesting.
Michael: Yeah. Is it compatible? Is with Mongo?
Nikolay: I don't know details. I only know that one of the founders is Peter Zeer from Perona. So it's a lot of experience should be brought there and I think it's an interesting project. I checked it yesterday that it looks very live, so a lot of development happening.
So Warf keeping an eye on it for those who have preference in terms of work with Jason documents. I'm curious about checking benchmarks for this new system, it's like open source MongoDB, because, you know, MongoDB is not really open source anymore, right? They push everyone to use Atlas. So this project has good chances to, convert some users to pauses, basically, right?
So it's a.
Michael: yep. wonderful. Well, thank you very much, Nikolai. Thank you everybody for joining us and see you next week.
Nikolay: Thanks, Michael. Bye bye.