Data in the Wild

Today we talk to Pierre de Wulf, the technical co-founder of ScrapingBee. He talks to us about how they arrived at 36 different pricing data types, the difference between plans and subscriptions, and more.
Chapters
  • (00:44) - What is ScrapingBee?
  • (03:56) - The evolution of ScrapingBee’s data model
  • (06:40) - Subscription versus plan
  • (10:43) - The importance of reading documentations
  • (20:16) - Pierre’s tall tale about int
  • (22:59) - Benedicte’s tall tale about a 106-year old getting enrolled into school

Sponsor
This show is brought to you by Xata, the only serverless data platform for PostgreSQL. Develop applications faster knowing your data layer is ready to evolve and scale with your needs.

About the Hosts
Queen Raae wrote her first HTML in 1997 after her Norwegian teachers encouraged her to take the new elective class. 

Around the same time, Captain Ola bought a Macintosh SE for his high school with the proceeds from the school newspaper he started.

These days you’ll find them building web apps live on stream, and doing developer marketing work for clients. They are both passionate about the web as a platform and the joy of creating your own thing.

Visit queen.raae.codes to learn more.

Creators & Guests

Host
Benedicte (Queen) Raae 👑
🏴‍☠️ Dev building web apps in public for fun and profit 👑 Helping you get the most out of @GatsbyJS📺 Streams every Thursday: https://t.co/xaLy43cqMI
Host
Ola Vea
A piraty dev who also help devs stop wrecking their skill-builder-ship ⛵. Dev at (https://t.co/8m50kyT981) & POW! w/👑 @raae & Pirate Princess Lillian (8) 🥳🏴‍☠️
Editor
Krista Melgarejo
Marketing & Podcasts at @userlist | Originally trained in science but happily doing other stuff in SaaS & tech now
Guest
Pierre de Wulf
Bootstrapped @ScrapingBee to $millions ARR. Sharing what I'm learning about growth, SEO & tech. And sometimes dumb jokes.

What is Data in the Wild?

Learn from your favorite indie hackers as they share hard-earned lessons and tall tales from their data model journeys!

Brought to you by Xata — the only serverless data platform for PostgreSQL.

[00:00:00] Ola: Welcome to Data in the Wild! Discover data model tips and tricks used by fab indie hacker devs With Queen Raae I'm your co-host Captain Ola Vea and this podcast is brought to you by Xata, the serverless data platform for modern web apps!

[00:00:22] Benedicte: And today's guest is the great and powerful Pierre de Wulf founder of ScrapingBee Welcome to the show, Pierre.

[00:00:29] Pierre: Thank you for having me. So glad to be here.

[00:00:31] Benedicte: We are super excited to be learning from your journey.

[00:00:35] So you know, people listening don't or to the listener doesn't have to do the same mistakes or maybe can just learn from what you've done right.

[00:00:44] But before we begin, could you just quickly tell us what problem ScrapingBee solves?

[00:00:49] Pierre: ScrapingBee quickly, it's a web scraping API. So you send us a URL, we give you the HTML, no matter what. At least we try.

[00:01:00] So we solve two big problem, which is proxy management. You know, if you're. about to do web scraping, you will get blocked because too many requests will come from your own IP address. So you will need to buy some proxies and it's very hard to do it at scale because there's so many providers, so many different qualities.

[00:01:21] So we do this, we also the chromat less management. So, let's say you want to scrape a single page application, for example. If you do it on your side, you'll only get maybe some simple HTML and lots of JavaScript that will never get executed. So going through Scraping Bee, we will run your request. On a real Chrome brother.

[00:01:46] So the whole HTML is correctly loaded and then we'll send you back to the result.

[00:01:51] Benedicte: Yeah. I've tested that for, I was trying to scrape the crowdcast website, and that is a single page application. And then I had to wait for some elements to render before I wanted to get kind of the rest of the information.

[00:02:06] Pierre: Exactly. Exactly.

[00:02:07] For this use case, ScrapingBee works very well. Yeah.

[00:02:12] Benedicte: Absolutely.

[00:02:13] Ola: Before we get into your experience with data modeling for ScrapingBee, could you quickly run through your tech stack?

[00:02:20] Pierre: Yeah, sure. So our tech stack is very classy. So we run on Python, our web app.

[00:02:28] So what we dashboards or subscription management, like when you log in. It's all Flask, the web framework, Python web framework. For the web breaking part, so we have many different layers. We have some part running on Cupid here, which is the tool that allows you to, it's an API to control Chrome, basically.

[00:02:52] So we can say on a script, okay, open that page, click on that document. Use that proxy all in all in all. So we use Puppeteer. We also use Docker where we manage all our Chrome instances. So Chrome instances managing Docker and controlled through Puppeteer. As a database, we use Redis for the cache For everything cat related, for example, so in our case, everything related to your usage metrics, you know, since you can have quite a big throughput using scraping me, we catch all your data usage on Redis.

[00:03:36] And from time to time, save it to Postgres And so Postgres with this Python, Puppeteer, AWS Lambda, Docker, and some bare metal server here and there. So it's definitely not in the right order, but this is our tech stack.

[00:03:54] Benedicte: Yeah, that was great. That was a great run through.

[00:03:56] So since this is our, you know, data model focus on this show, what data model has changed the least for you since you launched Scraping Bee?

[00:04:05] Pierre: I'd say the one that changed the list was the user model, because we're a SaaS, and user for us is quite simple, it's an email address, password, some hash, password, of course. Some address field, ID, whatever. Very boring, very classical. So yeah, those are the ones that changed the least.

[00:04:29] Benedicte: Yeah. Have you had any one that's changed a lot?

[00:04:32] Pierre: I think there's two things we had to update quite a bit. The first one was a subscription model. So we're a SaaS, so every user has a subscription linked to a plan, you know? And at first, because we wanted it to go live very quickly, we didn't put a lot of care. So basically everything was on the user table, you know, like the plans, the subscription and all, and it was a mess.

[00:04:59] So one thing we couldn't do, for example, was to say, okay, let's change the number of credits. of this plan for all users. What should have been a very simple, update on the plan table was a very complicated Postgres query on the user table. So we did two or three versions where we cleanly split the data.

[00:05:26] User subscription plan, allowing you lots of flexibility. For example, it allows us to very easily create custom plan, you know, for some user, so some user will want, okay, I want lots of credit, but not that many concurrency and this feature, but not this one, you know, and we can do this. For customer ready to pay the price.

[00:05:50] Thanks to this data model, we try to spend a lot of time on, and we've also spent a lot of time on the analytics data model. So while some people would use things like click outs, you know. Or a time sharing base data base.

[00:06:12] We've made the choice to do everything is Postgres for, you know, simplicity, but we've had to create some table structure, efficient, both in insertion and querying because we do have, I don't know, millions of rows per hour.

[00:06:32] So that's a data model. Data performance issue we've had to spend a bit of time on.

[00:06:40] Benedicte: So going back to the subscription and plan thing, what is the difference between a subscription and a plan?

[00:06:47] Pierre: So a plan would be, I don't know, if you go to, you know, whatever SaaS, you will have like OB plan, free plan. Pro plan, you know, so it's like 0, 49, 100.

[00:07:01] A subscription would be something that links you and the plan. And you can add some data to the subscription such as, okay, what is a renewal date? Is it a monthly subscription or an annual subscription? Do you want me to cancel after a few days? Do you want me to auto renew it if you run out of credit?

[00:07:23] And so those are things that need to be very split because for example, if you do some prices pricing update, which you will do a lot if you create a SaaS, you want to be able to say, "Okay, people who subscribe to this plan can keep the subscription and creating a new plan. So the old plan will be unbuyable for a new customer, but old customer adding this subscription to this plan will keep it as is. But now let's replace this plan by a new one."

[00:07:57] So very simply, let's say you have a pricing page. You know, you will list all your plan, you know, but you can have a flag. Let's call it, you know, buyable.

[00:08:08] And so the plan you want to remove from your product, you can set buyable to false and no one will be able to buy them when you go on the pricing page. And it allows you to let all user still have their subscription link to this one.

[00:08:23] Benedicte: And you said, like if you're making a SaaS, you do want to be able to experiment with pricing and plans. Yeah.

[00:08:31] Pierre: And it's hell to migrate because you not really, you need to deal with some Stripe ID, invoice ID, subscription ID, plan ID. You will have user telling you, okay, I want to get back on my invoice and all I want to switch from this plan to this one with it's basically edge cases everywhere. So yeah, try to have a rock solid data model.

[00:08:59] And if I had to do it all over again today, I definitely spend more time on it. Like maybe read full tutorial, tutorial or full documentation about this.

[00:09:12] Ola: So if you could time travel back in, in time, what would you undo about this, part of the, or this data model? What part would you undo?

[00:09:21] Pierre: Yeah. So today we finally managed to clean everything, but what I would do would be to add an invoice model, you know, to manage invoice on our end, which is something I added last week and, which is, yeah, very important.

[00:09:40] Like, you know, when someone like sign in to your product, you want to be able to say, "okay, this person has 12 unpaid invoices, so maybe we should do something about it, you know?"

[00:09:54] Benedicte: Maybe they should be cut off.

[00:09:56] Pierre: Yeah, maybe. No, no, just an idea. So yeah.

[00:10:01] Ola: So what time would you travel back to then to undo that?

[00:10:04] Pierre: Four years ago, because Scraping Bee was our second SaaS. So we built a first SaaS, which failed. We reused the SaaS codebase to build the second one, so we scrapped all the feature code. But we used the whole subscription management code. And yeah, I definitely, yeah, we do that to share the opportunity, but in hindsight, it's always easy to say.

[00:10:31] So at that time we had 600 euros left on the bank account. The company bank account. So yeah, clean data model was definitely not our priority.

[00:10:43] Benedicte: But you said something interesting about, you would have maybe looked at a proper tutorial or documentation. Is that what you, yeah.

[00:10:50] Do you know of any? Like, do you know of any good ones?

[00:10:55] Pierre: So no. Because, and that's my mistake, I haven't read one. But we use ChargeBee, which is a subscription management platform. So basically you have Stripe who manage the payment and ChargeBee manage all the subscription, invoice, refund, and all. Lots of ChargeBee features are no part of Stripe.

[00:11:18] But yeah, we haven't had the courage to migrate. And they have a very great documentation. So I probably, you know, would look. You know, when you're in a hurry, sometimes you go straight to the documentation or API specification. Maybe this time is a good time to read this whole getting started 10 pages.

[00:11:41] Benedicte: Yeah. The guide.

[00:11:42] Pierre: Yeah. The guide, you know. The 10 pages guide you never want to read. Well, maybe for this one time you should do, you should read it.

[00:11:51] Benedicte: I think that's a cool insight because every SaaS and almost every product getting paid. Should have like almost identical data model. And it's been done so many times before. And I think everyone I've ever talked to have had the same pain points where it's like, "we, you know, we want to change plans."

[00:12:11] And that's one of the things that you maybe don't know when you get started because you said, "Oh, we also offer custom plans." I don't think people, when they create a SaaS, they think, "Oh, I will only have the plans that I offer."

[00:12:26] But then you get maybe enterprise customers and they want something specific and you want to make a plan for them.

[00:12:32] So just to like, look at the scale of that, how many plans do you have now after. Is it four years you've been live? Like how many plans do you think you have?

[00:12:44] Pierre: So we are four plans you can buy today, but like in total, but we have 40 plans in the whole table.

[00:12:52] Benedicte: And how many of those are custom?

[00:12:54] Pierre: Yeah, custom. And we used to create custom plan for, you know, 10 customer who didn't have a lot of money. We don't do that anymore, but we, yeah, we used to create a lot of custom things.

[00:13:08] And this just stays in the database, you know, as a proof of your past mistakes.

[00:13:15] Benedicte: Or your past experimentation, you know, that's probably what helped you get to the plans you have today without all of those 36 plans that are now non buyable.

[00:13:27] Pierre: Yeah, but it's cool because, you know, it's like your first comic message. You know, the name of those plans, it's getting more and more ridiculous. It's like hobby, old, V2, old, real, old, trash, new enterprise, new big one, business plus plus.

[00:13:47] And then you have to be careful so that this technical name don't appear on the invoice of the customer, because then, yeah, it doesn't look very serious. You get a PDF invoice of OBL. You shouldn't have done that, you know.

[00:14:04] Benedicte: That's some good insights right there.

[00:14:06] Ola: But so would you say that your tip or that you will tip and as somebody who's starting their SaaS today is to read the documentation of the subscription thing?

[00:14:18] Was that what you meant? Yeah.

[00:14:20] Pierre: Yeah. Maybe I could have just said that. But yeah, basically that's what I say. And maybe, you know, Stripe, for example, you can try to have a data model on your end that match the Stripe's one as closely as possible, then it will make everything easier. So it's quite easy to reverse engineer the Stripe's data model because it's all, you know, subscription as an ID, as user ID, invoice as subscription ID, you know, and also if your data model should match something, it should be your PSP or subscription management software data model.

[00:15:01] Ola: Instead of like making your own.

[00:15:03] Pierre: Yeah, exactly. Instead of making your own.

[00:15:06] Benedicte: And I see in some forums that people are like, "Oh, this, the Stripe one is so complex. Like, Why do we need all of this? We're just gonna, you know, we're just going to do this or do X." And you know, you don't always have time to jump in, but I want to jump in and be like, there's a reason they've done this quite a lot.

[00:15:25] There is a reason to the madness.

[00:15:28] Pierre: You have to wonder like, will my product pricing change a lot? And will my product feature for some user change a lot? So maybe if you're selling an e book, you don't have to have a particular, I don't think you'll have any data model at all, but for the sake of the example, If you're selling just a PDF, yeah, so the feature will probably always be the same for everyone.

[00:15:54] And if you want to update the price, it's probably going to be very easy. So maybe in this case, you won't have to, but if you're selling a SaaS with subscription, Yeah, definitely. Lots of HKCs, lots of different offerings. There is, I don't think simple SaaS subscription exists.

[00:16:13] Benedicte: That is our quote of the day: there is no simple SaaS subscription model.

[00:16:19] Was that the, I'll go back and listen to exactly what you said.

[00:16:22] But I want to go back to, you were talking about analytics. And you worked a lot on the analytics model.

[00:16:29] Like who uses the analytics? Is that for the user? Is that for you internally?

[00:16:33] Pierre: So we have a small part shown to the user.

[00:16:37] So basically whatever happened in the last 30 days, but because our table are so slow, we have to trim some requests. And so basically the full table are only available for admin purpose.

[00:16:53] So as a user, you might not see everything you did using ScrapingBee, but on our site, we can see everything you scraped down to the domain name granularity. So not the URL granularity. And done by the day granularity.

[00:17:10] So we have, yeah, two years of history. I would have loved to have more, and this is why with Etienne, we're currently trying to see if we can migrate to something more appropriate for time series database, to have something down to the hour or minutes.

[00:17:29] Benedicte: Yeah. Cause so what you're logging every request that I would make to Scraping Bee and all the information that comes with that.

[00:17:37] Do you also store the result that you send me back or just the request?

[00:17:41] Pierre: No. So we use Datadog, you know, to monitor all the logs. But on Datadog, we store everything except the results, the body of the result, but we have only 15 days history.

[00:17:57] On our side, on our Postgres table, we store basically: domain name, status code, credit cost, and of course, user name, well, user, for, yeah, forever and ever.

[00:18:13] So, it's like we have two things, one short term, very exhaustive database. Datadog, and one long term but less exhaustive. And the end game would be to have everything, everywhere, all at once.

[00:18:28] Ola: Yeah.

[00:18:31] Benedicte: But the limitations, I guess, is because Postgres is really good for relational models, but not so good for kind of searching and aggregating, right?

[00:18:41] Pierre: So, well, it's good done up to a certain point, you know. And if you need time series, I mean, I know some PostgreSQL purists are like, "no, you can do everything with PostgreSQL, you just don't know how to use it. It's incredibly powerful. You don't need it right now," which I can understand because I've seen some PostgreSQL datatable database scale to a hundred millions of logs.

[00:19:10] But we don't know how to do it. And so we'd rather use some, you know, some other tool, made for that.

[00:19:18] Benedicte: Yeah, cool. And I guess that's where the Xata plug comes in because we'd say that you get the Postgres database, but then you also get the kind of elastic search version of your Postgres database so that you can do aggregation and summarations and all of those, but not on the Postgres database, but on the kind of engine.

[00:19:37] Yeah. Don't correct me if it might not have that a hundred percent correct terminology there, but for, and that's like, what's makes it fun for me because I do not have that expertise into, you know, Postgres and how you create a perfectly searchable database.

[00:19:54] Pierre: Me neither. Me neither.

[00:19:56] To be honest, I learned everything on the fly, you know. And since Postegres is one of the most well known database, there is lots of resources and it's also very old.

[00:20:10] So it's very, very stable. But yeah, definitely an interesting topic.

[00:20:16] Ola: So do you have a tall tale for us

[00:20:19] Benedicte: about data in the wild?

[00:20:21] Pierre: A tall tale? Ah, yeah.

[00:20:22] I remember at my last job, we used to have, you know, simple stuff. Very big table with one ID, column. So the ID was an int. So, you know, you have to type the column and the int string, whatever.

[00:20:41] So int, so I think it's a 64 bytes integer, and it can go up to 2, 500, 000, 000. Which was supposed to be enough because the company had, I don't know, 200 million, you know. But then one developers had an issue where he had conflicting a index, you know. Conflicting ID, unique ID. So in Postgres, you can say, "okay, this ID will be a sequence of integer."

[00:21:16] He messed something up and then two tables use the same sequence. And then Postgres said, "oh no, you're trying to create two things with the same ID. It was like, okay, let's bump the ID count of one billion. I'll be, you know, I'll be less annoyed. No, it will fix everything."

[00:21:36] So it bumped the index count of one billion, but then the integer was too big for it. For integer, you know, so what happened is like it rolled back and, yeah, it was a mess. The whole production database was blocked for six hours because we had to understand why we no longer have free int to use as a unique ID. And everyone has to migrate during the night, the integrity table to BigInt.

[00:22:05] So yeah, conclusion, use BigInt. No matter what, use BigInt.

[00:22:09] Ola: So bigInt, what is that?

[00:22:11] Pierre: So BigInt is like Integer, but way bigger. So it's Integer is big. Yeah, it's BigInt, it's like you see an int, it's a BigInt.

[00:22:21] Ola: Okay.

[00:22:23] Pierre: Instead of being able to go up to 2, 500, 000, 000, 000, I think you can go up to. Hundred trillion, whatever.

[00:22:32] So no limit at all.

[00:22:35] Ola: That is kind of similar to Monica's tip: don't use a string, use an array of strings when you start out.

[00:22:43] Benedicte: It's very little downside and very much upside. And it's the same thing with int and big int. There's like no downside in using big int and you're going to need it at some point.

[00:22:52] Pierre: Exactly. Especially if someone tried to manually mess up the sequence of the ID.

[00:22:59] Benedicte: Oh, I have a tall tale now. Can I tell a tall tale, Ola?

[00:23:02] Ola: Oh man. Yes. Yeah.

[00:23:04] Benedicte: Yeah. So this is, this is not anything I was a part of, but this was in the news in Norway, um, because they were storing the year that people were born with just two digits for the enrollment of when you are enrolled in school.

[00:23:22] So when you turn six. The parents get a letter that you are now enrolled in school and you have to like say yes or let them know that you're going to go to another school. And this hundred and six year old got enrolled in her local school. And it's like, what did they save by not saving four numbers instead of two numbers? Like very little.

[00:23:47] But they're, I guess their assumption was that people don't live that long. And it doesn't happen. I mean, it's an edge case, so it's fine, but but it was a great news story, this like 106 year old with her letter of school enrollment.

[00:24:01] Pierre: Yeah, maybe at 160, this person will get twice their retirement, you know?

[00:24:07] Ola: Yeah!

[00:24:08] Benedicte: I mean, that would be awesome. She deserves it if she lives to 160, you need to get your double of your retirement. That would be great.

[00:24:18] I have some friends working on that kind of pension systems here in Norway. And we keep joking at parties like, "could you just not make it like a tiny backdoor for like all of your friends get a little bit higher?" But they're like, "no, we can't do that."

[00:24:36] I'm like, "really though? Just a little bit?"

[00:24:38] Pierre: Yeah. Are you sure? Yeah. small one

[00:24:41] Benedicte: So where can folks find out more about you and ScrapingBee?

[00:24:46] Pierre: So ScrapingBee. www.scrapingbee.com, which should tell you everything there is to know about the product. And for me, I'm quite active on Twitter, @PierreDeWulf, and yeah, I try to share daily the bootstrapper life journey.

[00:25:06] Benedicte: Absolutely. And thank you so much for sharing your data model stories with us today. The listener can check out the podcast description for the links. And I especially enjoyed our conversations around pricing and how. You'll most definitely be experimenting with pricing when you're creating a SaaS.

[00:25:25] So make sure that you have a data model to support that.

[00:25:30] Pierre: Thank you for having me. It was awesome.

[00:25:32] Ola: Yeah. Yeah. Thank you.

[00:25:33] And to you listeners, welcome back next week to Data in the Wild, where we discover data model tips and tricks used by our fav indie hacker devs. Okay. Bye now.

[00:25:46] Benedicte: Bye bye.

[00:25:48] Pierre: Bye.