The Data Engineering Show

Substack is an amazing — if not the most amazing — content publishing platform out there. Essentially, it allows anyone to become a journalist or to start their own newsletters and charge subscriptions for them. So how did they build a data stack that can support all of their 500K paying subscribers?

Guest: Mike Cohen, Data Engineer at SubStack
Hosts: The Data Bros, Eldad and Boaz Farkash, CEO and CPO at Firebolt

What is The Data Engineering Show?

The Data Engineering Show is a podcast for data engineering and BI practitioners to go beyond theory. Learn from the biggest influencers in tech about their practical day-to-day data challenges and solutions in a casual and fun setting.

SEASON 1 DATA BROS
Eldad and Boaz Farkash shared the same stuffed toys growing up as well as a big passion for data. After founding Sisense and building it to become a high-growth analytics unicorn, they moved on to their next venture, Firebolt, a leading high-performance cloud data warehouse.

SEASON 2 DATA BROS
In season 2 Eldad adopted a brilliant new little brother, and with their shared love for query processing, the connection was immediate. After excelling in his MS, Computer Science degree, Benjamin Wagner joined Firebolt to lead its query processing team and is a rising star in the data space.

For inquiries contact tamar@firebolt.io

Boaz: We are very proud and lucky to have Mike Cohen with us today. Mike is a super talented data engineer at Substack. Now, for those of you who don't know Substack, you should! Substack is an amazing — if not the most amazing — content publishing platform out there. Essentially, it allows people like you and me to become journalists or to start our own newsletters and charge subscriptions for them. And Substack has been growing. In February, they reported that they have 500,000 paying subscribers.
Eldad: That's old news.
Boaz: Yeah, that's probably old news by now. Every month that number grows like crazy. They've been getting a lot of media attention. I think that we should consider stopping this podcast and moving to Substack or something. That's the place to be right now. So, Mike has been working in the data space for quite some time prior to Substack, which we'll hear all about. He also spent time at companies like Capax, Venmo, and a lot of other exciting places.
Boaz: At Substack, how much data, in terms of data volumes, do you guys deal with?
Mike: I don't know what big data means when people say “big,” but I think we're in the small to medium data camp still. We're in the tens of terabytes, not in the petabytes or anything like that quite yet. But the data volumes are growing quickly as more and more people come on. We have a lot of event data that we're logging and that's where we're at today.
Boaz: I think the rule of thumb to “big data” is that everybody starts with apologizing that they have data, but it could be bigger. Yeah, we only have hundreds of terabytes. We're not petabytes scale yet.
Mike: Yeah.
Boaz: I think that is considered big data. And what's the headcount at Substack these days?
Mike: We just surpassed 40. In the last couple of weeks, we broke through the 40 mark. We're on a big hiring spree. Come check out our jobs page!
We're trying to scale up the team. Fingers crossed, we'd love to be somewhere in the seventies by the end of the year.
Boaz: And how many people deal with data?
Mike: We are a very small team in an already small company. We're a two-person team at the moment. So, it was just me for the first 15 months or so and then I recently brought on someone so we have two of us now since March.
Boaz: So, you’ve grown really fast. In terms of subscribers and visitors, you've grown dramatically also in a little bit over a year. How does that look like from the retention perspective?
Mike: It's been exhilarating. It's been super fun to watch. And the problems to tackle have grown with the growth of the company in general. Exhilarating is the best way to describe it. Constantly thinking about “this thing that we were doing a week ago — now how do we do it at a much bigger rate and faster pace? How do we design our systems for a couple of weeks and months from now?” Stuff like that. So, it's been super fun.
Boaz: So, let's talk about your data world. Tell us what it looks like and what kinds of things do you do with it.
Mike: We have a Postgres production database and what we call our events pipeline, which is effectively a Kinesis stream of data that gets processed in parts into S3 and subsequently then dumped into Snowflake. Separately, we also have a process that will mirror our data from production into the data warehouse. Other data sources are getting piped in there too. And so, everything ultimately lands there. Then we have our BI tooling set up on top of that. From there we do transformations internal to Snowflake, and then we pipe that back out to places. I think the phrase I've seen a lot of these days is reverse ETL. But we send our derived or transformed data back out to a separate Postgres database such that the data can be accessed in the product with indexes and be super-fast. So that's our high-level structure today.
Eldad: What about BI? Which BI tools are you using there?
Mike: We use Periscope Data acquired by a company called Sisense.
Boaz: So how much of the data stack at Substack do you consider legacy versus modern? How much has it changed since you joined or is it something that's already built to scale into the future?
Mike: That's a good question. A lot has changed since when I joined. When I first joined, we had very little in terms of BI tooling or any data warehousing. So, all of that was in the last 12 to 15 months and that's gotten us to where we are today. As a company, today we are thinking about how we start to really put the data to work now that it's much easier to work with and accessible, and we have the systems in place to put it back into the product or to do BI and analytics. And then I think after that, there'll be the next chapter of asking “What do we do next? How do we build towards more real-time? How do we build towards faster insights?” And unfortunately, as a small team, we have to take things in little chapters. That's how I kind of think of it. So, I think we're on chapter two now and then chapter three will be about figuring out how to ramp this up even faster and do even more.
Boaz: Who is driving the requirements for BI? With 40 people, is it specific departments or cross-departmental?
Mike: It's a mixture of data exploration work, which can be driven by either the product or data team. We also have other internal teams, whether it's our support team or what we call our partnerships team, and they have data questions. As a data team, we will help them answer those questions by giving them reports that they can monitor in Periscope.
Boaz: So how much of a bottleneck do you end up with? It sounds like you have a lot of supporting to do.
Mike: Yeah. That's one of the reasons for doubling and why we're seeking to double again, hopefully this year. I'd love to end the year with around four people on the team. And so it's definitely a factor. And I would say it was only somewhat recently we found our team chemistry... we used to be more technical users than non-technical users. In other words, more SQL users than non-SQL users. And only very recently has our growth shifted that equation to where now we have more non-SQL users. And so that bottleneck has started to become more apparent than it once was.
Boaz: What does your morning routine look like? Which tool do you open the most to check in on things every day?
Mike: That's a good question. I check Slack to make sure there's nothing in the data channel that someone has reported or asked about. Then I have about six Periscope dashboards that I look at in order every single day that are basically pinned in one of my Chrome browsers. They're high-level company stuff. And then I have a spam publication detection dashboard where I look for any bad actors and try to handle that, too.
Boaz: It’s an interesting use case. How do you find the right actors in the spam publications?
Mike: I can’t answer that. That'll give away my fraud rules and everyone will know how to beat them.
Boaz: Good point good point. I was just testing you, but okay. Let's do what we call a lightning round...
Mike: Okay.
Boaz: So don't overthink. Shoot straight. Let's see what you come up with. Are you ready?
Mike: Yes.
Boaz: Commercial or open source?
Mike: Commercial.
Boaz: Batch or streaming?
Mike: Streaming.
Boaz: Write your own SQL or use a drag-and-drop vis tool?
Mike: Write my own.
Boaz: Work from home or from the office?
Mike: That one is hard.
Eldad: Both. You can have both.
Mike: Yeah, I think three days at home, two days in the office.
Eldad: Yeah, exactly.
Boaz: There are Hybrid modules now so it's legit to say both. AWS, GCP or Azure?
Mike: AWS.
Eldad: So, you can pick one. To DBT or not to DBT?
Mike: Controversial, not to DBT.
Boaz: To delt delete or not to delt delete.
Mike: Not to delt delete.
Boaz: Not to DBT, I think is the first time we had somebody said, no.
Eldad: This is the first time.
Mike: I know.
Boaz: Let's talk about that.
Eldad: You're probably mistaken. You probably got that answer wrong.
Mike: All your listeners just turn this episode off.
Eldad: We put it in the trailer.
Boaz: So, elaborate on that a little bit. So, what's your take on DBT and why not?
Mike: No.
Eldad: Big no. It was a big no, that's why.
Mike: It's not, it's not. I don't have that strong of an opinion. I've used it, but I have another system that I kind of put together that does similar behavior and I think allows a little bit more control. Ultimately, with DBT, my understanding is you still need to have something that schedules and orchestrates the jobs and so I’ve just kind of created a system that does a lot of that. I don't want to say it's anywhere near as comprehensive as DB, but does a lot of that. And it's all just based in Airflow, so it’s Python-based.
Boaz: You said you guys are on Snowflake. How much processing? Did you guys do ELT exclusively in Snowflake? Do you do a lot of processing also outside of Snowflake?
Mike: No. A hundred percent of the processing is happening in Snowflake.
Eldad: Have you ever considered using spark to do that? Or just started a clean sheet with Snowflake? No need to migrate anything.
Mike: That was my thought. Yeah, it was clean sheet. Start from scratch and we'll see when we need to go bigger than snowflake “can handle.” I'm sure they don't want to hear that, but I'm sure there's a point where using something like Spark in a more distributed fashion — where you can have a lot more control — might make a lot of sense. But we're not there yet at least.
Boaz: Looking at your pie chart of time spent on which activities, how much time do you spend supporting the BI users and the BI tools versus supporting the warehouse or supporting the pipeline, and so forth?
Mike: Yeah, not to cop out on the question, but I'm pretty evenly split at the moment. And there's like another administrative chunk, which is hiring and building the team out so that I can be more places all the time.
Eldad: So basically, like most high-growth startups, 70% of your time goes to hiring and then the 30% that's left goes on real stuff, which is great.
Mike: Yeah, I would say I'm at 35% hiring, 35% support of different people or functions and meetings, and then the remaining 30% is split evenly between either data engineering work or just my own data analytics work.
Boaz: Always hiring is also always a good excuse because if you tell people, “I'm sorry, it will be fixed the moment we hire another person, so I'm actively hiring.” It's not your fault essentially.
Eldad: Boaz loves hiring. He discovered hiring a few months ago.
Boaz: Eldad is always complaining, “Why didn’t you do this? Why didn’t you do that?” And I tell him I'm hiring for it.
Okay. So, tell us about an awesome win at Substack.
Mike: I mentioned before we want to get to a place where there’s more real-time analytics and more real-time insights in the product. But right now, it's we're living in a batch world where our definition of "real-time” is really every 20 minutes. We're kind of updating data in place. But I'm pretty proud of the system that we have. We're running a bunch of interesting, complex queries that create meaningful tables that are de-normalized and great for analytics, but also great for serving up things in the product. And then the way we're piping that back to Postgres with indexes in a way that is efficient and scalable is pretty neat. So I'm happy with that system that we have in place.
Eldad: Connecting data back to the product and feeding the product experience with data is huge. And you're right, it is super satisfying to get there.
Mike: Yeah. And when you send a newsletter a lot, some people want to just refresh, refresh, refresh, refresh, and watch the numbers tick. We're not there yet and I want to get there. There'll be other exciting things between now and then, but that will be a really exciting day for me.
Boaz: Now enough with this self-gratification, and then the win stories, tell us about an epic failure.
Mike: Okay. That's a bigger list.
Eldad: Everything that happened before we managed to get data back to the product.
Mike: I guess I should say, thankfully, there haven't been catastrophic errors that we can attribute to the data team, but there have been things we've done poorly. For example, we were writing data too aggressively to this Postgres thing I keep talking about and we ended up filling up the write ahead log and knocking over the database. All queries started to time out and then the site went down. So, we've had a number of learning experiences about how to do things that keep the site running and how to test things a little bit better. We also use some of our metrics and monitoring tools like Honeycomb to have a sense of when things might be going wrong and then we try to prevent that from happening in the first place. So, there's been a lot of small disasters, nothing too catastrophic yet. The keyword is “yet” because I'm sure that it's coming.
Boaz: What's the top challenge for data engineers or the data team in general at Substack?
Mike: I prefer to keep the surface areas small. So, in many organizations, there might be a data warehouse or something like that and then data is sent back out to a lot of different other services. And I am trying to not send it out to too many places because my fear is that ends up leading to a situation where you have one person looking at something in Google sheets, or Excel, or Air Table and saying, “Oh, I see this number here, but over in the BI tool, the number is different.” There's a lot of ways to try to control for that, but one way I think to control for that is to try to centralize and keep things in one location.
And so, the thing I'm thinking about a lot recently is how to make our data and our analytics more self-service? And more self-service for not necessarily technical users. Whether it's building canonical data sets that are easy to query and we give everyone a little SQL lesson, or we get some sort of tooling that doesn't require SQL knowledge at all. How do we democratize the data access a bit more? So, I think about that a lot, and that's maybe more on the analytics side than on the engineering side, but I think that they're two sides of the same coin, really.
Boaz: How much of your responsibilities are on the analytics side as well?
Mike: All of it.
Boaz: Building out company dashboards and stuff like that. All of it? Wow. What gets on your nerves the most in your daily work with data?
Mike: Reconciling data from different data sources. For example, yesterday, I spent a long time trying to reconcile data that we're seeing from a test that we're running in Optimizely with our own data events logging and it's hard. It's an example of what I was talking about just a moment ago where Optimizely is a bit of a black box. You want to be able to put trust in the tool but sometimes you verify it yourself and in verifying it yourself, you end up in a rabbit hole. And so that can be kind of frustrating but it's important.
Boaz: Yeah. I feel you on that one. It's important, it's frustrating.
Mike: Yeah.
Boaz: For sure. Okay, so we're close to reaching the end. We want to get your advice on which companies, leaders or people to follow that inspire you or that you find interesting online.
Mike: Data Engineering Weekly is a cool Substack newsletter. Tristan Handy has a great newsletter, which is not Substack based but is still a big newsletter.
I'm also interested in what you guys are doing at Firebolt. I love Snowflake, but one of the reasons we're sending data back out to Postgres is because you lose the ability to index or to have functional aggregation that are very snappy. And so, I think getting the data warehousing or the OLAP databases to look more like databases where the index is going to be very, very interesting and compelling in the near future. So that's very interesting to me.
Boaz: Thank you so much. I think this is it. Eldad, what do you think?
Eldad: I think his interests are really spot on.
Boaz: Imagine Mike is not here with us. What would you tell me about him behind his back?
Eldad: So, I think the stack is perfect and frustration is hard, but it's a daily frustration we all deal with. And really, I think I wish you all the best. I wish you can scale fast. I wish you can hire a team that you love and can work with and accomplish meaningful things together. Always great to see you.
Boaz: And congrats for being a part of the success of Substack. So, keep in touch and thank you everybody for hopping on this episode of the Data Engineering Show. See you soon.
Eldad: See you soon. Bye bye.
Mike: Thank you.