The Data Engineering Show

Gong manages hundreds of thousands of videoconferences and millions of emails PER DAY, which add up to hundreds of TBs.

The Data Bros met Yarin Benado, Gong’s engineering manager to understand what is required to move to a modern data stack to support all this, what this stack looks like, and why it all comes down to data quality at the end of the day.

Show Notes

Gong manages hundreds of thousands of videoconferences and millions of emails PER DAY, which add up to hundreds of TBs. 

The Data Bros met Yarin Benado, Gong’s engineering manager to understand what is required to move to a modern data stack to support all this, what this stack looks like, and why it all comes down to data quality at the end of the day. 

What is The Data Engineering Show?

The Data Engineering Show is a podcast for data engineering and BI practitioners to go beyond theory, and learn from the biggest influencers in tech about their practical day to day data challenges and solutions in a casual and fun setting.

WHO ARE THE DATA BROS?

Eldad and Boaz Farkash shared the same stuffed toys growing up as well as a big passion for data. After founding Sisense and building it to become a high growth analytics unicorn, they moved on to their next venture, Firebolt, a high performance cloud data warehouse serving some of the world’s most advanced tech companies. Their guilty pleasures include analyzing data pipelines and beating each other in endless query performance battles.

TRANSCRIPT

Boaz: Hello Everybody! Welcome to another episode of the Data Engineering Show. How are you Eldad?

Eldad: I am good. Thanks.

Boaz: Not tested positive for any variant yet?

Eldad: Dodged it. Had it at home.

Boaz: Talk to the mic.

Eldad: Managed to dodge it.

Boaz: Do not dodge the mic. I dodged it too. Maybe you think it runs in the family.

Eldad: On this version, at least.

Boaz: Yeah. Thanks everyone who joined us. We are here with Yarin Benado. Hi Yarin! How are you?

Yarin: Hey guys! How are you? I am great.

Boaz: Did you dodge the virus too?

Yarin: So far so good.

Boaz: So far. Yarin Benado - Engineering Manager from Gong. Gong, for those of you, who do not know is a really, really interesting and great product, a revenue intelligence product. It started with a product that helps you record and analyze all your conversations, typically for sales teams, but not only, and you can go back and analyze who talked, how long, what words were mentioned, and get insights into how to better, sort of, close deals if you manage sales teams, etc. Yarin joined around a year and a half ago after an acquisition. So, Yarin tell us what the Vayo was, which sort of was your home before entering Gong.

Yarin: I founded Vayo around 2017, with a good friend and a partner, and our goal at Vayo was to basically take the entire customer data that a company is creating constantly, and answer a few simple questions about these customers. Is the customer a happy customer? Are they going to leave the churn? Are there any upscale opportunities? and the idea was basically to automate all the data aggregation, collection aggregation, and reporting around these specific customer-related questions. What used to be done as internal BI and data teams, so we offered a simple solution, where Vayo is doing all the hard stuff.

Boaz: How big was the team at Vayo at that time?

Yarin: So we were a relatively small, very small startup. We ran for two and a half years. We were like 10 employees, mostly engineering, here in Tele Aviv and a bit offshore.

Boaz: Tell us a little bit about your background. What did you do up until that and how did you enter the data and engineering world?

Yarin: Before Vayo, I was leading the engineering team for, a bit bigger startup, here in Tel Aviv. Before that, I held the role of principal engineer for several startups. I think Gong is the biggest company I worked for.

Boaz: How many employees are at Gong nowadays.

Yarin: Almost 900.

Boaz: Wow! You are an engineering manager. Tell us a little bit about your current role and how much of it is dedicated to data.

Yarin: I am an engineering manager in what is called the insights group within Gong. The group is in charge of all the data that is customer-facing basically. So it is either in-app analytics or BI for customer use. Basically, everything we do is around data. At Gong, there are many, many teams and groups within Gong that create a lot of data and we have to build insights on top of this data.

Boaz: Yeah, Eldad and I know, we talked about it in past episodes, I think as well. We always loved the intersection of software engineering and deep data projects. So, let us try to untangle that. Tell us a bit about the current data stack and sort of the current active projects, and let us see what is going on under the hood at Gong with data.

Yarin: Cool. So in terms of the stack, basically, we like to keep things very simple, but at the large scale at Gong. Our data stack is, we have the operational databases, which are serving the entire product and pipelines within the product. In terms of data, we started simple. We just PostgreSQL, then moved to Redshift. Now, we have a bigger project basically to take data from multiple sources for this in-app analytics use case and move everything to a single data warehouse, with a lot of pipelines to aggregate and pre-aggregate, data for basically sub-second query times for dashboards and graphs, etc. For operational databases, we still use a lot of Elasticsearch, PostgreSQL. We have some MongoDB, and as mentioned, it is also for the in-app analytics and we are now moving also a lot of parts from PostgreSQL to Redshift and to Snowflake.

Boaz: These are a lot of things. Walk us through the different teams that are in? How do the teams that deal with data look like at Gong? What do you have between data engineering, software engineering, which teams are dedicated to in-app insights in particular? How does that look like? How do all of these interact with each other?

Yarin: It is a very good question because it is pretty complex at the pace that Gong is growing. We have within the insights group, right now, we have 2 tracks, 2 pods, with products and engineering and data engineering. One of them is to create the in-app analytics track. Another team is focused on the infrastructure of basically building the lakehouse if we can call it like that, both for internal use.

We also have the classic data analysis team and internal BI team. They all are utilizing the same data generated from Gong and others, we have data as mentioned product analysts that are doing a lot of internal work with the data. Overall, I would say there are about 30 different people working around these areas of data that is being generated, on top of Gong's platform.

Boaz: What data volumes are you guys dealing with?

Yarin: A lot. So, most of the data that is not being analyzed or being analyzed but not for data analysis is basically the videos and calls - what we call the media. The last time I checked recently, it was almost 5 petabytes, in terms of data ingestion. So we are talking about hundreds of thousands of videoconference calls per day. I would say many millions of emails per day. Overall, all the data that is not the media, I think takes around a few hundreds of terabytes.

Boaz: Wow! As you mentioned a variety of technologies. Walk us through, maybe, you know, if I am a user, from a user experience perspective, what kinds of analytics am I exposed to as a user? And then what parts of the stack are delivering it to me?

Yarin: The largest part of data, which we call team stats is analyzing how the team is performing in terms of sales calls. Gong has identified some very interesting metrics that help salesperson do better, things like the talk ratio, patience. So we have a very specific part of the product that serves this data, and then, we have some more advanced things that we can basically track what is being said in calls and track it over time. It is relatively new area that we are focusing on at Gong. What we call tracking the strategy or the sales organization strategy. Basically, all the data right now is being served either from PostgreSQL, in which all the data is pre-aggregated or from Snowflake where all the data is basically being modeled using DBT and almost no aggregation is done there. So it is query time.

Boaz: How did the data stack evolve in the year and a half that you are there? I mean you mentioned you are looking into a Lakehouse architecture right now. Is that a new initiative or was that always the case? And when did Snowflake and DBT come in and what was the driver for that?

Yarin: So the entire lakehouse solution is being in progress these days. We are trying to make it more of an evolution rather than a revolution, but still, a lot of the data is being served from the classic simple one big table on top of PostgreSQL, and now we form the new architecture basically, looking for, in terms of scale and building this new lakehouse solution.

Boaz: You also mentioned Redshift, right? Where does the Redshift come in?

Yarin: Mostly around product analytics. A lot of data that is being collected from either third-party tools ends up in Redshift anonymized. We do not have any customer data at all there. So, it is only just IDs and raw metrics.

Boaz: What are the typical frustration that you guys run into in the data stack? What about maybe the internal users like to complain about or what the new evolution may resolve?

Yarin: Users always complain obviously about performance, why this query takes so long, etc. A lot of internal users find it a bit difficult to get the real picture of data because we are using so many disparate databases and also when we try to bring everything under the same hood, data modeling still takes some effort because Gong is a company that moves really fast in the engineering, and sometimes we do not always have a cohesive data model, in terms of data engineering and analytics. So, these are two pretty interesting challenges that we are facing.

Boaz: Who is driving or who is involved in the evolution into the new architecture? How from a process perspective or a human perspective are you managing that project?

Yarin: Everyone, not kidding. The interesting part here is that it is not a classic internal data project, because we are also exposing some of this data, obviously model differently, but to customers. So, they will end up having their own access to a warehouse with all their data. So we have two areas pulling the same string, both from the product and from the internal analytics use case. It is almost everywhere in the organization, starting from customer success to product, to engineering, to pretty complex projects.

Boaz: I think you mentioned sharing data with customers. Can you elaborate more a bit about that? So what's the business need or requirement there and how are we going about it?

Yarin: I think I look at it as the next evolution of APIs. Gong has a pretty robust API, which companies can utilize and build their own products, but the biggest benefit of Gong is the data that it holds, and obviously, our customers want to build data products around Gong's data. So it only makes sense instead of consuming APIs and building the pipelines and data transformation is just hand them over all the data cleanly modeled in a way that they can either just plug in a Tableau or Looker and just start using the data, or just build products on top of data and not API.

Boaz: They are essentially exporting the data, copying it into their own environment through the APIs.

Yarin: Yes.

Boaz: Got it. What are your impressions so far you have been using both Snowflake and Redshift, how is that sort of looking for you guys? What's the conclusion>

Yarin: We are relatively new with Snowflake, but overall I think both have their strengths and weaknesses. Obviously Snowflake, now, in terms of a bit cost and Redshift is not as performant and ingestion is a bit more challenging. Overall, I think they are both okay for what they do. Some use cases are better here. Some are there.

Boaz: Is DBT used with both or only with a Snowflake.

Yarin: Only with Snowflake.

Boaz: And how long has it been since you have adopted DBT? How has that been going?

Yarin: Same pretty, pretty new for us as well.

Boaz: Any tips so far for other newbies with DBT - Things to avoid, best practices from the rather short journey for you guys there or not yet.

Yarin: Yeah, not yet. We are still in the part of, like these things, we are more comfortable with, but we are still learning as we go.

Boaz: Awesome! I wonder you being a veteran in software engineering and essentially now delivering data experiences into products, the intersection of software engineering and data, what kind of practices are important for you that are different from sort of the traditional internal analytics world

Yarin: Mostly it is around the quality of data, which we end up serving our customers. So we treat data projects almost as a software development lifecycle. What is the SLA for supports, monitoring? How can we make sure that the data that customers see, that went through so many pipelines and transformation, is accurate? Pipelines cannot break, so we cannot have data that is, "yeah, it's broken for the past 24 hours." So customers see data that is from the last couple of days. We treat it just like any other large-scale customer-facing software project.

Boaz: Let us maybe use this as an opportunity to pick on you and ask, was there any sort of a failure that you remember that we can look back and then maybe help our listeners avoid? What do you remember as a horrible day? whether it is data pipelines or production issues that we can all learn from.

Yarin: So far, nothing major customer-facing. We did have some mega failures trying to be very naive in terms of how we are going to approach this lakehouse project. We were very naive and said, yeah, let's like take something like a CDC solution that consumes all the possible data, throw it in S3 and then just have Athena or Presto or something like that and it is just a one month project and we goal in, but yeah, this one month was a battle, I think, for about 6 months ago and then, we decided to move on to something more robust.

Boaz: What is that more robust thing. Let us just untangle that a little bit. Let us spend some time there.

Yarin: It is let us not cut angles and just build a streaming service from within the products where we can make sure also in terms of data lineage. So we know when the data was modified and what kind of data was modified. So for example, if we are talking about recorded calls, even scheduled calls. So before the call is being processed and analyzed by Gong, we can know exactly when this call was changed and modified by the host, was it rescheduled, etc., and then, we write everything. We almost created our own CDC mechanism. We dump everything to S3, where we have a like staging area, I would say, being rapidly ingested into something like Snowflake, which is the journal for each entity. And from then on, we can create history tables or daily snapshots. We can go back in time and say, yeah, let us see how the actual data was looking a week ago. So, no shortcuts.

Boaz: How did you go about with the lineage challenge? How did you implement that

Yarin: We still have some challenges there, but we keep track of any transaction and data pipeline, which is internal to the product, and have a record that says that we know when was this entity modified and by whom. From then on, we try to keep a sequence or a batch ID and each transformation has its own records. When was it transformed, by which pipeline, etc? Eventually, we should be able to unroll aggregated data and find out the actual raw data that was taken into consideration for disaggregation.

Boaz: Tell us internal analytics, how does that look like? What tools are being used and which of the different databases and warehouses are they hitting?

Yarin: For internal use?

Boaz: Yes.

Yarin: We have Sisense. We now have Tableau. We have our homegrown analytics, which I am not sure at which front it is being used. Most of the data is being ingested to Redshift, either from third parties or being streamed from within the product. Pretty straightforward. We are using Amplitude for analytics. FullStory for UI analytics, like product user journey. Pretty straightforward.

Boaz: You know, we have picked on you for data tragedy. Now, let us talk about something happy. What are you proud of or projects that went very well or something that we can learn from if you can share?

Yarin: I think one example that I have on top of my mind is like we were looking to replace a PostgreSQL for one part of the product, which we served some stats, and before moving off from PostgreSQL, we map where the problem is? Why is it not working? Is moving on from 2 different datastore is the solution? and at the end of the day, we identify that with just a very, very minor optimization to this PostgreSQL, using RDS or Aurora, so we can more easily scale PostgreSQL, not that it is an easy task in itself, but we had a pretty big table in terms of what PostgreSQL can deal with, it was just research for about 2 weeks and then the solution took a couple of days, and we were back from like 10 seconds loading time to sub-second queries, just with using the right column type, the right indexes, and it was pretty interesting how far you can go with the modest tools.

Boaz: Yeah. Sometimes the simplest things work like a charm, but we oftentimes do not spend enough time figuring that out and go from a robust solution. Awesome! Great story! Maybe an interesting angle to talk about would be hiring. How do you go about hiring engineers from a data angle? How do you make sure they will be able to deliver on those big data challenges? What are you looking for?

Yarin: First, I would say it is a very difficult task nowadays, but yeah, we all can agree on that. In terms of data, we are looking for people that dealt with data, not necessarily at a data engineering level and know all the tools and how to build infrastructure, but have some experience around data, what it means to like query 2 terabytes of data? How databases and data stores are modeled and understand technically the strength and weaknesses of each. It is not necessarily the best SQL experience, but more of the understanding of performance, good data modeling. This is also something that we put some emphasis on. Have the good foundations, 3NF, Kimball, things like that.

Boaz: Awesome! Thanks! Are there any parts left in your stack that you consider very legacy and are the ones that are in line to be replaced?

Yarin: Yes. Although Gong is a relatively young company, the things that are implemented 2 years ago are considered legacy. I would say that we still store data, which is not for operational use in PostgreSQL is something definitely want to move on from, either to the lakehouse approach or maybe some other data store which allow fast analytics at a smaller scope, but at the faster scale and performance.

Boaz: Thank you. I think Yarin I am running out of questions. I mean, that has been super, super interesting.

Eldad: Yes.

Boaz: We love Gong ourselves at Firebolt, and it is definitely interesting to hear such a sort of insight-driven and data-centric product as data flowing behind the scenes. Thank you for joining our episode and any final words you want to spread out to the data engineering world?

Yarin: Yeah, I would say what we always do, one of our operating principles at Gong is first enjoy the ride and the second one is just, do not be so naive and say, yeah, put everything on S3, and then it would work. It takes a lot of effort and a lot of time and a lot of people along the way to create a robust data solution.

Boaz: But the first few days of that naive, false optimism, they are so fun. You think only worries are over before it explodes in your face

Eldad: Yes

Yarin: Feel so super-powered. Yeah! it will take us just 2 weeks

Boaz: Within your head, it is all done already. Yarin, Thank you very, very much and see you around.

Eldad: Thank you so much.

Yarin: Thank you. Bye.

Eldad: Bye-bye.

Boaz: Bye. Thank you for joining us.