Data in the Wild

Today, we’re joined by Mathias Hansen, co-founder of Geocodio. He talks to us about the challenges of dealing with geodata, how opening up data helps improve accuracy for everyone, and more.
Chapters
  • (00:52) - What is Geocodio?
  • (04:01) - Geocodio’s tech stack
  • (06:46) - Data models and flexible pricing
  • (10:13) - Challenges in geocoding
  • (19:05) - Long lat or lat long?
  • (20:39) - Challenges of having locations near bodies of water
  • (23:25) - How opening up data could help with accuracy
  • (25:00) - Mathias’ tall tale about why they’re only available in the US and Canada
  • (39:26) - Why you should automate your processes from day one


Sponsor
This show is brought to you by Xata, the only serverless data platform for PostgreSQL. Develop applications faster, knowing your data layer is ready to evolve and scale with your needs.

About the Hosts
Queen Raae wrote her first HTML in 1997 after her Norwegian teachers encouraged her to take the new elective class. 

Around the same time, Captain Ola bought a Macintosh SE for his high school with the proceeds from the school newspaper he started.

These days, you’ll find them building web apps live on stream and doing developer marketing work for clients. They are both passionate about the web as a platform and the joy of creating your own thing.

Visit queen.raae.codes to learn more.

Creators & Guests

Host
Benedicte (Queen) Raae 👑
🏴‍☠️ Dev building web apps in public for fun and profit 👑 Helping you get the most out of @GatsbyJS📺 Streams every Thursday: https://t.co/xaLy43cqMI
Host
Ola Vea
A piraty dev who also help devs stop wrecking their skill-builder-ship ⛵. Dev at (https://t.co/8m50kyT981) & POW! w/👑 @raae & Pirate Princess Lillian (8) 🥳🏴‍☠️
Editor
Krista Melgarejo
Marketing & Podcasts at @userlist | Originally trained in science but happily doing other stuff in SaaS & tech now
Guest
Mathias Hansen
Software Developer wearing many hats. Married to and @Geocodio Co-Founder with @mjwhansen 🇩🇰🇺🇸

What is Data in the Wild?

Learn from your favorite indie hackers as they share hard-earned lessons and tall tales from their data model journeys!

Brought to you by Xata — the only serverless data platform for PostgreSQL.

[00:00:00] Ola: Welcome to Data in the Wild! Discover data model tips and tricks used by our favorite indie hacker devs, with Queen Raae! I'm your co host, Captain Ola Vea, and this podcast is brought to you by Xata, the serverless data platform for modern web apps.

[00:00:26] Benedicte: And today's guest is the great and powerful Mathias Hansen, co founder of Geocodio.

[00:00:32] Welcome to the show, Mathias.

[00:00:34] Mathias: Thank you for having me.

[00:00:35] Benedicte: Do you like being introduced as the great and powerful?

[00:00:39] Mathias: I've never been introduced like that before. That's something new. I'm not sure if I can live up to it.

[00:00:46] Benedicte: I think you can. I think you can. Cause you know, you're one of the OG indie hackers.

[00:00:52] So before we get into it, what problem does Geocodio solve?

[00:00:58] Mathias: So Geocodio is a geocoding API and platform. So we convert street addresses into coordinates, and coordinates into street addresses. And then we also allow you to add a bunch of additional data on top of this because coordinates are great, but often you need additional information, such as the time zone for an address or the school district or you can add census data and a bunch of other things.

[00:01:26] Benedicte: Is this worldwide or is it focused on the US?

[00:01:29] Mathias: It is just in the US and Canada for now. And we'll probably get into a little bit later why that is.

[00:01:36] Benedicte: Yeah. So when did you go into production? Like when did you have your first customers to get a sense of the time timeline?

[00:01:45] Mathias: We launched in the end of January of 2013.

[00:01:50] And basically we just built a small proof of concept prototype for our own use. We had tons of projects at a time at work and random side projects where we needed to geocode lots of addresses and it was just really expensive and prohibitive. So we tried to build a small proof of concept geocoding engine that would just work for us.

[00:02:13] It was not very good. It was slow. It was not very accurate either. But it was just good enough that we could actually use it for some of our data. So we talked about like Michelle and I, my co founder and wife, why don't we just publish this and try to charge for it? Just a simple payable, essentially in front of the API to sortf cover our costs and the development time we spend on it.

[00:02:44] And late January 2013, we launched everything. And at the time, the cool thing was to post on Hacker News. So we posted it to the Y Combinator Hacker News site. And lo and behold, we actually got to spend about 24 hours on the front page and we were ecstatic. It was an amazing experience.

[00:03:06] People actually were interested in this stuff. Um,Itdn't mean that we became millionaires that day. We got lots of traffic, lots of comments, some nice, some less nice.

[00:03:24] It's a typical thing where people will post, "Oh, it's much cheaper to use this other provider," or, "Oh, I could build that myself on a weekend," and those kinds of comments they are in there as well, right?

[00:03:36] Ola: Yeah.

[00:03:36] Mathias: We got lots of comments. We got lots of feedback. We got lots of visitors, not a whole lot of signups necessarily. But it was a good boost to get the ball rolling. And it kind of made us realize that we probably hit a nerve and we've solved a problem that not only we were facing, but others might be facing as well.

[00:04:00] Ola: Yeah.

[00:04:01] So before we get into your experience with data modeling, could you quickly run through your tech sack?

[00:04:07] Mathias: Sure. So we've always been using the Laravel PHP framework for our API and dashboard. So lots of PHP code in there.

[00:04:22] Today, we used a whole bunch of different technologies. So on the database side, we use MariaDB. We use a whole lot of SQLite as well the file based database. And then we use Go, we use Python, we use, you know, frontend we use React, we use Node.js for various things too. We use Ansible, we use Terraform. Trying to use the right tool for the job. Each tool has a you know, specific purpose. But we also do a whole lot of things.

[00:04:58] We run our entire infrastructure ourselves so there's a lot of tooling that needs to be in place to automate things as much as possible.

[00:05:06] Benedicte: But so on the database side, you said it was MariaDB and then some SQLite?

[00:05:10] Mathias: Yep, exactly.

[00:05:13] Benedicte: So what would you say are the database models that has changed the least for you since you launched? And it's been 10 years!

[00:05:23] Mathias: Yeah, it has. It's crazy, right?

[00:05:26] Benedicte: So which one has changed the least?

[00:05:28] Mathias: I mean, if you look back, codebase has been alive for 10 years with in the beginning, a lot of development, just nights and weekends. So a lot of hacking around. Let's be honest here, wasn't planned to be a big thing.

[00:05:45] You know, there's a lot of, you know, corners of the code base that hasn't been touched for a long time. And a lot of the data modeling was done a long, long time ago. So some of the things that really hasn't changed much is the way we manage users, API keys, basic commissioning, and we have the concept of concept of spreadsheet uploads.

[00:06:10] The way that's stored in a database has changed very little. We might've added some columns to our tables over time, but we try to keep everything exactly the same. And for the most part, that's actually perfectly fine. The concept of a user is very generic, right? And hasn't really changed much over the years.

[00:06:35] On the geocoding side, however, lots of things have changed over the years.

[00:06:39] Benedicte: Yeah, we're going to get into it. We're going to get into that.

[00:06:44] Ola: Maybe you should get into that. What has changed?

[00:06:46] Benedicte: I want to ask that because like a lot of people who have been on the show so far has had a lot of changes to kind of, maybe not to the user model, but to how the user is connected to plans and subscriptions and organizations.

[00:06:59] Ola: Yeah.

[00:07:00] Benedicte: And is that also true for you? We don't have to go a lot into that, but is that true for you or did all of those things also not change as much?

[00:07:10] Mathias: So there's kind of two parts there. One thing is the plan and pricing stuff. And I was actually working at another software as a service company at a time when we started Geocodio.

[00:07:24] Geocodio was just a side project for a long time. And we had a challenge that company where we had a big marketing department that kept coming up with new plans, new pricing, various sales and ways to you know, onboard customers using different offers, right? So I knew how much a pain it was to work with at a time.

[00:07:49] So I actually designed Geocodio from the get go like a data model of how we deal with planned pricing to be somewhat flexible because I knew that you know, pricing isn't a constant. It's bound to change at some point. So I tried to not make it too fixed and actually made it database driven pretty much from the get go.

[00:08:11] So that means that today, uwecan actually use the same data model as we started out with. We've also been pushing pretty hard to not make too many major changes around the way we charge for things for our own sake, but also for our customers' sakes. It gets really confusing to explain if you change your pricing model entirely, for example. And that's really complicated.

[00:08:37] On the other hand, one thing that I hadn't planned for in the beginning was thinking about granular permissions, thinking about teams, and team structures, and, you know, roles in a team and things like that. And that's kind of biting us in the heels right now because that's something I wish our data structure or data model already supported.

[00:09:04] And it's going to be a project probably later this year, where we have to redesign some of this stuff and migrate it over to a data model that lends itself better to teams. Right now, we just have a simple structure where you have a parent user and that parent user can have child users. And that has been great for the last three or four years, but at this point we need to be able to have like a billing admin versus a technical admin and things like that.

[00:09:37] So that's something we have to redesign.

[00:09:41] Benedicte: Cool! But I like that you got to take some of that experience into the new project and that actually was helpful. And that is hopefully a little bit what we do with the show that people can hear people struggling with some of the same things and be like, "Oh hey, maybe I will also struggle with that and like take some time to think about that part to make it a little bit flexible from the start.

[00:10:06] But what were you going to say, Ola?

[00:10:08] Ola: No, no, let's, yeah.

[00:10:09] Benedicte: Yeah, yeah. But then let's get into it.

[00:10:12] Ola: Yes.

[00:10:13] So what data model has changed the most?

[00:10:18] Mathias: So the way we do create an address, if you go really high level, it's a two step process. If you give us an address, say 1 Main Street, Springfield, Virginia.

[00:10:33] We first parse the address into bits. We need to figure out what's the house number, what's the city. If that's a zip code, what part is that and whatnot. So that's the parsing part.

[00:10:42] The other part is the actual geocoding, where we take these address components and we look them up in a large database of cities and zip codes and actual addresses as well.

[00:10:56] So when we first started out with Geocodio, we got all of our address data from the US Census Bureau. They have a nationwide dataset that basically, it covers most of the US and they have what's called address ranges. So they say between 100 and 500 Main Street, there's these coordinates on a line and then you can do some math to figure out, "okay, if you want 200 Main Street, it's going to be roughly in the beginning of that set of coordinates."

[00:11:33] So our entire data structure for geocoding was based off of these address ranges. So ranges of house numbers. Now that's not super accurate because if you just estimate based on a little line on a map, you might end up at, so best case, the neighbor's house. Worst case, a little bit further up, especially if the distance between the houses are not super equal or even.

[00:12:03] So one of the first major changes that we had to do in geocoding side was being able to support what's basically called rooftop geocoding.

[00:12:11] So having points on the rooftops of houses and that's individual coordinates rather than lines and then we still need to be able to fall back to the previous solution because we don't have nationwide coverage of rooftop points.

[00:12:34] We're getting closer and closer to being at a hundred percent, but we always need to have a fallback. So we basically had to sit down and rethink how we make these lookups. It has to be fast and efficient, but also as accurate as possible. So having these kinds of fallbacks as needed. So that's definitely been on of the biggest projects to put that in place.

[00:13:01] Benedicte: But do you get these rooftop coordinates also from the US the same place? The same source?

[00:13:08] Mathias: So that would be awesome if you can just get it all from one place.

[00:13:13] So at this point, we have about 2,500 different data sources where we get rooftop points from.

[00:13:25] Benedicte: 2,500?

[00:13:26] Mathias: Yes. So it's usually a local county or a local city that publishes address data for their local area.

[00:13:36] We we're really fortunate that there's some great open source projects that has really helped us save a lot of time because it's a big open source project called Open Addresses that, the goal of that project is basically to collect a lot of these data sources and make a big index of them so you can sort of see them in one place.

[00:13:59] So we piggyback on a lot of these sources that they have collected. We also have our own sets of sources, that aren't in open matrices for various reasons. And we also, of course, contribute back to the Open Addresses project whenever we find any issues, broken links to sources, we push it back to the project.

[00:14:25] But a big part of dealing with so many different distinct data sources is the whole address, data normalization, essentially. It's not just addresses, it's basically just data at this point, right? So a big part of what we do across the board now is data normalization, trying to get all these different sources to fit into one box.

[00:14:49] Benedicte: And how does that look for you now? Oh, sorry.

[00:14:53] Mathias: Yeah, it's looking pretty good at this point. We actually had a major project that we worked on for, I think almost a year and we launched earlier this year to basically have our own full in house platform for doing all this ingestion and normalization.

[00:15:15] So what that means now is that we can go out to every single individual source and decide exactly how often we want to refresh the data for it. So in many cases, we go to each county, each city, each local area every single week and ask them for a new data dump or download it from their site and run it through our normalization process.

[00:15:36] There's a lot of cleaning involved as well. As you can imagine, some of these sources are great and some of them have a lot of messy data in them that we have to discard. You might have address points that are lacking a house number or a street. And that's not really useful. We can't really do much for that, especially not a house number, then we don't know where to associate it.

[00:16:02] There's spelling errors and street names from the local cities, for example. That's not uncommon. Lots of other issues. You might have addresses that they actually don't have a coordinate for yet. And there is this concept in the GIS world.

[00:16:22] By the way, I'm no GIS expert. Everything I've learned, I learned on the job.

[00:16:28] Benedicte: So you said GIS, which is a geo?

[00:16:31] Mathias: Geographic information systems. So just working with geographic data. So there's this concept called a null island. So if you have a point and you don't know where to put the map yet, but you know we need a point for this house number on this street in the city and whatnot.

[00:16:50] You might just pick a random location on the map and say, "we're going to put all those addresses over there that we don't know where to put yet."

[00:16:57] Benedicte: Because the system requires a coordinate. So you're dumping it in some.

[00:17:01] Mathias: Exactly. You can have a null.

[00:17:02] Benedicte: Yeah.

[00:17:03] Mathias: Yep. Yep. Exactly. There's no null type for a coordinate, right?

[00:17:07] But instead of picking say, 0. 0 or something that's almost like null, you often just pick a random point and say, "there." And then you can imagine that we have 2,500 different data sources, and each of them has a different random point that decides that null point, right? So it's problems like that makes everyday fun.

[00:17:28] Benedicte: So do you then go in and check to see if every coordinate is like, in the vicinity of that city to make sure that it's a true coordinate or like, how do you go about?

[00:17:39] Mathias: Exactly. Yeah. So we try to do different things. Like sanity checks, we call them. Does this look remotely accurate? Or is anything concerning?

[00:17:48] If anything is concerning, we remove it. So it's an automatic process, of course, because we're dealing with hundreds of millions of records. So that's not much manual work, space for manual work here.

[00:18:00] But sometimes, this coordinate might also be within city limits. So then we can do things like testing if lots of coordinates, sorry, lots of locations, addresses have the same coordinate, right?

[00:18:14] Then something is up and we discard those because there shouldn't be a thousand addresses with the same coordinates, unless you're dealing with a big apartment building and you have lots of units, then maybe it's okay. But it's things like that.

[00:18:29] Benedicte: Have you ever come across, cause this is like early in my career, I did this project and I was like mapping something on the map, mapping something on the map, funny, but like, and it was Norway, and it was like in some ocean and I was like, "what is up with this?"

[00:18:47] And then I realized that, and I no longer know what the standard is, but instead of doing lat long, they had done long lat or the other way around.

[00:18:55] Mathias: Oh yes.

[00:18:56] Benedicte: And it took me forever to figure out what was wrong because it just kept showing up in this ocean outside of Asia. And then finally I got it.

[00:19:05] But is that, are we not in agreement on what that standard is?

[00:19:09] Mathias: Not at all. Not at all.

[00:19:12] It's a big confusion because some people strongly believe that it should be lat long and some people strongly believe it should be long lat. There are different format, like file formats or ways to store geographic data that dictates that it has to be a certain order.

[00:19:30] You might've heard of the GeoJSON format, which is basically just a JSON document that is formatted in a specific way so it can have geographic data. And that I believe is standardized to be long lat.

[00:19:42] So there are some rules there but sometimes it can get really confusing but at least, you know, as in your case, when you've done it a couple of times, you quickly realize, "Oh, right. That's that weird issue. I just have to fix that," And then it worked next time.

[00:20:00] But if you haven't seen before, it's mind blowing. Why would it end up in the middle of nowhere, right?

[00:20:05] Benedicte: Yeah. And then also, yeah, because you don't, I guess you get a sense after a while to like how a flipped one looks like the US kind of ends up. You just start recognizing where they will end up and then you can see, "oh, that's the reason."

[00:20:20] But yeah, it was, there was some other fun things, but we're here to talk about you.

[00:20:25] Mathias: Well I've definitely been there as well, that's why I'm talking about learning on the job. Those kinds of things, right? I just learned the hard way. Oh sorry, go ahead.

[00:20:36] Benedicte: No, go ahead. You keep on talking about.

[00:20:39] Mathias: You just mentioned oceans and that's actually another thing there that I had not expected to have to deal with in the way beginning of starting this company.

[00:20:50] So take for example, Florida. We have lots of coordinates, addresses that are really close to the ocean, like right on the beach or whatnot. And sometimes you have coordinates that for whatever reason is placed in the ocean. They're probably misplaced or inaccurate by the local county or city.

[00:21:11] So you filter out coordinates that are in bodies of water. Because we don't consider those valid addresses. Well in order to figure out whether the coordinates is on land or is in the water, we use geographic maps of shorelines. And it turns out that, you know, shorelines change over time because of erosion and whatnot.

[00:21:39] So these maps are not always highly accurate. So we sometimes have issues with houses really close to the ocean where we actually don't know if it's in the ocean or not based on the data we have. Sometimes you have to go and look at like satellite photos and things like that to like figure out, is this a real data point or is something wrong with the coordinate here?

[00:22:00] Benedicte: Yeah, and I guess it's getting more normal to just like fill in water as well where you're just like.

[00:22:05] Mathias: That as well yeah.

[00:22:06] Benedicte: Let's just build some more houses. We're just going to dump a lot of rocks and stuff. And then just build yeah, on top of it. And I guess that doesn't get accurately reflected.

[00:22:16] What I was going to ask was like, do you want to give a shout out to like a specific city or county who's like doing a really good job with their data source?

[00:22:24] Is there one who's been like consistently great?

[00:22:29] Mathias: Yeah, I mean, our hometown. Hometown in quotes, Arlington, Virginia has actually done a fantastic job of having like a lot of cities now have these open data portals where they just publish a whole bunch of things for the city. It's things like snowplow zones. Like which zones they're splitting up into for plowing.

[00:22:54] And I don't know, trash collection schedules, and that's just tons of random stuff in there. You know, booking public parksforr playing games and stuff like that. Tons of awesome stuff.

[00:23:08] Another city that's doing a really good job, which is probably no surprise being a major tech hub is San Francisco. San Francisco also has fantastic open data. And they're also really good with their geographic data being accessible and up to date. And one of the things.

[00:23:25] Benedicte: It makes it more up to date when it's open. So they're very kind of the public can go in and do changes. If you see a city opening their data, you also see that their data is more accurate.

[00:23:39] Is there a correlation?

[00:23:41] Mathias: Yeah, very often you have some small cities that hold their geographic data close to them and they say, "Oh if you want access to our data, it's going to be tens of thousands of dollars a year for this random small town in the middle of nowhere," right? And their data is definitely much more likely to be less accurate.

[00:24:02] I guess the only thing that's good. With the small towns is that there's less data. So it's easier to manage, right? A big city like San Francisco has lots of data. So more can go wrong, right? Or slip through the cracks.

[00:24:15] But I really think that having all this data be open and transparent really also helps with improving it.

[00:24:25] And we've worked with, for example, the GIS office in San Francisco with some issues where we able to provide feedback back so, you know, we often have customers reach out to us who has issues with certain addresses, and often we're able to work with the city or local town to correct those issues.

[00:24:46] So it benefits more than just us, but also everybody else.

[00:24:51] Benedicte: Yeah. Oh, this is so. I'm getting so into it. I just want to talk about geodata. We're supposed to talk about data models so let's bring it back to that.

[00:25:00] You said that there is a reason for why you're just in the US or just why you're just, I feel like that was the wrong way to say it, but you said there is a reason you're in Canada and the US and not the rest of the world.

[00:25:13] And I feel like you alluded there to being like data model, data topic in there.

[00:25:20] Mathias: Yeah. I think that's where we kind of get into having like a tall tale because.

[00:25:25] Ola: Yes!

[00:25:28] Mathias: That's all right.

[00:25:29] Benedicte: Yes. Go for it!

[00:25:31] Mathias: So we kind of started off with, you know, poked a little bit of a hole into it already.

[00:25:38] So one of the issues with working with geographic and address data is there's a lot of edge cases. So we've basically decided that instead of trying to cover the entire world with a really mediocre product that probably isn't that great, we want to try to just be as the best possible in the US and Canada.

[00:26:03] And by having a smaller scope, there's a higher chance of us succeeding in that goal, right? So this one, I have a couple of examples of how crazy address formats can be and why this is like. It's so tricky because there might be, you know, a million other different formats outside of the US and Canada.

[00:26:32] One thing is is language right now we do support Canada. So we're working with French addresses and things like that as well, which is okay. But if you start expanding even further to conventions of local countries and whatnot, it gets really, really tricky to get things right, and we just don't believe we can do it well.

[00:26:51] So in the US in particular, that's actually a lot of funny rules. So large companies or companies that own buildings, they love if they can get a vanity address for the building. So instead of so it's typical in New York City, for example. There's an address that's actually on 1325 6th Avenue, New York, which is a cool address, but it's not cool enough.

[00:27:17] So they applied to get a vanity address, which is basically an alias of an address. So their public address is 1325 Avenue of the Americas even though that building is like two and a half miles from Avenue of the Americas.

[00:27:38] Makes my job really hard when the addresses don't make sense. But there's lots of examples of this.

[00:27:46] There's also, One Kendall Square in Cambridge, Massachusetts, which is known for all the like big biotech companies, they're all located in One Kendall Square. But there's no Two Kendall Square. That's just the One Kendall Square.

[00:28:03] You have things like, again in New York, we have your avenues and you have your streets. So it's a little bit of a grid system. So you can kind of easily find an address if the address makes sense.

[00:28:18] So you have a Sixth Avenue, for example. The example from before, but you can actually spell sixth spelled out S I X T H, or you can do 6 T H, right? You can also just drop the T H and it's just a little word, right?

[00:28:35] So lots of different variations there we have to support. And then on the country, we have, Utah who likes to have grid style addresses. So you have an address, for example, 842, East 1700 South in Utah. That's an address, but it doesn't look like any other addresses you normally see.

[00:29:02] Wisconsin has a similar system and apparently it predates the GPS system, but the idea is, again, using some kind of coordinate so you can better tell where a certain building is located. So that's West 156, North 8480 Pilgrim Road, that's an address. Why is there multiple house numbers? Or is it street numbers?

[00:29:29] Oh well, it's kind of both, right? So when you have a, try to go back to data modeling, when you have a nice little table, you have your house number here in this column, you have your street here and whatnot. It doesn't really fit anymore in this data model. So you really have to be quite flexible and sometimes you have to do a lot of research to figure out the designation of things.

[00:29:55] So in the US, at least the USPS, the US Postal Service, uhthey've down a lot of documentation on how addresses are allowed to be formatted. So that's actually a good reference for how to build your data model for storing things. But that's just so many edge cases. And again, that's what makes my job fun. All these edges.

[00:30:19] Benedicte: But if I'm making a SaaS and I need to ask for the address, would like a just like a free text form, be the best instead of trying to force it into different boxes? Or should I try to force them into different boxes?

[00:30:35] Mathias: So in our case, we actually prefer the free text form because we see input data. So like addresses submitted to us in lots of different formats. Some of them are super cleanly formatted and they're super trustworthy. This is actually the street part. This is actually the house number and whatnot. But people also use our systems to process really messy data that isn't super trustworthy in terms of how it's been parsed previously.

[00:31:11] You might also have, have errors or things are out of order or something like that. So we've actually optimized our geocoding engine to just work on free form text. And then our geocoder can make the decision on what's what. Things have to change, right? You have an address from 10 years ago and the street name might've been renamed, for example, and suddenly it doesn't make sense anymore.

[00:31:39] Benedicte: So do you store historical data as well then in your systems? So if you've detected a street name change, do you keep the old street in your systems to be able to check for that?

[00:31:53] Mathias: So our current data pipeline, the way we process data, we primarily just look at the most recent data and use that. We have not really gotten into truly doing historical addresses or like, you know, saving previous street names and things like that.

[00:32:18] But we have different things in our systems, like some of our fallback stuff that allows us still to figure out if an address, a street name has changed in many cases and match it to the correct one. But it's not something that is front and center.

[00:32:36] And it's not something we like have fantastic support for. It gets really complicated, as you can imagine.

[00:32:43] Benedicte: Yeah.

[00:32:44] It sounds like it. And now I just, I really want to go to like, get a snapshot of your database model. But I guess like proprietary information. But it sounds like, yeah, it's quite challenging, especially when you have aliases, because then you're, it's basically.

[00:33:02] It's basically like the issues you're coming up with when you're tagging and people use different tags for the same thing.

[00:33:09] Mathias: Yes.

[00:33:09] Benedicte: And then you change the one tag, but not the other tag, and you kind of have to keep these things in sync. So it's more of that kind of problem than just like an address, which I kind of, how I imagine in my head. Like you said, like you have your number and your street and your, and it's one row in one table.

[00:33:27] Mathias: Right? Yeah.

[00:33:29] Benedicte: But it's not, it's what I'm, I don't know.

[00:33:32] Mathias: So the data model itself is actually, it's pretty stable at this point. Where we spend most time, you know, the last couple of years has been the journey from raw data and into this data model. So having this pipeline where we have a lot of steps along the way to clean and filter the data. That's what we've been focusing on the most. That's probably gonna be some changes to our data model in the future based on all that work now.

[00:34:04] I alluded earlier to unit numbers. So apartment numbers and things like that. That is something that we are slowly, so right now we don't support geocoding on that level, which is not really a big deal if you have an apartment building, you kind of want the same lat and long for the front door, basically. But you might have row houses where each row house is a separate unit rather than a separate house number, and those should be geocoded to each building.

[00:34:31] And we don't support that just yet, but that's something we're slowly starting to change our data model so we can have that extra level of granularity.

[00:34:46] Ola: Yeah, so if you could time travel back in time, what would you undo with any of your data models? It could be just one thing. Yeah, do differently.

[00:35:02] Mathias: I think the biggest thing that's bugging me right now is we were talking about earlier is the team support. I haven't really gotten into adding team support yet because I've got dreading to go in and split that apart and migrating it over.

[00:35:18] So one of the hardest things at this point in time is of course that the product itself it's a fast moving freight train. Um, And have to change pieces on it, you have to change you know, your what's it called, your carts out or anything like that while it's moving. That's really tricky.

[00:35:41] Anything we do has to be zero down time. We can't just take the product down for maintenance for six hours while you migrate something. We really want to avoid that. So changing things like the core of how we deal with users and permissions is going to be a little bit tricky.

[00:36:02] We also want to make sure that we're not messing up permissions along the way. So you know, users get access to something they shouldn't have access to. Excuse me. I feel like you've seen all the time that there is companies where I feel like I've seen so many times the last couple of months that company changes some things in a data model or they're in their infrastructure and accidentally users get assigned to the wrong session.

[00:36:34] And suddenly, you log into a site and you have access to someone else's account. That sounds terrifying to me. That's like my nightmare. If I wake up at one day and a customer says they have access to something they shouldn't have access to. Trust is a really, really big deal with what we do.

[00:36:49] You can imagine some of the people uploading addresses the data is not very sensitive, but we have other customers who have very sensitive data. We actually have a separate platform now that is designed to work with healthcare data, patient data and things like that. And that sort of stuff has to be super, super tight.

[00:37:11] There's no mistake.

[00:37:13] Benedicte: Because you got, because you're certified. What's that certification called again? Is it?

[00:37:18] Mathias: Yeah, we recently went through SOC 2 Type II which is kind of like, and it's similar to an ISO certification. It's just a American standard to sort of prove that we're doing everything we need to do. All the controls in place, all the you know, product features in place, encryption and whatnot to keep things locked down.

[00:37:46] Benedicte: I don't know if I told this on this show before, but when you mentioned that in Norway, we had a couple of years back, I think we ended up with the slogan, "we're all Kim." Because when tax time came around, people were logging in to check their taxes and everybody got logged in as Kim.

[00:38:03] Mathias: Poor Kim!

[00:38:05] Benedicte: So luckily for the Norwegian tax authorities, you know, I mean, all tax information is open.

[00:38:14] Mathias: Yeah.

[00:38:14] Benedicte: So because otherwise, you know, we would have. Well we could have gotten, I guess we got access to more like details, but at least in Norway, your end tax results or returns or whatever are public knowledge.

[00:38:29] So they kind of got away with it. I mean, it was a big deal, but I don't think they got like, heavily fined, but the newspapers had a field day.

[00:38:39] Mathias: Wow, yeah. I could imagine, yeah.

[00:38:40] Benedicte: Yeah, because we were like, yeah, so for a while there, we were like, "we're all Kim."

[00:38:48] So yeah, you can mess up on any level, I guess of corporation size. Or not corporation, government size.

[00:38:56] Mathias: Yeah.

[00:38:57] Benedicte: But I do understand that that you don't want that to happen when you are a private company that can lose customers. The Norwegian tax authorities don't lose any customers.

[00:39:08] Mathias: Right. There's no choice.

[00:39:10] Benedicte: There's no choice. Yeah. Do we have any more questions, Ola?

[00:39:18] Ola: I was going to check about this tip and trick thing.

[00:39:22] Benedicte: Yeah. Let's do that as a last.

[00:39:26] Ola: So do you have like a trick to offer our audience from your real world data experience?

[00:39:35] Mathias: Yeah.

[00:39:35] I mean, I guess one thing I kind of live by is of a more general thing.

[00:39:42] But we were very small company. There's just me and my co founder and we just hired our first employee last year, Corey. He's super, super awesome. And we are so happy to have some more hands on deck.

[00:39:56] And one thing that's really important for us being having such few people on board is automating like crazy.

[00:40:09] So what I do a lot with when working with data is I'm writing bash scripts all over the place to automate steps. Most often it's because when I import some data, it's very likely that I need to import them again in the future. In many cases, we automate that because we need constant data updates. But in some cases, for example, the shoreline dataset, we don't update it every week because the upstream source probably doesn't update it every week either.

[00:40:43] So just writing a small bash scripts that downloads the data, transform it as we need it, and outputs it again, instead of writing some notes, it's just as easy to just copy paste the command you ran in the first place into a bash scripts. And next time you sort of have a self descriptive way of processing the data.

[00:41:06] Yeah so I just do that with a lot of things, just trying to automate things from day one if it's not a huge undertaking. Rather than writing a readme file or in addition, sometimes too.

[00:41:22] Benedicte: That's a great tip. That's a great tip. Do you even have, are some of the sources, do you have to like download them from a website where a part of the script is like opening a website and then like clicking a button. Or can you access most of these via some kind of API?

[00:41:41] Mathias: So for the most part, thankfully we can find a direct URL and download the file. In some cases it's, you have to like log in first or something like that. And at this point where we usually do because it's a bit more effort to automate that process, and it's probably going to break, like you know, change the login form or something like that in the future.

[00:42:04] We download it manually and we cache the file on our end. And then we write instructions on how to download the next time, especially for these things that we only update once a year or something like that.

[00:42:15] That's just how it is. But the biggest issue we have is actually, so our company's based out of the US but recently, my co founder and I, moved to the Danish countryside. And it is surprising how many US and Canadian government sites block traffic from IP addresses that are not from the US or Canada.

[00:42:39] So very often we have to process data from servers in the US or basically proxy into the US using a US IP address just to access basic government websites. And that's just a bit frustrating because it adds a little bit of friction. It's slower to work with quite often.

[00:43:03] So I use VPNs and proxies so much and it sounds super nefarious, right? Like it shouldn't be necessary, but it's just to like just browse basic US government websites.

[00:43:17] Benedicte: Did you know this before the move or did it take you by surprise when you moved?

[00:43:23] Mathias: I knew that it was an issue for some sites.

[00:43:27] We basically always had a lot of infrastructure. with a hosting provider in Germany. So we already made a lot of, we had a lot of traffic from Germany for various things. So I knew sometimes a little bit of an issue, but day to day, I'm really surprised the smallest local towns in the US blocking all foreign traffic.

[00:43:48] And it's just a little bit of a extra hurdle to jump through, right?

[00:43:55] Ola: So why do they do it? Do they think that is like security kind of thing or?

[00:44:01] Mathias: I think it's definitely a security thing. And I think there's a bit of a standardization within the US government when it comes to cybersecurity in general.

[00:44:12] And I think a lot of these local governments they contract probably with some large company that has what's called a WAF, a web application firewall in front of the website. And these WAFs are just really aggressive and I think it's just a lot of local governments that contract with a lot, like some small, some major companies that provide the service.

[00:44:36] And it's just a big fly swatter. And they also, I think their point is also why would someone in Denmark want to access our small town website? Like what should they like, what do they need that for? Right.

[00:44:51] Ola: Yeah. Yeah.

[00:44:52] Benedicte: People don't travel.

[00:44:54] Mathias: No.

[00:44:55] Benedicte: No.

[00:44:56] So where can folks find out more about you and Geocodio?

[00:45:01] Mathias: Yeah. So we are at Geocod.io. So G E O C O D dot I O, Geocod.io. And you can use our spreadsheet upload tool to process addresses, and we also have a very well used API so you can integrate everything we do into your application.

[00:45:23] Benedicte: Great. Thank you so much for sharing your data model stories with us.

[00:45:28] I knew, you know, geodata was a little complex, but they're even more complex, so kudos to you for tackling that problem and helping other developers not having to tackle them Mathias.

[00:45:46] Mathias: Well, thank you for having me.

[00:45:48] Ola: Yeah, it was great. Okay. Welcome back to Data in the Wild next week! Discover more data model tips and tricks.

[00:45:59] Ahoy!

[00:46:01] Benedicte: Ahoy!

[00:46:02] Mathias: Ahoy!