How AI Is Built

This episode of "How AI Is Built" is all about data processing for AI. Abhishek Choudhary and Nicolay discuss Spark and alternatives to process data so it is AI-ready.

Spark is a distributed system that allows for fast data processing by utilizing memory. It uses a dataframe representation "RDD" to simplify data processing.

When should you use Spark to process your data for your AI Systems?
→ Use Spark when:
  • Your data exceeds terabytes in volume
  • You expect unpredictable data growth
  • Your pipeline involves multiple complex operations
  • You already have a Spark cluster (e.g., Databricks)
  • Your team has strong Spark expertise
  • You need distributed computing for performance
  • Budget allows for Spark infrastructure costs
→ Consider alternatives when:
  • Dealing with datasets under 1TB
  • In early stages of AI development
  • Budget constraints limit infrastructure spending
  • Simpler tools like Pandas or DuckDB suffice
Spark isn't always necessary. Evaluate your specific needs and resources before committing to a Spark-based solution for AI data processing.
In today’s episode of How AI Is Built, Abhishek and I discuss data processing:
  • When to use Spark vs. alternatives for data processing
  • Key components of Spark: RDDs, DataFrames, and SQL
  • Integrating AI into data pipelines
  • Challenges with LLM latency and consistency
  • Data storage strategies for AI workloads
  • Orchestration tools for data pipelines
  • Tips for making LLMs more reliable in production
Abhishek Choudhary:
Nicolay Gerold:

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

Nicolay Gerold (00:00.654)
Hey buddy, welcome back to another episode of how AI is built. Today we have a very special guest. Abhishek Chudhari is a principal data engineer at Bayer building scalable data pipelines and ML models for pharma. Today we are trying to understand complex data pipelines and specifically Spark. When should you use it? When should you opt for something else? Okay, so if I manage to explain it to non -tech people, I think for tech people it is too easy. So...

Spark in this sense if we really go to the spark with we first need to understand what is data now I mean I still Get this point. What is big data? Why data? Why do we need spark? So I Would say anything in a company can be made to a data Which can be used for any other purpose rather than just storing in a disk if you can store this data Then you can do something on it

So any data which has a digital footprint or in a digital shape and you're saving it to storage, it means that data in future can be used. And that data grows eventually over time. And when it grows too big, one computer cannot load it. It's like you will realize your RAM is crashing or your hard disk doesn't even have capacity to load that kind of information. And that's where you introduce

this technology called Spark. I would say it's, it's a, it became one of the not, it is not the foundational technology in space. It came from the era of Hadoop. The first foundational stack known as Hadoop, which is, or MapReduce, Hadoop, whatever we call, which actually opened this gate and Spark, I would say made it better. Like Spark made things way much easier. So it let you load this data.

wherever it is sitting in a distributed fashion. It means behind the scene, multiple computers or nodes or commodity hardware are working together. But for end user, you feel like you can access it from a single computer. But behind the scene, many, many network of mesh is actually loading this data. And then it is letting you do anything, whatever you want to do on that data. It can be

Nicolay Gerold (02:24.398)
dashboarding it can be extracting some particular information anything it let you do and Spark in this particular sense is very fast because it used most of the things on the memory so RAM when we even in the computer side when we talk we know that if you have a more RAM things are more faster exactly that basic fundamental thing Spark took it and it make it faster it does most of the thing in the memory

And that's why most of the processing is very, very fast in Spark. And that's in a nutshell for a very basic Spark.

And if you are looking a little bit below the hood, how would you say like, what are the fundamental components of Spark and how are they actually working together?

Okay, so Spark, when I... Basic fundamental if you go to that error is it's actually the core element of Spark which is RDD, which we say it's resident. I forgot the name actually. I just call it RDD. But even before that, it is actually a DAG. It's a data structure known as Director Asylum Graph. Where the process...

The flow of input in this particular graph is one way. It doesn't go circular. It just goes one way. That's the principle of RDD and it is a distributed art so just imagine you have a directive graph and You created a distributed view so you can connect bigger grants. It's like a forest in that So that's the main principle on this part. Everything as a base principle is RDD. That's a core element on top of that

Nicolay Gerold (04:15.31)
Next version came up which which is known as data frame. It is nothing but if you can simulate it with like a pandas data frame It's a tabular representation of the data where it creates an abstraction on top of RDD RDD gives you a lot of control on the data, but data frame makes things easier for you So that's the layer of data frame they created and on top of the data frame since it is tabular You already introduced the concept of database

SQL query blah blah blah blah blah. So that's where the SQL component or the query planning how to take that data what kind of optimizer are the RDD is written so that you can load this data faster and whatever the operation you want to do on the data can be done much faster than that. So if you see the evolution first it was RDD you can program everything on RDD then eventually it came to a point in data frame where it is recommended now you need not even use RDD.

just do everything is data film because the easiest way to iterate over the data in a most optimized way is already being done for you. All the APIs are already written for you. So if you use data from almost everything you can do and if you go to the layer of RDD, you may do something unoptimized or you may need to know much more to make it better. So that's the main principle. Another stack on the top it came up is I would say structure streaming.

It's a streaming component on top of Spark. Initially they came MLlib, the machine learning library. There was a graph frame which didn't shoot up much. I don't think it's still there. And then right now it is all Python, Pandas and all those stack up there. So principle is in short, this is what Spark as an ecosystem is. And Spark, where do you see it actually?

being necessary, especially if you're looking at like data processing, ML application, where it's actually like the switching point where I have to move from like something like pandas to a distributed framework.

Nicolay Gerold (06:32.614)
To be honest it really really depends you really need spark now in this moment in 2024 if you're sitting if you ask me I will say It really depends so first you need a gigantic data I would say gigabyte is not even close to that you don't need Spark for 100 GB of data. It's not required even for terabyte you don't need there is a way with a

little bit of verticals and things are easier. But when your data goes on terabyte and processing layers where your data pipeline is doing many things, for example, it loads the CSV data, convert to something for parquet and then you want to build dashboard, you want to run streaming on that or you want to do too many manipulation on this data to achieve from A to Z. And this data volume is very high or

you can say uncertain it can grow from 10 GB to 200 GB to 1 TB when there is uncertainty then spark is pretty good because scalability is done very well in spark so it will ensure that almost every if you can if you are allowed to do wild wild scaling like whenever anything comes just scale up don't even think then spark just works brilliantly very efficient

Efficiently is not the right word. It works. It just makes things happen So that's where I would say when there's uncertainty and your data load is very high DataFrame is close to Panda so you can do the transformation and you can get things done That's where since you ask only when you can use this is the point I would right now recommend to use

And you already teased it a little bit and I think that's something I want to go a little bit more into. There is so much development in the space since Spark came out. Do you think if you had the opportunity to rethink a tech stack and basically build from scratch everything you have right now, would you be using Spark?

Nicolay Gerold (08:50.35)
question.

Nicolay Gerold (08:54.382)
Probably in the production pipeline where the data load is very high in very some of the application where we really have a very high load of data there, yes or else I won't like it is not. So I will say this if I have a spark cluster set up, it's already there. It is integrated to your top storage and the cluster is there. You have a very good.

notebook environment like Databricks then why not? Simple, it's very fast. Everything works very fast. But if you ask me now you want to do everything from scratch, you have nothing. Then Spark is complex. Setting up cluster on your own, managing that is a very complex and expensive process. Even sometimes processing data in Spark is expensive. It takes too long compared to some other technologies out there. So for now, I will start with

Python like Python data from libraries like data pullers or I will use Doug DB where because it supports streaming so the data loading even the sizes goes beyond memory I can use it I can even use IBIS it's a Voltron interface because it can connect to multiple different endpoints which is much easier for me. Spark won't be my first choice if I'm currently exploring things definitely not going to be my first choice because I

want to focus on the problem, then these technologies are much faster, efficient. And if it comes to production and I have, you can say, a luxury of Spark cluster management or Databricks something, then maybe yes. Yes. And I think it's very similar to machine learning where it's always the question, do you deploy it yourself or do you actually use a managed service for it?

Nicolay Gerold (10:54.382)
And I always recommend people like, unless you actually have a MLOps team, you should probably go with a managed service most of the time, because it's just way easier. Do you think there is like the in -between step where you actually start to get into Spark or like a managed service, which is like on most of the different cloud providers, like on AWS, GCP, when would that be a good choice to go with? Okay, so

I am pretty much biased on Kubernetes. So you can imagine everything is hell for me. So since my stack grew up, so I know how to deploy Spark. I know how to scale things up. So if it gives me choice on me, I go on stack of any work, like any pre -built charts on Kubernetes. It's like a package manager in Kubernetes, which gives Spark or something, or any tech. And I can deploy or I can even

Pre -built so because you can see it is mostly pre -built even the same way somebody has already built it for me I just need to install in the cluster if it goes to manage services Specifically then AWS EMR is pretty good for spark, but not anymore So if you say if you want a spark managed service, I will still say that databases way better like I It can be a question then on the cost but I will say it is far better in the experience side you I say if it comes to

machine learning space as you mentioned, managed services in between is always good. So for example, if I want to use MLOps on one technology, let's consider Feast or Settler. And if you want to do it, it's first, there are so many of them, how many you cannot lock into one without even knowing how others are working. So when you have your own setup, or at least you can say staging environment or

everyone whatever you call you can first install yourself and first run your stack for some time and when you go to this maturity level of where you understand this technology going to sustain and it is taking too much load now whoever is maintaining or your team is not designed to handle the infrastructure then the best bet is whatever you actually worked on get the paid services for that and leave that workload over there.

Nicolay Gerold (13:20.398)
If you have a team and you have a very high optimization problem and the cost problem, then and your team is very high, highly on top of this infrastructure setup, then maybe yes, you manage your own in the open source stack.

If you are just doing exploratory analysis, definitely just do everything. Don't even try managed services. You won't even understand it will be expensive. In between, the most of the teams are, technically if you think, it is much easier to go with managed services when you are aligned well that yeah, that is what you want.

Yeah. And nowadays I think what we're seeing, especially because we are working so much with unstructured data, I think also the AI is moving closer to the data because you have to do some pre -processing or some processing with AI already. How is actually the integration of AI into Spark and into other data processing?

So something like daft, for example, actually manages like natively that AI is actually part of the data processing. So let's try to understand first this problem that we have unstructured data. That's one thing in the AI and we are saying AI is a part of it. When it comes to data processing, there's no AI. AI is not going to do anything there. It is just...

To be honest, it is more self -speech in that way. You have unstructured data space like a text or paragraph of a data. But you are going to engineer that. It needs to be stored somewhere, and then you're going to process it. So to be honest, if your data size is small, you can do this thing in Polar or you can do this thing in DuckDB and Daft. All these data frames are way, way easier to explore it. So my point will be that

Nicolay Gerold (15:25.134)
At least in Pulas and DuckDB the problem is scalability. If your unstructured data is very huge, you start with Daft kind of technology or any other operational Apache arrow integration which can be distributed. Why? Because then you start with this technology, write the code and when you think that all the transformation is right, you slowly

going to distributed fashion. So you are already part of the stack and you convert your unstructured data to some how -to AI. As soon as it comes to an API call that you have some data and you want to call some AI services, then you don't need anything. It is so slow. Like if you try to just create an embedding, for example, in OpenAI, it is terribly slow. So I don't think you need anything distributed there. You need more like a reliable retry engine where

We try to upload things that are failing in opening -end, usually try to select asynchronous model. So you need a very solid backend rather than anything to do with distributed computing. Because not everyone is using grok. I mean, that is one of the fastest ones. But any other endpoint is so slow that I don't think we have a distributed system problem there. We need a more mature infrastructure problem there. How would you actually best set it up?

If you had the choice of actually designing the infrastructure completely and the architecture, how would you actually go about the integration of AI in the data pre -processing, like embedding calculations, tagging, classifications, what you typically have to do?

Nicolay Gerold (17:10.798)
So let's consider, of course I won't be rebuilding a new infrastructure for that. I would like to extend it, whatever the stack we are in. So since I said Kubernetes, forget Kubernetes, just imagine it's all containers. I want to put everything in a container and then a Dockerized fashion. So for the unstructured processing part, if my pipeline is in Spark for my data processing thing, I would say.

for AI specific thing I will prefer not using pandas spa I will prefer to use pandas kind of frame like it can be polars it can be pandas 2 it can be raft it can be anything where it is more closer to python ecosystem so that plotting and managing and using some libraries like tick token or anything is easier rather than using a distributed engine and then struggling that come on why it is not working why it is not prompting

So that will be my first principle that it may take a bit of time, but I really want to use all the possible AI toolkit which is available. So that will be my fashion on that. So data processing will be completely depend on any data from library which team prefers. And then when this data is transformed and tagged, I would like to store the same thing in it depends that is

any of them needs an analytical point, then we need to think about data warehouse. And if it is not, then it should be set cheaply in blob storage. Because in the end, it is mostly predictive annotations, or it will be model training, or it will be sending to LLM model and get something. So then I will keep it, it will be more coding or application. So again, you just create a container.

deploy and just wait. So I will keep it there. Everything will be tagged more on the metrics. So I think on the AI side, as soon as it enters the data metrics become more important. I would say observability becomes more important. But there is to be honest, I don't see any real tool which actually solves that. But I like Datadog. It has a lot of information, but it's expensive.

Nicolay Gerold (19:35.662)
Grafana Prometheus can do magic if you have a very good infrastructures or DevOps team who can do this thing. But you need to observe lot of points in this AI because it is so new, just evolving so fast that you just don't know what kind of signal you need to find out the problem, the data drift problem or why it is slow. Is it your end or the API endpoint?

then this local LLM models, how to inference them. So if you think the problem ratio is growing, it is keep on growing. Your infrastructure is going to be always in the staging environment. There's nothing called production for me in this stack. Nothing is production. I mean, we use production for the stack, but there's nothing production. Then maybe MLflow or Sendal kind of stack or Feast for Feature Store if you really need features. And then SDKs, I...

I like as a library, I like Lang chain. As a user, I'd hate Lang chain. It's just too heavy and it's too complicated. So if I am very much tied up with any particular LLM model, then I will be specific to that particular package like OpenAI, then only just OpenAI, plain simple. Then vector databases is coming. And this vector database is becoming, I would say, a giant.

It's become a real database because you just imagine. So most of the anticipation is I will upload my text. I will do semantic search. I will get a lot of information fast. So you keep on raising the index of a database. So it is becoming a critical problem because we are thinking we are dumb that data things work. But as an infrastructure point of view, your data pieces is increasing. So sharding is happening behind the scene. Your indexes are getting crashed.

it became a distributed system problem. So I will consider a vector database as a database problem. So it becomes an integral part of database or infrastructure and keep everything as a lightweight. So that's why application will run with the packages in the same way how they want. Whatever the necessary library tools we think can be used by everyone, like matrix, observability, those will partially go to

Nicolay Gerold (22:02.478)
a stage of making me a part of infrastructure. So that we like eventual growth of infrastructure. Yeah. And looking at the storage component, it sounds to me like you should have these stage in the, especially like whatever data storage you use, that you actually use like multiple different buckets and you basically have transitions like you have

an event which triggers a function. So you take the object, do some classification, do your embeddings, and then insert it into the next storage. And the same for the warehouse environment where you have like staging tables when the data is inserted and triggers the classification, the structuring and whatever. And it's inserted into the real warehouse where you actually do more of the processing. Yes, that's true. I mean, I think it makes things more complicated, to be honest.

It's very hard to design. So just example I was last working on embeddings. So I was trying sentence transformers and meetings which is 768 dimension Then we thought what about we try opening a meetings that is almost more than double or double in 36 and see the performance of the search So we just created a replica of embeddings for the database. Now, what should we do like? What should be how should we plan this thing?

Same thing goes with this model, as you mentioned. I was using the performance on OpenAI. Now I'm using Cohort or something else. So the biggest problem is we go to a stage of infrastructure pollution where we have everything. It's like, yeah, you asked that. Yeah, that is there. That environment may have that. So maybe you will end up having multiple infrastructure with a different, different stack.

So that you have two things. First, you let your team and data science team and different teams to try with the different permutation combination of someone. Because it is too early to enable this enrollment that this is what you will do. You cannot. This is so it is the time to explore too many things. So it means you will have many different infrastructure. So that's why I said one big governing infrastructure where we definitely know you need these components.

Nicolay Gerold (24:24.27)
And after that, all these other components will be kind of a subset of infrastructure, small, small, tiny, and they communicate to the larger infrastructure. And that's my point of migration, that when we think one technology gets stable, now let's make it a part of bigger infrastructure rather than keeping it offside. And that is where I said, so pollution and even in production,

the infrastructure of AI is not really a very solid infrastructure. It's more like, if we just want to run, just a business, we just want to see. So yeah, that's why it is hard to actually blueprint. Like I cannot create a blueprint and say, this is exactly what I will do. No, I cannot. At least I can't. It's too hard. I think in AI, you especially have like a problem with

forward compatibility, especially like in the, if you're looking at LLMs and embeddings, because you have to plan that there will be a new embedding model in like the foreseeable future that you have to swap in. So this is something that like most people, it's like hard to fathom because are you recomputing it for all of your data or are you already integrating it? It's like, how can you swap it out?

Are you introducing like versioning from the get -go?

I think in AI, it was not the case. I think with this blast of new way of AI, this generative AI, I think ad hoc and breaking things became a norm. So even the libraries what we are using, they change the like the deprecation is not that

Nicolay Gerold (26:20.238)
I was that well maintained. It's just the change. If you people install, you said, what is happening? Why my entire thing is failing? So things are more hack than a product line. So apart from building this infrastructure and promoting the product line of generative AI is still evolving. We really don't know what are the real products are. So I think this is a pain, but that is a natural transition.

If you think about the data time, I still remember when I was using Spark in the versions 1 .2, I started using Spark with my team was curious, what the hell is this? Why we had like MapReduce can do this thing? Why, what is this spark? And what is this complex? Why we need to write Scala because it's introduced Scala and before that was all Java and we were writing Java. So they're like, what is this new thing we need to learn? So I remember we hacked a lot of things and at that time we got hives.

We got Spark and Uzi, all those things as evolution we see we were hacking at that time. We are trying and hacking. Generally very similar phase. It has a lot more because of like right now there are many people to contribute and but the concept is same. We are in the phase that we just don't know what is there. So everybody is bringing something on board and as engineer we just say, okay.

Let's try this and if something is break either we fix it and I'm usually see the pattern people just rewrite it. It's like there's no point because codes are not that big. It's not like you have invested months of time. Usually it is small. So it is easy to change. If you have something which is a very solid production code for example months and on then generative AI I won't be expecting generative AI there. I will be using traditional machine learning there where

It is like, like training, but as a boost model, then putting a matrix, then data. So prediction, those things, I think still are safe. They are evolving. The process are very clean and neat, specifically generative AI is where things are more hacky. Yeah. And looking at using AI and like the data processing, something you don't really like.

Nicolay Gerold (28:44.654)
Do you have anything you've picked up actually what to do in like the idle time where you actually wait for the AI endpoint to return something? Do you have come up with something what you do with the rest of the data? No, many, many times. So for example, I wanted to, I have a hundred thousand records. Just example, it happened this week, I think. And I wanted to, I just say this, I wanted to create an embeddings with OpenAI. I have to understand.

So first I use Spark because my framework was in Spark. So I thought, OK, let's create a map partition in Spark. So I will get a batch of the records, and I will put it, and I will get fast. And I was seeing it was taking too long. It's like very, very long. So I was not sure. I didn't put a print statement thinking it is not optimized. Why putting it too fast? And it took hours to return. It returned, but it returned so long that 100 ,000 record I think took

1 .5 hours it's a long record not a small one but the embedding took 1 .5 hours and 100 ,000 record processing in Spark should take how much 30 seconds 10 seconds 20 seconds depends on the cluster size and then I realized okay it's not me it is it is it is that endpoint which is taking so long to return same thing so we have in house chat in house open not F1A I would say

LLM stack where you and our internal employees interact with the system. But you just imagine if you have a multiple people trying to interact and they're building applications, which is like getting some data going to open AI, getting a summary of data or ask open AI to extract some information to that. Then you come back, do something, then again go to open AI and say, okay, now do this with that data. So you just take back and forth of

An iteration and finally getting output and you run this event Parallel e 10 times You'll wait for 10 minutes. You're like what is happening? Like what happened is your end user will feel what is this application? Yeah, it work. It is fine It is working but 10 minutes to just get this particular ID out of my query. Why? the why is the latency of the response of

Nicolay Gerold (31:12.302)
this LLM providers is not fast enough. You just cannot do anything about it. And the larger the context you grow, you get slower. So we started evolving. So we started creating cache in our end. So if somebody asks a question, it is already done. So you do cache it. And if you have a warm cache, you got the answer. Don't even

talked to OpenAI. So we started recording every event which should have interaction. So it means you started building your own stack to not interact with the LLM endpoint till it's not required. That way we are solving it. But then your cache grows infinitely. Because even a simple change in your query can, your cache can detect no it's not an exact match. Then again you created a cache of semantic search that these are close match. So it's the same cache.

So it becomes like Frankenstein monster infrastructure where just for making things faster you created cache, cache, cache, cache, cache. But for now that's the thing we can do. We want a fast interaction. And that's why the data iteration is for now it's tricky. It's hard to build, but you just can't build if you want to. Yeah. And that's like a massive pan point. I think like most people probably aren't operating at Euroscale.

but it's already a pain, especially the also handling all the like fallbacks, also the errors of open AI. It has been down so much time in the last month. Yes. how are you adjusting for that? So are you building like fallbacks for everything? And that's what I'm doing in production. I'm using, I have like three different services which are like backing up each other.

Trust me. I'm not making the this happen today with my colleague So he was getting 429 exception the favorite one the token limit reached and and he asked what can we do about it? Like it's an asynchronous model a lot of multiparty this happened So so I in those cases, I actually use tenacity or back off kind of technology So to be honest, I just don't want to code there. I just want

Nicolay Gerold (33:29.902)
That five times the retry should happen in a way that it goes it wait for 10 second next time again fail 20 seconds that the random order so I use those library for all those essential point where we realize this particular repeat all this failing. So maybe we do that. Another thing we are currently doing is this so when we ask this thing usually have already asked it so that again is going to the problem of caching.

But currently, if you think simply, we just try to retry. Retry, retry. If it is in the end, if it is failing, sometimes it really fails even if you retry for 10 times. It just fails. Then we flag it in a log as an alert, usually in teams, so that we realize what happened. So that becomes kind of a pager duty for us. So like five times, 10 times it happened. Is it really open A endpoint? It is something.

for with us like it can be something with us we may be actually exhausted or limit or something like that so we didn't try to find out and if it is really open AI or any providers then answer it sucks we just can't do anything unfortunately so if it the problem we find which is in our end is very good then we know what to fix but if it is really not on us

is impossible then that's where we come to this point when we think let's try local LLLM and trust me it does not solve anything then you need to be GPU rich and because inference hits and that's again is expensive it's not cheap yeah and when we are looking back at building actually pipelines what

tools do you actually use for the orchestration together with Spark or even with the other frameworks we've mentioned before? To be honest, for a very long time for orchestration I use Airflow. I'm pretty confident, I'm still very confident in doing everything with Airflow. There are many, many complaints on Airflow. But I feel very comfortable because I'm using Airflow I think since beginning when it was very, very immature.

Nicolay Gerold (35:48.59)
But if I honestly take the feedback of my team, I realize Airflow is not a very easy to use tools for new upcoming or over the new data scientists. It's not like once I tried, I tried many times and it just didn't work. It just didn't like it. It's not only with the UI thing, it's just like there are many crucial things which just doesn't work in Airflow. Like you need to code quite a lot to orchestrate things.

It is not like you create an actor on your code and you already created a DAG. You really, really need to write Airflow code to schedule this thing. So that became an eventual pain. Then for Kubernetes track, we have another stack on Argo as a YAML format and easier as Argo Hawkflow.

Since it is YAML, if you know YAML, it is pretty easy. But again, the complaint is not everybody is comfortable writing YAML for workflows. I have tried Flight. It's a very good but very software engineering driven orchestration framework. It's very mature on Kubernetes, but it's like it becomes a prerequisite for a data scientist as well to know software engineering, like understanding Act and understanding all.

I would say my team was not comfortable with that. And very recently we started orchestrating in a crone job. And trust me it is going super well. It is so good. It is so amazing. It is like if you have a very simple setup you really don't need much. If you can simply orchestrate with some small crone jobs or even step functions, something small, then small is better.

If you have a thousand jobs but all those thousand jobs are monotonous, there's nothing new and eventual about it, there's no point of having a big blast of orchestration. If you have too many complexity and too many people are contributing to the code, of course I understand pipelining and visual orchestration is good. So it really depends that if we are going for the big and giant, I still say

Nicolay Gerold (38:08.782)
Either go with Airflow, if you have a Kubernetes stick with Argo, there's no point of over engineering. Or if you want something more, it's a data sciencey, DAXter is pretty good. But for my choice, no orchestration, just to very simply deploy, maybe get a workflow can do the job, then just let it do and get it over, don't over engineer. Argo's default comes with Kubernetes, so I just pick it.

So I have no preference. It is just, I just want to run it. What we just say actually is missing from the data and AI space to make it more pleasant to work with and actually to bring it into production and from the endless prototyping mode.

Nicolay Gerold (38:58.018)
Use cases I think it is nothing to do with First okay first I would say sorry. I will go back to them the zero point is Accuracy this hallucination is a very very risky business at least in our case. It is very very risky Initially when we started it was amazing we all felt so good it answers everything over the time when we started doing something serious we realized no this is not a serious tool

It's not at all a serious tool which we can consider. So it can do maths. Yeah, but not always. So if you think in computer science, what is the most important thing about computer science is consistency. That is missing. So you can be consistently wrong. I'm fine with it. I know then you are wrong with this. But that is the biggest problem with the LLM Ecospace, the language model.

I'm hoping over the time we will have better and mature agents, which will be a firewall between actual interaction and your question and response, which we are building in the end to make it more consistent with the user response. I think that ecosystem is very hacky because we are just doing something. To be honest, Fintune doesn't fix that.

Have done too many fine -tune doesn't work. It's a logical point that fine -tune makes things perfect. No, it doesn't that thing I would say must be there to make a serious application Even for chat some can argue you don't need 100 % accuracy I would say not in all business in our business mean our chat engine works with for the doctor So just imagine the doctor ask can I use this medicine of your company's this medicine with this conception? It just say yeah, I was wrong this time

not good enough right so even some chats are not it's not okay with having a wrong response so consistency and then speed latency is a huge problem so imagine a data where you are you're targeting lowest latency output and then comes the AI component which just breaks everything they don't go well together and then finally cheap so

Nicolay Gerold (41:20.558)
It need to be more cheaper. I mean, for now it's not cheap. It's not cheap at all. Yeah. And I think like data and AI, it's mostly a game of knowledge. What are like the little tricks and hacks you've picked up in your project that actually helps you to make the LLMs nowadays more consistent?

Nicolay Gerold (41:49.678)
I will say code in the middle helps me a lot like Even the response I got for example, okay first you structure your response You say what it is it will be much better if you can say give me the response in json or in some format So that you could validate the response and then maybe you can reshape it and give it like a text

Somehow in my experience they work better like it is very crystal clear on those messages second hack is this prompt I think The prompt somehow are my experiences One of the prompt I wrote it works pretty good for a month I don't know what change that prompt started reacting badly then again. I need to change the prompt with some

keyword some more information, blah, blah, and then again started working. So working consistently with the prompt to refine it, tune it with response, giving more and more example, but keeping the context length relatively okay is essential. And

And finally, in this entire context, it really depends what you're asking. So maybe I'm working so many, I'm investing quite a lot of time on it. So I started understanding how will I ask that it will give me a better response. So the system prompt customization what we did, it need to be done very well because your user, you don't know how they will ask. I know something. Get me that.

So I'm asked like, get me the sensor. So somehow you need to filter those questions and reshape it in a way which LLM react better. So you need to identify those signals and change it. These are current list of hacks I personally do to make it better. And what is something that you would love to see built in the data processing or in the LLM field?

Nicolay Gerold (44:03.95)
Data processing I would say much better.

So this unstructured data what we recently discussed, if you think it needs quite a lot of work to it's nobody's fault but if there is engines or I would say some way of the word will be list of functions or libraries which can do generic processing of your data with all well -known data, it will be very helpful then maybe the starting point won't be zero.

It will be evolving. I'm not saying all those library can solve all the problem right away But like spark provide you some rest of library or integrated functions, which they recommend use this most of the shaping can be done So something like that Integration directly talk with this distributed fashion. So if you are using any distributed technology

and somehow you have a direct integration with OpenAI and you get the response distributedly rather than you engineer and all this thing you do. It's very good. Then you can say you can have a distributed data frame and it works. And finally this metrics. So metrics is a wild wild west. I think we develop our own, some develop their own. So there is no standard on what metrics to match. What are the signals? What are the bad signals? What are the good signals?

and if something breaks what to find and what is exactly matching like if somebody say me in this LLM stack what is your metrics like what are you actually storing I would say I don't have that answer so I just want maybe a pre -built stack somebody say upload this in Grafana and everything is ready something like that or somebody will phone it is too hard to understand what are the things I should be missing so I couldn't get anywhere

Nicolay Gerold (46:00.462)
which is really reliable that yeah, this open source library is good. Let's upload to Grafana. Everything is sorted. Great. And where can people actually follow along with you? Your work? Or how can people be helpful to you? You can tag me in LinkedIn or in Twitter. Yeah, these are the most places I'm in.