Dmitry Petrov: DataOps & ML automation with DVC (ENG)

Published: Sep 03, 2024 Duration: 00:23:16 Category: Entertainment

Trending searches: dvc
hello everyone my name is dmitry i'm going to talk about data ops and ml automation with dvc uh the first part of the talk is going to be about uh uh principle behind building ai platforms and ml automation so this part is going to be a little bit abstract i'm very sorry about that while the second part is about dvc the tool that we have built and i'll show you uh what kinds of principle we used uh we implemented in the dbc tool so the second part will connect the dots between the abstract concepts and the actual uh experience actual toolset so a little bit about myself originally from russia a rural place in siberia for my entire life i was constantly moving from this from north to south and now i live in california uh i am the after of the dvc tool data version control uh if you are not familiar with dvc it's an open source tool that implements some principle of principles of data ops and uh data washing i will talk about this in the second part of the talk uh before uh and right now we are building a startup in san francisco around gvc so we built dvc we built some different tools for automating ml uh processes uh before that i was a data scientist at microsoft and i have seen how large companies uh are doing automation of ml processes what kinds of best practices they use and what kinds of tools they have built i work with internal ai platforms and my big belief is those platforms uh are going to change and will be replaced by the systems of the two this is what we are going to talk about in the first part of the talk and start i started the my career as a software engineer so i uh i'm familiar with the best practices in software engineering industry and what kinds of uh tools and best practices engineers are using and uh i think we need to uh in the mail project we need to learn from those those best practices and those tool set to improve our ml workflow and ml process so what learnings what lessons we can get from the software engineering experience uh first of all it's uh version control right the version control is the foundation uh behind automation uh behind the software development right and all the automation and software development is based on version control tools uh today it's usually good in in most of the cases right uh we are talking not only about version in the code we are talking about the work about code review uh about automate automating uh tests and ci cd process uh and even team collaboration is based on version control tool right it's uh we use github or gitlab for collaboration and gears works under the hood of those systems but ai systems ml systems are quite different from software engineering right and there are few differences few major major differences that we can uh distinguish between ai and software engineering ml and software engineering first of all is mls metrics driven right we need to have metrics in the process not source code kind of a difference between your different versions doesn't make sense anymore uh you need a difference between your metrics you need to understand better how your changes code changes or data changes or hyper parameter changes affect your metrics uh the second difference is computational resources uh for modeling you need a lot of memory or gpus or hard drive uh to storing data sets and it takes a lot of time and effort uh to move your computations uh move your training to a different parts of your uh infrastructure it might be like cloud infrastructure your local laptop your internal tools and in this environment uh when computational is always moving around data transferring and version versioning became a huge pain point uh people spend a lot of time on the end resources the transfer data to version data to make sure uh that the right version of data was used to train kind of the last model and and we see kind of those different differences makes software tools not kind of the best fit for the ml automation so we need to extend the existing tool set or we need to reinvent a new tool set specifically for machine learning processes so uh many teams many companies especially a big ones implemented internal tool set uh for machine learning to support the best practices uh i work one with one of those inside microsoft and you can find a lot of different tools from huge companies from netflix from lyft and many others but my big belief is i see the future of ml systems ml platforms not as a monolithical huge platforms but as an ecosystem of open tools and especially open source tools when a team can pick and choose the right tool uh change them together and build their workflow which might be very different from the workflow from uh in in in some companies all right the same way software engineering works today today we don't have like uh software engineering platforms right we have a tools with which we can which we connect together like git ci icd code quality control different like linters loggers and so forth so far and so on uh i believe the systems ml systems ml automation needs to follow absolutely the same principles and one of the reason why we need this ecosystem and not not a single platform not analytical platform is uh it's better to have one single technology stack for ai for ml and software engineering instead of having two different stacks right one part part of the team will be working with software tools and software engineering tool set and another part is going uh you will be working on uh on ml toolset and this is pretty much what is happening right now and that is not okay to have a two different devops team for data and for a regular software projects this needs to be this this should be should disappear uh from uh from the market and uh we need to reuse the great tools in the mail project that we already have i'm talking about git first of all github and other version control providers and continuous integration systems at the first place and there are many many other tools related to development so how uh do i see the infrastructure in the future i believe we need to extend the toolset or the existing toolset i am to git needs to be extended and improved by supporting data versioning and data transferring we need to include data metrics div in git otherwise uh how can you how can you understand which change uh is better how can you understand uh what changed what was changed between your current uh commit and the previous one we need to extend ci cd systems because we need to use we need to bring data in ci system for data validation we need to be able to bring models ml models to the ci system to around model test tests and we we also need to train in the cloud through ci system training in the cloud automatically uh we saw without any like plumbing uh in infrastructure and people level collaboration uh also needs to be changed i'm talking about different kinds of ui uh ui when team can came together and discuss uh result of work result of our weekly work or today's work and there are many many other pieces of this of the ecosystem but i believe those three are the major pain point that needs to be solved at the first place and among those three three important steps data ops is probably the biggest uh the biggest pain point and it which needs to be solved uh in the very beginning because other tools can benefit from having a good version control data version control and metrics driven divs and experiment difference so any ml platform has some data ops capabilities under the hood you might see those capabilities from outside you might not but this is like an essential absolutely needed part this is how you move data to your cloud this is how you bring the result back and make sure you have a consistent uh connection between code and data uh but we will talk about how to build these data ops functionality in an open way in the way that can be reused uh can be reused by other platforms by other ml automation parts so we know that data apps can help to build your ml automation and data ops is an essential part of any ml platform or ai platform in this part i will show you how those principles of data ops are implemented in the tool that we have built in dbc or data version control so uh first of all let's start from the ideas behind and dvc uh implements data ops principle on top of uh existing tool git why we made this decision we made this decision because we'd like to build a ml infrastructure on top of software engineering infrastructure we don't want to reinvent two separate stacks right as we discussed we don't want to have people who understand ml stack and who understand uh software engineering stack we believe that the same people can work in both uh in both directions the same team of devops uh this is one uh reason and the second reason we'd like to utilize we'd like to reuse great software engineering tools we'd like to reuse git itself we'd like to use github ci cd systems and many other software tools that we already have so how to implement data apps principle on top of git we use a pretty common this day design pattern which is called data as a code uh you probably uh know about infrastructure as a code pattern especially like people who are familiar with devops uh principles they know this they know tools like terraform for example or puppet puppet or ansible when you can clarify your infrastructure to some meta files uh like adjacent files yaml files and put these files into your git repository so git repository becomes a source of truth for your configuration for configuration of your software of your distributed system usually so we use the same paradigm we create a meta files about your data and data artifacts data pipelines and so forth and so on data metrics and git versions those data it gets it does a good portion of the work uh but by himself we just extend the functionality of uh git so let's take a look how the basic principle of uh the data ops are implemented uh what we need to extend in the existing git tool uh to make it kind of data alps compatible if you wish and first of all data versioning data versioning should help help us to get back and forth between our version of data sets or models right we should be able to get model from like previous month or roll back our production system to the uh release from the last week and also we need to connect our data code models like all the pieces together in a one particular commit right and this is how it looks like in dbc you just run a simple command dvc ad with some data artifacts like image directory for example and dvc creates a meta file it puts data in some internal structure and creates a data file which describes what exactly this data means with all the checksums with all the addresses of the actual data then git commit message kind of connects the data set with the git history with your code and probably with a model if you commit model on the same time the next question would be but how about transferring data all right we know how to version of my local machine right or uh how can i transfer this data in a server uh when my training needs to be happened how uh can i uh move the data to my uh teammate machine right so for this functionality uh we implemented uh commands we first of all we implemented the concept of data remotes in additional to kind of regular remotes in the nugget right we invented a data remote so you add a data remote in your repository by running dvc remote add command you specify the address of your storage it can be like a bucket in s3 or directory in azure store or just like ssh server with some directory and then dbc push move the data to this order when when data is there and your code is committed and pushed to your git repository your teammate can just clone the repository and pull data back right this is how you supposed to move the data uh between the different pieces of your infrastructure these are different parts of your infrastructure and the last question would be but how about metrics we know that data ops it's not only about data it's not only about files we also need to know uh what which metrics correspond to a particular model because otherwise it's it's really hard to evaluate what exactly is in this file for this uh to cover this pro problem uh we implemented dvc metrics uh functionality uh dvc metrics uh is just a json file uh with matrix kind of value format and the dvc matrix diff can give you a difference between your metrics in any given commit in the entire repo you can compare commits between your current branch and for example master or current uh kamid and the previous one and in this way you can understand where my best model lives how much improvements we made during the last week from the git repository right it helps you to quantify uh to quantify everything around the models so just summarize data versioning is implemented by dvc ad transferring by dvc push metrics by dvcd and the actual connection between data code models and metrics happens by uh git itself right you just commit all these meta files inside the repository and this is how you do the data ups with dvc so let me show you some examples some applications when dvc can be used for for ml automation in this case we'll show you how to this this use case shows you how to run how to transfer models in your cicd pipelines so you probably know uh familiar with the concept of cict right this is how you train how you check your code when you make a pull request when you pull make a pull request github runs a specific runner but this runner doesn't have the data to make a decision about how about quality of your model that that was built you need to uh you need to transfer the model to the runner all right and dvc pool uh can help you to transfer model into the runner and then you run your specific code or you just like show like two outputs of your inference before changing after change and the team can make a decision how good the model is in this particular case uh we run inference uh style transfer inference uh with uh current change and the master branch right and by your eyes you can take a look at the difference and make a informed informed decision about the change how good the change is in other cases uh metrics might be a better way to kind of make a decision to quantify your decision and and team can have more information about the change itself and dvc metrics can bring this uh functionality in your ci cd pipeline when you run matrix diff and see the difference between your current change and master which usually corresponds to your production uh environment and by the metrics diff you can make a decision how good the current change is does it make sense to merge to master and replace the and and go to the next version of your code to in production system for example so that's uh how this ecosystem of the future can can look like and this is what we are working on uh we believe that this ecosystem needs to be open needs to be open sourced and we work on different pieces different building blocks of the ecosystem we extend git by dvc functionality by adding data metrics and data transferring functionality we extend ci cd systems uh by implementing cml project which helps you to bring metrics and graphs and plots to the cicd reports and also we are working on uh something similar to github uh we'd like to extend github uh by metrics and data we'd like to extend we'd like to bring collaboration experience around metrics uh to a service when the team can collaborate when team can uh see the results the history the trends of their ml projects so thank you thank you very much for your attention and i am open for your questions thanks

Share your thoughts

Related Transcripts

Marcel Ribeiro-Dantas: DataOps/MLOps with DVC (ENG) thumbnail
Marcel Ribeiro-Dantas: DataOps/MLOps with DVC (ENG)

Category: Entertainment

Hi there my name is marcel and i'm going to talk about dvc today with you so currently i'm an early stage researcher at the kiwi institute in france but before i also worked at lace which is a laboratory for healthcare innovation in brazil even though it it's it's located inside a university it works... Read more

Congress rushes to approve final package of spending bills before shutdown deadline thumbnail
Congress rushes to approve final package of spending bills before shutdown deadline

Category: Science & Technology

As the clock ticks down lawmakers scramble to pass the final spending package for the current budget year avoiding a potential government shutdown the $1.2 trillion measure combines six annual spending bills with over 70% allocated to defense sparking intense debate and negotiation the house and senate... Read more

Congress races to pass $1.2 trillion in spending before shutdown deadline thumbnail
Congress races to pass $1.2 trillion in spending before shutdown deadline

Category: Science & Technology

As congress prepares to vote on a $1.2 trillion spending package tensions rise as the deadline looms will they avert a shutdown the senate has limited time to vote on the spending package risking a partial shutdown will they beat the clock some republican senators are posing challenges to the bill causing... Read more

Predict Baseball Stats using Machine Learning and Python thumbnail
Predict Baseball Stats using Machine Learning and Python

Category: Education

Introduction hi my name is vic and today we're going to learn how to predict future baseball stats given a player's historical data we'll start out by downloading baseball stats using python we'll clean the data and get it ready for machine learning then we'll pick which columns we want to use as predictors... Read more

Government shutdown deadline: House expected to vote on key government funding legislation thumbnail
Government shutdown deadline: House expected to vote on key government funding legislation

Category: Science & Technology

As the house rushes to vote on crucial government spending tensions rise with the looming threat of a shutdown the bill covers defense homeland security and key departments vital for national security and public safety if passed the bill moves to the senate for approval adding pressure to the tight... Read more

New ways to search: Beetlejuice Beetlejuice (ft. Bob) thumbnail
New ways to search: Beetlejuice Beetlejuice (ft. Bob)

Category: Science & Technology

[upbeat music] [grunts to speak] Read more

Top 5 AI Updates From Apple Event 2024 thumbnail
Top 5 AI Updates From Apple Event 2024

Category: Education

Okay so here are top five ai updates from apple's event first one is custom emoes creation users can now create original emojis termed zen mosi by typing a description or selecting a photo of a friend or family member this feature aims to personalize the emoji experience significantly second one is... Read more

Training the largest LLMs, Cerebras Wafer-Scale Architecture | Keynote 3 | Jean-Philippe Fricker thumbnail
Training the largest LLMs, Cerebras Wafer-Scale Architecture | Keynote 3 | Jean-Philippe Fricker

Category: Science & Technology

[music] thank you thank you so much and thank you for attending this session i know you had uh maybe four days of uh hard labor uh i'm going to try to show you how your labor could be somewhat simplified with what we've done at cerebras um so yes my name is jean philip freer and as you know when you... Read more

Unlocking AI's Potential: How OpenAI's New Model Uses Chain-of-Thought Prompting! thumbnail
Unlocking AI's Potential: How OpenAI's New Model Uses Chain-of-Thought Prompting!

Category: People & Blogs

Open a i just released their most advanced ai model yet it's called 01 and it's designed to think stepbystep just like a human to solve complex problems in science math and coding owan can perform at phd level tackling tasks that were previously to difficult for ai some key things that set 01 apart... Read more

iPhone 16 Pro Price, Design Update And Battery Life Confirmed In Apple Launch thumbnail
iPhone 16 Pro Price, Design Update And Battery Life Confirmed In Apple Launch

Category: Science & Technology

Newly developed 3 nanm a8 chip this chip represents a significant leap in performance boasting a six core cpu with four efficiency cores and a dramatically improved neural network apple claims that the a18 chip is up to 30% faster than the a17 found in the iphone 15 while also being 35% more power efficient... Read more

AI News: OpenAI Finally Released Their New Model! thumbnail
AI News: OpenAI Finally Released Their New Model!

Category: Science & Technology

Intro i just spent the last week at disneyland and of course the week that i'm gone turns out to be an insane week with tons of big announcements i'm a day later than normal on getting this ai news video out so i'm not going to waste your time let's just jump right in there was really two major major... Read more

OpenAI o1: A Major Step Forward in AI Reasoning and Technology thumbnail
OpenAI o1: A Major Step Forward in AI Reasoning and Technology

Category: Science & Technology

Ai just took a major leap forward open ai1 is changing the game in artificial intelligence reasoning this new model is designed to think more like humans making smarter decisions and solving complex problems faster than ever before from understanding context better to improving how ai tackles real world... Read more