Marcel Ribeiro-Dantas: DataOps/MLOps with DVC (ENG)

Published: Sep 03, 2024 Duration: 00:23:03 Category: Entertainment

Tags : Artificial Intelligence machine learning data science data fest open data science ods ML REPA Marcel Ribeiro-Dantas

Trending searches: dvc

hi there my name is marcel and i'm going to talk about dvc today with you so currently i'm an early stage researcher at the kiwi institute in france but before i also worked at lace which is a laboratory for healthcare innovation in brazil even though it it's it's located inside a university it works in a very industry oriented way so we have developed the devices and software and done some data science projects and analysis for the ministry of healthcare in brazil so for almost eight years i worked there and worked with machine learning and artificial intelligence and software development so many things but always related to to health care so i have a computer and engineering degree a graduate degree in big data and a master of science in bioinformatics in the federal university of huguenot in brazil and i'm doing a phd in bioinformatics at sohbon in france and i'm the dvc ambassador now so i'm very happy to talk about dbc to you today so you know the point is life can be hard so if you work with data analysis at some point you may probably you caught yourself in the situation where you don't know which version of maybe software or data set or the picture i mean you don't know which version you used in this picture on the right we have last week but you know last week usually you can remember but sometimes like last month like smash last semester last year and then things are a bit more complicated if you have several documents or data set or whatever call like budget estimation one two three okay that's fine i mean at least you have some ordering you know what's the latest one but then you generated a picture you have a plot that you generated from your data set and you don't remember if it's was it what it was in your raw data set it was in your preprocessed data set was it uh in your raw data set but with this script that uses the normalization x or the normalization technique y or z so you know so many things can happen between the data set and your final picture plot document metrics model whatever and sometimes it can be very tricky mostly if you're trying to reproduce something that you've done six months ago one year ago and if it's a friend of yours of yours trying to reproduce your work then it can really be hell but you know at least we have the data sometimes we lost it so like in here the person accidentally deleted another page of his manuscript but you know it could be just a data set oh i thought i wouldn't need any more the raw version the raw version of my data set now i do i have to do to try a new normalization and i don't have it anymore so if you are not tracking your data set you are you have an issue here and even if you have all the versions of your data set all the versions of your source code all the output files you can still feel lost because as i said in the previous example maybe you do have your plot you do have your script and you have your data set but you don't know where this plot came from was it with the raw version of the data set pre-processed which normalization uh you know so many things can happen between the plot and what you did before and then you even if you have everything if you don't have your pipeline your experiment being managed you have an issue so the thing is does it have to be like this clearly not but i mean what are what are our choices so right now we have devops and mlaps and data ops and all these things so we have knowledge from other fields that we can bring to machine learning we can bring to data science so the content is there the knowledge is there but you know it's it's not so easy so sometimes what we are looking for are tools that can do that for you or maybe not do everything but create an environment in which it's easier for you to add the new things that you're looking for and you can think of okay git does that maybe gives enough for you maybe you think gito's are your issues but is it enough and you know for code i would say yes so for tracking code i think git is pretty good but you know it's not only about code and even if you say if you say ah but then i can i can also version my data i can use some extensions that we have to use git to version your data because as you probably know git is not was not developed due to version objects or data sets it gets really really slow if you keep adding data sets and big data files to to your git repository but then you can try an extension you know in the end if it gets really big the way we have today very like bigger and bigger data files files changing and then you have your data set you have a pre-processed version you have a future version so you have a lot of copies that are very very similar and if you don't have garbage collectors for example to make sure you're not having exact copies of part of your data sets all over the place things can get really dirty but even so even if you have something to do that is not only about data versioning so as i've been saying in all these examples data files they are connected to scripts that india that are also connected to output files to plots and all these things so you also need to manage that so it's not only about source code versioning or data versioning you also need pipeline management and even pipeline versioning so if you don't have experiment or pipeline management you can be really in trouble in the same way that your data and your code changes with time your pipeline also changes with time so you need to version your pipeline also so one example that i have uh you know during this coping 19 pandemic a lot of people have been trying to analyze data so basically everyone wants to analyze code v19 data if you go to facebook or linkedin medium any social media you're gonna see a lot of people just trying to analyze kobe 19 data and you know they are they they think they can help they want to be helpful so they do that but sometimes you know they are not really analyzing anything they're not really seeing anything they they are usually they have too much focus on epidemiological data and that's a bit misleading because you know in the end you can be comparing oranges and apples so if you're comparing the epidemiological numbers of san marino in china they are very different so in area for example in number of inhabitants a number of medical doctors for every hundred thousand citizens you know inhabitants you really have to take constant internet consideration a lot of other things so in the end when you're only focusing on on epidemiological data you're gonna have a lot of confounding a lot of selection bias and all these things so i was thinking about this issue and i thought that maybe we could do a bit better so i discussed with a few friends and in the end the result of this discussion was these two data papers that we published in data in brief which is a data journal by australia so the first paper was called the data set for country profile so we wanted to get as much data as possible about countries put them in one data set and then give them for people to analyze and so in in this one we had mobility data epidemiological data socioeconomic data political data ecological data expenditure from the government and health care and in the research the number of medical doctors for every 100 000 and you know all these things we did some picture selection future engineering we got them together some pre-processing and then we ship them so that anyone can use and the second one was to analyze the papers the scientific studies that have been published so far about kobe 19. so the questions that we had that that we it was clearly to us that people would have if they saw our work and it's the question that usually we have when we look at people's work is really about open science so how transparent is this study how reproducible is this study can i get more updated data because you know for covid for example we published this paper a few months ago and we have a lot more data now and even the data from the past it's been updated it's been fixed some errors have been corrected some misreporting and all these things so we wanted to they we wanted us to be able to easily update the data set and now so that in case we didn't have time to the user could update the data set so we wanted to go from a common black box to a really white box where the user and the reader would have full knowledge of everything that we did and we found dvc an amazing tool to do to do that and we tried dbc we did we use dvc to to analyze our data to do all the pipeline management and everything else and it worked pretty well and you know it's not only about dvc so it's an ecosystem usually you have so many tools that you like and you use to do your data analysis your data ops so usually what what people do is that they're going to use get to search to track the code the search code and they're going to use dbc to to track data sets the output files and the pipeline itself but you're not limited to that you can add other tools that you want so maybe you don't like about this part of git or this part of dvc you can just get other tools to do that so you could add cml so you are free to choose your tools you can use cml to do continuous machine learning cicd for machine learning and then you can use this other tool because you think this tool makes something better about gate or dvc you know this is not a code that you write inside your software it's agnostic to that so if it's uh it could be r or python or ruby or anything you can use dbc with that the same way that you can use git with that you are not limited to github or gitlab or anything like this was really agnostic so let's do some hands-on for data tracking so in at this beginning i'm not doing any data tracking at first i'm just creating a folder i'm gonna use git init to initiate this folder as a git repository in the same way i'm gonna do a dvc in it to initiate this git repository also as a dvc repository i'm gonna do some commands in r just to to do a few things i'm gonna commit that so far no data tracking so one thing that i think is nice to point here is that the same way that you do get init you do dvc init this is something that developers of tbc have thought and i think it's a great idea so git is pretty famous nowadays everyone knows at least the basic commands of kit and if you know that you're good to go with dbc because a lot of commands they're just same so let's get in it dvc in it git add dbc ad get pull get push dbc post dvc po you're going to see it it's very similar so now we're going to start doing our data tracking so i created a text file a readme or it could be a script or whatever and i'm going to use git to add that because i want git track that but then i had i have this raw dataset file which is huge and i don't want kit to track that i want dvc to track that and then i'm going to use dvc dvc add and the name when you do that dvc is going to create a dvc file about your data set a metadata file it's a very tiny very tiny data file but it's going to have some metadata like checksum and things like this about your your data set that's how dvc can for example know that if the file changed and also tie that to the git commit so that if you travel in time you use git checkout to go to another version of your of your upholstery the data sets are also going to change accordingly and then you're going to do a git commit just like you would do at different c commit so these commands that you keep repeating you can also use dbc install and then whenever you do a dv git pull or get push or things like this automatically in the background dvc will also be run and like in this example we're going to use git checkout to check out to a different version of this repository maybe i want to know how this repository looked like a year ago and if you type this command the git checkout this source code files the scripts they're going to change and look like they used to look a year ago if you type dvc checkout you're going to change all sorted data sets and the output files to the way they looked a year ago so as i said the same way you do get push there's an equivalent dbc push git pull dvc poll so even though it's not mandatory you can have a git remote like github it's like a remote copy of your repository in the same way it's not mandatory it's the same way for dvc you can have a dbc repository and it's it doesn't have to be a github or it can be anything an s3 bucket for an amazon uh microsoft azure google cloud dropbox ssh your external hard drive anything you want and it's not mandatory i mean if you want to have a dvc repository so when you do a git push git is going to push your code and some metadata from dvc to your git repository and then a dvc push is going to send the data set the output files the big files the object files to your dvc repository your dvc remote so here you have the workflow of dbc or of any machine learning project it's not always like this but in this example it's we can think it's a pretty common example of a machine learning workflow you're going to have some script to download your data and then you're going to have the data there you're going to have this splitter file which is going to split your data set into training data and validation data and then you're going to have another script which is going to you bro it's going to run some algorithm and here's decision tree but it could be any algorithm this upgrade is going to generate a model and then you can use another script called evaluation to evaluate this model evaluate matrix and if you don't like it you're going to run everything again you're going to tune your algorithm change some hyper parameters see again and again again at some point you're going to like the metrics or no but if you do you're going to save this as your model and then it's done let's say so let's do some hands-on again but now with pipelines so as i was showing uh you do dvc add to add a data set it's going to create this dvc file it's also going to create an entry at the git ignore file with the name of your new dbc file that's interesting because if you have a file in your git repository and git is not tracking it a lot of times git is going to warn you saying that this file is untracked and you know you know it's on track you don't want git to track it you because dvc is tracking it so by automatically putting your file in the git ignore dvc is making git ignore that file and then it won't complain it won't warn you then you're going to start creating stages for your pipeline so here i'm going to create a preprocessed stage i'm going to use dbc run to create this stage and the most common parameters are the minus d for dependencies and the minus o for output files so this stage what we're going to tell the vc to do is that we're going to have dependencies which is the data set the simulation of the suv and the pre-process are the script is going to do the processing these are the dependencies if these two things if any of these two things change we know that this stage must be run again and then we're going to generate an output file which is a pre-processed version of this data set and what's the stage supposed to do run the preprocess dot r script when you do that you're gonna add your git ignore file your pre-process r file and actually here instead of pre-process dot dbc it should be the dvc.ml it's a mistake in the slide in the past it used to be like this but now you're going to have if it's 1.0 or later the dvc version is going to be the dvc.ml file and then you're going to do a git commit just like you would do in any other project you have but then i want to create a new stage i want to create a new stage and in here it's going to be calling another algorithm to work on this pre-processed data set and maybe i want to create a new one which is the final one and you can keep creating stages of your pipeline and in the end if you want when you create this stage dvc also runs it for the first time so let's say that you have run everything but at some point you change your data set or you change some parameter in your algorithm you want to do something like this and then you can just call dvc and say dbc reproduce this pipeline it's the repro command and you're going to call the final or any stage and dvc is going to run up to that stage then you can ask but marcel i just changed the last stage i don't want it to run everything again and it never does if dvc realizes it didn't change it won't run it we skipped that stage and in the end that's what i did with the first paper that i mentioned of the data sets so in here we have in green descript files in light orange datasets in strong orange this stages so i have this 30 on the top right we have 35 data sets for the united nations and we're gonna call two scripts one script the aggregate un data from the generate single uin dataset stage and that we're going to merge all the 35 un data sets in one single data set in a specific format and we're going to do another stage on the right which is going to generate a raw dictionary data dictionary with this human data set this merge along with all the three data sets the john hopkins university called v19 data set the google mobility data set and the ecdc copied 19 data set so ecdc is the european center for disease control with these four data sets plus a pre-precise dot r script these are the dependencies of our preprocessing stage and then we're going to run that and we're going to have the deep data set the final data set that was published in the data paper that i mentioned to you earlier then later i did what we thought at the very beginning it would be very interesting that people could update the data sets so i created on the top left a new stage called update data sets and i created a script which is check underline update dot r so this script it checks for new versions of the data and if there are new ones it's going to download that and then call the preprocess stage and with that we're going to rerun everything and have the most updated version of our dating brief data set one interesting thing about machine learning or data science is that from a certain perspective it's metrics driven so differently from this software development environment where you're always changing a lot of code and stuff in here you're not really you have your code and actually what you keep changing several times is the parameters the hyper parameters you're turning your algorithm you're tuning it so you're just changing a few things here and there running everything again and seeing how the metrics changed and you keep doing that so you can see that at least at some point it's metrics driven and then you can do that with dbc so in here we're going to create this new stage called evaluate there are some dependencies and this parameter minus m so the minus m parameter in dbc run it tells dvc which file it should consider a metric file and that's interesting to use the metrics command of dbc so in here after you run the stage you can run dvc metrics show and it's going to show the information inside this this metrics file and here it's the auc the error and the true positive rate the number of true positives then you can tune again your aggregate chain a few things and you run it again and then you're going to type dbc matrix diff to see the difference so this is very interesting you see the difference comparing now to the last run of this experiment and you know just like you do with git branches you can do experiments with branches too so you could have a few friends working with you on the same experiment so you think that the best way to improve the the metrics of your algorithm is to tune the l4 so you create a branch for that and you keep testing but your other friend thinks you are wrong he thinks that increasing the beta is a better choice and the other friend thinks no no let's change the alpha so you are working with that at some point you will indeed find that you have the best model then you merge that back into master and just like git branches you can just leave the branch there or you can delete them i think it's nice to leave even if it's like fabled experience i think it's nice to leave them there because you know at some point like maybe a year later someone may ask you like okay but did you try to increase the beta and you're like uh i don't really remember or maybe how better was tuning l4 compared to increasing beta you know so i think keeping the failed experience is very nice to know to keep the transparency and the history of your projects to make it everything really transparent so i think this whole thing this whole talk is about open science or doing like an open box a white box machine learning data science projects so reproducibility transparency all these things are very important and i think dbc can help you with that so thanks for your attention i hope you enjoyed the talk if you want to get in touch with me through my website your contacts page there or my twitter feel free to send any questions so thank you very much for your attention

Share your thoughts

Related Transcripts

Dmitry Petrov: DataOps & ML automation with DVC (ENG)

Category: Entertainment

Hello everyone my name is dmitry i'm going to talk about data ops and ml automation with dvc uh the first part of the talk is going to be about uh uh principle behind building ai platforms and ml automation so this part is going to be a little bit abstract i'm very sorry about that while the second... Read more

Congress rushes to approve final package of spending bills before shutdown deadline

Category: Science & Technology

As the clock ticks down lawmakers scramble to pass the final spending package for the current budget year avoiding a potential government shutdown the $1.2 trillion measure combines six annual spending bills with over 70% allocated to defense sparking intense debate and negotiation the house and senate... Read more

Congress races to pass $1.2 trillion in spending before shutdown deadline

Category: Science & Technology

As congress prepares to vote on a $1.2 trillion spending package tensions rise as the deadline looms will they avert a shutdown the senate has limited time to vote on the spending package risking a partial shutdown will they beat the clock some republican senators are posing challenges to the bill causing... Read more

Predict Baseball Stats using Machine Learning and Python

Category: Education

Introduction hi my name is vic and today we're going to learn how to predict future baseball stats given a player's historical data we'll start out by downloading baseball stats using python we'll clean the data and get it ready for machine learning then we'll pick which columns we want to use as predictors... Read more

Government shutdown deadline: House expected to vote on key government funding legislation

Category: Science & Technology

As the house rushes to vote on crucial government spending tensions rise with the looming threat of a shutdown the bill covers defense homeland security and key departments vital for national security and public safety if passed the bill moves to the senate for approval adding pressure to the tight... Read more

New ways to search: Beetlejuice Beetlejuice (ft. Bob)

Category: Science & Technology

[upbeat music] [grunts to speak] Read more

Top 5 AI Updates From Apple Event 2024

Category: Education

Okay so here are top five ai updates from apple's event first one is custom emoes creation users can now create original emojis termed zen mosi by typing a description or selecting a photo of a friend or family member this feature aims to personalize the emoji experience significantly second one is... Read more

Training the largest LLMs, Cerebras Wafer-Scale Architecture | Keynote 3 | Jean-Philippe Fricker

Category: Science & Technology

[music] thank you thank you so much and thank you for attending this session i know you had uh maybe four days of uh hard labor uh i'm going to try to show you how your labor could be somewhat simplified with what we've done at cerebras um so yes my name is jean philip freer and as you know when you... Read more

Unlocking AI's Potential: How OpenAI's New Model Uses Chain-of-Thought Prompting!

Category: People & Blogs

Open a i just released their most advanced ai model yet it's called 01 and it's designed to think stepbystep just like a human to solve complex problems in science math and coding owan can perform at phd level tackling tasks that were previously to difficult for ai some key things that set 01 apart... Read more

iPhone 16 Pro Price, Design Update And Battery Life Confirmed In Apple Launch

Category: Science & Technology

Newly developed 3 nanm a8 chip this chip represents a significant leap in performance boasting a six core cpu with four efficiency cores and a dramatically improved neural network apple claims that the a18 chip is up to 30% faster than the a17 found in the iphone 15 while also being 35% more power efficient... Read more

AI News: OpenAI Finally Released Their New Model!

Category: Science & Technology

Intro i just spent the last week at disneyland and of course the week that i'm gone turns out to be an insane week with tons of big announcements i'm a day later than normal on getting this ai news video out so i'm not going to waste your time let's just jump right in there was really two major major... Read more

OpenAI o1: A Major Step Forward in AI Reasoning and Technology

Category: Science & Technology

Ai just took a major leap forward open ai1 is changing the game in artificial intelligence reasoning this new model is designed to think more like humans making smarter decisions and solving complex problems faster than ever before from understanding context better to improving how ai tackles real world... Read more