Is there a way that we could update an Existing Azure ML Dataset using a pandas Dataframe and update the version? The default Dataset is stored in a blob as a csv file.How can we approach this?
Also let's say we want to change the latest version to another version.
Above we see that version 2 is the latest, but I want to change the latest to version 1 so that if anyone reads the Dataset it will be from version 1. Don't want to use versions specifically to retrieve it.
CodePudding user response:
Regarding your first question, here are two methods to update your Azure ML dataset with a new version using a CSV file stored in Blob Storage:
Method 1:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'
my_dataset = Data(
path=blob_url ,
type=AssetTypes.MLTABLE,
description="a description for your dataset",
name="dataset_name",
version='<new_version>'
)
ml_client.data.create_or_update(my_dataset)
Method 2:
import azureml.core
from azureml.core import Dataset, Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()
blob_url = 'https://sampleazurestorage.blob.core.windows.net/data/my-sample-data.csv'
my_dataset = Dataset.File.from_delimited_files(path=blob_url)
my_dataset.register(
workspace=ws,
name="dataset_name",
description="a description for your dataset",
create_new_version=True
)
If you want to update the dataset using a pandas DataFrame:
my_df = ... # the variable that contains the new dataset in a DataFrame
my_dataset = Dataset.File.from_pandas_dataframe(dataframe=my_df)
my_dataset.register(
...
)
Regarding your second question:
Above we see that version 2 is the latest, but I want to change the latest to version 1
It is not possible since 'latest' always points to the last (latest) uploaded version of the dataset with the given name. So, if you want a specific or latest version, you should change the version
parameter in the Data
class in the "Method 1" code snippet.