change value in dict from datasets Hugging face-CodePudding

I want to train model on the Wikitext-2-v1 training corpus(https://huggingface.co/datasets/wikitext). I tried to remove punctuation in each line, which from what I find is a dictionary, i.e., each line is a dictionary, so I tried to update the value, but after the loop, I checked the value, there is nothing changed.

%pip install datasets
from datasets import list_datasets, load_dataset, list_metrics, load_metric
dataset = load_dataset('wikitext', 'wikitext-2-v1', split =['train','validation','test'])

for i in range(len(dataset[0])):

   for k,v in dataset[0][i].items():

       v = v.translate(str.maketrans('', '', string.punctuation))

   dataset[0][i]['text'] = v

Here is a one line example {'text': ' = Valkyria Chronicles III = \n'} after the loop it keeps same, but I want it to be {'text': ' Valkyria Chronicles III'}

CodePudding user response：

Your translate code is good and doing what it's supposed to.

I believe the Dataset type does not support item assignment. (You can't change it.) Normally this raises a TypeError but you managed to code it in a way to avoid that. You may need to build up a new a new datastructure with your translated data as you go.

Note also that it is rather strange they way you iterate through items() but only use the text one. Seems to me either explicitly use only text or explicitly handle every one.