I want to train model on the Wikitext-2-v1 training corpus(https://huggingface.co/datasets/wikitext). I tried to remove punctuation in each line, which from what I find is a dictionary, i.e., each line is a dictionary, so I tried to update the value, but after the loop, I checked the value, there is nothing changed.
%pip install datasets
from datasets import list_datasets, load_dataset, list_metrics, load_metric
dataset = load_dataset('wikitext', 'wikitext-2-v1', split =['train','validation','test'])
for i in range(len(dataset[0])):
for k,v in dataset[0][i].items():
v = v.translate(str.maketrans('', '', string.punctuation))
dataset[0][i]['text'] = v
Here is a one line example {'text': ' = Valkyria Chronicles III = \n'} after the loop it keeps same, but I want it to be {'text': ' Valkyria Chronicles III'}
CodePudding user response:
Your translate
code is good and doing what it's supposed to.
I believe the Dataset
type does not support item assignment. (You can't change it.) Normally this raises a TypeError
but you managed to code it in a way to avoid that. You may need to build up a new a new datastructure with your translated data as you go.
Note also that it is rather strange they way you iterate through items()
but only use the text
one. Seems to me either explicitly use only text
or explicitly handle every one.