Huggingface Load_dataset() function throws "ValueError: Couldn't cast"-CodePudding

My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library. Code is executed on Google Colaboratory.

My test dataset is read from this csv file: https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv

and train dataset: https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv

Data has two columns: column of Slovak sentences and 2nd column of labels which indicate sentiment of the sentence. Labels have values -1, 0 or 1.

Load_dataset() function throws this error:

ValueError: Couldn't cast Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string -1: int64 -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' 954 to {'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)} because column names don't match

Code:

!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq

from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'


model_name = 'gerulata/slovakbert'


#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})

What is done wrong while loading the dataset?

CodePudding user response：

The reason is since delimiter is used in first column multiple times the code fails to automatically determine number of columns ( some time segment a sentence into multiple columns as it cannot automatically determine , is a delimiter or a part of sentence.

But, the solution is simple: (just add column names)

dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])

output:

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 89
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 91
    })
})