I'm using an already existing code from Towards Data Science for fine-tuning a BERT Model.
The problem I'm facing belongs to this part of the code which where try to format our data into a PyTorch data.Dataset
object:
class MeditationsDataset(torch.utils.data.Dataset):
def _init_(self, encodings, *args, **kwargs):
self.encodings = encodings
def _getitem_(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def _len_(self):
return len(self.encodings.input_ids)
dataset = MeditationsDataset(inputs)
When I run the code, I face this error:
TypeError Traceback (most recent call last)
<ipython-input-144-41fc3213bc25> in <module>()
----> 1 dataset = MeditationsDataset(inputs)
/usr/lib/python3.7/typing.py in __new__(cls, *args, **kwds)
819 obj = super().__new__(cls)
820 else:
--> 821 obj = super().__new__(cls, *args, **kwds)
822 return obj
823
TypeError: object.__new__() takes exactly one argument (the type to instantiate)
I already searched for this error but the problem here is that sadly I'm not familiar with either PyTorch or OOP so I couldn't fix this problem. Could you please let me know what should I add or remove from this code so I can run it? Thanks a lot in advance.
Also if needed, our data is as below:
{'input_ids': tensor([[ 2, 1021, 1005, ..., 0, 0, 0],
[ 2, 1021, 1005, ..., 0, 0, 0],
[ 2, 1021, 1005, ..., 0, 0, 0],
...,
[ 2, 1021, 1005, ..., 0, 0, 0],
[ 2, 103, 1005, ..., 0, 0, 0],
[ 2, 4, 0, ..., 0, 0, 0]]),
'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 0, ..., 0, 0, 0]]),
'labels': tensor([[ 2, 1021, 1005, ..., 0, 0, 0],
[ 2, 1021, 1005, ..., 0, 0, 0],
[ 2, 1021, 1005, ..., 0, 0, 0],
...,
[ 2, 1021, 1005, ..., 0, 0, 0],
[ 2, 1021, 1005, ..., 0, 0, 0],
[ 2, 4, 0, ..., 0, 0, 0]])}
CodePudding user response:
Special functions in Python use double underscores prefix and suffix. In your case, to implement a data.Dataset
, you must have __init__
, __getitem__
, and __len__
:
class MeditationsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, *args, **kwargs):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings.input_ids)