Home > Back-end >  Formatting our data into PyTorch Dataset object for fine-tuning BERT
Formatting our data into PyTorch Dataset object for fine-tuning BERT

Time:06-26

I'm using an already existing code from Towards Data Science for fine-tuning a BERT Model. The problem I'm facing belongs to this part of the code which where try to format our data into a PyTorch data.Dataset object:

class MeditationsDataset(torch.utils.data.Dataset):
    def _init_(self, encodings, *args, **kwargs):
        self.encodings = encodings
    def _getitem_(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def _len_(self):
        return len(self.encodings.input_ids)


dataset = MeditationsDataset(inputs)

When I run the code, I face this error:

TypeError                                 Traceback (most recent call last)
<ipython-input-144-41fc3213bc25> in <module>()
----> 1 dataset = MeditationsDataset(inputs)

/usr/lib/python3.7/typing.py in __new__(cls, *args, **kwds)
    819             obj = super().__new__(cls)
    820         else:
--> 821             obj = super().__new__(cls, *args, **kwds)
    822         return obj
    823 

TypeError: object.__new__() takes exactly one argument (the type to instantiate)

I already searched for this error but the problem here is that sadly I'm not familiar with either PyTorch or OOP so I couldn't fix this problem. Could you please let me know what should I add or remove from this code so I can run it? Thanks a lot in advance.

Also if needed, our data is as below:

{'input_ids': tensor([[   2, 1021, 1005,  ...,    0,    0,    0],
                      [   2, 1021, 1005,  ...,    0,    0,    0],
                      [   2, 1021, 1005,  ...,    0,    0,    0],
                      ...,
                      [   2, 1021, 1005,  ...,    0,    0,    0],
                      [   2,  103, 1005,  ...,    0,    0,    0],
                      [   2,    4,    0,  ...,    0,    0,    0]]), 
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0],
                           ...,
                           [0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 1,  ..., 0, 0, 0],
                           ...,
                           [1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 0,  ..., 0, 0, 0]]), 
 'labels': tensor([[   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   ...,
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2,    4,    0,  ...,    0,    0,    0]])}

CodePudding user response:

Special functions in Python use double underscores prefix and suffix. In your case, to implement a data.Dataset, you must have __init__, __getitem__, and __len__:

class MeditationsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, *args, **kwargs):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)
  • Related