I have a train dataset with 4,000 examples, I want to split it randomly into 2 equal sub-datasets with 2,000 in each of them. As suggested here i used the split
and drop
methods like so:
I1 = train_df.sample(frac=0.5, random_state=opts.seed)
I2 = train_df.drop(index=I1.index)
However it seems like it drops more indices for no apparent reason:
print(len(train_df))
print(len(I1))
print(len(I2))
4000
2000
1010
I would appreciate any insight as to why it happens.
CodePudding user response:
For me, it's just that train_df
has multiple same indexes. This should work as expected:
train_df = train_df.reset_index(drop=True)
I1 = train_df.sample(frac=0.5, random_state=opts.seed)
I2 = train_df.drop(index=I1.index)