Dataset splitting with pandas sample and drop does not work as expected-CodePudding

I have a train dataset with 4,000 examples, I want to split it randomly into 2 equal sub-datasets with 2,000 in each of them. As suggested here i used the split and drop methods like so:

I1 = train_df.sample(frac=0.5, random_state=opts.seed)
I2 = train_df.drop(index=I1.index)

However it seems like it drops more indices for no apparent reason:

print(len(train_df))
print(len(I1))
print(len(I2))

4000
2000
1010

I would appreciate any insight as to why it happens.

CodePudding user response：

For me, it's just that train_df has multiple same indexes. This should work as expected:

train_df = train_df.reset_index(drop=True)
I1 = train_df.sample(frac=0.5, random_state=opts.seed)
I2 = train_df.drop(index=I1.index)