Excuse my lack of understanding - I am very new to Python programming.
Imagine I have the following code:
df_filtered.drop_duplicates(subset=['date'], keep='first', inplace=True)
How can I randomise the dropping of the duplicates, instead of choosing always the first? Something like:
df_filtered.drop_duplicates(subset=['date'], keep='random', inplace=True)
CodePudding user response:
Example
data = {'col1': {0: 'A', 1: 'B', 2: 'B', 3: 'B', 4: 'A'},
'col2': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
df = pd.DataFrame(data)
df
col1 col2
0 A 1
1 B 2
2 B 3
3 B 4
4 A 5
Code
shuffle -> drop duplicates -> sort by index
out = df.sample(frac=1).drop_duplicates('col1').sort_index()
out
is random!
example of random
col1 col2
0 A 1
2 B 3
CodePudding user response:
From documentation, the only available options are as follows
However, you can adopt a multi-staged approach.
subset all duplicated using
dups =df[df.duplicated(subset=['column'],keep=False)]
subset all none duplicated using
nodups = df[~df.duplicated(subset=['column'],keep=False)]
random sample the dups
dups = dups.sample(frac=, replace=True/False, random_state=1)
Combine the dups and nondups by concat
pd.concat([dups, nondups])