Home > database >  Python: how to randomise drop_duplicates using datetime?
Python: how to randomise drop_duplicates using datetime?

Time:12-16

Excuse my lack of understanding - I am very new to Python programming.

Imagine I have the following code:

df_filtered.drop_duplicates(subset=['date'], keep='first', inplace=True)

How can I randomise the dropping of the duplicates, instead of choosing always the first? Something like:

df_filtered.drop_duplicates(subset=['date'], keep='random', inplace=True)

CodePudding user response:

Example

data = {'col1': {0: 'A', 1: 'B', 2: 'B', 3: 'B', 4: 'A'}, 
        'col2': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
df = pd.DataFrame(data)

df

  col1  col2
0   A   1
1   B   2
2   B   3
3   B   4
4   A   5

Code

shuffle -> drop duplicates -> sort by index

out = df.sample(frac=1).drop_duplicates('col1').sort_index()

out is random!

example of random

   col1 col2
0   A   1
2   B   3

CodePudding user response:

From documentation, the only available options are as follows enter image description here

However, you can adopt a multi-staged approach.

subset all duplicated using

dups =df[df.duplicated(subset=['column'],keep=False)]

subset all none duplicated using

nodups = df[~df.duplicated(subset=['column'],keep=False)]

random sample the dups

dups = dups.sample(frac=, replace=True/False, random_state=1)

Combine the dups and nondups by concat

pd.concat([dups, nondups])
  • Related