How to randomly drop duplicate values in a dataframe-CodePudding

Let's say i have a dataframe like this:

    A   B   C   D   E
0   5   0   16  32  48
1   5   1   17  33  49
2   5   2   18  34  50
3   5   3   19  35  51
4   5   4   20  36  52
5   4   5   21  37  53
6   4   6   22  38  54
7   3   7   23  39  55
8   3   8   24  40  56
9   3   9   25  41  57
10  3   10  26  42  58
11  2   11  27  43  59
12  2   12  28  44  60
13  2   13  29  45  61
14  2   14  30  46  62
15  2   15  31  47  63

As you can see, Column A has a lot of duplicate values. How can i randomly delete some of them, lets say i want to delete only 50% of duplicate values and have a df like this:

    A   B   C   D   E
0   5   0   16  32  48
1   5   1   17  33  49
2   5   2   18  34  50
3   4   5   21  37  53
4   3   7   23  39  55
5   3   8   24  40  56
6   2   11  27  43  59
7   2   12  28  44  60
8   2   13  29  45  61

I tried this one, but it is not working:

df = df.sample(frac=0.5).drop_duplicates(subset=['A']).reset_index(drop=True)

Also, is there a way I can have this percentage as a random parameter, for example, in my dataframe Column A have 4 unique values, so lets say:

a = df['A'].nunique()  # Lets say 4
percentage_values = [random.randint(30,50) for i in range(a)] # Lets say [31, 47, 34, 42]

And for my dataframe, i want to remove 31% of rows with 5, 47% of rows with 4, 34% with 3 and 42% with 2. Is this possible? But not hard coded like this but based on the values in the list and based on the duplicate values from the dataframe in Column A...

CodePudding user response：

As an answer to your first (sub) question: you were close using sample. However, try using groupby instead of drop duplicates:

df.groupby('A').apply(
    pd.DataFrame.sample, frac=0.5
).reset_index(level=0, drop=True).sort_index()

I'm not sure how you want to calculate the percentages in your second (sub) question. But perhaps you can figure this out with groupby in mind.

CodePudding user response：

apply drop_duplicates multiple times on main dataframe by keeping first and last rows

df1 = df.drop_duplicates(subset='A', keep="first")
df2 = df.drop_duplicates(subset='A', keep="last")
df = pd.concat([df1, df2], axis = 0).drop_duplicates()
df.head()

# output
    A   B   C   D   E
0   5   0   16  32  48
5   4   5   21  37  53
7   3   7   23  39  55
11  2   1   27  43  59
4   5   4   20  36  52