Home > Enterprise >  Replacing duplicate values in a dataframe
Replacing duplicate values in a dataframe

Time:11-21

I've the following dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)

Which gives:

    ID
0    1
1    1
2    2
3    5
4    5
5    6
6    1
7    1
8    2
9    2
10   5
11   9
12   1
13   2
14   3
15   3
16   3
17   5

I want to replace duplicate values in the 'ID' column with the lowest, not yet used, value. However, consequtive identical values should be seen as a group and their values should be changed in the same way. For example: the first two values are both 1. These are consequtive so they are a group and the second '1' should therefore not be replaced with a '2'. Row 14-16 are three consequtive threes. The value 3 has already been used to replace above values, so these threes need to be replaced. But they're consequtive, thus a group, and should get the same replacemnt value. The expected outcome is as follows and will make it more clear:

    ID
0    1
1    1
2    2
3    5
4    5
5    6
6    3
7    3
8    4
9    4
10   7
11   9
12   8
13  10
14  11
15  11
16  11
17  12

CodePudding user response:

I made up a way to get your outcome using for loops and dictionaries. More difficult I expected to be fair, the code can seem a bit complex at first but it isnt. Probably, there's a way to do it with multiple logical vectors, but I don't know.

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
####################
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums

# dict with range of numbers from min to max in ID col
d = {} # Empty dict
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq))) # range values
d = {a[k]:False for k in range(len(a))} # Fill dict
d[df.ID[0]] = True # Set first value in col as True
     
for m in range(1,df.shape[0]):
    # Find a value different from previous one
    # therefore, beginning of new subgroup
    if diffs[m-1] != 0:
        # Check if value was before in the ID column
        if d[df.ID[m]] == True:
            # Get the lowest value which wasn't used
            lowest = [k for k, v in d.items() if v == False][0]
            # loop over the subgroup (which differences are 0)
            for n in range(m 1,df.shape[0]):
                if diffs[n-1] > 0: # If find a new subgroup
                    break         # then stop looping
            # Replace the subgroup with the lowest value
            df.ID[m:n] = lowest # n is the final index of the subgroup
            # *Exception in case last number is a subgroup itself
            # then previous for loop doesnt work
            if m == df.shape[0]-1:
                 df.ID[m] = lowest
    # Update dictionaries for values retrieved from ID column
    d[df.ID[m]] = True
    
print(df)

Therefore, what you want is to think about your column ID as subgroups or different arrays, checking different conditions and making different operations then. You can think about your column as a set of multiple arrays:

[1, 1 | 2 | 5, 5 | 6 | 1, 1 | 2, 2 | 5 | 9 | 1 | 2 | 3, 3, 3 | 5]

What you need to do is find the limits of that subgroups and check whether they meet certain conditions (1. not a previous number, 2. the lowest number which we didn't use). We can know the subgroups if we calculate the differences between a value and the previous one

diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)

We can know the conditions using a dictionary which keys are the integers in the array or longer values we could need and values are whether we have used them or not (True or False).

To do so, we need to do the max value of the ID column. However, we need to build the dictionary with more numbers as there are in the column (in your example the max(input) = 9 and max(output) = 12). You could do randomly, I chose to calculate the possible proportion we could need attending to the number of rows and number of unique values in the column (the last input in a = range... ).

uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {}
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq)))
d = {a[k]:False for k in range(len(a))}
d[df.ID[0]] = True # Set first value in col as True

Last part of the code is a main for loop with some If and another for inside, it works as:

# 1. Loop over ID column
# 2. Check if ID[m] value is different number from previous one (diff != 0)
# 3. Check if the ID[m] value is already in the ID column.
# 4. Calculate lowest value (first key == False in dict) and change the subset of
#   in the ID
# 5. How is made step 4, if last value is a subset itself, it doesn't work, so 
#  there's a little condition to check it will work.
# 6. Update the dict every time a new value shows up.

I am sure there are many ways to shorten this code. But this work and should work with larger dataframes and the same conditions.

CodePudding user response:

df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})


def fun():
    v, dub = 1, set()
    d = yield
    while True:
        num = d.iloc[0]['ID']
        if num in dub:
            while v in dub:
                v  = 1
            d.ID = num = v
        dub.add(num)
        d = yield d


f = fun()
next(f)

df = df.groupby([df['ID'].diff().ne(0).cumsum(), 'ID'], as_index=False).apply(lambda x: f.send(x))
print(df)

Output:

    ID
0    1
1    1
2    2
3    5
4    5
5    6
6    3
7    3
8    4
9    4
10   7
11   9
12   8
13  10
14  11
15  11
16  11
17  12

CodePudding user response:

import pdrle

def foo(arr):
    def find(d, prev):
        while True:
            prev  = 1
            if d.get(prev) is None:
                return prev

    ans = []
    d = {}
    prev = 1
    for el in arr:
        if d.get(el) is None:
            ans.append(el)
            d[el] = True
        else:
            curr = find(d, prev)
            d[curr] = True
            ans.append(curr)
            prev = curr
        
    return ans

rle = pdrle.encode(df.ID)
rle["new"] = foo(rle.vals)
rle.new.repeat(rle.runs).reset_index(drop=True)
#new
#0  1
#1  1
#2  2
#3  5
#4  5
#5  6
#6  3
#7  3
#8  4
#9  4
#10 7
#11 9
#12 8
#13 10
#14 11
#15 11
#16 11
#17 12
  • Related