I've the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
Which gives:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 1
7 1
8 2
9 2
10 5
11 9
12 1
13 2
14 3
15 3
16 3
17 5
I want to replace duplicate values in the 'ID' column with the lowest, not yet used, value. However, consequtive identical values should be seen as a group and their values should be changed in the same way. For example: the first two values are both 1. These are consequtive so they are a group and the second '1' should therefore not be replaced with a '2'. Row 14-16 are three consequtive threes. The value 3 has already been used to replace above values, so these threes need to be replaced. But they're consequtive, thus a group, and should get the same replacemnt value. The expected outcome is as follows and will make it more clear:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12
CodePudding user response:
I made up a way to get your outcome using for loops and dictionaries. More difficult I expected to be fair, the code can seem a bit complex at first but it isnt. Probably, there's a way to do it with multiple logical vectors, but I don't know.
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
####################
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {} # Empty dict
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq))) # range values
d = {a[k]:False for k in range(len(a))} # Fill dict
d[df.ID[0]] = True # Set first value in col as True
for m in range(1,df.shape[0]):
# Find a value different from previous one
# therefore, beginning of new subgroup
if diffs[m-1] != 0:
# Check if value was before in the ID column
if d[df.ID[m]] == True:
# Get the lowest value which wasn't used
lowest = [k for k, v in d.items() if v == False][0]
# loop over the subgroup (which differences are 0)
for n in range(m 1,df.shape[0]):
if diffs[n-1] > 0: # If find a new subgroup
break # then stop looping
# Replace the subgroup with the lowest value
df.ID[m:n] = lowest # n is the final index of the subgroup
# *Exception in case last number is a subgroup itself
# then previous for loop doesnt work
if m == df.shape[0]-1:
df.ID[m] = lowest
# Update dictionaries for values retrieved from ID column
d[df.ID[m]] = True
print(df)
Therefore, what you want is to think about your column ID as subgroups or different arrays, checking different conditions and making different operations then. You can think about your column as a set of multiple arrays:
[1, 1 | 2 | 5, 5 | 6 | 1, 1 | 2, 2 | 5 | 9 | 1 | 2 | 3, 3, 3 | 5]
What you need to do is find the limits of that subgroups and check whether they meet certain conditions (1. not a previous number, 2. the lowest number which we didn't use). We can know the subgroups if we calculate the differences between a value and the previous one
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
We can know the conditions using a dictionary which keys are the integers in the array or longer values we could need and values are whether we have used them or not (True or False).
To do so, we need to do the max value of the ID column. However, we need to build the dictionary with more numbers as there are in the column (in your example the max(input) = 9 and max(output) = 12). You could do randomly, I chose to calculate the possible proportion we could need attending to the number of rows and number of unique values in the column (the last input in a = range... ).
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {}
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq)))
d = {a[k]:False for k in range(len(a))}
d[df.ID[0]] = True # Set first value in col as True
Last part of the code is a main for loop with some If and another for inside, it works as:
# 1. Loop over ID column
# 2. Check if ID[m] value is different number from previous one (diff != 0)
# 3. Check if the ID[m] value is already in the ID column.
# 4. Calculate lowest value (first key == False in dict) and change the subset of
# in the ID
# 5. How is made step 4, if last value is a subset itself, it doesn't work, so
# there's a little condition to check it will work.
# 6. Update the dict every time a new value shows up.
I am sure there are many ways to shorten this code. But this work and should work with larger dataframes and the same conditions.
CodePudding user response:
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
def fun():
v, dub = 1, set()
d = yield
while True:
num = d.iloc[0]['ID']
if num in dub:
while v in dub:
v = 1
d.ID = num = v
dub.add(num)
d = yield d
f = fun()
next(f)
df = df.groupby([df['ID'].diff().ne(0).cumsum(), 'ID'], as_index=False).apply(lambda x: f.send(x))
print(df)
Output:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12
CodePudding user response:
import pdrle
def foo(arr):
def find(d, prev):
while True:
prev = 1
if d.get(prev) is None:
return prev
ans = []
d = {}
prev = 1
for el in arr:
if d.get(el) is None:
ans.append(el)
d[el] = True
else:
curr = find(d, prev)
d[curr] = True
ans.append(curr)
prev = curr
return ans
rle = pdrle.encode(df.ID)
rle["new"] = foo(rle.vals)
rle.new.repeat(rle.runs).reset_index(drop=True)
#new
#0 1
#1 1
#2 2
#3 5
#4 5
#5 6
#6 3
#7 3
#8 4
#9 4
#10 7
#11 9
#12 8
#13 10
#14 11
#15 11
#16 11
#17 12