I have df like this:
id_1 box reference id_2 id_order
0 19588.0 100.0 300025886 1.0 100.0
1 16009.0 NaN 300025886 NaN NaN
2 19148.0 NaN 300025886 NaN NaN
3 16936.0 101.0 300025886 3.0 101.0
4 13127.0 102.0 300025886 4.0 102.0
5 15064.0 103.0 300025886 6.0 103.0
6 10630.0 104.0 300025886 7.0 104.0
7 12401.0 105.0 300025886 8.0 105.0
8 13163.0 106.0 300025886 9.0 106.0
24 NaN NaN 300025886 5.0 108.0
23 NaN NaN 300025886 2.0 107.0
25 NaN NaN 300025887 16.0 207.0
16 17697.0 206.0 300025887 15.0 206.0
15 19527.0 205.0 300025887 14.0 205.0
14 10538.0 204.0 300025887 13.0 204.0
13 11808.0 203.0 300025887 12.0 203.0
12 17577.0 202.0 300025887 11.0 202.0
11 17964.0 201.0 300025887 10.0 201.0
10 13118.0 NaN 300025887 NaN NaN
9 17891.0 200.0 300025887 NaN NaN
17 18426.0 300.0 300025888 17.0 300.0
18 17632.0 301.0 300025888 18.0 301.0
19 14726.0 302.0 300025888 19.0 302.0
20 18361.0 303.0 300025888 20.0 303.0
21 12908.0 NaN 300025888 NaN NaN
22 10279.0 NaN 300025888 NaN NaN
26 NaN NaN 300025888 21.0 304.0
27 NaN NaN 300025888 22.0 305.0
28 NaN NaN 300025888 23.0 306.0
29 NaN NaN 300025888 24.0 307.0
There are 3 different situations:
box = id_order
- reference 300025886 has same number of id_1 and id_2 2 NaN in box should be replaced with 2 left in id_order
- reference 300025887 id_1 has 8 values id_2 has 7 values we do nothing because id_1 has more values then id_2
- reference 300025887 id_1 has 6 values id_2 has 8 values we fill box(NaN) with left from id_order
so the result should look like:
id_1 box reference id_2 id_order new
0 19588 100 300025886 1 100 100
1 16009 NaN 300025886 NaN NaN 108
2 19148 NaN 300025886 NaN NaN 107
3 16936 101 300025886 3 101 101
4 13127 102 300025886 4 102 102
5 15064 103 300025886 6 103 103
6 10630 104 300025886 7 104 104
7 12401 105 300025886 8 105 105
8 13163 106 300025886 9 106 106
24 NaN NaN 300025886 5 108 NaN
23 NaN NaN 300025886 2 107 NaN
25 NaN NaN 300025887 16 207 NaN
16 17697 206 300025887 15 206 206
15 19527 205 300025887 14 205 205
14 10538 204 300025887 13 204 204
13 11808 203 300025887 12 203 203
12 17577 202 300025887 11 202 202
11 17964 201 300025887 10 201 201
10 13118 NaN 300025887 NaN NaN NaN
9 17891 200 300025887 NaN NaN 200
17 18426 300 300025888 17 300 300
18 17632 301 300025888 18 301 301
19 14726 302 300025888 19 302 302
20 18361 303 300025888 20 303 303
21 12908 NaN 300025888 NaN NaN 306
22 10279 NaN 300025888 NaN NaN 304
26 NaN NaN 300025888 21 304 NaN
27 NaN NaN 300025888 22 305 NaN
28 NaN NaN 300025888 23 306 NaN
29 NaN NaN 300025888 24 307 NaN
So every id_1 has value in 'new' column filled from box if it is not NaN and from id_order if it is NaN with conditions that if number of count(id_1) = count(id_2) then just fill missed, when number of count(id_1) > count(id_2) we don't do anything so NaN stays and when number of count(id_1) < count(id_2) then we randomly take 'free' id_order and replace NaN with it, but without duplicates.
I tried
fillna
merge
group by
some loops (but can't figure all conditions)
etc.
but can't figure algorithm of combining methods to get desired result
I need to fill NaN from box column which have id_1 with left (if there are any) from id_order column. The reference column is general for id_1, box and id_2, id_order
CodePudding user response:
Filling the NaN
values in 'box' with 'id_order' is simple:
df['new'] = df['id_order'].fillna(df['box'])
But filling the values by incrementing downward with the id, does not have a built in function that I know of. You may have to iterate through each line to accomplish this.
Here is a good place to start:
for ix, row in df.where(df['new'].isna()).iterrows():
if ix == 0:
continue
# More 'elif' logic here
df.loc[ix, 'new'] = df.loc[(ix-1), 'new'] 1
You may want to look into setting the 'new' column as index and using various functions to make the index unique as in this question: create Pandas Dataframe with unique index
CodePudding user response:
From what I understand, your condition can be changed to be like count(id_1) = count(id_2) or count(id_1) < count(id_2) since the only difference between them, is the randomly without duplicated thing
random.choice(a, size=None, replace=False, p=None)
Since its False = no duplicated, we send the list of the value that can be in this place, with the length of the missed placed.
refs = data['reference'].unique().tolist()
for ref in refs:
df = data.copy()
# Split each refernce rows to work with in it's own domain
# By this split we don't lose the orgininal indices of the rows, to use them later
df = df.loc[df['reference'] == ref]
# Get the indinces for the row of null for each id_1, id_2, new (same as box)
id_1_na_ind = df[df['id_1'].isnull()].index
id_2_na_ind = df[df['id_2'].isnull()].index
new_na_ind = df[df['new'].isnull()].index
# Check the condation of count(id_1) = count(id_2) and count(id_1) < count(id_2)
# Insted of using the not null count, here I used null count
# Where we flep the greater symbole null_count(id_1) > null_count(id_2)
if(len(id_1_na_ind) == len(id_2_na_ind) or len(id_1_na_ind) > len(id_2_na_ind)):
# Get the indinces of the place where we going to replace the value
# By getting the intersection of the null value of new with the null value of id_1
place_to_replace = set(new_na_ind.to_list()) - set(id_1_na_ind.to_list())
# Get the value that we are going to place using them
# By getting the intersect of the id_order value witht he box value, excluding null values
not_dup = [x for x in list(set(df['id_order'].to_list()) - set(df['box'].to_list())) if isnan(x) == False]
# Pick n of values from the not_dup without duplicated
# n = number of place_to_replace "len(place_to_replace)"
not_dup = np.random.choice(not_dup, len(place_to_replace), replace=False)
data.loc[place_to_replace, 'new'] = not_dup
Hope this answers your desired result