Home > Net >  Python pandas replace NaN from another column in table with different conditions
Python pandas replace NaN from another column in table with different conditions

Time:07-28

I have df like this:

    id_1     box    reference   id_2    id_order
0   19588.0  100.0  300025886   1.0     100.0
1   16009.0  NaN    300025886   NaN     NaN
2   19148.0  NaN    300025886   NaN     NaN
3   16936.0  101.0  300025886   3.0     101.0
4   13127.0  102.0  300025886   4.0     102.0
5   15064.0  103.0  300025886   6.0     103.0
6   10630.0  104.0  300025886   7.0     104.0
7   12401.0  105.0  300025886   8.0     105.0
8   13163.0  106.0  300025886   9.0     106.0
24  NaN      NaN    300025886   5.0     108.0
23  NaN      NaN    300025886   2.0     107.0
25  NaN      NaN    300025887   16.0    207.0
16  17697.0  206.0  300025887   15.0    206.0
15  19527.0  205.0  300025887   14.0    205.0
14  10538.0  204.0  300025887   13.0    204.0
13  11808.0  203.0  300025887   12.0    203.0
12  17577.0  202.0  300025887   11.0    202.0
11  17964.0  201.0  300025887   10.0    201.0
10  13118.0  NaN    300025887   NaN     NaN
9   17891.0  200.0  300025887   NaN     NaN
17  18426.0  300.0  300025888   17.0    300.0
18  17632.0  301.0  300025888   18.0    301.0
19  14726.0  302.0  300025888   19.0    302.0
20  18361.0  303.0  300025888   20.0    303.0
21  12908.0  NaN    300025888   NaN     NaN
22  10279.0  NaN    300025888   NaN     NaN
26  NaN      NaN    300025888   21.0    304.0
27  NaN      NaN    300025888   22.0    305.0
28  NaN      NaN    300025888   23.0    306.0
29  NaN      NaN    300025888   24.0    307.0

There are 3 different situations:

box = id_order
  1. reference 300025886 has same number of id_1 and id_2 2 NaN in box should be replaced with 2 left in id_order
  2. reference 300025887 id_1 has 8 values id_2 has 7 values we do nothing because id_1 has more values then id_2
  3. reference 300025887 id_1 has 6 values id_2 has 8 values we fill box(NaN) with left from id_order

so the result should look like:

    id_1    box reference id_2 id_order new
0   19588   100 300025886   1   100     100
1   16009   NaN 300025886   NaN NaN     108
2   19148   NaN 300025886   NaN NaN     107
3   16936   101 300025886   3   101     101
4   13127   102 300025886   4   102     102
5   15064   103 300025886   6   103     103
6   10630   104 300025886   7   104     104
7   12401   105 300025886   8   105     105
8   13163   106 300025886   9   106     106
24  NaN     NaN 300025886   5   108     NaN 
23  NaN     NaN 300025886   2   107     NaN 
25  NaN     NaN 300025887   16  207     NaN 
16  17697   206 300025887   15  206     206
15  19527   205 300025887   14  205     205
14  10538   204 300025887   13  204     204
13  11808   203 300025887   12  203     203
12  17577   202 300025887   11  202     202
11  17964   201 300025887   10  201     201
10  13118   NaN 300025887   NaN NaN     NaN      
9   17891   200 300025887   NaN NaN     200
17  18426   300 300025888   17  300     300
18  17632   301 300025888   18  301     301
19  14726   302 300025888   19  302     302
20  18361   303 300025888   20  303     303
21  12908   NaN 300025888   NaN NaN     306
22  10279   NaN 300025888   NaN NaN     304
26   NaN    NaN 300025888   21  304     NaN 
27   NaN    NaN 300025888   22  305     NaN 
28   NaN    NaN 300025888   23  306     NaN 
29   NaN    NaN 300025888   24  307     NaN 

So every id_1 has value in 'new' column filled from box if it is not NaN and from id_order if it is NaN with conditions that if number of count(id_1) = count(id_2) then just fill missed, when number of count(id_1) > count(id_2) we don't do anything so NaN stays and when number of count(id_1) < count(id_2) then we randomly take 'free' id_order and replace NaN with it, but without duplicates.

I tried

fillna
merge
group by
some loops (but can't figure all conditions)
etc.

but can't figure algorithm of combining methods to get desired result

I need to fill NaN from box column which have id_1 with left (if there are any) from id_order column. The reference column is general for id_1, box and id_2, id_order

CodePudding user response:

Filling the NaN values in 'box' with 'id_order' is simple:

df['new'] = df['id_order'].fillna(df['box'])

But filling the values by incrementing downward with the id, does not have a built in function that I know of. You may have to iterate through each line to accomplish this.

Here is a good place to start:

for ix, row in df.where(df['new'].isna()).iterrows():
    if ix == 0:
        continue
    # More 'elif' logic here
    df.loc[ix, 'new'] = df.loc[(ix-1), 'new']  1

You may want to look into setting the 'new' column as index and using various functions to make the index unique as in this question: create Pandas Dataframe with unique index

CodePudding user response:

From what I understand, your condition can be changed to be like count(id_1) = count(id_2) or count(id_1) < count(id_2) since the only difference between them, is the randomly without duplicated thing

random.choice(a, size=None, replace=False, p=None)

Since its False = no duplicated, we send the list of the value that can be in this place, with the length of the missed placed.

refs = data['reference'].unique().tolist()
for ref in refs:
    df = data.copy()
    # Split each refernce rows to work with in it's own domain
    # By this split we don't lose the orgininal indices of the rows, to use them later
    df = df.loc[df['reference'] == ref]
    # Get the indinces for the row of null for each id_1, id_2, new (same as box)
    id_1_na_ind = df[df['id_1'].isnull()].index
    id_2_na_ind = df[df['id_2'].isnull()].index
    new_na_ind = df[df['new'].isnull()].index
    # Check the condation of count(id_1) = count(id_2) and count(id_1) < count(id_2)
    # Insted of using the not null count, here I used null count
    # Where we flep the greater symbole null_count(id_1) > null_count(id_2)
    if(len(id_1_na_ind) == len(id_2_na_ind) or len(id_1_na_ind) > len(id_2_na_ind)):
        # Get the indinces of the place where we going to replace the value
        # By getting the intersection of the null value of new with the null value of id_1
        place_to_replace = set(new_na_ind.to_list()) - set(id_1_na_ind.to_list())
        # Get the value that we are going to place using them
        # By getting the intersect of the id_order value witht he box value, excluding null values
        not_dup = [x for x in list(set(df['id_order'].to_list()) - set(df['box'].to_list())) if isnan(x) == False]
        # Pick n of values from the not_dup without duplicated
        # n = number of place_to_replace "len(place_to_replace)"
        not_dup = np.random.choice(not_dup, len(place_to_replace), replace=False)
        data.loc[place_to_replace, 'new'] = not_dup

Hope this answers your desired result

  • Related