Home > database >  Fill missing value with with value from another row and if not take the next similar value
Fill missing value with with value from another row and if not take the next similar value

Time:06-14

I have a problem. I have missing numbers inside the column materialNumber. But if a similar price is it should take the exactly same materialNumber. If more than two materialNumber occur with the same price it should take the first. If no materialNumber is found with the same price it should take the next nearst materialnumber depending on the price.

Dataframe

   customerId  materialNumber  price
0           1          1234.0    100
1           1          4562.0     20
2           2             NaN    100
3           2          4562.0     30
4           3          1547.0     40
5           3             NaN     37

Code

import pandas as pd
d = {
    "customerId": [1, 1, 2, 2, 3, 3],
    "materialNumber": [
        1234,
        4562,
        None,
        4562,
        1547,
        None, 
    ],
    "price": [100, 20, 100, 30, 40, 37],
}
df = pd.DataFrame(data=d)
print(df)

import numpy as np
def find_next(x):
  if(x['materialNumber'] == None):
    #if price occurs only once it should finde the next nearst price
    if(x['price'].value_counts().shape[0] == 1): 
      return x.drop_duplicates(subset=['price'], keep="first")
    else:
      return x.iloc[(x['price']-input).abs().argsort()[:2]]

df['materialNumber'] = df.apply(lambda x: find_next(x), axis=1)

What I want

   customerId  materialNumber  price
0           1          1234.0    100
1           1          4562.0     20
2           2            1234    100 # of index 0: 1234.0, 100 (same value)
3           2          4562.0     30
4           3          1547.0     40
5           3            1547     37 # of index 4: 1547.0, 40 (next similar value)

CodePudding user response:

Use merge_asof with match rows with missing values per materialNumber by rows without missing values and assign values in DataFrame.loc:

m = df['materialNumber'].isna()
new = pd.merge_asof(df[m].reset_index().sort_values('price'),
                    df[~m].sort_values('price'), on='price', direction='nearest')

df.loc[m, 'materialNumber'] = new.set_index('index')['materialNumber_y']
print(df)
   customerId  materialNumber  price
0           1          1234.0    100
1           1          4562.0     20
2           2          1234.0    100
3           2          4562.0     30
4           3          1547.0     40
5           3          1547.0     37

CodePudding user response:

IIUC, you can use a merge_asof to find the equal or closest price value, then update your dataframe:

# mask to split the DataFrame in NaN/non-NaN for materialNumber
m = df['materialNumber'].isna()

# sort by price (required for merge_asof)
df2 = df.sort_values(by='price')

# fill missing values
missing = pd.merge_asof(df2.reset_index().loc[m, ['index', 'price']],
                        df2.loc[~m, ['price', 'materialNumber']],
                        on='price',
                        direction='nearest') # direction='forward' for next only

# update in place
df.update(missing.set_index('index')['materialNumber'])

output:

   customerId  materialNumber  price
0           1          1234.0    100
1           1          4562.0     20
2           2          1234.0    100
3           2          4562.0     30
4           3          1547.0     40
5           3          1547.0     37
  • Related