I have a problem. I have missing numbers inside the column materialNumber
. But if a similar price
is it should take the exactly same materialNumber
. If more than two materialNumber
occur with the same price
it should take the first. If no materialNumber
is found with the same price
it should take the next nearst materialnumber
depending on the price.
Dataframe
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 NaN 100
3 2 4562.0 30
4 3 1547.0 40
5 3 NaN 37
Code
import pandas as pd
d = {
"customerId": [1, 1, 2, 2, 3, 3],
"materialNumber": [
1234,
4562,
None,
4562,
1547,
None,
],
"price": [100, 20, 100, 30, 40, 37],
}
df = pd.DataFrame(data=d)
print(df)
import numpy as np
def find_next(x):
if(x['materialNumber'] == None):
#if price occurs only once it should finde the next nearst price
if(x['price'].value_counts().shape[0] == 1):
return x.drop_duplicates(subset=['price'], keep="first")
else:
return x.iloc[(x['price']-input).abs().argsort()[:2]]
df['materialNumber'] = df.apply(lambda x: find_next(x), axis=1)
What I want
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234 100 # of index 0: 1234.0, 100 (same value)
3 2 4562.0 30
4 3 1547.0 40
5 3 1547 37 # of index 4: 1547.0, 40 (next similar value)
CodePudding user response:
Use merge_asof
with match rows with missing values per materialNumber
by rows without missing values and assign values in DataFrame.loc
:
m = df['materialNumber'].isna()
new = pd.merge_asof(df[m].reset_index().sort_values('price'),
df[~m].sort_values('price'), on='price', direction='nearest')
df.loc[m, 'materialNumber'] = new.set_index('index')['materialNumber_y']
print(df)
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234.0 100
3 2 4562.0 30
4 3 1547.0 40
5 3 1547.0 37
CodePudding user response:
IIUC, you can use a merge_asof
to find the equal or closest price value, then update
your dataframe:
# mask to split the DataFrame in NaN/non-NaN for materialNumber
m = df['materialNumber'].isna()
# sort by price (required for merge_asof)
df2 = df.sort_values(by='price')
# fill missing values
missing = pd.merge_asof(df2.reset_index().loc[m, ['index', 'price']],
df2.loc[~m, ['price', 'materialNumber']],
on='price',
direction='nearest') # direction='forward' for next only
# update in place
df.update(missing.set_index('index')['materialNumber'])
output:
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234.0 100
3 2 4562.0 30
4 3 1547.0 40
5 3 1547.0 37