I have a dataframe (really big) what have some null values and I can replaces them because there are two columns: Name and Weight, Name appears many times, sometimes with weight, sometimes not. This is a little example and what I tried for solve it.
First I created the dataframe:
import pandas as pd
import numpy as np
lst1 = ["AA","BB","CC","AA","BB","CC","AA","BB","CC"]
lst2 = [12,np.nan,14,12,15,14,np.nan,np.nan,14]
df = pd.DataFrame(list(zip(lst1,lst2)), columns = ['Name','Weight'])
Next I found the weight values for every name with a dictionary:
df_2 = df.groupby('Name')['Weight'].max()
dic = df_2.to_dict()
Finally I created a loop for replaces it all:
for k,v in dic:
for i in range(len(df)):
if df['Weight'][i] == None and k == df['Name'][i]:
df['Weight'][i] = v
else:
print(df)
But it returns the same dataframe and I don't know what more doing. I'll really thanks you if you could help me or give an advice. :)
CodePudding user response:
Maybe you can fill the NaNs in .groupby
:
df["Weight"] = df.groupby("Name", group_keys=False)["Weight"].apply(
lambda x: x.fillna(x.max())
)
print(df)
Prints:
Name Weight
0 AA 12.0
1 BB 15.0
2 CC 14.0
3 AA 12.0
4 BB 15.0
5 CC 14.0
6 AA 12.0
7 BB 15.0
8 CC 14.0
OR:
x = df.groupby("Name")["Weight"].max()
df = df.set_index("Name")
df["Weight"] = df["Weight"].fillna(x)
df = df.reset_index()
print(df)
Prints:
Name Weight
0 AA 12.0
1 BB 15.0
2 CC 14.0
3 AA 12.0
4 BB 15.0
5 CC 14.0
6 AA 12.0
7 BB 15.0
8 CC 14.0