Home > Blockchain >  Fill NA and update columns from another dataframe
Fill NA and update columns from another dataframe

Time:06-28

I want to conditionally fill the missing and update the value from another dataframe.

I want to fill missing and update the data on column values in dataframe smalldf.

The condition is, if the value in B column (large df) is in the range of columns Range_FROM and Range_TO in (small df). Always choose the minimum records in (largedf) to fill or update the values in smalldf.

  • For example, in the second record of smalldf values (3), as there is a smaller value (1.3) in the largedf, it would end up being updated by a smaller value (1.3).
  • Another example, the third record of smalldf (Nan) is being filled by 1.6, as it is the smallest value within the range of Range_FROM (1.5) and Range_TO (1.6)

This is the first dataframe (smalldf):

RoadNo Range_FROM Range_TO values
A001 1.15 1.2 0.1
A001 1.35 1.4 3
A001 1.55 1.6 Nan
A001 1.75 1.8 0.1
A001 1.9 2 Nan

This is the second dataframe (largedf):

RoadNo B values
A001 1.1 0.2
A001 1.2 0.1
A001 1.3 1.9
A001 1.4 1.3
A001 1.5 1.6
A001 1.6 1.9
A001 1.7 0.2
A001 1.8 0.1
A001 1.9 1.9
A001 2 1.3

Below is the expected dataframe:

RoadNo Range_FROM Range_TO values
A001 1.15 1.2 0.1
A001 1.35 1.4 1.3
A001 1.55 1.6 1.6
A001 1.75 1.8 0.1
A001 1.9 2 1.3

Below is the code to create the two dataframes:

smalldf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001"],   
                             "Range_FROM": [1.15, 1.35, 1.55, 1.75, 1.9], 
                             "Range_TO":[1.2, 1.4, 1.6, 1.8, 2],
                             "values": [0.1, 0.25, "Nan", 0.1, "Nan" ]})
                             

largedf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001"],   
                             "B": [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2],                             
                             "values": [0.2, 0.1, 1.9, 1.3, 1.6, 1.9, 0.2, 0.1, 1.9, 1.3]})

Please keep in mind that there are other RoadNo (e.g. A002, A003).

Both left join and fillna(df) in pandas did not produce my desired result. Is there a function in Pandas or SQL that can help me with this operation?

Thank you very much!

CodePudding user response:

here is one way to do it Merge the two DF based on the RoadNo and then filter out the ones that meet the criteria. finally update the values where they are null with the new value

value of Nan in the df is assumed to be NaN, but had a typo in question. if it is indeed a Nan, you can convert that to NaN as follows after the merge that created df3 dataframe

df3['values'] = df3['values'].replace('Nan',np.nan ).astype(float)

df3=df.merge(df2, on='RoadNo', how='left', suffixes=('','_y')).query('(B >=Range_FROM) & (B<= Range_TO)')

df3['values'] = np.where((df3['values'].isnull()) | (df3['values'] > df3['values_y']) , df3['values_y'], df3['values'])
df3.drop(columns=['B','values_y'])

    RoadNo  Range_FROM  Range_TO    values
1   A001    1.15    1.2     0.1
13  A001    1.35    1.4     1.3
25  A001    1.55    1.6     1.9
37  A001    1.75    1.8     0.1
48  A001    1.90    2.0     1.9
49  A001    1.90    2.0     1.3

CodePudding user response:

In SQL your problem would look like:

SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2.values)
FROM df1
LEFT JOIN df2 
    ON df1.RoadNo = df2.RoadNo
    AND df2.B >= df1.Range_FROM
    AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO

So we can do:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values"
FROM df1
LEFT JOIN df2 
    ON df1.RoadNo = df2.RoadNo
    AND df2.B > df1.Range_LAG
    AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO'''
pysqldf(q)
df = pysqldf(q)
print(df)

Output:

  RoadNo  Range_FROM  Range_TO  values
0   A001        1.15       1.2     0.1
1   A001        1.35       1.4     1.3
2   A001        1.55       1.6     1.9
3   A001        1.75       1.8     0.1
4   A001        1.90       2.0     1.3

But since you actually want something a little different, let's try:

df1['Range_LAG'] = df1.Range_TO.shift(fill_value=0)

q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values"
FROM df1
LEFT JOIN df2 
    ON df1.RoadNo = df2.RoadNo
    AND df2.B > df1.Range_LAG
    AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO'''

df = pysqldf(q)
print(df)

Output:

  RoadNo  Range_FROM  Range_TO  values
0   A001        1.15       1.2     0.1
1   A001        1.35       1.4     1.3
2   A001        1.55       1.6     1.6
3   A001        1.75       1.8     0.1
4   A001        1.90       2.0     1.3
  • Related