Floating point arithmetic and comparison causing undesirable output-CodePudding

Consider the Pandas DataFrame df1:

df1 = pd.DataFrame({"Name":["Kevin","Peter","James","Jose","Matthew","Pattrick","Alexander"],"Number":[1,2,3,4,5,6,7],"Total":[495.2,432.5,'-',395.5,485.8,415,418.7],"Average_old":[86.57,83.97,'-',96.59,84.67,83.10,83.84],"Grade_old":['A','A','A','A ','A','A','A'],"Total_old":[432.8,419.8,'-',482.9,423.3,415,418.7]})

I calculated the Average and Grade with the following formula

df1["Average"] = df1["Total"].apply(lambda x: x/5   0.1 if x != "-" else "-")
df1["Grade"] = df1["Average"].apply((lambda x:'A ' if x>90 else 'A'))

So df1 becomes

df1

        Name  Number  Total  Average_old Grade_old  Total_old  Average Grade
0      Kevin       1  495.2        86.57         A      432.8    99.14    A 
1      Peter       2  432.5        83.97         A      419.8    86.60     A
2      James       3      -            -         A          -        -     A
3       Jose       4  395.5        96.59        A       482.9    79.20     A
4    Matthew       5  485.8        84.67         A      423.3    97.26    A 
5   Pattrick       6  415.0        83.10         A      415.0    83.10     A
6  Alexander       7  418.7        83.84         A      418.7    83.84     A

df1 has Total, Total_old, Grade, Grade_old, Average, Average_old. I am trying to check if any value of Total is modified with respect to Total_old, any value of Grade is modified with respect to Grade_old or any value of Average is modified with respect to Average_old. I am trying to create a new Dataframe dfmod that would give all the modified values of df1 with the following code

dfmod = pd.DataFrame()
columns =["Total","Average","Grade"]   
for col in columns:      
   dfmod = pd.concat([dfmod,df1[["Name","Number",col   '_old']][df1[col].ne(df1[col  '_old'])].dropna()],sort=False)

dfmod.rename(columns={col   '_old':col},inplace=True)           
dfmod = dfmod.groupby('Name',as_index = False,sort = False).first()

And got the output as

dfmod

        Name  Number  Total  Average Grade
0      Kevin       1  432.8    86.57     A
1      Peter       2  419.8    83.97  None
2       Jose       4  482.9    96.59    A 
3    Matthew       5  423.3    84.67     A
4  Alexander       7    NaN    83.84  None

Here no values of Pattrick was modified when comparing Total with Total_old, Average with Average_old, and Grade with Grade_old so Pattrick's entry was correctly dropped.

However if you observe Alexander's Average even though the Total, Average and Grade are same as Total_old,Average_old,Grade_old respectively , the modified value dataframe dfmod has the Average value wrongly added as a modified value. The reason why it happened is because floating point arithmetic wont work like integer arithmetic in programming languages as mentioned in the link below. https://www.geeksforgeeks.org/floating-point-error-in-python/

so I tried to implement np.isclose function as:

for col in columns:  
     if col is 'Grade':
          dfmod = pd.concat([dfmod,df1[["Name","Number",col   '_old']][df1[col].ne(df1[col  '_old'])].dropna()],sort=False)
       continue

    dfmod = pd.concat([dfmod,df1[["Name","Number",col   '_old']][~np.isclose(df1[col],df1[col '_old'])].dropna()],sort=False)

But it throws the error message as

  `Exception has occurred: TypeError ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''`

The error seems to be because of the '-' character in the data how can I fix this issue, please do help, I am stuck at this issue for a while and tried all the resources I could get.

Expected output:

        Name  Number  Total  Average Grade
0      Kevin       1  432.8    86.57     A
1      Peter       2  419.8    83.97  None
3       Jose       4  482.9    96.59    A 
4    Matthew       5  423.3    84.67     A

It should omit the values for James,Pattrick and Alexander as they don't have any change for Total - Total_old, Average - Average_old, Grade - Grade_old

CodePudding user response：

Please have a look if this is what you are looking for.

import pandas as pd
import numpy as np


def compute_grade(new_average, old_grade):
    try:
        grade = 'A ' if float(new_average) > 90 else 'A'
    except:
        grade = old_grade
    return grade

df1 = pd.DataFrame({"Name":["Kevin","Peter","James","Jose","Matthew","Pattrick","Alexander"],"Number":[1,2,3,4,5,6,7],"Total":[495.2,432.5,'-',395.5,485.8,415,418.7],"Average_old":[86.57,83.97,'-',96.59,84.67,83.10,83.84],"Grade_old":['A','A','A','A ','A','A','A'],"Total_old":[432.8,419.8,'-',482.9,423.3,415,418.7]})
df1["Average"] = df1["Total"].apply(lambda x: round((x/5)   0.1, 2) if x != "-" else "-")
df1["Grade"] = df1.apply((lambda x: compute_grade(x['Average'], x['Grade_old'])), axis=1)

print(df1)

# import pdb; pdb.set_trace()

dfmod = df1[(df1['Total'] != df1['Total_old']) | (df1['Average'] != df1['Average_old']) | (df1['Grade'] != df1['Grade_old'])]

print(dfmod)

Output:

        Name  Number  Total Average_old Grade_old Total_old Average Grade
0      Kevin       1  495.2       86.57         A     432.8   99.14    A 
1      Peter       2  432.5       83.97         A     419.8    86.6     A
2      James       3      -           -         A         -       -     A
3       Jose       4  395.5       96.59        A      482.9    79.2     A
4    Matthew       5  485.8       84.67         A     423.3   97.26    A 
5   Pattrick       6    415        83.1         A       415    83.1     A
6  Alexander       7  418.7       83.84         A     418.7   83.84     A
      Name  Number  Total Average_old Grade_old Total_old Average Grade
0    Kevin       1  495.2       86.57         A     432.8   99.14    A 
1    Peter       2  432.5       83.97         A     419.8    86.6     A
3     Jose       4  395.5       96.59        A      482.9    79.2     A
4  Matthew       5  485.8       84.67         A     423.3   97.26    A

Rounding off the average we compute to 2 decimals was the key here.

Also when computing the Grade if we encounter non numeric value like '-', i was returning Grade_old.

CodePudding user response：

As far as I can tell, the "-" character is unnecessary – you can replace it with None in the columns where they show up, then make those columns numerical. This will make your preprocessing steps much cleaner, and avoid unnecessary conditional statements where we need to check whether certain values are "-".

df1 = df1.replace("-",None).astype({"Total":float, "Average_old":float, "Total_old":float})

Then you don't have to use .apply or check for whether a particular element is "-" when creating the new columns Average and Grade:

df1["Average"] = df1["Total"]/5   0.1
df1["Grade"] = ["A " if x>90 else "A" for x in df1["Average"]]

Then you can use np.isclose in your condition, drop any rows containing null, select the columns with "_old", and rename the columns:

condition = ~(np.isclose(df1['Average'], df1['Average_old']) & np.isclose(df1['Total'], df1['Total_old']) & (df1['Grade_old'] == df1['Grade']))
cols = ['Name','Number']   [col for col in df1.columns if "_old" in col]
df1.loc[condition, cols].dropna().rename(columns={'Total_old':'Total','Average_old':'Average','Grade_old':'Grade'})

Result:

      Name  Number  Total  Average Grade
0    Kevin       1  432.8    86.57     A
1    Peter       2  419.8    83.97     A
3     Jose       4  482.9    96.59    A 
4  Matthew       5  423.3    84.67     A