Consider the Pandas DataFrame df1
:
df1 = pd.DataFrame({"Name":["Kevin","Peter","James","Jose","Matthew","Pattrick","Alexander"],"Number":[1,2,3,4,5,6,7],"Total":[495.2,432.5,'-',395.5,485.8,415,418.7],"Average_old":[86.57,83.97,'-',96.59,84.67,83.10,83.84],"Grade_old":['A','A','A','A ','A','A','A'],"Total_old":[432.8,419.8,'-',482.9,423.3,415,418.7]})
I calculated the Average
and Grade
with the following formula
df1["Average"] = df1["Total"].apply(lambda x: x/5 0.1 if x != "-" else "-")
df1["Grade"] = df1["Average"].apply((lambda x:'A ' if x>90 else 'A'))
So df1 becomes
df1
Name Number Total Average_old Grade_old Total_old Average Grade
0 Kevin 1 495.2 86.57 A 432.8 99.14 A
1 Peter 2 432.5 83.97 A 419.8 86.60 A
2 James 3 - - A - - A
3 Jose 4 395.5 96.59 A 482.9 79.20 A
4 Matthew 5 485.8 84.67 A 423.3 97.26 A
5 Pattrick 6 415.0 83.10 A 415.0 83.10 A
6 Alexander 7 418.7 83.84 A 418.7 83.84 A
df1
has Total, Total_old, Grade, Grade_old, Average, Average_old
. I am trying to check if any value of Total is modified with respect to Total_old
, any value of Grade
is modified with respect to Grade_old
or any value of Average
is modified with respect to Average_old
. I am trying to create a new Dataframe
dfmod
that would give all the modified values of df1 with the following code
dfmod = pd.DataFrame()
columns =["Total","Average","Grade"]
for col in columns:
dfmod = pd.concat([dfmod,df1[["Name","Number",col '_old']][df1[col].ne(df1[col '_old'])].dropna()],sort=False)
dfmod.rename(columns={col '_old':col},inplace=True)
dfmod = dfmod.groupby('Name',as_index = False,sort = False).first()
And got the output as
dfmod
Name Number Total Average Grade
0 Kevin 1 432.8 86.57 A
1 Peter 2 419.8 83.97 None
2 Jose 4 482.9 96.59 A
3 Matthew 5 423.3 84.67 A
4 Alexander 7 NaN 83.84 None
Here no values of Pattrick was modified when comparing Total with Total_old, Average with Average_old, and Grade with Grade_old so Pattrick's entry was correctly dropped.
However if you observe Alexander's
Average
even though the Total
, Average
and Grade
are same as Total_old,Average_old,Grade_old
respectively , the modified value dataframe dfmod
has the Average
value wrongly added as a modified value. The reason why it happened is because floating point arithmetic wont work like integer arithmetic in programming languages as mentioned in the link below. https://www.geeksforgeeks.org/floating-point-error-in-python/
so I tried to implement np.isclose
function as:
for col in columns:
if col is 'Grade':
dfmod = pd.concat([dfmod,df1[["Name","Number",col '_old']][df1[col].ne(df1[col '_old'])].dropna()],sort=False)
continue
dfmod = pd.concat([dfmod,df1[["Name","Number",col '_old']][~np.isclose(df1[col],df1[col '_old'])].dropna()],sort=False)
But it throws the error message as
`Exception has occurred: TypeError ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''`
The error seems to be because of the '-' character in the data how can I fix this issue, please do help, I am stuck at this issue for a while and tried all the resources I could get.
Expected output:
Name Number Total Average Grade
0 Kevin 1 432.8 86.57 A
1 Peter 2 419.8 83.97 None
3 Jose 4 482.9 96.59 A
4 Matthew 5 423.3 84.67 A
It should omit the values for James,Pattrick and Alexander as they don't have any change for Total - Total_old, Average - Average_old, Grade - Grade_old
CodePudding user response:
Please have a look if this is what you are looking for.
import pandas as pd
import numpy as np
def compute_grade(new_average, old_grade):
try:
grade = 'A ' if float(new_average) > 90 else 'A'
except:
grade = old_grade
return grade
df1 = pd.DataFrame({"Name":["Kevin","Peter","James","Jose","Matthew","Pattrick","Alexander"],"Number":[1,2,3,4,5,6,7],"Total":[495.2,432.5,'-',395.5,485.8,415,418.7],"Average_old":[86.57,83.97,'-',96.59,84.67,83.10,83.84],"Grade_old":['A','A','A','A ','A','A','A'],"Total_old":[432.8,419.8,'-',482.9,423.3,415,418.7]})
df1["Average"] = df1["Total"].apply(lambda x: round((x/5) 0.1, 2) if x != "-" else "-")
df1["Grade"] = df1.apply((lambda x: compute_grade(x['Average'], x['Grade_old'])), axis=1)
print(df1)
# import pdb; pdb.set_trace()
dfmod = df1[(df1['Total'] != df1['Total_old']) | (df1['Average'] != df1['Average_old']) | (df1['Grade'] != df1['Grade_old'])]
print(dfmod)
Output:
Name Number Total Average_old Grade_old Total_old Average Grade
0 Kevin 1 495.2 86.57 A 432.8 99.14 A
1 Peter 2 432.5 83.97 A 419.8 86.6 A
2 James 3 - - A - - A
3 Jose 4 395.5 96.59 A 482.9 79.2 A
4 Matthew 5 485.8 84.67 A 423.3 97.26 A
5 Pattrick 6 415 83.1 A 415 83.1 A
6 Alexander 7 418.7 83.84 A 418.7 83.84 A
Name Number Total Average_old Grade_old Total_old Average Grade
0 Kevin 1 495.2 86.57 A 432.8 99.14 A
1 Peter 2 432.5 83.97 A 419.8 86.6 A
3 Jose 4 395.5 96.59 A 482.9 79.2 A
4 Matthew 5 485.8 84.67 A 423.3 97.26 A
Rounding off the average we compute to 2 decimals was the key here.
Also when computing the Grade if we encounter non numeric value like '-', i was returning Grade_old.
CodePudding user response:
As far as I can tell, the "-"
character is unnecessary – you can replace it with None
in the columns where they show up, then make those columns numerical. This will make your preprocessing steps much cleaner, and avoid unnecessary conditional statements where we need to check whether certain values are "-"
.
df1 = df1.replace("-",None).astype({"Total":float, "Average_old":float, "Total_old":float})
Then you don't have to use .apply
or check for whether a particular element is "-"
when creating the new columns Average
and Grade
:
df1["Average"] = df1["Total"]/5 0.1
df1["Grade"] = ["A " if x>90 else "A" for x in df1["Average"]]
Then you can use np.isclose
in your condition, drop any rows containing null, select the columns with "_old"
, and rename the columns:
condition = ~(np.isclose(df1['Average'], df1['Average_old']) & np.isclose(df1['Total'], df1['Total_old']) & (df1['Grade_old'] == df1['Grade']))
cols = ['Name','Number'] [col for col in df1.columns if "_old" in col]
df1.loc[condition, cols].dropna().rename(columns={'Total_old':'Total','Average_old':'Average','Grade_old':'Grade'})
Result:
Name Number Total Average Grade
0 Kevin 1 432.8 86.57 A
1 Peter 2 419.8 83.97 A
3 Jose 4 482.9 96.59 A
4 Matthew 5 423.3 84.67 A