Python Pandas Calculate Two Columns Containing String-CodePudding

Hi,

I am trying to calculate the difference of column C (unnamed: 2) and column D (unnamed: 3), and put the result in column E. I have attached a Excel screenshot and a pandas read Excel screenshot.

Here is my code:

import pandas

# load excel file
df1 = pandas.read_excel("Trial.xlsx")

# column C minus column D, and store result in column E
df1["total"] = df1["unnamed: 2"] - df1["unnamed: 3"]

print(df1)

The code does not work because column C and D contain string on C4 and D4. I am wondering if there is any way to bypass row 4 or do the calculation starting from row 5 (index 4).

CodePudding user response：

First I would use the parameter skiprows in the read_excel method Documentation:

df1 = pandas.read_excel("Trial.xlsx", skiprows=4)

You can also use pandas.to_numeric with the parameter errors='coerce' to make all non numeric values into nans. See documentation If you use the above code, pandas should pick up the header names and you can use the following:

df1["total"] = (
    pandas.to_numeric(df1["Debit"], errors='coerce') 
    - pandas.to_numeric(df1["Credit"], errors='coerce')
)

CodePudding user response：

one way to accomplish it would be to make the values in the column as float

df1["unnamed: 2"] = df1["unnamed: 2"].str.replace('([^\d\.]) ','', regex=True ).replace('',np.nan).astype(float)
df1["unnamed: 3"] = df1["unnamed: 3"].str.replace('([^\d\.]) ','', regex=True ).replace('',np.nan).astype(float)


df1["total"] = df1["unnamed: 2"] - df1["unnamed: 3"]

if you post the excel as a downloadable, i an run the solution and provide the Result