How can I clean text in an Excel file with Python?-CodePudding

I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):

I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:

file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd 
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')

What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):

I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it. Thanks!

CodePudding user response：

Try this:

df['column_name'] = df['column_name'].str.replace(r'<br>', '')

CodePudding user response：

The index in the output file can be turned off with index=False in the df.to_excel function, i.e,

clean_df.to_excel('output.xlsx', index=False)

CodePudding user response：

As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.

To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False

import pandas as pd 

file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)

clean_df = df.copy()
for col in clean_df.columns:
    clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)