Home > Software engineering >  Resetting, fixing the column values using Python Pandas
Resetting, fixing the column values using Python Pandas

Time:03-02

I'm trying to understand the last 4 lines of this code that reads a data file to a pandas dataframe. I have written a comment above those 4 lines; but I'm still a bit confused. It seems the code is trying to do some cleaning of null values etc. Question: Can someone please explain what exactly the code is doing. It seems it can be simplified a bit more.

import pandas as pd
import numpy as np

pf = pd.read_csv('MyDataFile.txt', low_memory=False, sep=',', encoding='ISO-8859-1')
#trim the column names. Trimmed column will be the new column name
pf = pf.rename(columns=lambda x: x.strip())
#select all the columns that have undefined datatypes
pf_names = pf.select_dtypes(['object'])
#Select all columns with undefined datatypes and trim them
pf[pf_names.columns] = pf_names.apply(lambda x: x.str.strip())
#Replace the null values of dataframe columns with none
pf.replace([np.nan], [None], inplace=True)

CodePudding user response:

This will remove blank spaces in the name of all columns. So 'Column 1' will be 'Column1' and so forth.

pf = pf.rename(columns=lambda x: x.strip())

This select all columns that are dtype object.

pf_names = pf.select_dtypes(inculde='object')

If you know that Column1 and Column2 are dtype object it would be the same as:

pf_names = pf[pf[['Column1', 'Column2']]

This will do the same as the first strip (removing blank spaces) but for the values of the columns that are dtype object.

pf[pf_names.columns] = pf_names.apply(lambda x: x.str.strip())

This will remove all NaN from the dataframe and replace with None.

pf.replace(np.nan, None, inplace=True)
  • Related