I'm trying to understand the last 4 lines of this code that reads a data file to a pandas
dataframe. I have written a comment above those 4 lines; but I'm still a bit confused. It seems the code is trying to do some cleaning of null values etc. Question: Can someone please explain what exactly the code is doing. It seems it can be simplified a bit more.
import pandas as pd
import numpy as np
pf = pd.read_csv('MyDataFile.txt', low_memory=False, sep=',', encoding='ISO-8859-1')
#trim the column names. Trimmed column will be the new column name
pf = pf.rename(columns=lambda x: x.strip())
#select all the columns that have undefined datatypes
pf_names = pf.select_dtypes(['object'])
#Select all columns with undefined datatypes and trim them
pf[pf_names.columns] = pf_names.apply(lambda x: x.str.strip())
#Replace the null values of dataframe columns with none
pf.replace([np.nan], [None], inplace=True)
CodePudding user response:
This will remove blank spaces in the name of all columns. So 'Column 1' will be 'Column1' and so forth.
pf = pf.rename(columns=lambda x: x.strip())
This select all columns that are dtype object.
pf_names = pf.select_dtypes(inculde='object')
If you know that Column1 and Column2 are dtype object it would be the same as:
pf_names = pf[pf[['Column1', 'Column2']]
This will do the same as the first strip (removing blank spaces) but for the values of the columns that are dtype object.
pf[pf_names.columns] = pf_names.apply(lambda x: x.str.strip())
This will remove all NaN from the dataframe and replace with None.
pf.replace(np.nan, None, inplace=True)