Home > database >  Remove NaN Cells without dropping entire rows (Pandas, Python)
Remove NaN Cells without dropping entire rows (Pandas, Python)

Time:09-12

I am parsing some data from a number pdf documents and storing them in a dataframe for insights. When writing to a pandas dataframe each page from the pdf document is not aligning all the data points under the same column needed.

One way I can fix this is to remove cells that contain NaNs and shift the non-null values left.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Word':['Text1', np.nan, np.nan, 'Text1', 'Text1'],
    'Word2':['Text2', 'Text1', np.nan, 'Text2', np.nan],
    'Word3':['Text3', 'Text2', 'Text1', 'Text3', np.nan]
})
df

Output of sample df:

    Word    Word2   Word3
0   Text1   Text2   Text3
1   NaN     Text1   Text2
2   NaN     NaN     Text1
3   Text1   Text2   Text3
4   Text1   NaN     NaN

Desired output needed:

    Word    Word2   Word3
0   Text1   Text2   Text3
1   Text1   Text2   
2   Text1   
3   Text1   Text2   Text3
4   Text1   

In this example, only rows with index 1 and 2 actually change.

Any assistance would be much appreciated.

Alan

CodePudding user response:

One option, by shifting the columns and filling the NaNs:

out = (pd.DataFrame(df.apply(sorted, key=pd.isna, axis=1).to_list(),
                    index=df.index, columns=df.columns)
         .fillna('')
       )

Or:

out = (df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
         .fillna('')
         .set_axis(df.columns, axis=1)
       )

output:

    Word  Word2  Word3
0  Text1  Text2  Text3
1  Text1  Text2       
2  Text1              
3  Text1  Text2  Text3
4  Text1              

CodePudding user response:

one way to do it, slightly different than mozway solution

(df.apply(
    lambda x: x.sort_values(ignore_index=True), axis=1)
 .fillna('')
 .set_axis(df.columns, axis=1))
    Word    Word2   Word3
0   Text1   Text2   Text3
1   Text1   Text2   
2   Text1       
3   Text1   Text2   Text3
4   Text1       
  • Related