Home > Blockchain >  Squeeze dataframe rows with missing values
Squeeze dataframe rows with missing values

Time:03-10

I'd like to squeeze a dataframe like this:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([[1,pd.NA,100],[2,20,np.nan],[np.nan,np.nan,300],[pd.NA,"bla",400]], columns=["A","B","C"])
df1
      A     B      C
0     1  <NA>  100.0
1     2    20    NaN
2   NaN   NaN  300.0
3  <NA>   bla  400.0

into a left-side compact way:

    df_out
      A     B      C
0     1   100    NaN
1     2    20    NaN
2   300   NaN    NaN
3   bla   400    NaN

Also, it will be nice to convert all missing values from pd.NA to np.nan. How can do it? Thanks!

CodePudding user response:

You can apply a function with dropna to remove the NaN and reset_index to push (squeeze) the data to the "left" by removing the aligment.

Then rename to the original A/B/C... and reindex to reinstate the eventually lost levels (here C would be missing as there is no non-NaN values):

df2 = (df1
 .apply(lambda x: x.dropna().reset_index(drop=True), axis=1)
 .rename(columns=dict(enumerate(df1.columns)))
 .reindex(columns=df1.columns)
)

Or another approach working as the Series level by only dropping the NaNs and adding missing ones in the end:

df2 = df1.apply(lambda x: pd.Series((y:=x.dropna().to_list())
                                     [float('nan')]*(len(df1.columns)-len(y)),
                                    index=df1.columns), axis=1)

output:

       A      B   C
0      1  100.0 NaN
1      2   20.0 NaN
2  300.0    NaN NaN
3    bla  400.0 NaN

Or also possible but without changing the NA to NaN:

df2 = df1.T.apply(sorted, key=pd.isna).T

output:

       A      B     C
0      1  100.0  <NA>
1    2.0   20.0   NaN
2  300.0    NaN   NaN
3    bla  400.0  <NA>

CodePudding user response:

For each row remove missing values in Series.dropna, rename columns by dictionary and last add missing columns in DataFrame.reindex:

df = (df1.apply(lambda x: pd.Series(x.dropna().to_numpy()), axis=1)
         .rename(columns=dict(enumerate(df1.columns)))
         .reindex(df1.columns, axis=1))

print (df)
       A      B   C
0      1  100.0 NaN
1      2   20.0 NaN
2  300.0    NaN NaN
3    bla  400.0 NaN

Another idea:

df = (df1.apply(lambda x: x.sort_values(key=lambda x: x.isna()).to_numpy(), 
                axis=1, 
                result_type='expand')
        .set_axis(df1.columns, axis=1)
        .mask(lambda x: x.isna())
        )
print (df)
       A      B    C
0      1  100.0  NaN
1      2   20.0  NaN
2  300.0    NaN  NaN
3    bla  400.0  NaN

df = (df1.apply(lambda x: x.sort_values(key=lambda x: x.isna()).to_numpy(), 
                axis=1, 
                result_type='expand')
        .set_axis(df1.columns, axis=1)
        )
print (df)
       A      B     C
0      1  100.0  <NA>
1      2   20.0   NaN
2  300.0    NaN   NaN
3    bla  400.0  <NA>
  • Related