I'd like to squeeze a dataframe like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,pd.NA,100],[2,20,np.nan],[np.nan,np.nan,300],[pd.NA,"bla",400]], columns=["A","B","C"])
df1
A B C
0 1 <NA> 100.0
1 2 20 NaN
2 NaN NaN 300.0
3 <NA> bla 400.0
into a left-side compact way:
df_out
A B C
0 1 100 NaN
1 2 20 NaN
2 300 NaN NaN
3 bla 400 NaN
Also, it will be nice to convert all missing values from pd.NA to np.nan. How can do it? Thanks!
CodePudding user response:
You can apply
a function with dropna
to remove the NaN and reset_index
to push (squeeze) the data to the "left" by removing the aligment.
Then rename
to the original A/B/C... and reindex
to reinstate the eventually lost levels (here C would be missing as there is no non-NaN values):
df2 = (df1
.apply(lambda x: x.dropna().reset_index(drop=True), axis=1)
.rename(columns=dict(enumerate(df1.columns)))
.reindex(columns=df1.columns)
)
Or another approach working as the Series level by only dropping the NaNs and adding missing ones in the end:
df2 = df1.apply(lambda x: pd.Series((y:=x.dropna().to_list())
[float('nan')]*(len(df1.columns)-len(y)),
index=df1.columns), axis=1)
output:
A B C
0 1 100.0 NaN
1 2 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 NaN
Or also possible but without changing the NA to NaN:
df2 = df1.T.apply(sorted, key=pd.isna).T
output:
A B C
0 1 100.0 <NA>
1 2.0 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 <NA>
CodePudding user response:
For each row remove missing values in Series.dropna
, rename
columns by dictionary and last add missing columns in DataFrame.reindex
:
df = (df1.apply(lambda x: pd.Series(x.dropna().to_numpy()), axis=1)
.rename(columns=dict(enumerate(df1.columns)))
.reindex(df1.columns, axis=1))
print (df)
A B C
0 1 100.0 NaN
1 2 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 NaN
Another idea:
df = (df1.apply(lambda x: x.sort_values(key=lambda x: x.isna()).to_numpy(),
axis=1,
result_type='expand')
.set_axis(df1.columns, axis=1)
.mask(lambda x: x.isna())
)
print (df)
A B C
0 1 100.0 NaN
1 2 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 NaN
df = (df1.apply(lambda x: x.sort_values(key=lambda x: x.isna()).to_numpy(),
axis=1,
result_type='expand')
.set_axis(df1.columns, axis=1)
)
print (df)
A B C
0 1 100.0 <NA>
1 2 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 <NA>