I am parsing some data from a number pdf documents and storing them in a dataframe for insights. When writing to a pandas dataframe each page from the pdf document is not aligning all the data points under the same column needed.
One way I can fix this is to remove cells that contain NaNs and shift the non-null values left.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Word':['Text1', np.nan, np.nan, 'Text1', 'Text1'],
'Word2':['Text2', 'Text1', np.nan, 'Text2', np.nan],
'Word3':['Text3', 'Text2', 'Text1', 'Text3', np.nan]
})
df
Output of sample df:
Word Word2 Word3
0 Text1 Text2 Text3
1 NaN Text1 Text2
2 NaN NaN Text1
3 Text1 Text2 Text3
4 Text1 NaN NaN
Desired output needed:
Word Word2 Word3
0 Text1 Text2 Text3
1 Text1 Text2
2 Text1
3 Text1 Text2 Text3
4 Text1
In this example, only rows with index 1 and 2 actually change.
Any assistance would be much appreciated.
Alan
CodePudding user response:
One option, by shifting the columns and filling the NaNs:
out = (pd.DataFrame(df.apply(sorted, key=pd.isna, axis=1).to_list(),
index=df.index, columns=df.columns)
.fillna('')
)
Or:
out = (df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
.fillna('')
.set_axis(df.columns, axis=1)
)
output:
Word Word2 Word3
0 Text1 Text2 Text3
1 Text1 Text2
2 Text1
3 Text1 Text2 Text3
4 Text1
CodePudding user response:
one way to do it, slightly different than mozway solution
(df.apply(
lambda x: x.sort_values(ignore_index=True), axis=1)
.fillna('')
.set_axis(df.columns, axis=1))
Word Word2 Word3
0 Text1 Text2 Text3
1 Text1 Text2
2 Text1
3 Text1 Text2 Text3
4 Text1