I have a data set formatted as follows:
sha | 0_x | 1_x | N_x |
---|---|---|---|
Sha1 | rm | rm | |
Sha2 | rw | rw | |
Sha3 | rw | ||
Sha4 | tr |
In particular, the dataset currently contains about 2000 columns.
I want to reduce the number of columns removing as many as possible the empty rows, as follows:
sha | 0_x | 1_x |
---|---|---|
Sha1 | rm | rm |
Sha2 | rw | rw |
Sha3 | rw | |
Sha4 | tr |
I don't care about the names of the columns.
CodePudding user response:
Assuming empty cells are NaN
, if not, first replace('', np.nan)
.
cols = df.columns[1:]
# ['0_x', '1_x', 'N_x']
(df.set_index('sha')
.stack()
.reset_index()
.assign(cols=lambda d: d.groupby('sha')
.cumcount()
.map(dict(enumerate(cols)))
)
.pivot(index='sha', columns='cols', values=0)
.reset_index()
)
Other option, with apply
:
cols = list(df.columns[1:])
# ['0_x', '1_x', 'N_x']
(df.set_index('sha')
.apply(lambda s: s.dropna().reset_index(drop=True), axis=1)
.pipe(lambda d: d.set_axis(cols[:len(d.columns)], axis=1))
.reset_index()
)
Output:
cols sha 0_x 1_x
0 Sha1 rm rm
1 Sha2 rw rw
2 Sha3 rw NaN
3 Sha4 tr NaN
CodePudding user response:
Another possible solution:
(df.set_index('sha')
.replace(r'$', '_', regex=True)
.replace(np.nan, '')
.sum(numeric_only=False, axis=1)
.str.split('_ ', regex=True, expand=True)
.replace('', np.nan)
.dropna(how='all', axis=1)
.pipe(lambda d: d.set_axis(d.columns.astype('str') '_x', axis=1))
.reset_index())
Output:
sha 0_x 1_x
0 Sha1 rm rm
1 Sha2 rw rw
2 Sha3 rw NaN
3 Sha4 tr NaN