Home > Enterprise >  How to compact a dataset with empty rows in Python?
How to compact a dataset with empty rows in Python?

Time:11-24

I have a data set formatted as follows:

sha 0_x 1_x N_x
Sha1 rm rm
Sha2 rw rw
Sha3 rw
Sha4 tr

In particular, the dataset currently contains about 2000 columns.

I want to reduce the number of columns removing as many as possible the empty rows, as follows:

sha 0_x 1_x
Sha1 rm rm
Sha2 rw rw
Sha3 rw
Sha4 tr

I don't care about the names of the columns.

CodePudding user response:

Assuming empty cells are NaN, if not, first replace('', np.nan).

You can stack and pivot:

cols = df.columns[1:]
# ['0_x', '1_x', 'N_x']

(df.set_index('sha')
   .stack()
   .reset_index()
   .assign(cols=lambda d: d.groupby('sha')
                           .cumcount()
                           .map(dict(enumerate(cols)))
          )
   .pivot(index='sha', columns='cols', values=0)
   .reset_index()
)

Other option, with apply:

cols = list(df.columns[1:])
# ['0_x', '1_x', 'N_x']

(df.set_index('sha')
   .apply(lambda s: s.dropna().reset_index(drop=True), axis=1)
   .pipe(lambda d: d.set_axis(cols[:len(d.columns)], axis=1))
   .reset_index()
)

Output:

cols   sha 0_x  1_x
0     Sha1  rm   rm
1     Sha2  rw   rw
2     Sha3  rw  NaN
3     Sha4  tr  NaN

CodePudding user response:

Another possible solution:

(df.set_index('sha')
 .replace(r'$', '_', regex=True)
 .replace(np.nan, '')
 .sum(numeric_only=False, axis=1)
 .str.split('_ ', regex=True, expand=True)
 .replace('', np.nan)
 .dropna(how='all', axis=1)
 .pipe(lambda d: d.set_axis(d.columns.astype('str')   '_x', axis=1))
 .reset_index())

Output:

    sha 0_x  1_x
0  Sha1  rm   rm
1  Sha2  rw   rw
2  Sha3  rw  NaN
3  Sha4  tr  NaN
  • Related