Splitting a dataframe when NaN rows are found-CodePudding

I'm trying to split a dataframe when NaN rows are found using grps = dfs.isnull().all(axis=1).cumsum().

But this is not working when some of the rows have NaN entry in a single column.

import pandas as pd
from pprint import pprint
import numpy as np
d = {
    't': [0, 1, 2, 0, 2, 0, 1],
    'input': [2, 2, 2, 2, 2, 2, 4],
    'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
    'value': [0.1, 0.2, 0.3, np.nan, 2, 3, 1],
}
df = pd.DataFrame(d)
dup = df['t'].diff().lt(0).cumsum()

dfs = (
    df.groupby(dup, as_index=False, group_keys=False)
        .apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)

pprint(dfs)

grps = dfs.isnull().all(axis=1).cumsum()
temp = [dfs.dropna() for _, dfs in dfs.groupby(grps)]
i = 0
dfm = pd.DataFrame()
for df in temp:
    df["name"] = f'name{i}'
    i=i 1
    df = df.append(pd.Series(dtype='object'), ignore_index=True)
    dfm = dfm.append(df, ignore_index=True)
print(dfm)

Input df:

     t  input type  value
0  0.0    2.0    A    0.1
1  1.0    2.0    A    0.2
2  2.0    2.0    A    0.3
   NaN    NaN  NaN    NaN
3  0.0    2.0    B    NaN
4  2.0    2.0    B    2.0
   NaN    NaN  NaN    NaN
5  0.0    2.0    B    3.0
6  1.0    4.0    A    1.0

Output obtained:

     t  input type  value   name
0  0.0    2.0    A    0.1  name0
1  1.0    2.0    A    0.2  name0
2  2.0    2.0    A    0.3  name0
3  NaN    NaN  NaN    NaN    NaN
4  2.0    2.0    B    2.0  name1
5  NaN    NaN  NaN    NaN    NaN
6  0.0    2.0    B    3.0  name2
7  1.0    4.0    A    1.0  name2
8  NaN    NaN  NaN    NaN    NaN
9  NaN    NaN  NaN    NaN    NaN

Expected:

     t  input type  value   name
0  0.0    2.0    A    0.1  name0
1  1.0    2.0    A    0.2  name0
2  2.0    2.0    A    0.3  name0
3  NaN    NaN  NaN    NaN    NaN
4  0.0    2.0    B    NaN  name1
5  2.0    2.0    B    2.0  name1
6  NaN    NaN  NaN    NaN    NaN
7  0.0    2.0    B    3.0  name2
8  1.0    4.0    A    1.0  name2
9  NaN    NaN  NaN    NaN    NaN

I am basically doing this to append names to the last column of the dataframe after splitting df using

dfs = (
        df.groupby(dup, as_index=False, group_keys=False)
            .apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
    )

and appending NaN rows.

Again, I use the NaN rows to split the df into a list and add new column. But dfs.isnull().all(axis=1).cumsum() isn't working for me. And I also get an additional NaN row in the last row fo the output obtained.

Suggestions on how to get the expected output will be really helpful.

CodePudding user response：

Setup

df = pd.DataFrame(d)
print(df)

   t  input type  value
0  0      2    A    0.1
1  1      2    A    0.2
2  2      2    A    0.3
3  0      2    B    NaN
4  2      2    B    2.0
5  0      2    B    3.0
6  1      4    A    1.0

Simplify your approach

# assign name column before splitting
m = df['t'].diff().lt(0)
df['name'] = 'name'   m.cumsum().astype(str)

# Create null dataframes to concat
nan_rows = pd.DataFrame(index=m[m].index) 
last_nan_row = pd.DataFrame(index=df.index[[-1]])

# Concat and sort index
df_out = pd.concat([nan_rows, df, last_nan_row]).sort_index(ignore_index=True)

Result

     t  input type  value   name
0  0.0    2.0    A    0.1  name0
1  1.0    2.0    A    0.2  name0
2  2.0    2.0    A    0.3  name0
3  NaN    NaN  NaN    NaN    NaN
4  0.0    2.0    B    NaN  name1
5  2.0    2.0    B    2.0  name1
6  NaN    NaN  NaN    NaN    NaN
7  0.0    2.0    B    3.0  name2
8  1.0    4.0    A    1.0  name2
9  NaN    NaN  NaN    NaN    NaN

Alternatively if you still want to start with the initial input as dfs, here is another approach:

dfs = dfs.reset_index(drop=True)
m = dfs.isna().all(1)
dfs.loc[~m, 'name'] = 'name'   m.cumsum().astype(str)