Home > Back-end >  Splitting a dataframe when NaN rows are found
Splitting a dataframe when NaN rows are found

Time:07-10

I'm trying to split a dataframe when NaN rows are found using grps = dfs.isnull().all(axis=1).cumsum().

But this is not working when some of the rows have NaN entry in a single column.

import pandas as pd
from pprint import pprint
import numpy as np
d = {
    't': [0, 1, 2, 0, 2, 0, 1],
    'input': [2, 2, 2, 2, 2, 2, 4],
    'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
    'value': [0.1, 0.2, 0.3, np.nan, 2, 3, 1],
}
df = pd.DataFrame(d)
dup = df['t'].diff().lt(0).cumsum()

dfs = (
    df.groupby(dup, as_index=False, group_keys=False)
        .apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)

pprint(dfs)

grps = dfs.isnull().all(axis=1).cumsum()
temp = [dfs.dropna() for _, dfs in dfs.groupby(grps)]
i = 0
dfm = pd.DataFrame()
for df in temp:
    df["name"] = f'name{i}'
    i=i 1
    df = df.append(pd.Series(dtype='object'), ignore_index=True)
    dfm = dfm.append(df, ignore_index=True)
print(dfm)

Input df:

     t  input type  value
0  0.0    2.0    A    0.1
1  1.0    2.0    A    0.2
2  2.0    2.0    A    0.3
   NaN    NaN  NaN    NaN
3  0.0    2.0    B    NaN
4  2.0    2.0    B    2.0
   NaN    NaN  NaN    NaN
5  0.0    2.0    B    3.0
6  1.0    4.0    A    1.0

Output obtained:

     t  input type  value   name
0  0.0    2.0    A    0.1  name0
1  1.0    2.0    A    0.2  name0
2  2.0    2.0    A    0.3  name0
3  NaN    NaN  NaN    NaN    NaN
4  2.0    2.0    B    2.0  name1
5  NaN    NaN  NaN    NaN    NaN
6  0.0    2.0    B    3.0  name2
7  1.0    4.0    A    1.0  name2
8  NaN    NaN  NaN    NaN    NaN
9  NaN    NaN  NaN    NaN    NaN

Expected:

     t  input type  value   name
0  0.0    2.0    A    0.1  name0
1  1.0    2.0    A    0.2  name0
2  2.0    2.0    A    0.3  name0
3  NaN    NaN  NaN    NaN    NaN
4  0.0    2.0    B    NaN  name1
5  2.0    2.0    B    2.0  name1
6  NaN    NaN  NaN    NaN    NaN
7  0.0    2.0    B    3.0  name2
8  1.0    4.0    A    1.0  name2
9  NaN    NaN  NaN    NaN    NaN

I am basically doing this to append names to the last column of the dataframe after splitting df using

dfs = (
        df.groupby(dup, as_index=False, group_keys=False)
            .apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
    )

and appending NaN rows.

Again, I use the NaN rows to split the df into a list and add new column. But dfs.isnull().all(axis=1).cumsum() isn't working for me. And I also get an additional NaN row in the last row fo the output obtained.

Suggestions on how to get the expected output will be really helpful.

CodePudding user response:

Setup

df = pd.DataFrame(d)
print(df)

   t  input type  value
0  0      2    A    0.1
1  1      2    A    0.2
2  2      2    A    0.3
3  0      2    B    NaN
4  2      2    B    2.0
5  0      2    B    3.0
6  1      4    A    1.0

Simplify your approach

# assign name column before splitting
m = df['t'].diff().lt(0)
df['name'] = 'name'   m.cumsum().astype(str)

# Create null dataframes to concat
nan_rows = pd.DataFrame(index=m[m].index) 
last_nan_row = pd.DataFrame(index=df.index[[-1]])

# Concat and sort index
df_out = pd.concat([nan_rows, df, last_nan_row]).sort_index(ignore_index=True)

Result

     t  input type  value   name
0  0.0    2.0    A    0.1  name0
1  1.0    2.0    A    0.2  name0
2  2.0    2.0    A    0.3  name0
3  NaN    NaN  NaN    NaN    NaN
4  0.0    2.0    B    NaN  name1
5  2.0    2.0    B    2.0  name1
6  NaN    NaN  NaN    NaN    NaN
7  0.0    2.0    B    3.0  name2
8  1.0    4.0    A    1.0  name2
9  NaN    NaN  NaN    NaN    NaN

Alternatively if you still want to start with the initial input as dfs, here is another approach:

dfs = dfs.reset_index(drop=True)
m = dfs.isna().all(1)
dfs.loc[~m, 'name'] = 'name'   m.cumsum().astype(str)
  • Related