I'm trying to split a dataframe when NaN rows are found using grps = dfs.isnull().all(axis=1).cumsum()
.
But this is not working when some of the rows have NaN entry in a single column.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, np.nan, 2, 3, 1],
}
df = pd.DataFrame(d)
dup = df['t'].diff().lt(0).cumsum()
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
pprint(dfs)
grps = dfs.isnull().all(axis=1).cumsum()
temp = [dfs.dropna() for _, dfs in dfs.groupby(grps)]
i = 0
dfm = pd.DataFrame()
for df in temp:
df["name"] = f'name{i}'
i=i 1
df = df.append(pd.Series(dtype='object'), ignore_index=True)
dfm = dfm.append(df, ignore_index=True)
print(dfm)
Input df:
t input type value
0 0.0 2.0 A 0.1
1 1.0 2.0 A 0.2
2 2.0 2.0 A 0.3
NaN NaN NaN NaN
3 0.0 2.0 B NaN
4 2.0 2.0 B 2.0
NaN NaN NaN NaN
5 0.0 2.0 B 3.0
6 1.0 4.0 A 1.0
Output obtained:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 2.0 2.0 B 2.0 name1
5 NaN NaN NaN NaN NaN
6 0.0 2.0 B 3.0 name2
7 1.0 4.0 A 1.0 name2
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Expected:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
I am basically doing this to append names to the last column of the dataframe after splitting df
using
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
and appending NaN rows.
Again, I use the NaN rows to split the df
into a list and add new column. But dfs.isnull().all(axis=1).cumsum()
isn't working for me. And I also get an additional NaN row in the last row fo the output obtained.
Suggestions on how to get the expected output will be really helpful.
CodePudding user response:
Setup
df = pd.DataFrame(d)
print(df)
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
3 0 2 B NaN
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 1.0
Simplify your approach
# assign name column before splitting
m = df['t'].diff().lt(0)
df['name'] = 'name' m.cumsum().astype(str)
# Create null dataframes to concat
nan_rows = pd.DataFrame(index=m[m].index)
last_nan_row = pd.DataFrame(index=df.index[[-1]])
# Concat and sort index
df_out = pd.concat([nan_rows, df, last_nan_row]).sort_index(ignore_index=True)
Result
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
Alternatively if you still want to start with the initial input as dfs
, here is another approach:
dfs = dfs.reset_index(drop=True)
m = dfs.isna().all(1)
dfs.loc[~m, 'name'] = 'name' m.cumsum().astype(str)