I've a DataFrame like this :
ROW_A ROW_B
1 tata toto
2 tata toto
3 tata toto
4 ti tu te
5 ti tu te
6 ti tu te
7 ti tu te
I want to split ROW_B values in a new row. I know that the length of values does not match length of index but I just want to split the values and fill last values with NaN like this :
ROW_A ROW_B ROW_C
1 tata toto tata
2 tata toto toto
3 tata toto NaN
4 ti tu te ti
5 ti tu te tu
6 ti tu te te
7 ti tu te NaN
I tried this code :
df_columns = df.columns
row_b = df_columns[1]
df['ROW_C'] = df.groupby('ROW_A')[row_b].transform(lambda x:x.head(1).str.split(' ').explode().values)).fillna
Here is the error message :
ValueError: Length of values (2) does not match length of index (3)
CodePudding user response:
You could group by column ROW_B
and then create a new column on each of the groups -
from itertools import zip_longest
recons_df = []
for k, g in df.groupby('ROW_B'):
g.loc[:, 'ROW_C'] = list(x if x else y for (x, y) in zip_longest(k.split(' '), [np.nan]*g.index.size))
recons_df.append(g)
recons_df = pd.concat(recons_df)
print(recons_df)
# ROW_A ROW_B ROW_C
#0 1 tata toto tata
#1 2 tata toto toto
#2 3 tata toto NaN
#3 4 ti tu te ti
#4 5 ti tu te tu
#5 6 ti tu te te
#6 7 ti tu te NaN
CodePudding user response:
One option is to drop_duplicates
str.split
explode
to create a temporary Series. Then reindex this with df.index
to get the NaNs:
tmp = df['ROW_B'].drop_duplicates().str.split(' ').explode()
df['ROW_C'] = tmp.set_axis(tmp.groupby(level=0).cumcount().pipe(lambda x: x x.index), axis=0).reindex(df.index)
Another option is to use groupby
cumcount
to create group numbers, then index the list in each row using the group number. Since the group number exceeds the list length, wrap it in try-except:
out = []
for i, lst in zip(df.groupby('ROW_B').cumcount(), df['ROW_B'].str.split(' ')):
try:
out.append(lst[i])
except IndexError:
out.append(float('nan'))
Output:
ROW_A ROW_B ROW_C
0 1 tata toto tata
1 2 tata toto toto
2 3 tata toto NaN
3 4 ti tu te ti
4 5 ti tu te tu
5 6 ti tu te te
6 7 ti tu te NaN
CodePudding user response:
In case you don't care about the NaN
for every missing split, use -
df.merge(df['ROW_B'].str.split(' ', expand=True).stack().reset_index(), left_on=[df.index], right_on=['level_0']).drop(['level_0', 'level_1'], axis=1).rename({0: 'ROW_C'}, axis=1)
Output
ROW_A ROW_B ROW_C
0 1 tata toto tata
1 1 tata toto toto
2 2 tata toto tata
3 2 tata toto toto
4 3 tata toto tata
5 3 tata toto toto
6 4 ti tu te ti
7 4 ti tu te tu
8 4 ti tu te te
9 5 ti tu te ti
10 5 ti tu te tu
11 5 ti tu te te
12 6 ti tu te ti
13 6 ti tu te tu
14 6 ti tu te te
15 7 ti tu te ti
16 7 ti tu te tu
17 7 ti tu te te