Spliting strings values of a column out of index and fill with NaN in a Pandas DataFrame-CodePudding

I've a DataFrame like this :

ROW_A    ROW_B
1        tata toto
2        tata toto
3        tata toto
4        ti tu te
5        ti tu te
6        ti tu te
7        ti tu te

I want to split ROW_B values in a new row. I know that the length of values does not match length of index but I just want to split the values and fill last values with NaN like this :

ROW_A    ROW_B       ROW_C
1        tata toto   tata
2        tata toto   toto
3        tata toto   NaN
4        ti tu te    ti
5        ti tu te    tu
6        ti tu te    te
7        ti tu te    NaN

I tried this code :

df_columns = df.columns
row_b = df_columns[1]

df['ROW_C'] = df.groupby('ROW_A')[row_b].transform(lambda x:x.head(1).str.split(' ').explode().values)).fillna

Here is the error message :

ValueError: Length of values (2) does not match length of index (3)

CodePudding user response：

You could group by column ROW_B and then create a new column on each of the groups -

from itertools import zip_longest

recons_df = []
for k, g in df.groupby('ROW_B'):
    g.loc[:, 'ROW_C'] = list(x if x else y for (x, y) in zip_longest(k.split(' '), [np.nan]*g.index.size))
    recons_df.append(g)

recons_df = pd.concat(recons_df)
print(recons_df)
#   ROW_A      ROW_B ROW_C
#0      1  tata toto  tata
#1      2  tata toto  toto
#2      3  tata toto   NaN
#3      4   ti tu te    ti
#4      5   ti tu te    tu
#5      6   ti tu te    te
#6      7   ti tu te   NaN

CodePudding user response：

One option is to drop_duplicates str.split explode to create a temporary Series. Then reindex this with df.index to get the NaNs:

tmp = df['ROW_B'].drop_duplicates().str.split(' ').explode()
df['ROW_C'] = tmp.set_axis(tmp.groupby(level=0).cumcount().pipe(lambda x: x x.index), axis=0).reindex(df.index)

Another option is to use groupby cumcount to create group numbers, then index the list in each row using the group number. Since the group number exceeds the list length, wrap it in try-except:

out = []
for i, lst in zip(df.groupby('ROW_B').cumcount(), df['ROW_B'].str.split(' ')):
    try:
        out.append(lst[i])
    except IndexError:
        out.append(float('nan'))

Output:

   ROW_A      ROW_B ROW_C
0      1  tata toto  tata
1      2  tata toto  toto
2      3  tata toto   NaN
3      4   ti tu te    ti
4      5   ti tu te    tu
5      6   ti tu te    te
6      7   ti tu te   NaN

CodePudding user response：

In case you don't care about the NaN for every missing split, use -

df.merge(df['ROW_B'].str.split(' ', expand=True).stack().reset_index(), left_on=[df.index], right_on=['level_0']).drop(['level_0', 'level_1'], axis=1).rename({0: 'ROW_C'}, axis=1)

Output

    ROW_A      ROW_B ROW_C
0       1  tata toto  tata
1       1  tata toto  toto
2       2  tata toto  tata
3       2  tata toto  toto
4       3  tata toto  tata
5       3  tata toto  toto
6       4   ti tu te    ti
7       4   ti tu te    tu
8       4   ti tu te    te
9       5   ti tu te    ti
10      5   ti tu te    tu
11      5   ti tu te    te
12      6   ti tu te    ti
13      6   ti tu te    tu
14      6   ti tu te    te
15      7   ti tu te    ti
16      7   ti tu te    tu
17      7   ti tu te    te