Home > Software design >  Split column to multiple columns by another column value (complicated separator)
Split column to multiple columns by another column value (complicated separator)

Time:04-29

  1. I have dataframe like:
df = pd.DataFrame(np.array([['abc 33 aaa 9g98f 333', 'aaa'],
                            ['cde aaa 95fwf', 'aaa'],
                            ['12 faf bbb 92gcs', 'bbb'],
                            ['faf bbb 7t87f', 'bbb']]),
                            columns=['column1', 'column2'])

len of column1 value may be different - from 2 to 5 words, so split with space not an option.

            column1     column2
0  abc 33 aaa 9g98f 333   aaa
1     cde aaa 95fwf       aaa
2  12 faf bbb 92gcs       bbb
3     faf bbb 7t87f       bbb

  1. Output should be like:
            column1 new_column1 new_column2      column2
0  abc 33 aaa 9g98f      abc 33       9g98f 333     aaa
1     cde aaa 95fwf         cde       95fwf         aaa
2     faf bbb 92gcs         faf       92gcs         bbb
3  12 faf bbb 7t87f      12 faf       7t87f         bbb

That topic - How to split a dataframe string column into two columns? - didn't help coz of separator

UPD. Left "side" may have 2-5 words - and right side too.

CodePudding user response:

option 1

Splitting on spaces is an option, if you have a single word for the last two columns. Use rsplit:

df['column1'].str.rsplit(n=2, expand=True)

output:

        0    1      2
0  abc 33  aaa  9g98f
1     cde  aaa  95fwf
2  12 faf  bbb  92gcs
3     faf  bbb  7t87f

NB. this doesn't work with the updated example

option 2

Alternatively, to split on the provided delimiter:

df[['new_column1', 'new_column2']] = [a.split(f' {b} ') for a,b in
                                      zip(df['column1'], df['column2'])]

output:

                column1 column2 new_column1 new_column2
0  abc 33 aaa 9g98f 333     aaa      abc 33   9g98f 333
1         cde aaa 95fwf     aaa         cde       95fwf
2      12 faf bbb 92gcs     bbb      12 faf       92gcs
3         faf bbb 7t87f     bbb         faf       7t87f

option 3

Finally, if you have many time the same delimiters and many rows, it might be worth using vectorial splitting per group:

(df
 .groupby('column2')
 .apply(lambda g: g['column1'].str.split(f'\s*{g.name}\s*', expand=True)) 
)

output:

        0          1
0  abc 33  9g98f 333
1     cde      95fwf
2  12 faf      92gcs
3     faf      7t87f

CodePudding user response:

Not sure what you mean with split by space is not an option, can you give an example for that? As I see it the logic you want to achieve is the following: the new column should contain the last part after the last space and the other column should contain the rest, right? If that is the assumption what reproduces your output would be:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([['abc 33 aaa 9g98f', 'aaa'], ['cde aaa 95fwf', 'aaa'], ['12 faf bbb 92gcs', 'bbb'], ['faf bbb 7t87f', 'bbb']]), columns=['column1', 'column2'])


df["new_column1"] = [ ' '.join(s.split(' ')[:-2]) for s in df["column1"].values ]
df["new_column2"] = [ s.split(' ')[-1] for s in df["column1"].values ]

OK: Based on the updated question, here is how I understand it now:

df[["new_column1","new_column2"]] =  [ a.split(b) for a,b in df[["column1","column2"]].values ]
  • Related