- I have dataframe like:
df = pd.DataFrame(np.array([['abc 33 aaa 9g98f 333', 'aaa'],
['cde aaa 95fwf', 'aaa'],
['12 faf bbb 92gcs', 'bbb'],
['faf bbb 7t87f', 'bbb']]),
columns=['column1', 'column2'])
len of column1 value may be different - from 2 to 5 words, so split with space not an option.
column1 column2
0 abc 33 aaa 9g98f 333 aaa
1 cde aaa 95fwf aaa
2 12 faf bbb 92gcs bbb
3 faf bbb 7t87f bbb
- Output should be like:
column1 new_column1 new_column2 column2
0 abc 33 aaa 9g98f abc 33 9g98f 333 aaa
1 cde aaa 95fwf cde 95fwf aaa
2 faf bbb 92gcs faf 92gcs bbb
3 12 faf bbb 7t87f 12 faf 7t87f bbb
That topic - How to split a dataframe string column into two columns? - didn't help coz of separator
UPD. Left "side" may have 2-5 words - and right side too.
CodePudding user response:
option 1
Splitting on spaces is an option, if you have a single word for the last two columns. Use rsplit
:
df['column1'].str.rsplit(n=2, expand=True)
output:
0 1 2
0 abc 33 aaa 9g98f
1 cde aaa 95fwf
2 12 faf bbb 92gcs
3 faf bbb 7t87f
NB. this doesn't work with the updated example
option 2
Alternatively, to split on the provided delimiter:
df[['new_column1', 'new_column2']] = [a.split(f' {b} ') for a,b in
zip(df['column1'], df['column2'])]
output:
column1 column2 new_column1 new_column2
0 abc 33 aaa 9g98f 333 aaa abc 33 9g98f 333
1 cde aaa 95fwf aaa cde 95fwf
2 12 faf bbb 92gcs bbb 12 faf 92gcs
3 faf bbb 7t87f bbb faf 7t87f
option 3
Finally, if you have many time the same delimiters and many rows, it might be worth using vectorial splitting per group:
(df
.groupby('column2')
.apply(lambda g: g['column1'].str.split(f'\s*{g.name}\s*', expand=True))
)
output:
0 1
0 abc 33 9g98f 333
1 cde 95fwf
2 12 faf 92gcs
3 faf 7t87f
CodePudding user response:
Not sure what you mean with split by space is not an option, can you give an example for that? As I see it the logic you want to achieve is the following: the new column should contain the last part after the last space and the other column should contain the rest, right? If that is the assumption what reproduces your output would be:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([['abc 33 aaa 9g98f', 'aaa'], ['cde aaa 95fwf', 'aaa'], ['12 faf bbb 92gcs', 'bbb'], ['faf bbb 7t87f', 'bbb']]), columns=['column1', 'column2'])
df["new_column1"] = [ ' '.join(s.split(' ')[:-2]) for s in df["column1"].values ]
df["new_column2"] = [ s.split(' ')[-1] for s in df["column1"].values ]
OK: Based on the updated question, here is how I understand it now:
df[["new_column1","new_column2"]] = [ a.split(b) for a,b in df[["column1","column2"]].values ]