Extract element within string into nested list-CodePudding

I am working on a routing project. The route looks like this "CNSHG(B)-PAMIT(R)-COCTG(B)-USHOU(R)-COCTG(B)-USMSY" and I want to break it into a nested list. Also, a route contains multiple segments for example CNSHG-PAMIT is one segment transported using B and then PAMIT-COCTG transported using R i.e, Rail, and so on.

Input:

"CNSHG(B)-PAMIT(R)-COCTG(B)-USHOU(R)-COCTG(B)-USMSY"

The output should be like this:

[[CNSHG, PAMIT, B],[PAMIT, COCTG, R],[COCTG, USHOU, B],[USHOU, COCTG, R],[COCTG, USMSY, B]]

I have tried using regex and the below codes but it didn't work.

route.str.extract('(.)\s\((.\d )')

Thanks a lot.

CodePudding user response：

You can use

import pandas as pd
df = pd.DataFrame({'col':["CNSHG(B)-PAMIT(R)-COCTG(B)-USHOU(R)-COCTG(B)-USMSY"]})
df['result'] = df['col'].str.findall(r'(\w )\((?=[^()]*\)-(\w ))([^()]*)\)')

Output of df['result']:

[('CNSHG', 'PAMIT', 'B'), ('PAMIT', 'COCTG', 'R'), ('COCTG', 'USHOU', 'B'), ('USHOU', 'COCTG', 'R'), ('COCTG', 'USMSY', 'B')]

See the regex demo. Details:

(\w ) - one or more word chars
\( - a ( char
(?=[^()]*\)-(\w )) - a positive lookahead that requires (immediately to the right of the current location):
- [^()]* - zero or more chars other than ( and )
- \)- - a )- string
- (\w ) - Group 2: one or more word chars
([^()]*) - Group 3: zero or more chars other than ( and )
\) - a ) char.