str.split by regex (complex pattern)-CodePudding

How do I split the ID from annotation by using regex in the data frame below?

df=pd.DataFrame({"header":["SS50377_28860 All-trans-retinol 13,14-reductase"]})

So the columns supposed to be like this:

df_new=pd.DataFrame({"id":"SS50377_28860","header":["All-trans-retinol 13,14-reductase"]})

The following code doesn't work properly.

df.join(df["header"].str.split(r'\d ', 0, expand=True))

Thanks in advance!!

CodePudding user response：

You can split with one or more whitespaces between a digit and a letter:

df[['id','header']] = df['header'].str.split(r'(?<=\d)\s (?=[A-Z])', n=1, expand=True)

Or, you may capture the ID pattern into one group and the rest into another:

df[['id', 'header']] = df['header'].str.extract(r'^([A-Z0-9] _[A-Z0-9] )\s (.*)', expand=True)

Or, you may simply Series.str.split with the first whitespace chunk:

df[['id', 'header']] = df['header'].str.split("\s ", n=1, expand=True)

Output:

>>> df
                              header             id
0  All-trans-retinol 13,14-reductase  SS50377_28860

Details:

(?<=\d)\s (?=[A-Z]) - matches one or more whitespaces (\s ) that are immediately preceded with a digit ((?<=\d)) and immediately followed with an uppercase ASCII letter ([A-Z])
^([A-Z0-9] _[A-Z0-9] )\s (.*) - matches start of string (^), then captures one or more uppercase ASCII letters or digits, _ and again one or more uppercase ASCII letters or digits into Group 1 (Column "id") and then matches one or more whitespaces (\s ) and then captures the rest of the line into Group 2 (with (.*)).

Whichever solution you choose depends on how varied your input is and how much validation you want to apply here.