How do I split the ID from annotation by using regex in the data frame below?
df=pd.DataFrame({"header":["SS50377_28860 All-trans-retinol 13,14-reductase"]})
So the columns supposed to be like this:
df_new=pd.DataFrame({"id":"SS50377_28860","header":["All-trans-retinol 13,14-reductase"]})
The following code doesn't work properly.
df.join(df["header"].str.split(r'\d ', 0, expand=True))
Thanks in advance!!
CodePudding user response:
You can split with one or more whitespaces between a digit and a letter:
df[['id','header']] = df['header'].str.split(r'(?<=\d)\s (?=[A-Z])', n=1, expand=True)
Or, you may capture the ID pattern into one group and the rest into another:
df[['id', 'header']] = df['header'].str.extract(r'^([A-Z0-9] _[A-Z0-9] )\s (.*)', expand=True)
Or, you may simply Series.str.split
with the first whitespace chunk:
df[['id', 'header']] = df['header'].str.split("\s ", n=1, expand=True)
Output:
>>> df
header id
0 All-trans-retinol 13,14-reductase SS50377_28860
Details:
(?<=\d)\s (?=[A-Z])
- matches one or more whitespaces (\s
) that are immediately preceded with a digit ((?<=\d)
) and immediately followed with an uppercase ASCII letter ([A-Z]
)^([A-Z0-9] _[A-Z0-9] )\s (.*)
- matches start of string (^
), then captures one or more uppercase ASCII letters or digits,_
and again one or more uppercase ASCII letters or digits into Group 1 (Column "id") and then matches one or more whitespaces (\s
) and then captures the rest of the line into Group 2 (with(.*)
).
Whichever solution you choose depends on how varied your input is and how much validation you want to apply here.