Home > OS >  Regex to extract substring from pandas DataFrame column
Regex to extract substring from pandas DataFrame column

Time:01-19

I have following column in a DataFrame.

col1
['SNOMEDCT_US:32113001', 'UMLS:C0265660']
['UMLS:C2674738', 'UMLS:C2674739']
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455']

I would like extract the value after UMLS: and store it in another column. I am trying following lines of code but I am not getting the expected output.

df['col1'].str.extract(r'\['.*UMLS:(.*)]')

The expected output is:

col1                                                            col2
['SNOMEDCT_US:32113001', 'UMLS:C0265660']                       C0265660
['UMLS:C2674738', 'UMLS:C2674739']                              C2674738, C2674739
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455']      C1290857, C123455

CodePudding user response:

You can use DataFrame.explode to turn the rows of lists into rows of individual strings. Then, you can use Series.str.extract to match the desired regular expression. Finally, you can use DataFrame.groupby and DataFrame.agg to turn col1 back into its original form with col2 as desired:

df = df.explode("col1")
df["col2"] = df["col1"].str.extract(r"UMLS:(. )")
df = df.groupby(level=0).agg({
    "col1": list,
    "col2": lambda x: ", ".join(item for item in x if item == item)
})

This outputs:

                                                col1                col2
0              [SNOMEDCT_US:32113001, UMLS:C0265660]            C0265660
1                     [UMLS:C2674738, UMLS:C2674739]  C2674738, C2674739
2  [UMLS:C1290857, SNOMEDCT_US:118930001, UMLS:C1...   C1290857, C123455

CodePudding user response:

I used a different re that I tested at https://regex101.com/

UMLS:(\w*)

With the following command, I got a new column with the data formatted as you desired:

df['new'] = df['input'].apply(lambda x: re.findall(r"UMLS:(\w*)",x)).apply(lambda x: ','.join(map(str,x)))

The first .apply() function is based on this answer. The findall function returns a list ([C2674738, C2674739]).

Since you want a string with as many matches as are found, the second apply() function (based on this answer) will convert the list into a comma delimited string.

I hope there is a more elegant answer :-)

  • Related