I have following column in a DataFrame.
col1
['SNOMEDCT_US:32113001', 'UMLS:C0265660']
['UMLS:C2674738', 'UMLS:C2674739']
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455']
I would like extract the value after UMLS: and store it in another column. I am trying following lines of code but I am not getting the expected output.
df['col1'].str.extract(r'\['.*UMLS:(.*)]')
The expected output is:
col1 col2
['SNOMEDCT_US:32113001', 'UMLS:C0265660'] C0265660
['UMLS:C2674738', 'UMLS:C2674739'] C2674738, C2674739
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455'] C1290857, C123455
CodePudding user response:
You can use DataFrame.explode
to turn the rows of lists into rows of individual strings. Then, you can use Series.str.extract
to match the desired regular expression. Finally, you can use DataFrame.groupby
and DataFrame.agg
to turn col1
back into its original form with col2
as desired:
df = df.explode("col1")
df["col2"] = df["col1"].str.extract(r"UMLS:(. )")
df = df.groupby(level=0).agg({
"col1": list,
"col2": lambda x: ", ".join(item for item in x if item == item)
})
This outputs:
col1 col2
0 [SNOMEDCT_US:32113001, UMLS:C0265660] C0265660
1 [UMLS:C2674738, UMLS:C2674739] C2674738, C2674739
2 [UMLS:C1290857, SNOMEDCT_US:118930001, UMLS:C1... C1290857, C123455
CodePudding user response:
I used a different re that I tested at https://regex101.com/
UMLS:(\w*)
With the following command, I got a new column with the data formatted as you desired:
df['new'] = df['input'].apply(lambda x: re.findall(r"UMLS:(\w*)",x)).apply(lambda x: ','.join(map(str,x)))
The first .apply() function is based on this answer. The findall function returns a list ([C2674738, C2674739]).
Since you want a string with as many matches as are found, the second apply() function (based on this answer) will convert the list into a comma delimited string.
I hope there is a more elegant answer :-)