I have a dataframe (atc_df) with a column called 'atc' made of fixed length strings with a fixed structure, and can be split in 5 levels of subcoding. Here is an example:
prin atc
0 Acarbosio A10BF01
1 Aceclofenac M01AB16
2 Aciclovir J05AB01
3 Acido acetilsalicilico B01AC06
4 Acido alendronico M05BA04
... ... ...
324 Voriconazolo J02AC03
325 Zofenopril C09AA15
326 Zofenopril idroclorotiazide C09BA15
327 Zolmitriptan N02CC03
328 Zonisamide N03AX15
I have a function that given an atc code would return a list of the 5 subcodes:
def atc_split(atc_str):
atc1 = atc_str[0]
atc2 = atc_str[1:3]
atc3 = atc_str[3]
atc4 = atc_str[4]
atc5 = atc_str[5:7]
return(atc1,atc2,atc3,atc4,atc5)
Two questions:
Is there a more efficient/elegant way of splitting the atc code into its five subcodes?
How can I best apply this function to the atc_df dataframe to add five new columns (atc1..atc5) to each row?
Thank you
CodePudding user response:
Use str.extract
:
df_atc[[f"atc{i 1}" for i in range(5)]] = df_atc["atc"].str.extract("(\w)(\d{2})(\w)(\w)(\d{2})")
>>> df_atc
prin atc atc1 atc2 atc3 atc4 atc5
0 Acarbosio A10BF01 A 10 B F 01
1 Aceclofenac M01AB16 M 01 A B 16
2 Aciclovir J05AB01 J 05 A B 01
3 Acido acetilsalicilico B01AC06 B 01 A C 06
4 Acido alendronico M05BA04 M 05 B A 04
If you want to have every code including the previous codes, you can slice using .str
:
df_atc["atc1"] = df_atc["atc"].str[0]
df_atc["atc2"] = df_atc["atc"].str[:3]
df_atc["atc3"] = df_atc["atc"].str[:4]
df_atc["atc4"] = df_atc["atc"].str[:5]
df_atc["atc5"] = df_atc["atc"]
>>> df_atc
prin atc atc1 atc2 atc3 atc4 atc5
0 Acarbosio A10BF01 A A10 A10B A10BF A10BF01
1 Aceclofenac M01AB16 M M01 M01A M01AB M01AB16
2 Aciclovir J05AB01 J J05 J05A J05AB J05AB01
3 Acido acetilsalicilico B01AC06 B B01 B01A B01AC B01AC06
4 Acido alendronico M05BA04 M M05 M05B M05BA M05BA04