Home > Back-end >  Pandas regex: extract url information from column
Pandas regex: extract url information from column

Time:04-15

import pandas as pd    

d = {"Device_Type" : ["AXO145","TRU151","ZOD231","YRT326","LWR245"],
 "Stat_Access_Link" : ["<url>https://xcd32112.smart_meter.com</url>",
                       "<url>http://tXh67.dia_meter.com</url>",
                       "<url>https://yT5495.smart_meter.com</url>",
                       "<url>https://ret323_TRu.crown.com</url>",
                       "<url>https://luwr3243.celcius.com</url>"]}

df = pd.DataFrame(data = d)

I have a dataframe like this and what I need to do is extract url information from tags using regex. The output has to be like this:

Device_Type Stat_Access_Link
AXO145 xcd32112.smart_meter.com
TRU151 tXh67.dia_meter.com
ZOD231 yT5495.smart_meter.com
YRT326 ret323_TRu.crown.com
LWR245 luwr3243.celcius.com

Any help is appreciated.

CodePudding user response:

Do you really need a regex?

If you always have <url>...</url>, use:

df['Stat_Access_Link'].str[5:-6]

else, you could use:

df['Stat_Access_Link'].str.extract(r'<url>(.*)</url>', expand=False)

# OR

df['Stat_Access_Link'].str.extract(r'<url>([^<>]*)</url>', expand=False)

output:

0    https://xcd32112.smart_meter.com
1          http://tXh67.dia_meter.com
2      https://yT5495.smart_meter.com
3        https://ret323_TRu.crown.com
4        https://luwr3243.celcius.com
Name: Stat_Access_Link, dtype: object

CodePudding user response:

str.extract is what you need:

d = {"Device_Type" : ["AXO145","TRU151","ZOD231","YRT326","LWR245"],
 "Stat_Access_Link" : ["<url>https://xcd32112.smart_meter.com</url>",
                       "<url>http://tXh67.dia_meter.com</url>",
                       "<url>https://yT5495.smart_meter.com</url>",
                       "<url>https://ret323_TRu.crown.com</url>",
                       "<url>https://luwr3243.celcius.com</url>"]}

df = pd.DataFrame(d)
pattern = re.compile(r"(?<=://)(.*)(?=</url)")
df['Stat_Access_Link'] = df['Stat_Access_Link'].str.extract(pattern, expand=False)
print(df)

Output:

  Device_Type          Stat_Access_Link
0      AXO145  xcd32112.smart_meter.com
1      TRU151       tXh67.dia_meter.com
2      ZOD231    yT5495.smart_meter.com
3      YRT326      ret323_TRu.crown.com
4      LWR245      luwr3243.celcius.com
  • Related