Home > Software design >  Extract substring from urls stored in a pandas column
Extract substring from urls stored in a pandas column

Time:09-16

Pandas column contains a series of urls. I'd like to extract a substring from the url. MRE code below.

s = pd.Series(['https://url-location/img/xxxyyy_image1.png'])

s.apply(lambda x: x[x.find("/") 1:st.find("_")])

I'd like to extract xxxyyy and store them into a new column.

CodePudding user response:

You can use

>>> s.str.extract(r'.*/([^_] )')
        0
0  xxxyyy

See the regex demo. Details:

  • .* - zero or more chars other than line break chars as many as possible
  • / - a slash
  • ([^_] ) - Capturing group 1 (the value captured into this group will be the actual return value of Series.str.extract): one or more chars other than _ char.

CodePudding user response:

Also possible:

s.str.split('/').str[-1].str.split('_').str[0]
# Out[224]: xxxyyy

This works, because .str allows for the slice annotation. So .str[-1] will provide the last element after the split for example.

  • Related