Home > Software engineering >  How to extract specific number of characters from a substring in python with same suffix
How to extract specific number of characters from a substring in python with same suffix

Time:09-21

I have this python code to extract the image src from an HTML website

listingid=[img['src'] for img in soup.select('[src]')]

Now would like to extract the values from the following output and store into a dictionary:

img/katalog/honda-crv-4x2-2.0-at-2001_30082022103745.jpg 
img/katalog/dujual--xpander-1.5-gls-2018-manual_26072022120227.jpg 
img/katalog/nissan-juke-1.5-cvt-2011-matic_19072022105636.jpg 
img/katalog/mitsubishi-xpander-1.5-exceed-manual-2018_08072022134628.jpg

Need below Values:

30082022103745
26072022120227
19072022105636
08072022134628

Any approach I can take to achieve this?

Im thinking if there is any syntax in python to take 14 characters before a specific suffix(like .jpg)

CodePudding user response:

You can use negative indices in Python slices to count from the end. Since you say in the question you want 14 characters before a 4 character suffix, a simple s[-18:-4] would do.

With your code:

listingid = [img['src'] for img in soup.select('[src]')]
listingid = [s[-18:-4] for s in listingid]

or, in one statement:

listingid = [img['src'][-18:-4] for img in soup.select('[src]')]

CodePudding user response:

If the number of characters is exactly the same use slicing for shorthand, if it differ I would recommend to try split() by pattern:

[i.get('src').split('_')[-1].split('.')[0] for i in soup.select('[src]')]

or using regex:

import re
[re.search('.*?([0-9] )\.[a-zA-Z] $',i.get('src')).group(1) for i in soup.select('[src]')]

Example

from bs4 import BeautifulSoup

html = '''
<img src="img/katalog/honda-crv-4x2-2.0-at-2001_30082022103745.jpg">
<img src="img/katalog/mitsubishi-xpander-1.5-exceed-manual-2018_08072022134628.jpg">
'''
soup = BeautifulSoup(html)

[i.get('src').split('_')[-1].split('.')[0] for i in soup.select('[src]')]

Output

['30082022103745', '08072022134628']
  • Related