How to extract numbers from a multi line string in python 3-CodePudding

I have been struggling with the task of getting all numbers out of a multiline string called price (product is ok).

Im using python to scrape a website for product name and price which results in below output and written like that to the file:

Master C141,"

6

                    999

                        .
                        -
                "
Master 220,"

6

                    499

                        .
                        -
                "
Master C170,"

12

                    499
                        .
                        -
                "

I have tried many different code examples from Stackoverflow and several other sites but none have worked. What I would like to accomplish is an output like below:

Master C141, 6999

Master 220, 6499

Master C170, 12499

Here is the code:

content = driver.page_source

products=[] #List to store name of the product
prices=[] #List to store price of the product

soup = BeautifulSoup(content,"html.parser")
for a in soup.findAll('div', attrs={'class':'c-product-listing__col'}):
    name=a.find('h2', attrs={'class':'c-product-card__heading'})
    price=a.find('div', attrs={'class':'c-price-tag__price'})

    print(re.findall("\d ", price.text))
    
    products.append(name.text)
    prices.append(price.text)

df = pd.DataFrame({'Product Name':products,'Price':prices}) 
df.to_csv('products.txt', index=False, encoding='utf-8')

CodePudding user response：

This answer assumes we are starting with the text in your question:

output = re.sub(r'\b(Master \w ,).*?(\d ).*?(\d ).*?(?=\bMaster|$)', r'\1 \2\3\n', text, flags=re.S).strip()
print(output)

This prints:

Master C141, 6999
Master 220, 6499
Master C170, 12499

Here we are just capturing the Master term along with the two digits following it, and then combining to generate the output you want. Note that we use the dot all flag so we can match content across lines.

CodePudding user response：

What happens?

You already have a solution build in, only problem is you are not appending it to your list.

How to fix?

Append the result of your regex to your list:

prices.append(re.findall("\d ", price.text))

Added to your example:

...
products=[] #List to store name of the product
prices=[] #List to store price of the product

soup = BeautifulSoup(content,"html.parser")
for a in soup.find_all('div', attrs={'class':'c-product-listing__col'}):
    name=a.find('h2', attrs={'class':'c-product-card__heading'})
    price=a.find('div', attrs={'class':'c-price-tag__price'})

    products.append(name.text)
    prices.append(re.findall("\d ", price.text))
...

CodePudding user response：

Ok, problem solved. Thanks to those who helped me in the right direction. The code might not be optimal, but it works :)

.....
.....
content = driver.page_source

products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product

soup = BeautifulSoup(content,"html.parser")
for a in soup.findAll('div', attrs={'class':'c-product-listing__col'}):
    name=a.find('h2', attrs={'class':'c-product-card__heading'})
    price=a.find('div', attrs={'class':'c-price-tag__price'})
    strProduct = name.text
    strPrice = price.text
    
    strProduct = re.match('[^,] ', strProduct)[0]
    strPrice = re.sub('\D', '', strPrice)
    
    products.append(strProduct)
    prices.append(strPrice)

df = pd.DataFrame({'Product Name':products,'Price':prices}) 
df.to_csv('products.csv', index=False, encoding='utf-8')
driver.quit()