I have been struggling with the task of getting all numbers out of a multiline string
called price
(product
is ok).
Im using python to scrape a website for product name and price which results in below output and written like that to the file:
Master C141,"
6
999
.
-
"
Master 220,"
6
499
.
-
"
Master C170,"
12
499
.
-
"
I have tried many different code examples from Stackoverflow and several other sites but none have worked. What I would like to accomplish is an output like below:
Master C141, 6999
Master 220, 6499
Master C170, 12499
Here is the code:
content = driver.page_source
products=[] #List to store name of the product
prices=[] #List to store price of the product
soup = BeautifulSoup(content,"html.parser")
for a in soup.findAll('div', attrs={'class':'c-product-listing__col'}):
name=a.find('h2', attrs={'class':'c-product-card__heading'})
price=a.find('div', attrs={'class':'c-price-tag__price'})
print(re.findall("\d ", price.text))
products.append(name.text)
prices.append(price.text)
df = pd.DataFrame({'Product Name':products,'Price':prices})
df.to_csv('products.txt', index=False, encoding='utf-8')
CodePudding user response:
This answer assumes we are starting with the text in your question:
output = re.sub(r'\b(Master \w ,).*?(\d ).*?(\d ).*?(?=\bMaster|$)', r'\1 \2\3\n', text, flags=re.S).strip()
print(output)
This prints:
Master C141, 6999
Master 220, 6499
Master C170, 12499
Here we are just capturing the Master
term along with the two digits following it, and then combining to generate the output you want. Note that we use the dot all flag so we can match content across lines.
CodePudding user response:
What happens?
You already have a solution build in, only problem is you are not appending it to your list.
How to fix?
Append the result of your regex to your list:
prices.append(re.findall("\d ", price.text))
Added to your example:
...
products=[] #List to store name of the product
prices=[] #List to store price of the product
soup = BeautifulSoup(content,"html.parser")
for a in soup.find_all('div', attrs={'class':'c-product-listing__col'}):
name=a.find('h2', attrs={'class':'c-product-card__heading'})
price=a.find('div', attrs={'class':'c-price-tag__price'})
products.append(name.text)
prices.append(re.findall("\d ", price.text))
...
CodePudding user response:
Ok, problem solved. Thanks to those who helped me in the right direction. The code might not be optimal, but it works :)
.....
.....
content = driver.page_source
products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
soup = BeautifulSoup(content,"html.parser")
for a in soup.findAll('div', attrs={'class':'c-product-listing__col'}):
name=a.find('h2', attrs={'class':'c-product-card__heading'})
price=a.find('div', attrs={'class':'c-price-tag__price'})
strProduct = name.text
strPrice = price.text
strProduct = re.match('[^,] ', strProduct)[0]
strPrice = re.sub('\D', '', strPrice)
products.append(strProduct)
prices.append(strPrice)
df = pd.DataFrame({'Product Name':products,'Price':prices})
df.to_csv('products.csv', index=False, encoding='utf-8')
driver.quit()