lovely people! I'm totally new with Python. I tried to scrape several URLs and encountered a problem with "print".
I tried to print and write the "shipment status". I have two URLs, so ideally I get two results.
This is my code:
from bs4 import BeautifulSoup
import re
import urllib.request
import urllib.error
import urllib
# read urls of websites from text file
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
for url in line_in_list:
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html')
# parse something special in the file
shipment = soup.find_all('span')
Preparation=shipment[0]
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
for p in shipment:
# extract information
print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())
import sys
file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`
I have two problems here:
- Problem one: I have only two URLs, and when I print the results, every "span" is repeated 4 times (as there are four "span"s). The result in the "output" is as below:
http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 10:31 http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 10:31 http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 10:31 http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 10:31 http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 13:54 http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 13:54 http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 13:54 http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 13:54 http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 13:54
- Problem two: I tried to write the "print" to a text file, but only one line appeared in the file:
http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit ; Delivered on 07/01/2022 at 13:54
I want to know what is wrong in the code. I want to print 2 url results only.
Your help is really appreciated! Thank you in advance!
CodePudding user response:
First question
You have two nested loops :
for url in line_in_list:
for p in shipment:
print(...)
The print is nested in the second loop. If you have 4 shipments per url, that will lead to 4 prints per url.
Since you don't use p
from for p in shipment
you can completely get rid of the second loop and move the print one indentation level left, like this :
for url in line_in_list:
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html')
# parse something special in the file
shipment = soup.find_all('span')
Preparation=shipment[0]
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())
Second question
sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`
Without keyword argument, print is writing to sys.stdout
, which is by default your terminal output. There's only one print after sys.sdtout = ...
so there will only be one line written to the file.
There's another way to print to a file :
with open('demo.txt', 'a') as f:
print('Hello world', file = f)
The keyword with
will ensure the file is closed even if an exception is raised.
Both combined
From what I understood, you want to print two lines to the file. Here's a solution :
from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import urllib
# read urls of websites from text file
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
file_path = "randomfile.txt"
for url in line_in_list:
soup = BeautifulSoup(urllib.request.urlopen(url).read(), "html")
# parse something special in the file
shipment = soup.find_all("span")
Preparation = shipment[0]
Sent = shipment[1]
InTransit = shipment[2]
Delivered = shipment[3]
with open(file_path, "a") as f:
with open(file_path, "w") as f:
f.write(
f"{url} ; Preparation {Preparation.getText()}; Sent {Sent.getText()}; InTransit {InTransit.getText()}; Delivered {Delivered.getText()}"
)
CodePudding user response:
First point is caused by iterating over shipment - Just delete the for loop and correct indent of print()
:
for url in line_in_list:
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html')
# parse something special in the file
shipment = soup.find_all('span')
Preparation=shipment[0]
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())
Second issue is caused while you call the writing outside the loop and not in append mode - You will end up with this as your loop:
#open file in append mode
with open('somefile.txt', 'a') as f:
#start iterating your urls
for url in line_in_list:
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html')
# parse something special in the file
shipment = soup.find_all('span')
Preparation=shipment[0]
Sent=shipment[1]
InTransit=shipment[2]
Delivered=shipment[3]
#create output text
line = f'{url};Preparation{Preparation.getText()};Sent{Sent.getText()};InTransit{InTransit.getText()};Delivered{Delivered.getText()}'
#print output text
print (line)
#append output text to file
f.write(line '\n')
And you can delete:
import sys
file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`
Example of a bit optimized code:
from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import urllib
# read urls of websites from text file
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
file_path = "randomfile.txt"
with open('somefile.txt', 'a', encoding='utf-8') as f:
soup = BeautifulSoup(html, 'html')
# parse something special in the file
shipment = list(soup.select_one('#progress').stripped_strings)
line = f"{url},{';'.join([':'.join(x) for x in list(zip(shipment[::2], shipment[1::2]))])}"
print (line)
f.write(line '\n')
CodePudding user response:
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w")
There are four spans actuelly, try this
for url in line_in_list:
soup = BeautifulSoup(urlopen(url).read(), 'html')
# parse something special in the file
shipments = soup.find_all("span") # there are four span actually;
sys.stdout.write('Url ' url '; Preparation' shipments[0].getText() '; Sent' shipments[1].getText() '; InTransit' shipments[2].getText() '; Delivered' shipments[3].getText())
# change line
sys.stdout.write("\r")