Python - why the print result is repeated and "write to a text" only has one line-CodePudding

lovely people! I'm totally new with Python. I tried to scrape several URLs and encountered a problem with "print".

I tried to print and write the "shipment status". I have two URLs, so ideally I get two results.

This is my code:

from bs4 import BeautifulSoup 
import re 
import urllib.request
import urllib.error
import urllib 
 
# read urls of websites from text file 
list_open = open("c:/Users/***/Downloads/web list.txt") 
read_list = list_open.read() 
line_in_list = read_list.split("\n") 
 
for url in line_in_list: 
    soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
    # parse something special in the file 
    shipment = soup.find_all('span')
    Preparation=shipment[0] 
    Sent=shipment[1]
    InTransit=shipment[2]
    Delivered=shipment[3]
    for p in shipment: 
# extract information 
        print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())
     
import sys
 
file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`

I have two problems here:

Problem one: I have only two URLs, and when I print the results, every "span" is repeated 4 times (as there are four "span"s). The result in the "output" is as below:

http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 10:31
http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 10:31
http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 10:31
http://carmoov.fr/CfQd ; Preparation on 06/01/2022 at 17:45 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 10:31
http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 13:54
http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 13:54
http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 13:54
http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 13:54
http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on 06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 13:54

Problem two: I tried to write the "print" to a text file, but only one line appeared in the file:

http://carmoov.fr/CfQh ; Preparation on 06/01/2022 at 11:00 ; Sent on  06/01/2022 at 18:14 ; InTransit  ; Delivered on 07/01/2022 at 13:54

I want to know what is wrong in the code. I want to print 2 url results only.

Your help is really appreciated! Thank you in advance!

CodePudding user response：

First question

You have two nested loops :

for url in line_in_list:
    for p in shipment:
        print(...)

The print is nested in the second loop. If you have 4 shipments per url, that will lead to 4 prints per url.

Since you don't use p from for p in shipment you can completely get rid of the second loop and move the print one indentation level left, like this :

for url in line_in_list: 
    soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
    # parse something special in the file 
    shipment = soup.find_all('span')
    Preparation=shipment[0] 
    Sent=shipment[1]
    InTransit=shipment[2]
    Delivered=shipment[3]

    print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())

Second question

sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`

Without keyword argument, print is writing to sys.stdout, which is by default your terminal output. There's only one print after sys.sdtout = ... so there will only be one line written to the file.

There's another way to print to a file :

with open('demo.txt', 'a') as f:
     print('Hello world', file = f)

The keyword with will ensure the file is closed even if an exception is raised.

Both combined

From what I understood, you want to print two lines to the file. Here's a solution :

from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import urllib

# read urls of websites from text file
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
file_path = "randomfile.txt"

for url in line_in_list:
    soup = BeautifulSoup(urllib.request.urlopen(url).read(), "html")
    # parse something special in the file
    shipment = soup.find_all("span")
    Preparation = shipment[0]
    Sent = shipment[1]
    InTransit = shipment[2]
    Delivered = shipment[3]

    with open(file_path, "a") as f:
            with open(file_path, "w") as f:
                f.write(
                    f"{url} ; Preparation {Preparation.getText()}; Sent {Sent.getText()}; InTransit {InTransit.getText()}; Delivered {Delivered.getText()}"
                )

CodePudding user response：

First point is caused by iterating over shipment - Just delete the for loop and correct indent of print():

for url in line_in_list: 
    soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
    # parse something special in the file 
    shipment = soup.find_all('span')
    Preparation=shipment[0] 
    Sent=shipment[1]
    InTransit=shipment[2]
    Delivered=shipment[3]

    print (url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())

Second issue is caused while you call the writing outside the loop and not in append mode - You will end up with this as your loop:

#open file in append mode
with open('somefile.txt', 'a') as f:
    #start iterating your urls
    for url in line_in_list: 
        soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html') 
        # parse something special in the file 
        shipment = soup.find_all('span')
        Preparation=shipment[0] 
        Sent=shipment[1]
        InTransit=shipment[2]
        Delivered=shipment[3]
        #create output text
        line = f'{url};Preparation{Preparation.getText()};Sent{Sent.getText()};InTransit{InTransit.getText()};Delivered{Delivered.getText()}'
        #print output text
        print (line)
        #append output text to file
        f.write(line '\n')

And you can delete:

import sys
file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w")
print(url,';',"Preparation",Preparation.getText(),";","Sent",Sent.getText(),";","InTransit",InTransit.getText(),";","Delivered",Delivered.getText())`

Example of a bit optimized code:

from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import urllib

# read urls of websites from text file
list_open = open("c:/Users/***/Downloads/web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
file_path = "randomfile.txt"

with open('somefile.txt', 'a', encoding='utf-8') as f:
    soup = BeautifulSoup(html, 'html') 
    # parse something special in the file 
    shipment = list(soup.select_one('#progress').stripped_strings)
    line = f"{url},{';'.join([':'.join(x) for x in list(zip(shipment[::2], shipment[1::2]))])}"
    print (line)
    f.write(line '\n')

CodePudding user response：

list_open = open("c:/Users/***/Downloads/web list.txt") 
read_list = list_open.read() 
line_in_list = read_list.split("\n") 

file_path = 'randomfile.txt'
sys.stdout = open(file_path, "w") 
There are four spans actuelly, try this
for url in line_in_list: 
    soup = BeautifulSoup(urlopen(url).read(), 'html') 
    # parse something special in the file 
    shipments = soup.find_all("span") # there are four span actually;
    sys.stdout.write('Url ' url '; Preparation' shipments[0].getText() '; Sent' shipments[1].getText() '; InTransit' shipments[2].getText() '; Delivered' shipments[3].getText()) 
    # change line
    sys.stdout.write("\r")