Home > OS >  Looping through multiple URL's overwriting data
Looping through multiple URL's overwriting data

Time:09-22

I have successfully managed to pull all information necessary from a single URL but I'm struggling getting it to loop through the various pages and pull the information from each page. Currently my code is running through all the different page iterations but just rewriting the first page when I call it to print.

The url shows 20 results per page so page 1 end in 0 page 2 ends in 20 page 3 ends in 40 and so on. This is why I have added the calculation of adding x 20 each time.

When I print URL on line 8 it returns each url ending 0,20,40,60,80 twice so the list shows

https://xxxxx.com/0
https://xxxxx.com/0
https://xxxxx.com/20
https://xxxxx.com/20
https://xxxxx.com/40
https://xxxxx.com/40

I'd accept that even, but it's when I request it to print(info) or write (info) to csv it just overwrites itself multiple times and reprints the URL https://xxxxx.com/0

x = 0
While x<100:
    x  = 20


    for url in str(x):
        url = "https://xxxxx.com/0" str(x)

        page = requests.get(url)

        soup = BeautifulSoup(page.content, "html.parser")
        lists = soup.find_all('li', class_="SearchPage__Result-gg133s-2 djuMQD")


        with open('C:\\Users\hay\Houses.csv', 'w', encoding='UTF8', newline="") as f:


            thewriter = writer(f)
            header = ['URL', 'address', 'price', 'beds', 'baths', 'ber']
            thewriter.writerow(header)


            for list in lists:

                url = list.find('a').attrs['href']
                address = list.find('p', class_="TitleBlock__Address-sc-1avkvav-8 dzihyY")
                price = list.find('div', class_="TitleBlock__Price-sc-1avkvav-4 hiFkJc")
                beds = list.find_all('p', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
                baths = list.find('p data-testid="baths"', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
                energyrating = list.find('div', class_="TitleBlock__BerContainer-sc-1avkvav-11 iXTpuT")

                info = [url, address, price, beds, baths, energyrating]
                thewriter.writerow(info)

CodePudding user response:

You have several problems here...

Overwriting

The first problem is that you are opening the file in w mode every time you iterate through the loop:

with open('C:\\Users\hay\Houses.csv', 'w', encoding='UTF8', newline="") as f:

The mode w creates a new file or truncates it if it already exists. If you want the information obtained in every iteration to be added to the end of the file you should use the a mode, which creates a new file if it doesn't exist or appends the new data to it if it already does.

with open('C:\\Users\hay\Houses.csv', 'a', encoding='UTF8', newline="") as f:

A possible issue with this is that the information will be added to a single file everytime you run the code and it will never be cleared unless you manually delete the file. A solution to this is to open the file in w mode, but outside the loop:

with open('C:\\Users\hay\Houses.csv', 'w', encoding='UTF8', newline="") as f:
    thewriter = writer(f)
    header = ['URL', 'address', 'price', 'beds', 'baths', 'ber']
    thewriter.writerow(header)
    x = 0
    while x < 100:
        x  = 20
            for url in str(x):
                # ... more code here
                thewriter.writerow(info)

This way, the file will be truncated every time you run the code, but not for every loop.

For loop

If I understood right, you have to check these URLs:

https://xxxxx.com/0
https://xxxxx.com/20
https://xxxxx.com/40
https://xxxxx.com/60
[...]

With the loops you have, you are obtaining this list:

https://xxxxx.com/020
https://xxxxx.com/020
https://xxxxx.com/040
https://xxxxx.com/040
https://xxxxx.com/060
https://xxxxx.com/060
https://xxxxx.com/080
https://xxxxx.com/080
https://xxxxx.com/0100
https://xxxxx.com/0100
https://xxxxx.com/0100

You have to modify (and simplify) your loops to this:

x = 0
while x < 100:
    url = "https://xxxxx.com/"   str(x)
    # ... more code here
    x  = 20

Python reserved keyword

You are naming a variable with the Python reserved keyword list, you should avoid that.

Complete solution

With these problems solved, your code should look like this:

with open('C:\\Users\hay\Houses.csv', 'w', encoding='UTF8', newline="") as f:
    thewriter = writer(f)
    header = ['URL', 'address', 'price', 'beds', 'baths', 'ber']
    thewriter.writerow(header)

    x = 0
    while x < 100:
        url = "https://xxxxx.com/"   str(x)
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")
        lists = soup.find_all('li', class_="SearchPage__Result-gg133s-2 djuMQD")

        for l in lists:
            url = l.find('a').attrs['href']
            address = l.find('p', class_="TitleBlock__Address-sc-1avkvav-8 dzihyY")
            price = l.find('div', class_="TitleBlock__Price-sc-1avkvav-4 hiFkJc")
            beds = l.find_all('p', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
            baths = l.find('p data-testid="baths"', class_="TitleBlock__CardInfoItem-sc-1avkvav-9 iLMdur")
            energyrating = l.find('div', class_="TitleBlock__BerContainer-sc-1avkvav-11 iXTpuT")

            info = [url, address, price, beds, baths, energyrating]
            thewriter.writerow(info)

        x  = 20
  • Related