Concat multiple CSV's with the same column name-CodePudding

Im having trouble with concatting these pandas dataframes as I keep getting a error saying pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects I am also trying to make my code less clunky and run smoother. I was also wondering if there was a way to get multiple pages on one csv using python. Any help would be great.

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd[]=any&city[]=any&prop_type[]=R&prop_type[]=P&prop_type[]=MH&active[]=1&year=2021&sort=G&page_number=1"

t = URL   "&page_number="
URL2 = t   "2"
URL3 = t   "3"

s = requests.Session()

data = []

page = s.get(URL,headers=headers)
page2 = s.get(URL2, headers=headers)
page3 = s.get(URL3, headers=headers)

soup = BeautifulSoup(page.content, "lxml")
soup2 = BeautifulSoup(page2.content, "lxml")
soup3 = BeautifulSoup(page3.content, "lxml")


for row in soup.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup2.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup3.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])


df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])

final = pd.concat([df1, df2, df3], axis=0)

final.to_csv('Street.csv', encoding='utf-8')

CodePudding user response：

What happens?

As mentioned @Zach Young data is already holding all the rows you like to convert into one dataframe. So it is not an issue of pandas it is more an issue on how collecting the information.

How to fix?

An approach based on the code in your question is selecting the table data more specific - Note the tbody in the selection, this will exclude the headers:

for row in soup.select('#propertysearchresults tbody tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])

While creating your dataframe you can set the column headers additionally:

pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])

Example

This will show how to iterate the different pages of website containing your tables:

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd[]=any&city[]=any&prop_type[]=R&prop_type[]=P&prop_type[]=MH&active[]=1&year=2021&sort=G&page_number=1"

s = requests.Session()

data = []
while True:

    page = s.get(URL,headers=headers)
    soup = BeautifulSoup(page.content, "lxml")

    for row in soup.select('#propertysearchresults tbody tr'):
        data.append([c.get_text(' ',strip=True) for c in row.select('td')])

    if (a := soup.select_one('#page_selector strong   a')):
        URL = "https://www.collincad.org" a['href']
    else:
        break


pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])

Output

	Property ID ↓ Geographic ID ↓	Owner Name	Property Address	Legal Description	2021 Market Value
1	2709013 R-10644-00H-0010-1	PARTHASARATHY SURESH & ANITHA HARIKRISHNAN	12209 Willowgate Dr Frisco, TX 75035	Ridgeview At Panther Creek Phase 2, Blk H, Lot 1	$513,019
...	...	...	...	...	...
61	2129238 R-4734-00C-0110-1	HEPFER ARRON	990 Willowgate Dr Prosper, TX 75078	Willow Ridge Phase One, Blk C, Lot 11	$509,795

CodePudding user response：

Usually one would loop through the page numbers and concat a list of dataframes, but if you only have three pages, your code is ok.

Because for row in ... is always writing to data, your final dataframe is df1, but you just need to drop the column-named rows.

final = df1[df1['Property ID ↓ Geographic ID ↓']!='Property ID ↓ Geographic ID ↓']

CodePudding user response：

Rather than your last several lines of code:

df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])

final = pd.concat([df1, df2, df3], axis=0)

final.to_csv('Street.csv', encoding='utf-8')

You can use this (which avoids slicing into different dataframes and concatenation):

final = pd.DataFrame(data[1:], columns=data[0])   # Sets the first row as the column names
final = final.iloc[:,1:]   # Gets rid of the additional index column