Home > Software design >  Scrape multiple pages through pandas
Scrape multiple pages through pandas

Time:06-18

I want to scrape multiple pages but they will give the result of only the end page these are page link https://www.baroul-cluj.ro/tabloul-avocatilor/avocati-definitivi/

import pandas as pd

for page in range(1,26):
    df=pd.read_html('https://www.baroul-cluj.ro/tabloul-avocatilor/avocati-definitivi/?wpv_view_count=9662&wpv_post_search=&wpv_paged={page}'.format(page=page))
    df[0].to_csv('tab.csv',index=False)

CodePudding user response:

That's because you always write to the same file, so you will only get the last scrapped data.

A solution to your problem is to create a new file every time like this:

import pandas as pd

for page in range(1,26):
    df = pd.read_html('https://www.baroul-cluj.ro/tabloul-avocatilor/avocati-definitivi/?wpv_view_count=9662&wpv_post_search=&wpv_paged={page}'.format(page=page))
    df[0].to_csv(f"tab-{page}.csv",index=False)

Or if you want a single file, you can use append mode when writing the CSV file.

import pandas as pd

for page in range(1,26):
    df = pd.read_html('https://www.baroul-cluj.ro/tabloul-avocatilor/avocati-definitivi/?wpv_view_count=9662&wpv_post_search=&wpv_paged={page}'.format(page=page))
    df[0].to_csv('tab.csv', mode='a', index=False, header=False)
  • mode="a": Use the append mode as opposed to w – the default write mode.
  • index=False: Do not include an index column when appending the new data.
  • header=False: Do not include a header when appending the new data.

NOTE: Be sure that the file exist to use the append mode.

  • Related