Home > Software design >  Cannot scrape website any longer with httr and rvest
Cannot scrape website any longer with httr and rvest

Time:11-30

I have been scraping a table from this website succsessfully since the user hrbrmstr gave his answer to this question of mine 5 years ago. Lately something about the website changed and I can't fetch the data any longer.

URL <- "http://www.fiskistofa.is/veidar/aflaupplysingar/landanir-eftir-hofnum/"
library(httr)
library(rvest)
res <- POST(url = URL,
        query = list(lang="is"),
        body = list(magn = "Sundurlidun",
                    hofn = "87",
                    dagurFra = format(lubridate::today()-4, "%d.%m.%Y"),
                    dagurTil = format(lubridate::today(), "%d.%m.%Y"),
                    hnappur = "Sækja"),
        encode = "form")
doc <- content(res, as="parsed")

This is how I used to be able to find and extract the table but now the output is empty:

  html_nodes(doc, xpath=".//table[contains(., 'Magn')]") %>%
    html_table(header=TRUE) 

Nothing in the appearance of the site has changed but recently they opened up this Power BI (the table is on page nr. 3) for this database so they may have changed something in the meantime that I don't know about.

Any suggestions?

CodePudding user response:

Try changing the format in the dates to '%d.%m.%Y'. And try changing the http:// to https://

URL <- "https://www.fiskistofa.is/veidar/aflaupplysingar/landanir-eftir-hofnum/"
library(httr)
library(rvest)
res <- POST(url = URL,
        query = list(lang="is"),
        body = list(magn = "Sundurlidun",
                    hofn = "87",
                    dagurFra = format(lubridate::today()-4, '%d.%m.%Y') ,
                    dagurTil = format(lubridate::today(), '%d.%m.%Y'),
                    hnappur = "Sækja"),
        encode = "form")
doc <- content(res, as="parsed")

In Python:

import requests
import pandas as pd
from datetime import datetime, timedelta

url = "https://www.fiskistofa.is/veidar/aflaupplysingar/landanir-eftir-hofnum/"
today = datetime.now()    
payload = {
    'magn' : "Sundurlidun",
    'hofn' : "87",
    'dagurFra' : (today - timedelta(days=4)).strftime("%d.%m.%Y"),
    'dagurTil' : today.strftime("%d.%m.%Y"),
    'hnappur' : "Sækja"}

df = pd.read_html(requests.post(url, data=payload).text)[-1]

Output:

print(df)
              0        1  ...                        4      5
0   Löndun dags  Skipnr.  ...               Vörutegund   Magn
1    25.11.2021     2999  ...      Steinbítur /slægður      5
2    25.11.2021     2999  ...      ÝSA/ÓSL./VS (HAFRO)    690
3    25.11.2021     2999  ...              Ýsa /óslægð    415
4    25.11.2021     2999  ...  ÞORSKUR/ÓSL./VS (HAFRO)    861
5    25.11.2021     2999  ...       Þorskur / óslægður  4.870
6    26.11.2021     2615  ...      ÝSA/ÓSL./VS (HAFRO)     14
7    26.11.2021     2615  ...              Ýsa /óslægð  1.005
8    26.11.2021     2615  ...  ÞORSKUR/ÓSL./VS (HAFRO)    164
9    26.11.2021     2615  ...       Þorskur / óslægður  1.507
10   27.11.2021     2842  ...  ÞORSKUR/ÓSL./VS (HAFRO)    271
11   27.11.2021     2842  ...       Þorskur / óslægður  5.703
12   27.11.2021     2842  ...     Þorskur-undirmál/ósl    151
13   27.11.2021     2842  ...          Hlýri /óslægður     13
14   27.11.2021     2842  ...                Gullkarfi     27
15   27.11.2021     2842  ...           Ufsi /óslægður     29
16   27.11.2021     2842  ...            Keila /óslægð     11
17   27.11.2021     2842  ...             Lýsa /óslægð      2
18   27.11.2021     2842  ...              Ýsa /óslægð  3.072
19   27.11.2021     2842  ...      Ýsa-undirmál/óslægð      8
20   28.11.2021     2256  ...              Ýsa /óslægð  1.888
21   28.11.2021     2256  ...     Þorskur-undirmál/ósl    551
22   28.11.2021     2256  ...       Þorskur / óslægður  4.212
23   28.11.2021     2256  ...      Steinbítur /slægður      4
24   28.11.2021     2256  ...      ÝSA/ÓSL./VS (HAFRO)    243
25   28.11.2021     2615  ...              Ýsa /óslægð    829
26   28.11.2021     2615  ...       Þorskur / óslægður  2.659
27   28.11.2021     2615  ...                Gullkarfi     34
28   28.11.2021     2842  ...            Keila /óslægð     11
29   28.11.2021     2842  ...                Gullkarfi     18
30   28.11.2021     2842  ...  ÞORSKUR/ÓSL./VS (HAFRO)     95
31   28.11.2021     2842  ...          Hlýri /óslægður     17
32   28.11.2021     2842  ...     Þorskur-undirmál/ósl     79
33   28.11.2021     2842  ...            Langa /óslægð     18
34   29.11.2021     1136  ...              Tindabikkja    599
35   29.11.2021     1136  ...               Þorsklifur  1.787

[36 rows x 6 columns]
  • Related