Home > Mobile >  Can't gather data from Website using Selenium
Can't gather data from Website using Selenium

Time:09-15

good morning.

I am quite new to scrapping data with selenium and am facing one difficulty to gather the data from this website.

https://www.puertodeveracruz.com.mx/datosBuques/principal.php?jmlp=1

What I would like to retrive are all rows for these 3 columns:

Viaje,Nombre Buque,Fecha ETA

I tried to get it by using driver.findelements but I am not sure what elemt should I use, tried id="gridSimpleFiltering_footer_container" but it seems not to work.

What would be the solution?

Thank you in advance!

CodePudding user response:

# requires installing of bs4
from bs4 import BeautifulSoup

pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'html.parser')
# retrieves first of two tbody which contains all the data you seek
tbody = soup.find('tbody')
tr = tbody.find_all('tr')

Viaje = []
NombreBuque = []
FechaETA = []
for t in tr:
    # for each row find all the td cells
    td = t.find_all('td')
    #Viaje is the first column which appears in the 0th td cell and so on...
    Viaje.append(td[0])
    NombreBuque.append(td[1])
    FechaETA.append(td[2])

CodePudding user response:

Or none selenium solution:

import requests
import pandas as pd


df = pd.DataFrame(eval(requests.get('https://www.puertodeveracruz.com.mx/ws/BuquesProgramados').text.replace('\/', '/')))
print(df.to_string(columns=['VID', 'NOM_BUQUE', 'F_ETA']))

OUTPUT:

        VID            NOM_BUQUE       F_ETA
0    221680            VEGA VELA  30/09/2022
1    221704               HALLEY  29/09/2022
2    221666          NORDIC MASA  26/09/2022
3    221553              LUTETIA  25/09/2022
4    221709          MOUNT ATHOS  25/09/2022
5    221536              ORINOCO  24/09/2022
6    221703    MARGARETE SCHULTE  24/09/2022
7    221712     COLUMBIA HIGHWAY  23/09/2022
8    221622     MSC DON GIOVANNI  23/09/2022
9    221662         CONTSHIP LEO  23/09/2022
10   221676        MONTE PASCOAL  23/09/2022
11   221665           GINGA PUMA  22/09/2022
12   221674          AS PETRONIA  22/09/2022
13   221691      BBC SCANDINAVIA  22/09/2022
14   221715      BROOKLYN BRIDGE  22/09/2022
15   221630        MSC EMDEN III  22/09/2022
16   221711   ATLANTIC MONTERREY  21/09/2022
17   221694     VICTORIA HIGHWAY  21/09/2022
18   221708             CORONA J  21/09/2022
19   221636             PRESIDIO  21/09/2022
20   221629         MSC AQUARIUS  21/09/2022
21   221673         MONTE TAMARO  20/09/2022
22   221710       ATLANTIC DREAM  20/09/2022
23   221542          BRAVERY ACE  20/09/2022
24   221541            ADRIA ACE  20/09/2022
25   221702          SEAFRONTIER  20/09/2022
26   221701              VOKARIA  20/09/2022
27   221618       ATLANTIK PRIDE  20/09/2022
28   221627           MSC DARIEN  20/09/2022
29   221684         STOLT HALCON  20/09/2022
30   221714           CERRO AZUL  20/09/2022
31   221713             JMC 3080  20/09/2022
32   221628               GENOVA  19/09/2022
33   221705         STANLEY PARK  19/09/2022
34   221667  ORIENTAL MARGUERITE  19/09/2022
35   221707                MAIRA  19/09/2022
36   221716             DEE4 FIG  19/09/2022
37   221692      LONGVIEW LOGGER  18/09/2022
38   221698        ATLANTIC STAR  18/09/2022
39   221479        GRANDE TORINO  18/09/2022
40   221693          PIS PARAGON  18/09/2022
...

CodePudding user response:

You need find the rows first and then columns to fetch the value. There is page sync issue, you need to handle that as well using webdriverwait()

driver.get("https://www.puertodeveracruz.com.mx/datosBuques/principal.php?jmlp=1")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#gridSimpleFiltering>tbody>tr")))
tableRows=driver.find_elements(By.CSS_SELECTOR, "table#gridSimpleFiltering>tbody>tr")
for row in tableRows:
    print(row.find_element(By.XPATH, ".//td[1]").text)
    print(row.find_element(By.XPATH, ".//td[2]").text)
    print(row.find_element(By.XPATH, ".//td[3]").text)
    print("====================================")

You need to import below libraries

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

Output:

221680
VEGA VELA
30/09/2022
====================================
221704
HALLEY
29/09/2022
====================================
221666
NORDIC MASA
26/09/2022
====================================
221553
LUTETIA
25/09/2022
====================================
221709
MOUNT ATHOS
25/09/2022
====================================
221536
ORINOCO
24/09/2022
====================================
221703
MARGARETE SCHULTE
24/09/2022
====================================
221712
COLUMBIA HIGHWAY
23/09/2022
====================================
221622
MSC DON GIOVANNI
23/09/2022
====================================
221662
CONTSHIP LEO
23/09/2022
====================================
221676
MONTE PASCOAL
23/09/2022
====================================
221665
GINGA PUMA
22/09/2022
====================================
221674
AS PETRONIA
22/09/2022
====================================
221691
BBC SCANDINAVIA
22/09/2022
====================================
221715
BROOKLYN BRIDGE
22/09/2022
====================================
221630
MSC EMDEN III
22/09/2022
====================================
221711
ATLANTIC MONTERREY
21/09/2022
====================================
221694
VICTORIA HIGHWAY
21/09/2022
====================================
221708
CORONA J
21/09/2022
====================================
221636
PRESIDIO
21/09/2022
====================================
221629
MSC AQUARIUS
21/09/2022
====================================
  • Related