Home > Back-end >  BS4 Get all elements of anchor tag and between anchor tag together
BS4 Get all elements of anchor tag and between anchor tag together

Time:07-22

I have the following html that looks like what is shown below.

      Name                                  Last modified      Size  Description
Parent Directory - LICENSE 2022-05-25 10:00 384 README 2022-05-25 10:00 3.8K RELEASE.metalink 2022-05-25 10:00 7.9K

The actual html is

 <pre>      
<a href="?C=N;O=D">Name</a>                                  <a href="?C=M;O=A">Last modified</a>      <a href="?C=S;O=A">Size</a>  <a href="?C=D;O=A">Description</a><hr/>      <a href="/pub/databases/uniprot/knowledgebase/">Parent Directory</a>                                           -   
      <a href="LICENSE">LICENSE</a>                               2022-05-25 10:00  384   
      <a href="README">README</a>                                2022-05-25 10:00  3.8K  
      <a href="RELEASE.metalink">RELEASE.metalink</a>                      2022-05-25 10:00  7.9K  
<hr/></pre>

What I would like is to get everything together in the same list element (i.e.)

[[Parent Directory, , -, ]
 [LICENSE, 2022-05-25, 10:00, 384],
 [README, 2022-05-25, 10:00, 3.8K],
 [RELEASE.metalink 2022-05-25, 10:00, 7.9K]]

For further clarification, I am trying to grab this data from ftp sites (I want to use the site not login to the ftp directly).

If possible a generic solution that works on multiple FTP sites would be preferred, so something that works with the below format as well.

 Name                          Last modified      Size  
Parent Directory - FAM_000060.cif 2013-10-10 13:47 37K FAM_000110.cif 2013-10-10 13:47 5.2K
<pre>
<img alt="Icon " height="16" src="/icons/blank.gif" width="16"/> <a href="?C=N;O=D">Name</a>                          <a href="?C=M;O=A">Last modified</a>      <a href="?C=S;O=A">Size</a>  <hr/><a href="/pub/pdb/refdata/bird/family/"><img alt="[PARENTDIR]" height="16" src="/foh/myIcons/back.png" width="16"/></a> <a href="/pub/pdb/refdata/bird/family/">Parent Directory</a>                                   -   
<a href="FAM_000060.cif"><img alt="[   ]" height="16" src="/foh/myIcons/text.png" width="16"/></a> <a href="FAM_000060.cif">FAM_000060.cif</a>                2013-10-10 13:47   37K  
<a href="FAM_000110.cif"><img alt="[   ]" height="16" src="/foh/myIcons/text.png" width="16"/></a> <a href="FAM_000110.cif">FAM_000110.cif</a>                2013-10-10 13:47  5.2K  
<hr/>
</pre>

I have a solution that worked for the first format, but didn't work out for the second, and I thought there has to be an easier way to do this (see below for my solution)

    soup = BeautifulSoup(html, 'lxml')

    tag = soup.find('pre')
    print(soup.prettify())
    refs = []
    count = 0
    for z in tag.children:
        if count == 0:
            element = list()
            element.append(z.getText().strip())
            count  = 1
        else:
            element.append(z.getText().strip())
            refs.append(element)
            count = 0

    print(refs)

Any help would be greatly appreciated.

CodePudding user response:

This seemed to work

    soup = BeautifulSoup(html, 'lxml')

    tag = soup.find('pre')

    l = list()
    for a_tag in soup.findAll('a'):
        l.append((a_tag.text   ' '   a_tag.nextSibling.text).strip().split())

    print([x for x in l if len(x) > 2])
  • Related