I have the following html that looks like what is shown below.
Name Last modified Size Description
Parent Directory - LICENSE 2022-05-25 10:00 384 README 2022-05-25 10:00 3.8K RELEASE.metalink 2022-05-25 10:00 7.9K
The actual html is
<pre>
<a href="?C=N;O=D">Name</a> <a href="?C=M;O=A">Last modified</a> <a href="?C=S;O=A">Size</a> <a href="?C=D;O=A">Description</a><hr/> <a href="/pub/databases/uniprot/knowledgebase/">Parent Directory</a> -
<a href="LICENSE">LICENSE</a> 2022-05-25 10:00 384
<a href="README">README</a> 2022-05-25 10:00 3.8K
<a href="RELEASE.metalink">RELEASE.metalink</a> 2022-05-25 10:00 7.9K
<hr/></pre>
What I would like is to get everything together in the same list element (i.e.)
[[Parent Directory, , -, ]
[LICENSE, 2022-05-25, 10:00, 384],
[README, 2022-05-25, 10:00, 3.8K],
[RELEASE.metalink 2022-05-25, 10:00, 7.9K]]
For further clarification, I am trying to grab this data from ftp sites (I want to use the site not login to the ftp directly).
If possible a generic solution that works on multiple FTP sites would be preferred, so something that works with the below format as well.
Name Last modified Size
Parent Directory - FAM_000060.cif 2013-10-10 13:47 37K FAM_000110.cif 2013-10-10 13:47 5.2K
<pre>
<img alt="Icon " height="16" src="/icons/blank.gif" width="16"/> <a href="?C=N;O=D">Name</a> <a href="?C=M;O=A">Last modified</a> <a href="?C=S;O=A">Size</a> <hr/><a href="/pub/pdb/refdata/bird/family/"><img alt="[PARENTDIR]" height="16" src="/foh/myIcons/back.png" width="16"/></a> <a href="/pub/pdb/refdata/bird/family/">Parent Directory</a> -
<a href="FAM_000060.cif"><img alt="[ ]" height="16" src="/foh/myIcons/text.png" width="16"/></a> <a href="FAM_000060.cif">FAM_000060.cif</a> 2013-10-10 13:47 37K
<a href="FAM_000110.cif"><img alt="[ ]" height="16" src="/foh/myIcons/text.png" width="16"/></a> <a href="FAM_000110.cif">FAM_000110.cif</a> 2013-10-10 13:47 5.2K
<hr/>
</pre>
I have a solution that worked for the first format, but didn't work out for the second, and I thought there has to be an easier way to do this (see below for my solution)
soup = BeautifulSoup(html, 'lxml')
tag = soup.find('pre')
print(soup.prettify())
refs = []
count = 0
for z in tag.children:
if count == 0:
element = list()
element.append(z.getText().strip())
count = 1
else:
element.append(z.getText().strip())
refs.append(element)
count = 0
print(refs)
Any help would be greatly appreciated.
CodePudding user response:
This seemed to work
soup = BeautifulSoup(html, 'lxml')
tag = soup.find('pre')
l = list()
for a_tag in soup.findAll('a'):
l.append((a_tag.text ' ' a_tag.nextSibling.text).strip().split())
print([x for x in l if len(x) > 2])