I am trying to extract data from https://www.lipidmaps.org/databases/lmsd/LMSL01010001. I usually use beautifulsoup or pandas to extract table data. But the tables in the website dont seem to have been made with the table class. For example, the Calculated Physicochemical Properties table has been made with "flex-grow flex-shrink p-3 px-5".
How can I extract the data from the tables (specifically Calculated Physicochemical Properties table and SMILES value)?
I tried the following code but I get almost the whole websites text: 'soup.find("div")'.
I usually use pandas.read_table(link)
CodePudding user response:
Here is one way of getting that information, and displaying it into a dataframe format:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
big_list = []
r = requests.get('https://www.lipidmaps.org/databases/lmsd/LMSL01010001', headers=headers)
soup = bs(r.text, 'html.parser')
smiles = soup.select_one('div:-soup-contains("SMILES") > span:-soup-contains("(Click to copy)")').find_next('div').text.strip()
heavy_atoms = soup.select_one('strong:-soup-contains("Heavy Atoms")').find_next_sibling(string=True).strip()
rings = soup.select_one('strong:-soup-contains("Rings")').find_next_sibling(string=True).strip()
big_list.append((smiles, heavy_atoms, rings))
df = pd.DataFrame(big_list, columns=['SMILES', 'Heavy Atoms', 'Rings'])
print(df)
Result in terminal:
SMILES Heavy Atoms Rings
0 O(P(O)(=O)OC[C@@H]1[C@@H](O)[C@@H](O)[C@H](N2C(=O)NC(=O)C=C2)O1)P(O[C@H]1O[C@@H]([C@H]([C@@H]([C@H]1N)OC(C[C@@H](CCCCCCCCCCC)O)=O)O)CO)(=O)O 52 3
You can get the other datapoints as well, using the logic above. Also, make sure your packages are up to date. BeautifulSoup documentation can be found here