I'm having trouble with web scraping a table from ClinicalTrials.gov.
I'm trying to extract the CSS selector of the words in the first column of the first row, labeled "breast cancer", under the Terms and Synonyms Searched table. Here is the link to the table:
The CSS selector, .w3-padding-8:nth-child(1)
gets me all the terms in the first column. This works if the search term is a single word, like "pembrolizumab", but if the search term is two words, like "breast cancer", the table contains multiple rows ("chunks") and the above CSS selector returns all the terms from these rows.
Does anyone know the CSS selector to extract out the terms in the first column and first row ("chunk") only?
CodePudding user response:
The td
cells of class w3-padding-8
include the synonyms listed in the column you want and the (unwanted) number of studies for the search and the data base.
Because there are two cells containing study numbers following each synonym entry, the following strategy may help to isolate just the synonym column.
First make an html collection of all the td
elements of class w3-padding-8
:
const cells = document.querySelectorAll('td.w3-padding-8');
then, log the innerText
of the first, third, sixth and so on cells (so skipping those containing study numbers):
for (let i=0; i<cells.length; i =3) {
console.log(cells[i].innerText);
}
Note the use of i =3
for the loop incrementor - allowing just cells 0,3,6... etc., containing the synonyms to be listed.
I ran this on the browser console with the link you provided loaded and it returned the list of synonyms. The only part you may have to tinker with is that the loaded table contained three sections: 'breast cancer', 'cancer' and 'breast', and the list contained the synonyms for all three sections. You should be able to isolate the 'breast cancer' block and apply the same idea to retrieve its synonym column.
The key appears to be skipping two cells after each synonym using i =3
.
CodePudding user response:
Does it solve your problem to get the table instead?
library(tidyverse)
library(rvest)
"https://clinicaltrials.gov/ct2/results/details?cond=breast cancer" %>%
read_html() %>%
html_table() %>%
.[[1]]
# A tibble: 30 × 3
Terms `Search Results*` `Entire Database**`
<chr> <chr> <chr>
1 Synonyms Synonyms Synonyms
2 breast cancer 12,002 studies 12,002 studies
3 Breast Neoplasms 9,539 studies 9,539 studies
4 breast carcinomas 917 studies 917 studies
5 Breast tumor 159 studies 159 studies
6 cancer of the breast 66 studies 66 studies
7 Neoplasm of breast 61 studies 61 studies
8 cancer of breast 40 studies 40 studies
9 Carcinoma of the Breast 33 studies 33 studies
10 CARCINOMA OF BREAST 32 studies 32 studies
# … with 20 more rows
# ℹ Use `print(n = ...)` to see more rows