CSS selector for the first row of table on ClinicalTrials.gov-CodePudding

I'm having trouble with web scraping a table from ClinicalTrials.gov.

I'm trying to extract the CSS selector of the words in the first column of the first row, labeled "breast cancer", under the Terms and Synonyms Searched table. Here is the link to the table:

The CSS selector, .w3-padding-8:nth-child(1) gets me all the terms in the first column. This works if the search term is a single word, like "pembrolizumab", but if the search term is two words, like "breast cancer", the table contains multiple rows ("chunks") and the above CSS selector returns all the terms from these rows.

Does anyone know the CSS selector to extract out the terms in the first column and first row ("chunk") only?

CodePudding user response：

The td cells of class w3-padding-8 include the synonyms listed in the column you want and the (unwanted) number of studies for the search and the data base.

Because there are two cells containing study numbers following each synonym entry, the following strategy may help to isolate just the synonym column.

First make an html collection of all the td elements of class w3-padding-8:

const cells = document.querySelectorAll('td.w3-padding-8');

then, log the innerText of the first, third, sixth and so on cells (so skipping those containing study numbers):

for (let i=0; i<cells.length; i =3) {
  console.log(cells[i].innerText);
}

Note the use of i =3 for the loop incrementor - allowing just cells 0,3,6... etc., containing the synonyms to be listed.

I ran this on the browser console with the link you provided loaded and it returned the list of synonyms. The only part you may have to tinker with is that the loaded table contained three sections: 'breast cancer', 'cancer' and 'breast', and the list contained the synonyms for all three sections. You should be able to isolate the 'breast cancer' block and apply the same idea to retrieve its synonym column.

The key appears to be skipping two cells after each synonym using i =3.

CodePudding user response：

Does it solve your problem to get the table instead?

library(tidyverse)
library(rvest)

"https://clinicaltrials.gov/ct2/results/details?cond=breast cancer" %>% 
  read_html() %>% 
  html_table() %>% 
  .[[1]] 

# A tibble: 30 × 3
   Terms                   `Search Results*` `Entire Database**`
   <chr>                   <chr>             <chr>              
 1 Synonyms                Synonyms          Synonyms           
 2 breast cancer           12,002 studies    12,002 studies     
 3 Breast Neoplasms        9,539 studies     9,539 studies      
 4 breast carcinomas       917 studies       917 studies        
 5 Breast tumor            159 studies       159 studies        
 6 cancer of the breast    66 studies        66 studies         
 7 Neoplasm of breast      61 studies        61 studies         
 8 cancer of breast        40 studies        40 studies         
 9 Carcinoma of the Breast 33 studies        33 studies         
10 CARCINOMA OF BREAST     32 studies        32 studies         
# … with 20 more rows
# ℹ Use `print(n = ...)` to see more rows