I would like to scrape the content of this website https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283 and create a table with the columns NAME, TITLE, LOCATION. I know some individuals have more or less "lines", but I am just trying to understand how I could even classify the first 3 lines for each person given that the text is in between "fonts" for all.
So far I have:
url="https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("font")
But once I am there and I have all the text within "font" in my "column" variable, I don't know how to proceed to differentiate between each person and build a loop where I would retrieve name, title, location etc. for each.
Any help would be highly appreciated!
CodePudding user response:
Note: instead of using selenium, I simply fetched and parsed with soup = BeautifulSoup(requests.get(url).content, "html.parser")
; as far as I an tell, the required section is not dynamic, so it shouldn't cause any issues.
would you have any idea about how to look for pairs of
<br>
Since they represent empty lines, you could try simply splitting the text in that cell by \n\n\n
blockText = soup.select_one('td:has(font)').get_text(' ')
blockText = blockText.replace('-'*10, '\n\n\n') # pad "underlined" lines
blockSections = [sect.strip() for sect in '\n'.join([
l.strip('-').strip() for l in blockText.splitlines()
]).split('\n\n\n') if sect.strip()]
Although, if you looked at blockSections
, you might notice that some headers [ROSTER
and MEMBERS
] get stuck to the end of the previous section - probably because their formatting means that an extra <br>
is not needed to distinguish them from their adjacent sections. [I added the .replace('-'*10, '\n\n\n')
line so that at least they're separated from the next section.]
Another risk is that I don't know if all versions and parsers will parse <br><br>
to text as 3 line breaks - some omit br
space entirely from text, and others might add extra space based on spaces between tags in the source html.
It's easier to split if you loop through the <br>
s and pad them with something more distinctive to split by; the .insert...
methods are useful here. (This method also has the advantage of being able to target bolded lined as well.)
blockSoup = soup.select_one('td:has(font)')
for br2 in blockSoup.select('br br, font:has(br)'):
br2.insert_after(BeautifulSoup(f'<p>{"="*80}</p>').p)
br2.insert_before(BeautifulSoup(f'<p>{"="*80}</p>').p)
blockSections = [
sect.strip().strip('-').strip() for sect in
blockSoup.get_text(' ').split("="*80) if sect.strip()
]
This time, blockSections
looks something like
['Membership Roster - ACE\n AIDS CLINICAL STUDIES AND EPIDEMIOLOGY STUDY SECTION\n Center For Scientific Review\n (Terms end 6/30 of the designated year)\n ROSTER', 'CHAIRPERSON', 'SCHACKER, TIMOTHY\n W\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF MINNESOTA\n MINNEAPOLIS,\n MN\n 55455', 'MEMBERS', 'ANDERSON, JEAN\n R\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF GYNECOLOGY AND OBSTETRICS\n JOHNS HOPKINS UNIVERSITY\n BALTIMORE,\n MD 21287', 'BALASUBRAMANYAM, ASHOK\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE AND\n MOLECULAR AND CELLULAR BIOLOGY\n DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM\n BAYLOR COLLEGE OF MEDICINE\n HOUSTON,\n TX 77030', 'BLATTNER, WILLIAM\n ALBERT\n , MD,\n (15)\n PROFESSOR AND ASSOCIATE DIRECTOR\n DEPARTMENT OF MEDICNE\n INSTITUTE OF HUMAN VIROLOGY\n UNIVERSITY OF MARYLAND, BALTIMORE\n BALTIMORE,\n MD 21201', 'CHEN, YING\n QING\n , PHD,\n (15)\n PROFESSOR\n PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS\n FRED HUTCHINSON CANCER RESEARCH CENTER\n SEATTLE,\n WA 981091024', 'COTTON, DEBORAH\n , MD,\n (13)\n PROFESSOR\n SECTION OF INFECTIOUS DISEASES\n DEPARTMENT OF MEDICINE\n BOSTON UNIVERSITY\n BOSTON,\n MA 02118', 'DANIELS, MICHAEL\n J\n , SCD,\n (16)\n PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF TEXAS AT AUSTIN\n AUSTIN,\n TX 78712', 'FOULKES, ANDREA\n SARAH\n , SCD,\n (14)\n ASSOCIATE PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF MASSACHUSETTS\n AMHERST,\n MA 01003', 'HEROLD, BETSY\n C\n , MD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n ALBERT EINSTEIN COLLEGE OF MEDICINE\n BRONX,\n NY 10461', 'JUSTICE, AMY\n CAROLINE\n , MD, PHD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n YALE UNIVERSITY\n NEW HAVEN,\n CT 06520', 'KATZENSTEIN, DAVID\n ALLENBERG\n , MD,\n (13)\n PROFESSOR\n DIVISION OF INFECTIOUS DISEASES\n STANFORD UNIVERSITY SCHOOL OF MEDICINE\n STANFORD,\n CA 94305', 'MARGOLIS, DAVID\n M\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL\n CHAPEL HILL,\n NC 27599', 'MONTANER, LUIS\n J\n , DVM, PHD,\n (13)\n PROFESSOR\n DEPARTMENT OF IMMUNOLOGY\n THE WISTAR INSTITUTE\n PHILADELPHIA,\n PA 19104', 'MONTANO, MONTY\n A\n , PHD,\n (15)\n RESEARCH SCIENTIST\n DEPARTMENT OF IMMUNOLOGY AND\n INFECTIOUS DISEASES\n BOSTON UNIVERSITY\n BOSTON,\n MA 02115', 'PAGE, KIMBERLY\n , PHD, MPH,\n (16)\n PROFESSOR\n DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH\n AND GLOBAL HEALTH SCIENCES\n DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n SAN FRANCISCO,\n CA 94105', 'SHIKUMA, CECILIA\n M\n , MD,\n (15)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n HAWAII AIDS CLINICAL RESEARCH PROGRAM\n UNIVERSITY OF HAWAII\n HONOLULU,\n HI 96816', 'WOOD, CHARLES\n , PHD,\n (13)\n PROFESSOR\n UNIVERSITY OF NEBRASKA\n LINCOLN,\n NE 68588']
create a table with the columns NAME, TITLE, LOCATION
There may be a more elegant solution, but I feel like the simplest way would be to just loop the siblings of the headers and keep count of consecutive br
s.
doubleBr = soup.select('br')[:2] # [ so the last person also gets added ]
personsList = []
for f in soup.select('td>font>font:has(b br)'):
role, lCur,pCur,brCt = f.get_text(' ').strip('-').strip(), [],[],0
for lf in f.find_next_siblings(['font','br']) doubleBr:
brCt = brCt 1 if lf.name == 'br' else 0
if pCur and (brCt>1 or lf.b):
pDets = {'role': role, 'name': '?'} # initiate
if len(pCur)>1: pDets['title'] = pCur[1]
pDets['name'], pCur = pCur[0], pCur[2:]
dList = pCur[:-2]
pDets['departments'] = dList[0] if len(dList)==1 else dList
if len(pCur)>1: pDets['institute'] = pCur[-2]
if pCur: pDets['location'] = pCur[-1]
personsList.append(pDets)
pCur, lCur, brCt = [], [], 0 # clear
if lf.b: break # rached next section
if lf.name == 'font': # [split and join to minimize whitespace]
lCur.append(' '.join(lf.get_text(' ').split())) # add to line
if brCt and lCur: pCur, lCur = pCur [' '.join(lCur)], [] # newline
Since personsList
is a list of dictionaries, it can be tabulated as simply as pandas.DataFrame(personsList)
to get a DataFrame that looks like:
role | name | title | departments | institute | location |
---|---|---|---|---|---|
CHAIRPERSON | SCHACKER, TIMOTHY W , MD | PROFESSOR | DEPARTMENT OF MEDICINE | UNIVERSITY OF MINNESOTA | MINNEAPOLIS, MN 55455 |
MEMBERS | ANDERSON, JEAN R , MD | PROFESSOR | DEPARTMENT OF GYNECOLOGY AND OBSTETRICS | JOHNS HOPKINS UNIVERSITY | BALTIMORE, MD 21287 |
MEMBERS | BALASUBRAMANYAM, ASHOK , MD | PROFESSOR | ['DEPARTMENT OF MEDICINE AND', 'MOLECULAR AND CELLULAR BIOLOGY', 'DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM'] | BAYLOR COLLEGE OF MEDICINE | HOUSTON, TX 77030 |
MEMBERS | BLATTNER, WILLIAM ALBERT , MD | PROFESSOR AND ASSOCIATE DIRECTOR | ['DEPARTMENT OF MEDICNE', 'INSTITUTE OF HUMAN VIROLOGY'] | UNIVERSITY OF MARYLAND, BALTIMORE | BALTIMORE, MD 21201 |
MEMBERS | CHEN, YING QING , PHD | PROFESSOR | PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS | FRED HUTCHINSON CANCER RESEARCH CENTER | SEATTLE, WA 981091024 |
MEMBERS | COTTON, DEBORAH , MD | PROFESSOR | ['SECTION OF INFECTIOUS DISEASES', 'DEPARTMENT OF MEDICINE'] | BOSTON UNIVERSITY | BOSTON, MA 02118 |
MEMBERS | DANIELS, MICHAEL J , SCD | PROFESSOR | DEPARTMENT OF BIOSTATISTICS | UNIVERSITY OF TEXAS AT AUSTIN | AUSTIN, TX 78712 |
MEMBERS | FOULKES, ANDREA SARAH , SCD | ASSOCIATE PROFESSOR | DEPARTMENT OF BIOSTATISTICS | UNIVERSITY OF MASSACHUSETTS | AMHERST, MA 01003 |
MEMBERS | HEROLD, BETSY C , MD | PROFESSOR | DEPARTMENT OF PEDIATRICS | ALBERT EINSTEIN COLLEGE OF MEDICINE | BRONX, NY 10461 |
MEMBERS | JUSTICE, AMY CAROLINE , MD, PHD | PROFESSOR | DEPARTMENT OF PEDIATRICS | YALE UNIVERSITY | NEW HAVEN, CT 06520 |
MEMBERS | KATZENSTEIN, DAVID ALLENBERG , MD | PROFESSOR | DIVISION OF INFECTIOUS DISEASES | STANFORD UNIVERSITY SCHOOL OF MEDICINE | STANFORD, CA 94305 |
MEMBERS | MARGOLIS, DAVID M , MD | PROFESSOR | DEPARTMENT OF MEDICINE | UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL | CHAPEL HILL, NC 27599 |
MEMBERS | MONTANER, LUIS J , DVM, PHD | PROFESSOR | DEPARTMENT OF IMMUNOLOGY | THE WISTAR INSTITUTE | PHILADELPHIA, PA 19104 |
MEMBERS | MONTANO, MONTY A , PHD | RESEARCH SCIENTIST | ['DEPARTMENT OF IMMUNOLOGY AND', 'INFECTIOUS DISEASES'] | BOSTON UNIVERSITY | BOSTON, MA 02115 |
MEMBERS | PAGE, KIMBERLY , PHD, MPH | PROFESSOR | ['DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH', 'AND GLOBAL HEALTH SCIENCES', 'DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS', 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO'] | UNIVERSITY OF CALIFORNIA, SAN FRANCISCO | SAN FRANCISCO, CA 94105 |
MEMBERS | SHIKUMA, CECILIA M , MD | PROFESSOR | ['DEPARTMENT OF MEDICINE', 'HAWAII AIDS CLINICAL RESEARCH PROGRAM'] | UNIVERSITY OF HAWAII | HONOLULU, HI 96816 |
MEMBERS | WOOD, CHARLES , PHD | PROFESSOR | [] | UNIVERSITY OF NEBRASKA | LINCOLN, NE 68588 |
[ Btw, if the .select('br br, font:has(br)')
and .select('td>font>font:has(b br)')
parts are unfamiliar to you, you can look up .select
and CSS selectors. Combinators [like >
/
/,
] and pseudo-classes [like :has
] allow us to get very specific with out targets. ]