I'm working with PubMed central articles and need to create columns with 'pmc' id, 'title', 'abstract', 'full-text' and 'authors'.
I have worked with other similar questions and unable to apply on my case, I would highly appreciate if you could help me?
Heres my code and link for the sample
CodePudding user response:
This script shows how to parse the XML file with beautifulsoup
and creates a dataframe with one row:
import pandas as pd
from bs4 import BeautifulSoup
with open("your_file.xml", "r") as f_in:
soup = BeautifulSoup(f_in.read(), "xml")
data = []
pmid = soup.select_one('[pub-id-type="pmid"]').text.strip()
title = soup.select_one("article-title").text.strip()
abstract = soup.select_one("abstract").text.strip()
full_text = "\n".join(
sec.get_text(strip=True, separator=" ") for sec in soup.select("body sec")
)
authors = ", ".join(
a.get_text(strip=True, separator=" ")
for a in soup.select('[contrib-type="author"]')
)
data.append(
{"pmid": pmid, "title": title, "full_text": full_text, "authors": authors}
)
df = pd.DataFrame(data)
print(df)
Prints:
pmid title full_text authors
0 35409008 Repurposing Multiple-Molecule Drugs for COVID-19-Associated Acute Respiratory Distress Syndrome and Non-Viral Acute Respiratory Distress Syndrome via a Systems Biology Approach and a DNN-DTI Model Based on Five Drug Design Specifications 1. Introduction The coronavirus disease 2019 (COVID-19) is a novel pandemic caused by the new coronavirus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Since mid-July 2021, there have been more than 183 million cases and 3.9 million deaths around the world due to the rapid spread of COVID-19 [ 1 ]. SARS-CoV-2-infected patients have demonstrated a wide spectrum of clinical manifestations. Although the majority (81%) of COVID-19 patients experienced mild symptoms (e.g., asymptomatic, flu-like symptoms, or mild pneumonia), 14% of cases experienced severe symptoms (e.g., dyspnea or hypoxemia), around 5% of COVID-19 patients were critically ill (e.g., multiple organ failure or septic shock), and about 20% of COVID-19 patients required hospitalization [ 2 , 3 , 4 , 5 ]. Acute respiratory distress syndrome (ARDS), the severe form of acute lung injury (ALI), is an acute respiratory failure syndrome resulting from noncardiogenic lung edema and hypoxemia [ 6 ]. Common causes of ARDS developments can be infective (viral or bacterial pneumonia) or non-infective (e.g., pancreatitis and trauma). ARDS is also a frequent complication in COVID-19. Among hospitalized COVID-19 patients, about 30~40% of patients develop ARDS, 26% require intensive care unit (ICU) facilities, and 16% receive intermittent mandatory ventilation (IMV). Furthermore, for the ICU COVID-19 patients, 75% have ARDS. The mortality rate of COVID-19-associated ARDS patients approximately ranges from 26% to 61.5% [ 7 , 8 , 9 , 10 ]. The high incidence and mortality ratio observed among COVID-19-associated ARDS cases indicate that there is an urgent need to develop relative pharmaceutical therapies. Comparisons of clinical characteristics and pathophysiology between COVID-19-associated ARDS and classical ARDS (not associated with SARS-CoV-2) are still under debate. Most of the recent evidence suggest that there is no significant difference regarding respiratory compliance, lung morphology, and myocardial injury [ 11 ]. Some studies have also indicated that COVID-19-associated ARDS has higher coagulation potential and thromboembolic complications risk [ 12 , 13 ]. However, their corresponding molecular pathogenetic mechanisms and the role of epigenetics and genetic factors between COVID-19-associated ARDS and classical ARDS (not associated with SARS-CoV-2) are not fully understood. The microRNAs (miRNA) are short, non-protein-coding, and single-stranded RNA with 18–25 nucleotides in length. After binding to the 3′-untranslated region (3′UTR) or 5′-untranslated region (5′UTR) of mRNA transcripts, microRNAs can post-transcriptionally control gene expression either by mRNA degradation or directly inhibiting the translation process [ 14 , 15 ]. Given that miRNAs can control some biological activities in multi-levels such as cell proliferation, apoptosis, and even immune responses during virus infection, several studies have been dedicated to elucidating the complicated pathogenesis and epigenetic interplay between SARS-CoV-2 and humans. Several dysregulated miRNAs observed in differential gene analysis results have also been identified as biomarkers and proposed as therapeutic targets for COVID-19. In addition, the discovery of SARS-CoV-2 encoded miRNAs that can target human genes has also been investigated, although it is controversial because RNA viruses are mainly replicated in the cytoplasm and miRNA production may interfere with the replication of the viral genome. Several machine-learning-based bioinformatics tools and databases have been developed to predict virus-encoded miRNA and possible targets of human genes [ 16 , 17 , 18 ]. Long noncoding RNAs (lncRNAs) are another type of functional, non-protein-coding RNA longer than 200 nucleotides. By interacting with mRNA, DNA, or transcription factors, lncRNAs engage in versatile biological events such as modulating gene expression, epigenetic modification [ 19 , 20 ]. Increasing evidence has shown that lncRNAs play important roles during SARS-CoV-2 infection. For example, recent studies indicated that lncRNAs NEAT1 and MALAT1 are associated with immune responses in SARS-CoV-2 infected cells [ 21 , 22 ]. In traditional drug discovery, the average period of new drug development pipelines takes at least 12 years from the initial discovery to the marketplace [ 23 ]. Although the pharmaceutical industry invested 83 billion USD worldwide on research and development (R&D) expenditures in 2019 [ 24 ], the success rate of a drug candidate starting from clinical trial to marketing approval was approximately 10~20%, which has not changed for the past few decades [ 25 ]. On the contrary, drug repurposing (also known as drug repositioning), which aims to identify new therapeutic uses of approved or investigational drugs, is a feasible and advantageous strategy with a lower development risk and time cost. To this end, numerous approaches for drug repurposing have been developed, including experimental models, retrospective clinical analysis, virtual screening, signature-based methods, pathway mapping, etc. [ 26 ]. Additionally, combination therapies deployed with repurposed drugs have also been considered as therapeutic interventions for COVID-19. At present, thousands of repurposed clinical trials are being tested for COVID-19 [ 27 , 28 , 29 ]. Although most of them are monotherapy, the importance of accelerating the evaluation efficacy should not be neglected. In t.\n2. Results 2.1. Overview of Core HPI-GWGEN Construction and Drug Discovery Design for COVID-19-Associated ARDS and Non-Viral ARDS by Systems Biology Approach The research flowchart, as shown in Figure 1 , is used to summarize how to construct candidate HPI-GWGEN, real HPI-GWGEN, core HPI-GWGEN, and core signaling pathways of COVID-19-associated ARDS and non-viral ARDS. Sample groups and statistics of the node of COVID-19-associated ARDS and non-viral ARDS are described in Table 1 . Essentiallhe top 4000 nodes in core HPI-GWGENs of COVID-19-associated ARDS and non-viral ARDS, we also utilized DAVID Bioinformatics Resources (2021 update) [ 31 ] to obtain the enrichment analysis of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways annotation and correlative cellular functions, as shown in Tables S2 and S3 , respectively. On the basis of referencing literature surveys and the KEGG signaling pathways annotation, we obtained core signaling pathways of COVID-19-associated ARDS and non-viral ARDS. Then, through investigating the common and specific core signaling pathways between COVID-19-associated ARDS and non-viral ARDS in Figure 4 , we identified common specific biomarkers of infection pathogenesis as drug targets, which were TNF, NFκB, HIF1A, GRP78, FTO, and BECN1 (in Table 6) for COVID-19-associated ARDS and TNF, NFκB, HIF1A, and FOXA1 (in Table 7) for non-viral ARDS. Afterward, we trained a DTI model of DNN by drug–target interaction data in advance. By the use of the DNN-DTI model, we obtained a binary classifier, with a high probability to predict potential candidate drugs for these drug targets of 007" ref-type="table">Table 7 , respectively. Detailed discussions of the above results are described in the following subsections.\n2.1. Overview of Core HPI-GWGEN Construction and Drug Discovery Design for COVID-19-Associated ARDS and Non-Viral ARDS by Systems Biology Approach The research flowchart, as shown in Figure 1 , is used to summarize how to construct candidate HPI-GWGEN, real HPI-GWGEN, core HPI-GWGEN, and core signaling pathways of COVID-19-associated ARDS and non-viral ARDS. Sample groups and statistics of the node of COVID-19-associated ARDS and non-viral ARDS are described in Table 1 . Essentiallhe top 4000 nodes in core HPI-GWGENs of COVID-19-associated ARDS and non-viral ARDS, we also utilized DAVID Bioinformatics Resources (2021 update) [ 31 ] to obtain the enrichment analysis of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways annotation and correlative cellular functions, as shown in Tables S2 and S3 , respectively. On the basis of referencing literature surveys and the KEGG signaling pathways annotation, we obtained core signaling pathways of COVID-19-associated ARDS and non-viral ARDS. Then, through investigating the common and specific core signaling pathways between COVID-19-associated ARDS and non-viral ARDS in Figure 4 , we identified common specific biomarkers of infection pathogenesis as drug targets, which were TNF, NFκB, HIF1A, GRP78, FTO, and BECN1 (in Table 6) for COVID-19-associated ARDS and TNF, NFκB, HIF1A, and FOXA1 (in Table 7) for non-viral ARDS. Afterward, we trained a DTI model of DNN by drug–target interaction data in advance. By the use of the DNN-DTI model, we obtained a binary classifier, with a high probability to predict potential candidate drugs for these drug targets of 007" ref-type="table">Table 7 , respectively. Detailed discussions of the above results are described in the following subsections.\n2.2. The Common Pathogenic Molecular Mechanism between COVID-19-Associated ARDS and Non-Viral ARDS From the first common signaling pathway related to inflammation, as shown in Figure 4 , after interacting with microenvironment factor TNFa, receptor TNFR1 can activat possible strategy for preventing the aggravation of inflammation in ARDS [ 46 , 47 , 48 , 49 ]. Additionally, TAK1 can also stimulate the MAPK signaling pathway comprised of MKK6/MAPK13. Typically, androgen receptor (AR) belongs to the nuclear receptor family that has the dual role of functioning as transcription factors. Apart from being activated through steroids-mediated induction, transcription factor AR can also be phosphorylated by kinases involved in the signaling transduction pathway and provoke the expression of cytokine-related target genes TNF and IL6 , such behavior has been commonly described in several cancer researches [ 50 , 51 ]. In this study, transcription factor AR links with MAPK13 (p38 delta) and contributes to inflammation. Lack of negative regulator of immune response may also contribute to the hyperinflammation of cytokine. From the core common signaling pathways, as shown in Figure 4 , we demonstrated that TNF alpha induced protein 8 like 2 (TIPE2), a negative regulator considered to modulate the NFKB and MAPK signaling pathways, can inhibit Ras signaling effector Ras2 to downregulate PI3KCB. One study indicated that PRKCD could be phosphorylated by PI3KCB, confirming this downstream interactor of PI3KCB [ 52 ]. PRKCD can further interact with transcription factor FLI1 to induce the target genes CCL5 and IL6 [ 53 , 54 , 55 ]. CCL5(RANTES), encoded by gene CCL5 , is a chemokine contributing to leukocyte recruitment in innate immune responses [ 56 ]. It is noticed that there is a relatively lower expression of TIPE2, whereas relative higher expressions of its downregulated proteins were observed, signifying that the inhibitory effect of TIPE2 may be attenuated. Since there also exists an upstream interaction between TAK1 and TIPE2 in this study, it is reasonable to suppose that TIPE2 ubiquitination may contribute to the loss-of-control cytokine production [ 57 ]. Collectively, the common molecular mechanisms in COVID-19-associated ARDS and non-viral ARDS are leukocyte recruitments, inflammation, innate immune responses, apoptosis, and T cell inhibition. Based on the results of core signaling analyses and considering relative protein/gene expression levels as compared with normal nasopharyngeal tissues [ 58 ], we choose TNF, NFkB, and HIF1A as common biomarkers (drug targets) of infections pathogenesis in both COVID-19-associated ARDS and non-viral ARDS.\n2.3. The Specific Pathogenic Molecular Mechanism of COVID-19-Associated ARDS The early stage of the SARS-CoV-2 life cycle begins from the attachment of the host cellular receptor and the membrane fusion between virus and host cell. Accomplishments of both events are required for releasing viral RNA into the cytoplasm for the subsequent replication and translation. Although, currently, it has been effectively established that angiotensin-converting enzyme 2 (ACE2) is the main receptor for SARS-CoV-2 cell entry [ 59 ], there is no stop to identifying novel receptors that may potentiate the SARS-CoV-2 infectivity. Several cell receptors are identified to interact with the Spike protein of SARS-CoV-2 in Figure 4 . Firstly, ITGB3, an integrin protein thought to contain an LC3-interacting region (LIR), can bind to LC3 and contribute to autophagy upon activation [ 60 ]. In agreement with the previous studies that the toll-like receptor (TLR) signaling pathway can be triggered by structural proteins of SARS-CoV-2 [ 61 , 62 , 63 ]. After recognizing the Spike protein of SARS-CoV-2, receptor TLR4 could transmit the signal to TRAF6 by recruitment of adaptor proteins either IRAK4 or TRAM/TRIF. TRAF6 could promot73 ]. The positive feedback loop of GRP78 production established by virus infection may eventually lead to the sustained UPR and subsequent apoptosis. Moreover, IRE1α also contributes to inflammation by transmitting the signal through MKK7 and MAPK10. Ting Ching-Tse, Chen Bor-Sen *
CodePudding user response:
import pandas as pd
from bs4 import BeautifulSoup
with open("covid_19", "r") as f_in:
soup = BeautifulSoup(f_in.read(), "xml")
data = []
pmid = soup.select_one('[pub-id-type="pmid"]').text.strip()
title = soup.select_one("article-title").text.strip()
abstract = soup.select_one("abstract").text.strip()
full_text = soup.select_one("body").text.strip()
data.append(
{"pmid": pmid, "title": title, "abstract":abstract, "full_text": full_text,}
)
df = pd.DataFrame(data)
df.to_csv('covid_!9.csv')
Thanks to @Andrej Kesely