I am trying to extract the table element within the content:encoded tag while extracting the content of XML file using pythons beautiful soup.
Getting the below error
An error occurred ':encoded' pseudo-class is not implemented at this time
See My code below
import bs4
content_html_list = []
def main():
try:
#first get the xml file content and pass it to string variable called "res_text"
soup = bs4.BeautifulSoup(res_text, features="xml")
content_html_tag_list = soup.select('content:encoded')
for content_htmls in content_html_tag_list:
content_html = content_htmls.text
content_html_list.append(content_html)
print(f"content_html_list 0 is, {len(content_html_list)}")
print(f"content_html_list = , {content_html_list}")
except Exception as e:
print(f'An error occurred {str(e)}')
main()
see the xml below
<item>
<content:encoded>
<![CDATA[ <TABLE BORDER=0 WIDTH='100%'><TR><TD><table><tr><td>Funding Opportunity ID: </td><td>335905</td></tr><tr><td>Opportunity Number: </td><td>HHS-2022-ACF-OPRE-YE-0106</td></tr><tr><td>Opportunity Title:</td><td>Child Care Policy Research Partnership Grants</td></tr><tr><td>Opportunity Category:</td><td>Discretionary</td></tr><tr><td>Opportunity Category Explanation:</td><td></td></tr><tr><td valign='top'>Funding Instrument Type: </td><td>Cooperative Agreement</td></tr><tr><td valign='top'>Category of Funding Activity: </td><td>Income Security and Social Services</td></tr><tr><td valign='top'>Category Explanation: </td><td></td></tr><tr><td valign='top'>CFDA Number(s): </td><td>93.575</td></tr><tr><td valign='top'>Eligible Applicants:</td><td>State governments<br>County governments<br>City or township governments<br>Special district governments<br>Independent school districts<br>Public and State controlled institutions of higher education<br>Native American tribal governments (Federally recognized)<br>Public housing authorities/Indian housing authorities<br>Native American tribal organizations (other than Federally recognized tribal governments)<br>Nonprofits having a 501(c)(3) status with the IRS, other than institutions of higher education<br>Nonprofits that do not have a 501(c)(3) status with the IRS, other than institutions of higher education<br>Private institutions of higher education<br>For profit organizations other than small businesses<br>Small businesses<br>Others (see text field entitled "Additional Information on Eligibility" for clarification)</td></tr><tr><td valign='top'>Additional Information on Eligibility:</td><td>The applicant eligibility is unrestricted. Applications from individuals (including sole proprietorships) and foreign entities are not eligible and will be disqualified from competitive review and from funding under this funding opportunity announcement. Faith-based and community organizations that meet the eligibility requirements are eligible to receive awards under this funding opportunity. Faith-based organizations may apply for this award on the same basis as any other organization, as set forth at and, subject to the protections and requirements of 45 CFR Part 87 and 42 U.S.C. § 2000bb et seq., ACF will not, in the selection of recipients, discriminate against an organization on the basis of the organization's religious character, affiliation, or exercise.</td></tr><tr><td valign='top'>Agency Code:</td><td>HHS-ACF-OPRE</td></tr><tr><td valign='top'>Agency Name:</td><td>Department of Health and Human Services<br>Administration for Children and Families - OPRE</td></tr><tr><td>Posted Date:</td><td>Mar 01, 2022</td></tr><tr><td>Close Date:</td><td>Jun 10, 2022 Electronically submitted applications must be submitted no later than 11:59 pm Eastern Standard Time on the listed application due date.</td></tr><tr><td>Last Updated Date:</td><td>Mar 01, 2022</td></tr><tr><td>Award Ceiling:</td><td>$400,000</td></tr><tr><td>Award Floor:</td><td>$100,000</td></tr><tr><td>Estimated Total Program Funding:</td><td>$3,200,000</td></tr><tr><td>Expected Number of Awards:</td><td>8</td></tr><tr><td>Description:</td><td>The Administration for Children and Families (ACF) plans to solicit applications for Child Care Policy Research Partnership (CCPRP) Grants. These four-year cooperative agreements will support partnerships between Child Care and Development Fund (CCDF) Lead Agencies in states, territories, or tribes and institutions with demonstrated research capacity to develop rigorous investigations of child care subsidy policies and practices. Sponsored projects will inform local and federal understanding about the efficacy of child care subsidy policies and practices to increase low-income families’ access to quality child care. To ensure that the funded work is timely and relevant to the current child care context, projects are expected to be collaborative from start to finish. The CCDF Lead Agency and their research partners must work together throughout all phases of the project and are encouraged to engage other interested parties, as appropriate. This iteration of the CCPRP Grants Program will prioritize research projects exploring (1) evidence-informed approaches to measuring quality across different provider types and (2) approaches to building the supply of high-quality child care through targeted investments in the early childhood workforce. Sponsored projects will be expected to participate in a consortium that will meet and communicate regularly to identify opportunities for coordination, such as common data elements and research methods, and to develop collective expertise and resources for the field. The consortium’s collaboration will support research capacity and learning within individual projects and across recipients. For further information about prior awards made for CCPRP Grants, see https://www.acf.hhs.gov/opre/project/child-care-policy-research-partnerships-1995-2023.</td></tr><tr><td>Version:</td><td>1</td></tr></table></TD></TR></TABLE> ]]>
</content:encoded>
<dc:date>2022-04-20T17:15:42Z</dc:date>
</item>
Please advice on the best way to extract the text
CodePudding user response:
Escape the :
so it is not viewed as pseudo-class and use parser 'lxml'
soup.select('content\:encoded')
Example:
from bs4 import BeautifulSoup as bs
s = '''
<item>
<content:encoded>
<![CDATA[ <TABLE BORDER=0 WIDTH='100%'><TR><TD><table><tr><td>Funding Opportunity ID: </td><td>335905</td></tr><tr><td>Opportunity Number: </td><td>HHS-2022-ACF-OPRE-YE-0106</td></tr><tr><td>Opportunity Title:</td><td>Child Care
Policy Research Partnership Grants</td></tr><tr><td>Opportunity Category:</td><td>Discretionary</td></tr><tr><td>Opportunity Category Explanation:</td><td></td></tr><tr><td valign='top'>Funding Instrument Type: </td><td>Cooperative
Agreement</td></tr><tr><td valign='top'>Category of Funding Activity: </td><td>Income Security and Social Services</td></tr><tr><td valign='top'>Category Explanation: </td><td></td></tr><tr><td valign='top'>CFDA Number(s):
</td><td>93.575</td></tr><tr><td valign='top'>Eligible Applicants:</td><td>State governments<br>County governments<br>City or township governments<br>Special district governments<br>Independent school districts<br>Public and State
controlled institutions of higher education<br>Native American tribal governments (Federally recognized)<br>Public housing authorities/Indian housing authorities<br>Native American tribal organizations (other than Federally
recognized tribal governments)<br>Nonprofits having a 501(c)(3) status with the IRS, other than institutions of higher education<br>Nonprofits that do not have a 501(c)(3) status with the IRS, other than institutions of higher
education<br>Private institutions of higher education<br>For profit organizations other than small businesses<br>Small businesses<br>Others (see text field entitled "Additional Information on Eligibility" for
clarification)</td></tr><tr><td valign='top'>Additional Information on Eligibility:</td><td>The applicant eligibility is unrestricted. Applications from individuals (including sole proprietorships) and foreign entities are not
eligible and will be disqualified from competitive review and from funding under this funding opportunity announcement. Faith-based and community organizations that meet the eligibility requirements are eligible to receive awards
under this funding opportunity. Faith-based organizations may apply for this award on the same basis as any other organization, as set forth at and, subject to the protections and requirements of 45 CFR Part 87 and 42 U.S.C. §
2000bb et seq., ACF will not, in the selection of recipients, discriminate against an organization on the basis of the organization's religious character, affiliation, or exercise.</td></tr><tr><td valign='top'>Agency
Code:</td><td>HHS-ACF-OPRE</td></tr><tr><td valign='top'>Agency Name:</td><td>Department of Health and Human Services<br>Administration for Children and Families - OPRE</td></tr><tr><td>Posted Date:</td><td>Mar 01,
2022</td></tr><tr><td>Close Date:</td><td>Jun 10, 2022 Electronically submitted applications must be submitted no later than 11:59 pm Eastern Standard Time on the listed application due date.</td></tr><tr><td>Last Updated
Date:</td><td>Mar 01, 2022</td></tr><tr><td>Award Ceiling:</td><td>$400,000</td></tr><tr><td>Award Floor:</td><td>$100,000</td></tr><tr><td>Estimated Total Program Funding:</td><td>$3,200,000</td></tr><tr><td>Expected Number of
Awards:</td><td>8</td></tr><tr><td>Description:</td><td>The Administration for Children and Families (ACF) plans to solicit applications for Child Care Policy Research Partnership (CCPRP) Grants. These four-year cooperative
agreements will support partnerships between Child Care and Development Fund (CCDF) Lead Agencies in states, territories, or tribes and institutions with demonstrated research capacity to develop rigorous investigations of child
care subsidy policies and practices. Sponsored projects will inform local and federal understanding about the efficacy of child care subsidy policies and practices to increase low-income families’ access to quality child care.
To ensure that the funded work is timely and relevant to the current child care context, projects are expected to be collaborative from start to finish. The CCDF Lead Agency and their research partners must work together throughout
all phases of the project and are encouraged to engage other interested parties, as appropriate. This iteration of the CCPRP Grants Program will prioritize research projects exploring (1) evidence-informed approaches to measuring
quality across different provider types and (2) approaches to building the supply of high-quality child care through targeted investments in the early childhood workforce. Sponsored projects will be expected to participate in a
consortium that will meet and communicate regularly to identify opportunities for coordination, such as common data elements and research methods, and to develop collective expertise and resources for the field. The
consortium’s collaboration will support research capacity and learning within individual projects and across recipients. For further information about prior awards made for CCPRP Grants, see
https://www.acf.hhs.gov/opre/project/child-care-policy-research-partnerships-1995-2023.</td></tr><tr><td>Version:</td><td>1</td></tr></table></TD></TR></TABLE> ]]>
</content:encoded>
<dc:date>2022-04-20T17:15:42Z</dc:date>
</item>
'''
soup = bs(s, 'lxml')
soup.select('content\:encoded')