How to scrape text from different classes within 'td' in BS4?-CodePudding

So I want to scrape the table from this website: NCLT

I want to take this table's data in the same format including the hyperlinks and put it into an excel worksheet. I have already tried copy pasting the table but the formatting of table messes up. I want my data in a Dataframe in CSV format or Excel format.

Here's what I've tried:

url = "https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')

part where I am facing problem.

    t_r= soup.find_all('tr')
t_r[1]

gives following output:

<tr >
<td >
            1          </td>
<td >
            CPNo.464/BB/2018          </td>
<td >
            M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd
<br/>
A Murali,  Advocate 
<br/>
SPJ Legal R1-4          </td>
<td >
<span  content="2020-01-16T00:00:00 05:30" datatype="xsd:dateTime" property="dc:date">16-01-2020</span> </td>
<td >
<a href="https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf" target="_blank"><img alt="PDF icon"  src="modules/file/icons/application-pdf.png" title="application/pdf"/></a> Size :- 6.5 MB, Language:- English          </td>
</tr>

Since I am new to BS4 I am trying to figure out on how to copy the Following from each t_r[i]:

So far in my code i am only able to gather S.no. using the following snippet:

t_r[1].td.string.replace('\n', '').strip()

I also want a similar code to obtain: CPNo.464/BB/2018 , M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd, 16-01-2020, https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf"

For some reason, I am not being able to the same for the rest of the fields in t_r[i]. I don't understand how to progress further than this. I want extract the rest of the data as well but using the 't_r.contents' isn't working. Any help would be greatly appreciated.

How I want the output to look like This output but with just the href link in the last column instead of size and medium

CodePudding user response：

You can the desired table data using only pandas.

import pandas as pd

url='https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0'
table_data =pd.read_html(url)[0]
print(table_data)

Output:

S. No  ...                             PDF File
0      1  ...   Size :- 6.5 MB, Language:- English
1      2  ...  Size :- 2.02 MB, Language:- English
2      3  ...  Size :- 1.95 MB, Language:- English
3      4  ...  Size :- 2.26 MB, Language:- English
4      5  ...  Size :- 2.72 MB, Language:- English
5      6  ...  Size :- 3.35 MB, Language:- English
6      7  ...  Size :- 3.56 MB, Language:- English
7      8  ...  Size :- 1.22 MB, Language:- English
8      9  ...  Size :- 2.88 MB, Language:- English
9     10  ...  Size :- 2.18 MB, Language:- English

[10 rows x 5 columns]

CodePudding user response：

To get the table PDF Links you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.select_one("table.views-table")
df = pd.read_html(str(table))[0]
# get pdf links:
df["PDF File"] = [
    a["href"] for a in soup.select("td.views-field-field-final-orders-pdf a")
]

print(df)

Prints:

	S. No	Diary No. / Case No.[STATUS]	Name of Petitioner	Judgement date	PDF File
0	1	CPNo.464/BB/2018	M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd A Murali, Advocate SPJ Legal R1-4	16-01-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf
1	2	CA(CAA)No.50/BB/2020	E2open Software India Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : --	16-12-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/E2OPen.pdf
2	3	CA(CAA)No.51/BB/2020	Amber Road Software Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : --	16-12-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Amber Road.pdf
3	4	CA(CAA)No.48/BB/2020	Steelwedge Technologies Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : --	16-12-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Steelwedge.pdf
4	5	CPNo.158/BB/2020	Marble Industry [Mangalore] Pvt Ltd & Others Vs ROC Chethan Jeevandas Nayak, PCS Respondent Advocate : --	16-10-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/cp no 158 of 2020.pdf
5	6	CP(IB)No.156/BB/2017	Triumph India Software Services Pvt Ltd Vs Corporation Bank Girish Kumar M.S Shri Venkata Subbarao Kalva Liquidator, Vivekananda for Liquidator	04-12-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Triumph India Software Services Pvt Ltd Vs Corporation Bank.pdf
6	7	CPNo.129/BB/2020	M/s Shamel Projects India Pvt Ltd Vs Arjun Amanchi, Advocate Respondent Advocate : --	11-12-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/NCLT20210119111719.pdf
7	8	CP(IB)No.263/BB/2019	M/s RDC Concrete (India) Pvt Ltd Vs M/s Sukritha Buildmann Pvt Ltd Ricab Chad, Advocate Abhijit Atur, Advocate	25-10-2019	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Ms RDC Concrete (India) Pvt Ltd VS Ms Sukritha Buildmann Pvt Ltd _0.pdf
8	9	CP(IB)No.214/BB/2020	Shapoorji Pallonji and Company Pvt Ltd Vs Shore Dwellings Pvt Ltd[formerly known as Mantri Dwellings Pvt Ltd] Keystone Partners Respondent Advocate : --	18-12-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/CP IB 214 of 2020.pdf
9	10	CPNo.193/BB/2020	Chiteta Mining Company Pvt Ltd Vs Jose Thomas, PCS Respondent Advocate : --	30-12-2020	https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/NCLT20210111165206.pdf