Home > OS >  How to scrape text from different classes within 'td' in BS4?
How to scrape text from different classes within 'td' in BS4?

Time:04-18

So I want to scrape the table from this website: NCLT

I want to take this table's data in the same format including the hyperlinks and put it into an excel worksheet. I have already tried copy pasting the table but the formatting of table messes up. I want my data in a Dataframe in CSV format or Excel format.

Here's what I've tried:

url = "https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')

part where I am facing problem.

    t_r= soup.find_all('tr')
t_r[1]

gives following output:

<tr >
<td >
            1          </td>
<td >
            CPNo.464/BB/2018          </td>
<td >
            M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd
<br/>
A Murali,  Advocate 
<br/>
SPJ Legal R1-4          </td>
<td >
<span  content="2020-01-16T00:00:00 05:30" datatype="xsd:dateTime" property="dc:date">16-01-2020</span> </td>
<td >
<a href="https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf" target="_blank"><img alt="PDF icon"  src="modules/file/icons/application-pdf.png" title="application/pdf"/></a> Size :- 6.5 MB, Language:- English          </td>
</tr> 

Since I am new to BS4 I am trying to figure out on how to copy the Following from each t_r[i]:

So far in my code i am only able to gather S.no. using the following snippet:

t_r[1].td.string.replace('\n', '').strip()

I also want a similar code to obtain: CPNo.464/BB/2018 , M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd, 16-01-2020, https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf"

For some reason, I am not being able to the same for the rest of the fields in t_r[i]. I don't understand how to progress further than this. I want extract the rest of the data as well but using the 't_r.contents' isn't working. Any help would be greatly appreciated.

How I want the output to look like This output but with just the href link in the last column instead of size and medium

CodePudding user response:

You can the desired table data using only pandas.

import pandas as pd

url='https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0'
table_data =pd.read_html(url)[0]
print(table_data)

Output:

S. No  ...                             PDF File
0      1  ...   Size :- 6.5 MB, Language:- English
1      2  ...  Size :- 2.02 MB, Language:- English
2      3  ...  Size :- 1.95 MB, Language:- English
3      4  ...  Size :- 2.26 MB, Language:- English
4      5  ...  Size :- 2.72 MB, Language:- English
5      6  ...  Size :- 3.35 MB, Language:- English
6      7  ...  Size :- 3.56 MB, Language:- English
7      8  ...  Size :- 1.22 MB, Language:- English
8      9  ...  Size :- 2.88 MB, Language:- English
9     10  ...  Size :- 2.18 MB, Language:- English

[10 rows x 5 columns]

CodePudding user response:

To get the table PDF Links you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.select_one("table.views-table")
df = pd.read_html(str(table))[0]
# get pdf links:
df["PDF File"] = [
    a["href"] for a in soup.select("td.views-field-field-final-orders-pdf a")
]

print(df)

Prints:

S. No Diary No. / Case No.[STATUS] Name of Petitioner Judgement date PDF File
0 1 CPNo.464/BB/2018 M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd A Murali, Advocate SPJ Legal R1-4 16-01-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf
1 2 CA(CAA)No.50/BB/2020 E2open Software India Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : -- 16-12-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/E2OPen.pdf
2 3 CA(CAA)No.51/BB/2020 Amber Road Software Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : -- 16-12-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Amber Road.pdf
3 4 CA(CAA)No.48/BB/2020 Steelwedge Technologies Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : -- 16-12-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Steelwedge.pdf
4 5 CPNo.158/BB/2020 Marble Industry [Mangalore] Pvt Ltd & Others Vs ROC Chethan Jeevandas Nayak, PCS Respondent Advocate : -- 16-10-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/cp no 158 of 2020.pdf
5 6 CP(IB)No.156/BB/2017 Triumph India Software Services Pvt Ltd Vs Corporation Bank Girish Kumar M.S Shri Venkata Subbarao Kalva Liquidator, Vivekananda for Liquidator 04-12-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Triumph India Software Services Pvt Ltd Vs Corporation Bank.pdf
6 7 CPNo.129/BB/2020 M/s Shamel Projects India Pvt Ltd Vs Arjun Amanchi, Advocate Respondent Advocate : -- 11-12-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/NCLT20210119111719.pdf
7 8 CP(IB)No.263/BB/2019 M/s RDC Concrete (India) Pvt Ltd Vs M/s Sukritha Buildmann Pvt Ltd Ricab Chad, Advocate Abhijit Atur, Advocate 25-10-2019 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Ms RDC Concrete (India) Pvt Ltd VS Ms Sukritha Buildmann Pvt Ltd _0.pdf
8 9 CP(IB)No.214/BB/2020 Shapoorji Pallonji and Company Pvt Ltd Vs Shore Dwellings Pvt Ltd[formerly known as Mantri Dwellings Pvt Ltd] Keystone Partners Respondent Advocate : -- 18-12-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/CP IB 214 of 2020.pdf
9 10 CPNo.193/BB/2020 Chiteta Mining Company Pvt Ltd Vs Jose Thomas, PCS Respondent Advocate : -- 30-12-2020 https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/NCLT20210111165206.pdf
  • Related