So I want to scrape the table from this website: NCLT
I want to take this table's data in the same format including the hyperlinks and put it into an excel worksheet. I have already tried copy pasting the table but the formatting of table messes up. I want my data in a Dataframe in CSV format or Excel format.
Here's what I've tried:
url = "https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
part where I am facing problem.
t_r= soup.find_all('tr')
t_r[1]
gives following output:
<tr >
<td >
1 </td>
<td >
CPNo.464/BB/2018 </td>
<td >
M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd
<br/>
A Murali, Advocate
<br/>
SPJ Legal R1-4 </td>
<td >
<span content="2020-01-16T00:00:00 05:30" datatype="xsd:dateTime" property="dc:date">16-01-2020</span> </td>
<td >
<a href="https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf" target="_blank"><img alt="PDF icon" src="modules/file/icons/application-pdf.png" title="application/pdf"/></a> Size :- 6.5 MB, Language:- English </td>
</tr>
Since I am new to BS4 I am trying to figure out on how to copy the Following from each t_r[i]:
So far in my code i am only able to gather S.no. using the following snippet:
t_r[1].td.string.replace('\n', '').strip()
I also want a similar code to obtain: CPNo.464/BB/2018
, M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd
, 16-01-2020
, https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf"
For some reason, I am not being able to the same for the rest of the fields in t_r[i]
. I don't understand how to progress further than this. I want extract the rest of the data as well but using the 't_r.contents
' isn't working. Any help would be greatly appreciated.
How I want the output to look like This output but with just the href link in the last column instead of size and medium
CodePudding user response:
You can the desired table data using only pandas.
import pandas as pd
url='https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0'
table_data =pd.read_html(url)[0]
print(table_data)
Output:
S. No ... PDF File
0 1 ... Size :- 6.5 MB, Language:- English
1 2 ... Size :- 2.02 MB, Language:- English
2 3 ... Size :- 1.95 MB, Language:- English
3 4 ... Size :- 2.26 MB, Language:- English
4 5 ... Size :- 2.72 MB, Language:- English
5 6 ... Size :- 3.35 MB, Language:- English
6 7 ... Size :- 3.56 MB, Language:- English
7 8 ... Size :- 1.22 MB, Language:- English
8 9 ... Size :- 2.88 MB, Language:- English
9 10 ... Size :- 2.18 MB, Language:- English
[10 rows x 5 columns]
CodePudding user response:
To get the table PDF Links you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://archive.nclt.gov.in/judgement-date-wise?field_bench_target_id=5372&field_search_date_value[min][date]=01/01/19&field_search_date_value[max][date]=01/01/21&page=0"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.select_one("table.views-table")
df = pd.read_html(str(table))[0]
# get pdf links:
df["PDF File"] = [
a["href"] for a in soup.select("td.views-field-field-final-orders-pdf a")
]
print(df)
Prints:
S. No | Diary No. / Case No.[STATUS] | Name of Petitioner | Judgement date | PDF File | |
---|---|---|---|---|---|
0 | 1 | CPNo.464/BB/2018 | M/s Shankar Subramanya bhat Vs M/s Star Cable Infomet Pvt Ltd A Murali, Advocate SPJ Legal R1-4 | 16-01-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Star Cable.pdf |
1 | 2 | CA(CAA)No.50/BB/2020 | E2open Software India Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : -- | 16-12-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/E2OPen.pdf |
2 | 3 | CA(CAA)No.51/BB/2020 | Amber Road Software Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : -- | 16-12-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Amber Road.pdf |
3 | 4 | CA(CAA)No.48/BB/2020 | Steelwedge Technologies Pvt Ltd Vs Shyam Sundar H V, Adv Respondent Advocate : -- | 16-12-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Steelwedge.pdf |
4 | 5 | CPNo.158/BB/2020 | Marble Industry [Mangalore] Pvt Ltd & Others Vs ROC Chethan Jeevandas Nayak, PCS Respondent Advocate : -- | 16-10-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/cp no 158 of 2020.pdf |
5 | 6 | CP(IB)No.156/BB/2017 | Triumph India Software Services Pvt Ltd Vs Corporation Bank Girish Kumar M.S Shri Venkata Subbarao Kalva Liquidator, Vivekananda for Liquidator | 04-12-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Triumph India Software Services Pvt Ltd Vs Corporation Bank.pdf |
6 | 7 | CPNo.129/BB/2020 | M/s Shamel Projects India Pvt Ltd Vs Arjun Amanchi, Advocate Respondent Advocate : -- | 11-12-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/NCLT20210119111719.pdf |
7 | 8 | CP(IB)No.263/BB/2019 | M/s RDC Concrete (India) Pvt Ltd Vs M/s Sukritha Buildmann Pvt Ltd Ricab Chad, Advocate Abhijit Atur, Advocate | 25-10-2019 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/Ms RDC Concrete (India) Pvt Ltd VS Ms Sukritha Buildmann Pvt Ltd _0.pdf |
8 | 9 | CP(IB)No.214/BB/2020 | Shapoorji Pallonji and Company Pvt Ltd Vs Shore Dwellings Pvt Ltd[formerly known as Mantri Dwellings Pvt Ltd] Keystone Partners Respondent Advocate : -- | 18-12-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/CP IB 214 of 2020.pdf |
9 | 10 | CPNo.193/BB/2020 | Chiteta Mining Company Pvt Ltd Vs Jose Thomas, PCS Respondent Advocate : -- | 30-12-2020 | https://archive.nclt.gov.in/sites/default/files/January2021/final-orders-pdf/NCLT20210111165206.pdf |