I was scraping a website for data on a company and so far what I get as final result is bunch of string items which were converted into list.
Code snippet:
for tr in tables.find_all("tr"):
for td in tr.find_all("td"):
lists = td.text.split('\n')
now if, I print this lists
with index and value using enumerate, I get 16 items as per the table scrapped which is correct if checked as per the website.
Result of print(lists)
using enumerate
:
Index Data
0 ['XYZ']
1 ['100DL20C201961']
2 ['Capital']
3 ['12345']
4 ['Age']
5 ['16 Years']
6 ['Text']
7 ['56789']
8 ['Company Status']
9 ['Active']
10 ['Last Date']
11 ['27-11-2021']
12 ['Class']
13 ['Public Company']
14 ['Date']
15 ['31-12-2021']
However what I want to achieve is saving these bunch of list items as csv or excel so that every even number is header for the column name and odd number is data for the row.
Question:
- Is pandas DataFrame needed for this?
- How to convert bunch of lists as above (or strings) into a '.csv' or '.xlsx' table
Summary of goal: - A (2 row x 8 columns) table in .csv or .xlsx format.
CodePudding user response:
Try:
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = "https://www.instafinancials.com/company/mahan-energen-limited/U40100DL2005PLC201961"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
d = dict((row.select_one('td:nth-child(1)').get_text(),row.select_one('td:nth-child(2)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8])
#print(d)
data.append(d)
df = pd.DataFrame(data).to_csv('out.csv',index=False)
#print(df)
Complete ResulSet
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = "https://www.instafinancials.com/company/mahan-energen-limited/U40100DL2005PLC201961"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
d = dict((row.select_one('td:nth-child(1)').get_text(),row.select_one('td:nth-child(2)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8])
#print(d)
d.update(dict((row.select_one('td:nth-child(3)').get_text(),row.select_one('td:nth-child(4)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8]))
data.append(d)
df = pd.DataFrame(data).to_csv('out2.csv',index=False)
#print(df)