Home > database >  Convert bunch of list items (got from scaping vertical table) into pandas dataframe of equal headers
Convert bunch of list items (got from scaping vertical table) into pandas dataframe of equal headers

Time:09-27

I was scraping a website for data on a company and so far what I get as final result is bunch of string items which were converted into list.

Code snippet:

for tr in tables.find_all("tr"):
    for td in tr.find_all("td"):
        lists = td.text.split('\n')

now if, I print this lists with index and value using enumerate, I get 16 items as per the table scrapped which is correct if checked as per the website.

Result of print(lists) using enumerate:

Index   Data
0   ['XYZ']
1   ['100DL20C201961']
2   ['Capital']
3   ['12345']
4   ['Age']
5   ['16 Years']
6   ['Text']
7   ['56789']
8   ['Company Status']
9   ['Active']
10  ['Last Date']
11  ['27-11-2021']
12  ['Class']
13  ['Public Company']
14  ['Date']
15  ['31-12-2021']

However what I want to achieve is saving these bunch of list items as csv or excel so that every even number is header for the column name and odd number is data for the row.

Question:

  1. Is pandas DataFrame needed for this?
  2. How to convert bunch of lists as above (or strings) into a '.csv' or '.xlsx' table

Summary of goal: - A (2 row x 8 columns) table in .csv or .xlsx format.

enter image description here

CodePudding user response:

Try:

import pandas as pd
import requests
from bs4 import BeautifulSoup

URL = "https://www.instafinancials.com/company/mahan-energen-limited/U40100DL2005PLC201961"
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

data = []
d = dict((row.select_one('td:nth-child(1)').get_text(),row.select_one('td:nth-child(2)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8])
#print(d)
    
data.append(d)
df = pd.DataFrame(data).to_csv('out.csv',index=False)
#print(df)

             

enter image description here

Complete ResulSet

import pandas as pd
import requests
from bs4 import BeautifulSoup

URL = "https://www.instafinancials.com/company/mahan-energen-limited/U40100DL2005PLC201961"
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

data = []
d = dict((row.select_one('td:nth-child(1)').get_text(),row.select_one('td:nth-child(2)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8])
#print(d)

d.update(dict((row.select_one('td:nth-child(3)').get_text(),row.select_one('td:nth-child(4)').get_text()) for row in soup.select('#companyContentHolder_companyHighlightsContainer>table >tbody tr')[:8])) 
data.append(d)
df = pd.DataFrame(data).to_csv('out2.csv',index=False)
#print(df)
  • Related