Home > Enterprise >  How to select one column with panda dataframe on a wiki table?
How to select one column with panda dataframe on a wiki table?

Time:11-21

I've been practicing web scraping and this time I'm trying to get only the first column of data (only the stock symbols) all the way down but it keeps pulling all the data from the table? Not sure what I'm doing wrong any assistance would be appreciated thank you

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_S&P_500_companies"
r = requests.get(url, headers=headers)

tables= pd.read_html(url, attrs={'id': 'constituents'})
df = df.iloc[1:]
print (df)
#df.to_csv('Stock_List.txt', index=False, encoding='utf-8') 

CodePudding user response:

First you have to get single table from all tables.

And next you can get column Symbols

df = tables[0]

df = df['Symbol']

Full working code

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_S&P_500_companies"

tables = pd.read_html(url, attrs={'id': 'constituents'})

df = tables[0]

print(df['Symbol'])

If you want also links assigned to symbols then you will have to use requests and BeautifulSoup because read_html can't give it.

import bs4 as bs
import requests

url = 'https://en.wikipedia.org/wiki/List_of_S&P_500_companies'

r = requests.get(url)
    
soup = bs.BeautifulSoup(r.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
    
symbols = []
    
for row in table.find_all('tr')[1:]:  # [1:] to skip header
    items = row.find_all('td')
    symbol = items[0].text.strip()
    link = items[0].find('a')['href']
    symbols.append([symbol, link])
    print(f"{symbol:5} | {link}")

#print(symbols)

Result:

MMM   | https://www.nyse.com/quote/XNYS:MMM
ABT   | https://www.nyse.com/quote/XNYS:ABT
ABBV  | https://www.nyse.com/quote/XNYS:ABBV
ABMD  | http://www.nasdaq.com/symbol/abmd
ACN   | https://www.nyse.com/quote/XNYS:ACN
ATVI  | http://www.nasdaq.com/symbol/atvi
ADBE  | http://www.nasdaq.com/symbol/adbe
AMD   | http://www.nasdaq.com/symbol/amd
AAP   | https://www.nyse.com/quote/XNYS:AAP
AES   | https://www.nyse.com/quote/XNYS:AES
AFL   | https://www.nyse.com/quote/XNYS:AFL
A     | https://www.nyse.com/quote/XNYS:A
APD   | https://www.nyse.com/quote/XNYS:APD
AKAM  | http://www.nasdaq.com/symbol/akam
ALK   | https://www.nyse.com/quote/XNYS:ALK
ALB   | https://www.nyse.com/quote/XNYS:ALB
ARE   | https://www.nyse.com/quote/XNYS:ARE
ALGN  | http://www.nasdaq.com/symbol/algn
ALLE  | https://www.nyse.com/quote/XNYS:ALLE
LNT   | https://www.nyse.com/quote/XNYS:LNT

# ... etc ...

It based on my code from answer for: Running for-loop and skipping stocks with 'KeyError' : Date

The same code is also on GitHub in my repo:

python-examples/__scraping__/wikipedia.org - SP500 - requests, BS

  • Related