Home > Software engineering >  Extracting a scraped list into new columns
Extracting a scraped list into new columns

Time:03-04

I have this code (borrowed from an old question posted ont his site)

import pandas as pd
import json
import numpy as np
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml")
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")





#(The table has an id, it makes it more simple to target )
batting = doc.find(id='misc_batting')

careers = []
for row in batting.find_all('tr')[1:]:
    dictionary = {}
    dictionary['names'] = row.find(attrs = {"data-stat": "player"}).text.strip()
    dictionary['experience'] = row.find(attrs={"data-stat": "experience"}).text.strip()
    careers.append(dictionary)

Which generates a result like this:

[{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}, {'names':

How do I create this into a column separated dataframe like this?

Names       Experience
David Adams   1

CodePudding user response:

Simply pass your list of dicts (careers) to pandas.DataFrame() to get your expected result.

Example

import pandas as pd

careers = [{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}]

pd.DataFrame(careers)

Output

names experience
David Adams 1
Steve Ames 1
Rick Ankiel 11
Jairo Asencio 4
Luis Ayala 9
Brandon Bantz 1
Scott Barnes 2

CodePudding user response:

You can simplify this quite a bit with pandas. Have it pull the table, then you just want the Names and Yrs columns.

import pandas as pd

url = "https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml"
df = pd.read_html(url, attrs = {'id': 'misc_batting'})[0]

df_filter = df[['Name','Yrs']]

If you need to rename those columns, add:

df_filter = df_filter.rename(columns={'Name':'names','Yrs':'experience'})

Output:

print(df_filter)
              names  experience
0       David Adams           1
1        Steve Ames           1
2       Rick Ankiel          11
3     Jairo Asencio           4
4        Luis Ayala           9
..              ...         ...
209    Dewayne Wise          11
210       Ross Wolf           3
211  Kevin Youkilis          10
212   Michael Young          14
213          Totals        1357

[214 rows x 2 columns]
  • Related