Home > Blockchain >  How can I turn this into a DataFrame?
How can I turn this into a DataFrame?

Time:11-14

I am new to Python, and was trying to run a basic web scraper. My code looks like this

import requests

import pandas as pd

x = requests.get('https://www.baseball-reference.com/players/p/penaje02.shtml')

dfs = pd.read_html(x.content)

print(dfs)

df = pd.DataFrame(dfs)

when printing dfs it looks like this. I only want the second table.

[            Year            Age             Tm             Lg    G   PA   AB  \
0           2018             20        HOU-min             A-   36  156  136   
1           2019             21        HOU-min           A,A   109  473  409   
2           2021             23        HOU-min         AAA,Rk   37  160  145   
3           2022             24            HOU             AL  136  558  521   
4           1 Yr           1 Yr           1 Yr           1 Yr  136  558  521   
5  162 Game Avg.  162 Game Avg.  162 Game Avg.  162 Game Avg.  162  665  621   

    R    H  2B  ...    OPS   OPS    TB  GDP  HBP  SH  SF  IBB   Pos  \
0  22   34   5  ...  0.649    NaN   42    0    1   0   1    0   NaN   
1  72  124  21  ...  0.825    NaN  180    4   11   0   6    0   NaN   
2  25   43   5  ...  0.942    NaN   84    0    7   0   0    0   NaN   
3  72  132  20  ...  0.715  101.0  222    5    6   1   5    0  *6/H   
4  72  132  20  ...  0.715  101.0  222    5    6   1   5    0   NaN   
5  86  157  24  ...  0.715  101.0  264    6    7   1   6    0   NaN   

                Awards  
0           TRC · NYPL  
1  DAV,FAY · MIDW,CARL  
2   SKT,AST · AAAW,FCL  
3                   GG  
4                  NaN  
5                  NaN  

[6 rows x 30 columns]]

however, i end up with error Must pass 2-d input. shape=(1, 6, 30) after my last line. I have tried using df=dfs[1], but got the error list index our of range. Any way i can turn dfs from a list to a datframe?

CodePudding user response:

What do you mean you only want the second table? There's only one table, it's 6 rows and 30 columns. The backslashes show up when whatever you're trying to print to isn't wide enough to contain the dataframe without line wrapping. Here's the dataframe printed in a wider terminal:

enter image description here

The pd.read_html() function returns a List[DataFrame] so you first need to grab your dataframe from the list, and then you can subset it to get the columns you care about:

df = dfs[0]

columns = ['R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS ', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'Pos']

print(df[columns])

Output:

    R    H  2B  3B  HR  RBI  SB  CS  BB   SO     BA    OBP    SLG    OPS   OPS    TB  GDP  HBP  SH  SF  IBB   Pos
0  22   34   5   0   1   10   3   0  18   19  0.250  0.340  0.309  0.649    NaN   42    0    1   0   1    0   NaN
1  72  124  21   7   7   54  20  10  47   90  0.303  0.385  0.440  0.825    NaN  180    4   11   0   6    0   NaN
2  25   43   5   3  10   21   6   1   8   41  0.297  0.363  0.579  0.942    NaN   84    0    7   0   0    0   NaN
3  72  132  20   2  22   63  11   2  22  135  0.253  0.289  0.426  0.715  101.0  222    5    6   1   5    0  *6/H
4  72  132  20   2  22   63  11   2  22  135  0.253  0.289  0.426  0.715  101.0  222    5    6   1   5    0   NaN
5  86  157  24   2  26   75  13   2  26  161  0.253  0.289  0.426  0.715  101.0  264    6    7   1   6    0   NaN
  • Related