Home > Back-end >  Assistance with "read_html" result to a proper DataFrame
Assistance with "read_html" result to a proper DataFrame

Time:08-27

I'm looking to grab the stats table from this example link:

https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50

...however when I grab it, there's an extra column header that I can't seem to get rid of, and the bottom row has some useless "Page size" string that I could also get rid of.

I've provided an example code below for testing, along with some attempts to fix the issue, but to no avail.

from pandas import read_html, set_option
#set_option('display.max_rows', 20)
#set_option('display.max_columns', None)

# Extract the table from the provided link
url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]
print(table_of_interest)

# Attempt 1 - https://stackoverflow.com/questions/68385659/in-pandas-python-how-do-i-get-rid-of-the-extra-column-header-with-index-numbers
df = table_of_interest.iloc[1:,:-1]
print(df)

# Attempt 2 - https://stackoverflow.com/questions/71379513/remove-extra-column-level-from-dataframe
df = table_of_interest.rename_axis(columns=None)
print(df)

results in output

Output

I want to get rid of that top 1 Page size: select 14 items in 1 pages column header. How?

CodePudding user response:

You could try as follows:

from pandas import read_html

url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]

# keep only level 1 from the original MultiIndex cols
table_of_interest.columns = [col[1] for col in table_of_interest.columns]

# get rid of last `row`
table_of_interest = table_of_interest.iloc[:-1]

print(table_of_interest)

     # Team   G  PA HR  R RBI SB  ...   SLG  wOBA xwOBA wRC    BsR   Off   Def   WAR
0    1  STL  13  41  3  9   9  1  ...  .613  .413   NaN  173   0.1   3.6   0.4   0.6
1    2  NYM  16  42  0  5   5  0  ...  .400  .388   NaN  157  -0.6   2.2  -0.3   0.3
2    3  MIL  15  40  0  4   4  1  ...  .424  .346   NaN  122   0.2   1.2  -0.2   0.2
3    4  CHC  16  35  1  5   5  0  ...  .483  .368   NaN  140  -0.5   1.0  -0.2   0.2
4    5  HOU  13  38  2  3   3  1  ...  .514  .334   NaN  119   0.1   0.7  -0.1   0.2
5    6  CIN  14  38  1  6   6  0  ...  .371  .301   NaN   85   0.0  -0.7  -0.1   0.1
6    7  CLE  14  37  0  1   1  1  ...  .242  .252   NaN   66   0.2  -1.4   0.3   0.0
7    8  ARI  18  34  1  4   3  0  ...  .231  .276   NaN   77   0.0  -1.1  -0.2   0.0
8    9  SDP  15  36  0  2   2  1  ...  .172  .243   NaN   57   0.2  -1.6  -0.1  -0.1
9   10  WSN  15  35  1  1   1  0  ...  .313  .243   NaN   51  -0.1  -2.3  -0.3  -0.1
10  11  ATL  14  36  1  3   2  0  ...  .226  .227   NaN   44   0.0  -2.6  -0.1  -0.2
11  12  KCR  13  31  0  3   3  0  ...  .214  .189   NaN   30  -0.4  -3.1   0.0  -0.2
12  13  LAA  15  31  0  1   1  0  ...  .207  .183   NaN   21   0.0  -3.1   0.0  -0.2
13  14  PIT  18  32  0  0   0  0  ...  .200  .209   NaN   34  -0.4  -3.0  -0.6  -0.3
  • Related