For this project I built a web crawler. The crawler goes to a league home page on baseball reference, and collects all of the team links found on that page, appending them to a list, and then scraping all the links in that list, collecting data on each player that played on that team for a given season. When it is done, it returns a list of a list of dataframes. In this example there are 22 teams, so it returns a list, that contains 22 lists of player dataframes. Totaling at about 640 total dataframes (players).
Here is an example of one of those player dataframes: (nwl_21 is the list, and the first index gets the first of 22 teams, and the second index of 0 gets the first player on that team)
nwl_21[0][0]
1 to 7 of 7 entries
Filter
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 19 -2.2 Grand Canyon WAC NCAA None 4.0 11.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 4.0 0.0
2 20 -1.8 Grand Canyon WAC NCAA None 32.0 82.0 70.0 3.0 12.0 0.0 0.0 1.0 5.0 1.0 0.0 3.0 24.0 0.171
3 20 -0.5 Kokomo NWDS Smr None 35.0 145.0 118.0 18.0 20.0 1.0 1.0 0.0 11.0 0.0 0.0 22.0 27.0 0.17
4 None College None 36 93 80 3.0 12.0 0.0 0.0 1.0 5.0 2.0 0.0 3.0 28.0 0.15 0.207 0.188 0.394
5 None Other None 35 145 118 18.0 20.0 1.0 1.0 0.0 11.0 0.0 0.0 22.0 27.0 0.17 0.317 0.195 0.512
6 None None None 71 238 198 21.0 32.0 1.0 1.0 1.0 16.0 2.0 0.0 25.0 55.0 0.162 0.276 0.192 0.468
Now my goal is to access the rows that contain 'NCAA' and 'NWDS' and store it somewhere. The ultimate goal is to have a large sample of this data so I can build a multiple regression model, that can predict performance in the 'NWDS' League, based off of prior 'NCAA' statistics.
Any and all help is greatly appriciated.
CodePudding user response:
You can use a regular expression to find the rows that contain NCAA and NWDS. Here we have a find_strs()
function that we can use on each row in the df with .apply() and axis=1
. In this special case, the entire row is getting converted to a list, and then a string as our find_strs()
argument.
import pandas as pd
import re
df = pd.DataFrame({
1: {0: '20', 1: '20', 2: 'None', 3: 'None', 4: 'None'},
2: {0: '-1.8', 1: '-0.5', 2: 'College', 3: 'NCAA', 4: 'None'},
3: {0: 'Grand', 1: 'Kokomo', 2: 'None', 3: 'NWDS', 4: 'None'},
4: {0: 'NWDS', 1: 'NWDS', 2: '36', 3: '35', 4: '71'},
5: {0: 'WAC', 1: 'Smr', 2: '93', 3: '145', 4: '238'},
6: {0: 'NCAA', 1: 'None', 2: '80', 3: '118', 4: '198'},
7: {0: 'None', 1: '35.0', 2: '3.0', 3: '18.0', 4: '21.0'}})
def find_strs(row_string):
pattern = re.compile(r'^(?=.*NWDS)(?=.*NCAA). $')
if pattern.match(row_string):
return True
else:
return False
df['str_match'] = df.apply(lambda row : find_strs(str(row.tolist())), axis=1)
Output df:
1 2 3 4 5 6 7 str_match
0 20 -1.8 Grand NWDS WAC NCAA None True
1 20 -0.5 Kokomo NWDS Smr None 35.0 False
2 None College None 36 93 80 3.0 False
3 None NCAA NWDS 35 145 118 18.0 True
4 None None None 71 238 198 21.0 False
Next we can take a slice of the matching rows with this.
df = df.loc[df['str_match'] == True]
Output df:
1 2 3 4 5 6 7 str_match
0 20 -1.8 Grand NWDS WAC NCAA None True
3 None NCAA NWDS 35 145 118 18.0 True
So basically what I would do is run this on each of your dfs with a for loop and append the df with the match rows to a list. Then you can use pd.concat()
with your new list of df slices. This part should be pretty simple, but post a comment if you need a hand and I will add to this answer.
CodePudding user response:
If we call df the dataframe in the question and if we want to put in a dataframe df2, the rows of df whose values are equal to 'NCAA' or 'NWDS', we can use the following Pyhton line:
df2 = df[((df == 'NCAA')|(df == 'NWDS')).any(axis=1)]