Is there a way to extract rows from dataframes, that is in a list of dataframes-CodePudding

For this project I built a web crawler. The crawler goes to a league home page on baseball reference, and collects all of the team links found on that page, appending them to a list, and then scraping all the links in that list, collecting data on each player that played on that team for a given season. When it is done, it returns a list of a list of dataframes. In this example there are 22 teams, so it returns a list, that contains 22 lists of player dataframes. Totaling at about 640 total dataframes (players).

Here is an example of one of those player dataframes: (nwl_21 is the list, and the first index gets the first of 22 teams, and the second index of 0 gets the first player on that team)

nwl_21[0][0]

1 to 7 of 7 entries
Filter

index   0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19
0                           NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   19  -2.2    Grand Canyon    WAC NCAA    None    4.0 11.0    10.0    0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 4.0 0.0
2   20  -1.8    Grand Canyon    WAC NCAA    None    32.0    82.0    70.0    3.0 12.0    0.0 0.0 1.0 5.0 1.0 0.0 3.0 24.0    0.171
3   20  -0.5    Kokomo  NWDS    Smr None    35.0    145.0   118.0   18.0    20.0    1.0 1.0 0.0 11.0    0.0 0.0 22.0    27.0    0.17
4   None    College None    36  93  80  3.0 12.0    0.0 0.0 1.0 5.0 2.0 0.0 3.0 28.0    0.15    0.207   0.188   0.394
5   None    Other   None    35  145 118 18.0    20.0    1.0 1.0 0.0 11.0    0.0 0.0 22.0    27.0    0.17    0.317   0.195   0.512
6   None    None    None    71  238 198 21.0    32.0    1.0 1.0 1.0 16.0    2.0 0.0 25.0    55.0    0.162   0.276   0.192   0.468

Now my goal is to access the rows that contain 'NCAA' and 'NWDS' and store it somewhere. The ultimate goal is to have a large sample of this data so I can build a multiple regression model, that can predict performance in the 'NWDS' League, based off of prior 'NCAA' statistics.

Any and all help is greatly appriciated.

CodePudding user response：

You can use a regular expression to find the rows that contain NCAA and NWDS. Here we have a find_strs() function that we can use on each row in the df with .apply() and axis=1. In this special case, the entire row is getting converted to a list, and then a string as our find_strs() argument.

import pandas as pd
import re

df = pd.DataFrame({
 1: {0: '20', 1: '20', 2: 'None', 3: 'None', 4: 'None'},
 2: {0: '-1.8', 1: '-0.5', 2: 'College', 3: 'NCAA', 4: 'None'},
 3: {0: 'Grand', 1: 'Kokomo', 2: 'None', 3: 'NWDS', 4: 'None'},
 4: {0: 'NWDS', 1: 'NWDS', 2: '36', 3: '35', 4: '71'},
 5: {0: 'WAC', 1: 'Smr', 2: '93', 3: '145', 4: '238'},
 6: {0: 'NCAA', 1: 'None', 2: '80', 3: '118', 4: '198'},
 7: {0: 'None', 1: '35.0', 2: '3.0', 3: '18.0', 4: '21.0'}})


def find_strs(row_string):
    
    pattern = re.compile(r'^(?=.*NWDS)(?=.*NCAA). $')
    
    if pattern.match(row_string):
        return True 
    
    else:
        return False
    
df['str_match'] = df.apply(lambda row : find_strs(str(row.tolist())), axis=1)

Output df:

        1       2       3       4       5   6       7     str_match
0       20      -1.8    Grand   NWDS    WAC NCAA    None    True
1       20      -0.5    Kokomo  NWDS    Smr None    35.0    False
2       None    College None    36      93  80      3.0     False
3       None    NCAA    NWDS    35      145 118     18.0    True
4       None    None    None    71      238 198     21.0    False

Next we can take a slice of the matching rows with this.

df = df.loc[df['str_match'] == True]

Output df:

    1      2    3       4       5   6       7       str_match
0   20    -1.8  Grand   NWDS    WAC NCAA    None    True
3   None  NCAA  NWDS    35      145 118     18.0    True

So basically what I would do is run this on each of your dfs with a for loop and append the df with the match rows to a list. Then you can use pd.concat() with your new list of df slices. This part should be pretty simple, but post a comment if you need a hand and I will add to this answer.

CodePudding user response：

If we call df the dataframe in the question and if we want to put in a dataframe df2, the rows of df whose values are equal to 'NCAA' or 'NWDS', we can use the following Pyhton line:

df2 = df[((df == 'NCAA')|(df == 'NWDS')).any(axis=1)]