Home > Net >  How do I iterate through the column of a pandas dataframe and delete rows accordingly? (Jupyter Note
How do I iterate through the column of a pandas dataframe and delete rows accordingly? (Jupyter Note

Time:10-04

This is the dataframe I'm working with right now:

Season Team W L W/L% Coaches
1 2020-21 Atlanta Hawks* 41 31 0.569 L. Pierce (14-20) N. McMillan (27-11)
2 2019-20 Atlanta Hawks 20 47 0.299 L. Pierce (20-47)
3 2018-19 Atlanta Hawks 29 53 0.354 L. Pierce (29-53)
4 2017-18 Atlanta Hawks 24 58 0.293 M. Budenholzer (24-58)
5 2016-17 Atlanta Hawks* 43 39 0.524 M. Budenholzer (43-39)
6 2015-16 Atlanta Hawks* 48 34 0.585 M. Budenholzer (48-34)
7 2014-15 Atlanta Hawks* 60 22 0.732 M. Budenholzer (60-22)
8 2013-14 Atlanta Hawks* 38 44 0.463 M. Budenholzer (38-44)
9 2012-13 Atlanta Hawks* 44 38 0.537 L. Drew (44-38)
10 2011-12 Atlanta Hawks* 40 26 0.606 L. Drew (40-26)
11 2010-11 Atlanta Hawks* 44 38 0.537 L. Drew (44-38)
12 2009-10 Atlanta Hawks* 53 29 0.646 M. Woodson (53-29)

I basically want to only have rows where the Coach's name is different from one year to the next. So for example, I would keep Row 4 since its immediately adjacent row Row 3 has a different name under 'Coaches' column, but I would delete Row 5, 6, and 7 because the name in the 'Coaches' column is the same for all three rows. But, I would want to keep Row 8, because Row 9 (adjacent Row) has a different name in 'Coaches.'

I have gotten this dataframe from reading a csv file df = pd.read_csv("hawks.csv") and I think I am supposed to call df.iloc but I don't know how to iterate through each row and compare the values in the column. So far I have only managed to print the string values in the 'Coaches' column like this:

coaches = df.iloc[0:, 7]
for name in coaches:
    print(name)

But I want to know how to obtain the value stored in the 'Coaches' column when I iterate through each row (and then delete rows that do not fit the criteria I am looking for). Much appreciated!

CodePudding user response:

To iterate through dataframe:

it = df.iterrows()  #get iterator
for index, row in it:

    # to delete row
    if cond:
        df.drop([row], inplace=True)

# reset indices if needed
df.reset_index(drop=True, inplace=True)

CodePudding user response:

You can filter the column by comparing consecutive rows with shift:

import pandas as pd

data = [ { "idx": 1, "Season": "2020-21", "Team": "Atlanta Hawks*", "W": 41, "L": 31, "Coaches": "L. Pierce (14-20) N. McMillan (27-11)" }, { "idx": 2, "Season": "2019-20", "Team": "Atlanta Hawks", "W": 20, "L": 47, "Coaches": "L. Pierce (20-47)" }, { "idx": 3, "Season": "2018-19", "Team": "Atlanta Hawks", "W": 29, "L": 53, "Coaches": "L. Pierce (29-53)" }, { "idx": 4, "Season": "2017-18", "Team": "Atlanta Hawks", "W": 24, "L": 58, "Coaches": "M. Budenholzer (24-58)" }, { "idx": 5, "Season": "2016-17", "Team": "Atlanta Hawks*", "W": 43, "L": 39, "Coaches": "M. Budenholzer (43-39)" }, { "idx": 6, "Season": "2015-16", "Team": "Atlanta Hawks*", "W": 48, "L": 34, "Coaches": "M. Budenholzer (48-34)" }, { "idx": 7, "Season": "2014-15", "Team": "Atlanta Hawks*", "W": 60, "L": 22, "Coaches": "M. Budenholzer (60-22)" }, { "idx": 8, "Season": "2013-14", "Team": "Atlanta Hawks*", "W": 38, "L": 44, "Coaches": "M. Budenholzer (38-44)" }, { "idx": 9, "Season": "2012-13", "Team": "Atlanta Hawks*", "W": 44, "L": 38, "Coaches": "L. Drew (44-38)" }, { "idx": 10, "Season": "2011-12", "Team": "Atlanta Hawks*", "W": 40, "L": 26, "Coaches": "L. Drew (40-26)" }, { "idx": 11, "Season": "2010-11", "Team": "Atlanta Hawks*", "W": 44, "L": 38, "Coaches": "L. Drew (44-38)" }, { "idx": 12, "Season": "2009-10", "Team": "Atlanta Hawks*", "W": 53, "L": 29, "Coaches": "M. Woodson (53-29)" } ]
df = pd.DataFrame(data)
df[(df['Coaches'].str.split('(').str[0] != df['Coaches'].shift(1).str.split('(').str[0]) | (df['Coaches'].str.split('(').str[0] != df['Coaches'].shift(-1).str.split('(').str[0])]

Output:

idx Season Team W L Coaches
0 1 2020-21 Atlanta Hawks* 41 31 L. Pierce (14-20) N. McMillan (27-11)
2 3 2018-19 Atlanta Hawks 29 53 L. Pierce (29-53)
3 4 2017-18 Atlanta Hawks 24 58 M. Budenholzer (24-58)
7 8 2013-14 Atlanta Hawks* 38 44 M. Budenholzer (38-44)
8 9 2012-13 Atlanta Hawks* 44 38 L. Drew (44-38)
10 11 2010-11 Atlanta Hawks* 44 38 L. Drew (44-38)
11 12 2009-10 Atlanta Hawks* 53 29 M. Woodson (53-29)

I wasn't sure what you whether you'd want to count Pierce/McMillan as the first mention of Pierce, my answer assumes so. If you want to count this as a separate hit, just replace split('(') with rsplit('(', 1).

CodePudding user response:

You can use drop_duplicates method.

comp = re.compile(r'[A-Z].\s\w ')
df['Coaches_Names'] = df['Coaches'].apply(lambda x: ' - '.join(comp.findall(x)))  # New column which includes only coache's names 
df['Team'] = df['Team'].apply(lambda x: ''.join(re.sub('\*', '', x)))   # Dropping '*' characters from name of teams
    
df.drop_duplicates(subset=['Team', 'Coaches_Names'], inplace=True)

print(df)
     Season           Team   W   L   W/L%                                Coaches            Coaches_Names
0   2020-21  Atlanta Hawks  41  31  0.569  L. Pierce (14-20) N. McMillan (27-11)  L. Pierce - N. McMillan
1   2019-20  Atlanta Hawks  20  47  0.299                      L. Pierce (20-47)                L. Pierce
3   2017-18  Atlanta Hawks  24  58  0.293                 M. Budenholzer (24-58)           M. Budenholzer
8   2012-13  Atlanta Hawks  44  38  0.537                        L. Drew (44-38)                  L. Drew
11  2009-10  Atlanta Hawks  53  29  0.646                     M. Woodson (53-29)               M. Woodson

It is up to you to drop or not Coaches_Names Column.

Additionally, you can prefer to use keep parameter. It allows you to make selection on first or last year. For more information you can look at documents. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

EDIT

Code is editted to format columns, @RJ, Adriaansen

  • Related