Home > Software design >  Find consecutive indices of common values in a group
Find consecutive indices of common values in a group

Time:01-31

For the DataFrame below:

df = pd.DataFrame({'Name': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes'],
                   'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','-','-','Alpha','Alpha','S-Class'],
                   'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020],
                   'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000],
                   'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black']
                  })

df

    Name    Type     Year   Price   Color
0   Tesla   Model X  2015   85000   White
1   Tesla   Model X  2015   90000   White
2   Tesla   Model X  2015   95000   White
3   Toyota  Corolla  2017   20000   Red
4   Ford    Bronco   2018   35000   Blue
5   Ford    Bronco   2018   35000   Blue
6   Ford    Mustang  2020   45000   Yellow
7   BMW     3 Series 2015   40000   Silver
8   BMW  -   2015   40000   Silver
9   BMW  -   2017   65000   Black
10  Mercedes Alpha   2018   50000   White
11  Mercedes Alpha   2018   50000   White
12  Mercedes S-Class 2020   75000   Black

For each Name column groups, I am trying to find the corresponding start index and end index of consecutive common values in other columns. if the value repeat consecutively it should be considered as common and store it to a dictionary with key as start index and value as end index.

For example Tesla has Model X common in the Type column, so in the Type column I should get the start index of Model X and end index of Model X as the first value in a dictionary with start index : `end index as the key value pair of the dictionary.

Similarly, for the Color column Blue should have start index as 4 and end index as 5 and for the Year column 2015 should have start index as 0 and end index as 2.

Expected Output:

Name :

{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}

Type :

{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}

Year :

{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12] }

Price:

{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12} 

Color:

{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12:12}

What I tried

def find_indices(df, column):
    df1 = df.index.to_series().groupby(df[column]).agg(['first', 'last']).reset_index()
    df1 = df1.sort_values("first").reset_index()
    first_last_rows = df1.set_index('first')['last'].to_dict()
    print(column   ":")
    print(first_last_rows)

Output I got

Name:
{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}
Type:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}
Year:
{0: 8, 3: 9, 4: 11, 6: 12}
Price:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}
Color:
{0: 11, 3: 3, 4: 5, 6: 6, 7: 8, 9: 12}

In the output I got Year and Color columns end values are from disjoint group so it is not finding the consecutive common values(or the values are for different Name)

CodePudding user response:

It seems you want a group of consecutive values. For that a well known method is to use shift and cumsum and then groupby, and then for your problem you take first index and last index of each group and then build a dict as below:

col = "Year"
col_dict = dict(
    df.groupby(df[col].ne(df[col].shift()).cumsum())[col]
    .agg(lambda x: (x.index[0], x.index[-1]))
    .values
)
print(col_dict)

{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}
  • Related