For the DataFrame
below:
df = pd.DataFrame({'Name': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes'],
'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','-','-','Alpha','Alpha','S-Class'],
'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020],
'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000],
'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black']
})
df
Name Type Year Price Color
0 Tesla Model X 2015 85000 White
1 Tesla Model X 2015 90000 White
2 Tesla Model X 2015 95000 White
3 Toyota Corolla 2017 20000 Red
4 Ford Bronco 2018 35000 Blue
5 Ford Bronco 2018 35000 Blue
6 Ford Mustang 2020 45000 Yellow
7 BMW 3 Series 2015 40000 Silver
8 BMW - 2015 40000 Silver
9 BMW - 2017 65000 Black
10 Mercedes Alpha 2018 50000 White
11 Mercedes Alpha 2018 50000 White
12 Mercedes S-Class 2020 75000 Black
For each Name
column groups, I am trying to find the corresponding start index
and end index
of consecutive common values in other columns. if the value repeat consecutively it should be considered as common and store it to a dictionary with key as start index and value as end index.
For example Tesla
has Model X
common in the Type
column, so in the Type
column I should get the start index
of Model X
and end index
of Model X
as the first value in a dictionary with start index
: `end index as the key value pair of the dictionary.
Similarly, for the Color
column Blue
should have start index
as 4 and end index
as 5 and for the Year
column 2015
should have start index
as 0 and end index
as 2.
Expected Output:
Name
:
{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}
Type
:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}
Year
:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12] }
Price
:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}
Color
:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12:12}
What I tried
def find_indices(df, column):
df1 = df.index.to_series().groupby(df[column]).agg(['first', 'last']).reset_index()
df1 = df1.sort_values("first").reset_index()
first_last_rows = df1.set_index('first')['last'].to_dict()
print(column ":")
print(first_last_rows)
Output I got
Name:
{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}
Type:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}
Year:
{0: 8, 3: 9, 4: 11, 6: 12}
Price:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}
Color:
{0: 11, 3: 3, 4: 5, 6: 6, 7: 8, 9: 12}
In the output I got Year
and Color
columns end values are from disjoint group so it is not finding the consecutive common values(or the values are for different Name
)
CodePudding user response:
It seems you want a group of consecutive values. For that a well known method is to use shift
and cumsum
and then groupby, and then for your problem you take first index and last index of each group and then build a dict as below:
col = "Year"
col_dict = dict(
df.groupby(df[col].ne(df[col].shift()).cumsum())[col]
.agg(lambda x: (x.index[0], x.index[-1]))
.values
)
print(col_dict)
{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}