How to find the change of text based on a unique value on another column in an excel file using Pyth-CodePudding

I have a excel file containing three columns as shown below,

ID	Name	Date
117	Laspringe	2019-04-08
117	Laspringe (FT)	2020-06-16
117	Laspringe (Ftp)	2020-07-24
999	Angelo	2020-04-15
999	Angelo(FT)	2021-03-05
999	Angelo(Ftp)	2021-09-13
999	Angelo	2022-02-20

I wanted to find out that based on each ID which has the name changed from original name and changed back to the same original name. For example Angelo is changed to Angelo(FT), Angelo(Ftp) and changed back to original Angelo.

Whereas Laspringe is not changed back to the original name.

Is it possible to find out which of the ID's have changed the name back to original using python ??

Expecting the result to be like,

ID
999

CodePudding user response：

A simple way might be to check if the Name has any duplicate per group:

s = df.duplicated(['ID', 'Name']).groupby(df['ID']).any()
out = s[s].index.tolist()

Output: [999]

If you can have duplicates on successive dates (A -> A -> B shouldn't be a match):

s = (df
  .sort_values(by='Date')
  .groupby('ID')['Name']
  .agg(lambda s: s[s.ne(s.shift())].duplicated().any())
)
out = s[s].index.tolist()

The two code will behave differently on this input:

    ID             Name        Date
0  117        Laspringe  2019-04-08
1  117        Laspringe  2019-04-09 # duplicated but no intermediate name
2  117   Laspringe (FT)  2020-06-16
3  117  Laspringe (Ftp)  2020-07-24
4  999           Angelo  2020-04-15
5  999       Angelo(FT)  2021-03-05
6  999      Angelo(Ftp)  2021-09-13
7  999           Angelo  2022-02-29

CodePudding user response：

You can iterate over the columns in an excel sheet using openpyxl. In this case I've used a defaultdict to build a list of names for each id, and then the final check is that the first and last item in each list are the same.

import openpyxl, collections

ws = openpyxl.load_workbook('Book1.xlsx').active
name_dict = collections.defaultdict(list)
ids, names = ([cell.value for cell in col] for col in ws.iter_cols(1,2))
for id_, name in zip(ids[1:],names[1:]):  # [1:] to ignore the header row
    name_dict[id_].append(name)
print(*[k for k,v in name_dict.items() if v[0]==v[-1]])