I have a excel file containing three columns as shown below,
ID | Name | Date |
---|---|---|
117 | Laspringe | 2019-04-08 |
117 | Laspringe (FT) | 2020-06-16 |
117 | Laspringe (Ftp) | 2020-07-24 |
999 | Angelo | 2020-04-15 |
999 | Angelo(FT) | 2021-03-05 |
999 | Angelo(Ftp) | 2021-09-13 |
999 | Angelo | 2022-02-20 |
I wanted to find out that based on each ID which has the name changed from original name and changed back to the same original name. For example Angelo is changed to Angelo(FT), Angelo(Ftp) and changed back to original Angelo.
Whereas Laspringe is not changed back to the original name.
Is it possible to find out which of the ID's have changed the name back to original using python ??
Expecting the result to be like,
ID |
---|
999 |
CodePudding user response:
A simple way might be to check if the Name has any duplicate per group:
s = df.duplicated(['ID', 'Name']).groupby(df['ID']).any()
out = s[s].index.tolist()
Output: [999]
If you can have duplicates on successive dates (A -> A -> B shouldn't be a match):
s = (df
.sort_values(by='Date')
.groupby('ID')['Name']
.agg(lambda s: s[s.ne(s.shift())].duplicated().any())
)
out = s[s].index.tolist()
The two code will behave differently on this input:
ID Name Date
0 117 Laspringe 2019-04-08
1 117 Laspringe 2019-04-09 # duplicated but no intermediate name
2 117 Laspringe (FT) 2020-06-16
3 117 Laspringe (Ftp) 2020-07-24
4 999 Angelo 2020-04-15
5 999 Angelo(FT) 2021-03-05
6 999 Angelo(Ftp) 2021-09-13
7 999 Angelo 2022-02-29
CodePudding user response:
You can iterate over the columns in an excel sheet using openpyxl. In this case I've used a defaultdict to build a list of names for each id, and then the final check is that the first and last item in each list are the same.
import openpyxl, collections
ws = openpyxl.load_workbook('Book1.xlsx').active
name_dict = collections.defaultdict(list)
ids, names = ([cell.value for cell in col] for col in ws.iter_cols(1,2))
for id_, name in zip(ids[1:],names[1:]): # [1:] to ignore the header row
name_dict[id_].append(name)
print(*[k for k,v in name_dict.items() if v[0]==v[-1]])