How to find the name change based on a specific TEXT and display its id and date using python-CodePudding

I have a excel file containing three columns as shown below,

id	name	Date
436	Minster	2020-04-15
436	Minster (HTTP gg AG)	2021-12-07
145	Denskin (HTTP geplan)	2020-07-24
145	Denskinf HTTP DTAG	2020-08-15
555	Garben	2021-03-05
555	Wabern (HttP)	2021-09-13
555	Wabern Garben HTTP	2022-04-18
737	oyehausen	2020-06-26
737	WerrePark HTTP ag	2020-07-14
737	Werre Park (http ssd)	2020-08-25
737	Werre Park (HTTP)	2021-03-15
884	klintern	2021-03-23
884	kitern http	2021-04-08
884	Lausen (http los)	2021-06-16
884	kitener (http geplan)	2021-07-24
584	Lausern	2020-08-15
584	Lausern (HTTP DTAG gg)	2021-03-05

Is it possible to filter out the id, name and the date when there is a change in name if HTTP in any form like HttP, (HTTP), http is included in the name at first event of occurance. For Example id:436 doesn't have any form http text included in its first row but in the second row with the same id:436 HTTP is included, but for the id:145 the first row itself has the HTTP. But I wanted to filter out the change in name which includes HTTP in any form of text either small or captial in the first event of occurence, with its id, name and date.

Expecting the result to be like,

id	name	Date
436	Minster (HTTP gg AG)	2021-12-07
555	Wabern (HttP)	2021-09-13
737	WerrePark HTTP ag	2020-07-14
884	kitern http	2021-04-08
584	Lausern (HTTP DTAG gg)	2021-03-05

CodePudding user response：

Create a boolean mask has_http to identify the rows containing http then group this mask by id and use shift to create another boolean mask to identify whether previous row contains http. Then combine the masks using & to identify the changes

has_http = df['name'].str.contains(r'(?i)\bhttp\b')
mask = has_http & ~has_http.groupby(df['id']).shift(fill_value=True)

Now use the resulting mask to filter the rows

df[mask]


     id                    name        Date
1   436    Minster (HTTP gg AG)  2021-12-07
5   555           Wabern (HttP)  2021-09-13
8   737       WerrePark HTTP ag  2020-07-14
12  884             kitern http  2021-04-08
16  584  Lausern (HTTP DTAG gg)  2021-03-05

CodePudding user response：

Here is another approach using regex for any possible combination of uppercase/lowercase HTTP:

import re
import pandas as pd

data = [
["436", "Minster", "2020-04-15"],
["436", "Minster (HTTP gg AG)", "2021-12-07"],
["145", "Denskin (HTTP geplan)", "2020-07-24"],
["145", "Denskinf HTTP DTAG", "2020-08-15"],
["884", "klintern", "2021-03-23"],
["884", "klintern http", "2021-04-08"],
["884", "kitener (http geplan)", "2021-07-24"]
]

df = pd.DataFrame(data, columns=['id', 'name', 'Date'])


filtered_df = df.loc[df['name'].str.contains("http", flags=re.IGNORECASE)]
filtered_df = filtered_df.groupby("id").first()

Output:

id                 name        Date
145  Denskin (HTTP geplan)  2020-07-24
436   Minster (HTTP gg AG)  2021-12-07
884          klintern http  2021-04-08