Home > other >  How to find the name change based on a specific TEXT and display its id and date using python
How to find the name change based on a specific TEXT and display its id and date using python

Time:12-19

I have a excel file containing three columns as shown below,

id name Date
436 Minster 2020-04-15
436 Minster (HTTP gg AG) 2021-12-07
145 Denskin (HTTP geplan) 2020-07-24
145 Denskinf HTTP DTAG 2020-08-15
555 Garben 2021-03-05
555 Wabern (HttP) 2021-09-13
555 Wabern Garben HTTP 2022-04-18
737 oyehausen 2020-06-26
737 WerrePark HTTP ag 2020-07-14
737 Werre Park (http ssd) 2020-08-25
737 Werre Park (HTTP) 2021-03-15
884 klintern 2021-03-23
884 kitern http 2021-04-08
884 Lausen (http los) 2021-06-16
884 kitener (http geplan) 2021-07-24
584 Lausern 2020-08-15
584 Lausern (HTTP DTAG gg) 2021-03-05

Is it possible to filter out the id, name and the date when there is a change in name if HTTP in any form like HttP, (HTTP), http is included in the name at first event of occurance. For Example id:436 doesn't have any form http text included in its first row but in the second row with the same id:436 HTTP is included, but for the id:145 the first row itself has the HTTP. But I wanted to filter out the change in name which includes HTTP in any form of text either small or captial in the first event of occurence, with its id, name and date.

Expecting the result to be like,

id name Date
436 Minster (HTTP gg AG) 2021-12-07
555 Wabern (HttP) 2021-09-13
737 WerrePark HTTP ag 2020-07-14
884 kitern http 2021-04-08
584 Lausern (HTTP DTAG gg) 2021-03-05

CodePudding user response:

Create a boolean mask has_http to identify the rows containing http then group this mask by id and use shift to create another boolean mask to identify whether previous row contains http. Then combine the masks using & to identify the changes

has_http = df['name'].str.contains(r'(?i)\bhttp\b')
mask = has_http & ~has_http.groupby(df['id']).shift(fill_value=True)

Now use the resulting mask to filter the rows

df[mask]


     id                    name        Date
1   436    Minster (HTTP gg AG)  2021-12-07
5   555           Wabern (HttP)  2021-09-13
8   737       WerrePark HTTP ag  2020-07-14
12  884             kitern http  2021-04-08
16  584  Lausern (HTTP DTAG gg)  2021-03-05

CodePudding user response:

Here is another approach using regex for any possible combination of uppercase/lowercase HTTP:

import re
import pandas as pd

data = [
["436", "Minster", "2020-04-15"],
["436", "Minster (HTTP gg AG)", "2021-12-07"],
["145", "Denskin (HTTP geplan)", "2020-07-24"],
["145", "Denskinf HTTP DTAG", "2020-08-15"],
["884", "klintern", "2021-03-23"],
["884", "klintern http", "2021-04-08"],
["884", "kitener (http geplan)", "2021-07-24"]
]

df = pd.DataFrame(data, columns=['id', 'name', 'Date'])


filtered_df = df.loc[df['name'].str.contains("http", flags=re.IGNORECASE)]
filtered_df = filtered_df.groupby("id").first()

Output:

id                 name        Date
145  Denskin (HTTP geplan)  2020-07-24
436   Minster (HTTP gg AG)  2021-12-07
884          klintern http  2021-04-08
  • Related