I have a excel file containing three columns as shown below,
id | name | Date |
---|---|---|
436 | Minster | 2020-04-15 |
436 | Minster (HTTP gg AG) | 2021-12-07 |
145 | Denskin (HTTP geplan) | 2020-07-24 |
145 | Denskinf HTTP DTAG | 2020-08-15 |
555 | Garben | 2021-03-05 |
555 | Wabern (HttP) | 2021-09-13 |
555 | Wabern Garben HTTP | 2022-04-18 |
737 | oyehausen | 2020-06-26 |
737 | WerrePark HTTP ag | 2020-07-14 |
737 | Werre Park (http ssd) | 2020-08-25 |
737 | Werre Park (HTTP) | 2021-03-15 |
884 | klintern | 2021-03-23 |
884 | kitern http | 2021-04-08 |
884 | Lausen (http los) | 2021-06-16 |
884 | kitener (http geplan) | 2021-07-24 |
584 | Lausern | 2020-08-15 |
584 | Lausern (HTTP DTAG gg) | 2021-03-05 |
Is it possible to filter out the id, name and the date when there is a change in name if HTTP in any form like HttP, (HTTP), http is included in the name at first event of occurance. For Example id:436 doesn't have any form http text included in its first row but in the second row with the same id:436 HTTP is included, but for the id:145 the first row itself has the HTTP. But I wanted to filter out the change in name which includes HTTP in any form of text either small or captial in the first event of occurence, with its id, name and date.
Expecting the result to be like,
id | name | Date |
---|---|---|
436 | Minster (HTTP gg AG) | 2021-12-07 |
555 | Wabern (HttP) | 2021-09-13 |
737 | WerrePark HTTP ag | 2020-07-14 |
884 | kitern http | 2021-04-08 |
584 | Lausern (HTTP DTAG gg) | 2021-03-05 |
CodePudding user response:
Create a boolean mask has_http
to identify the rows containing http
then group this mask by id
and use shift
to create another boolean mask to identify whether previous row contains http. Then combine the masks using &
to identify the changes
has_http = df['name'].str.contains(r'(?i)\bhttp\b')
mask = has_http & ~has_http.groupby(df['id']).shift(fill_value=True)
Now use the resulting mask to filter the rows
df[mask]
id name Date
1 436 Minster (HTTP gg AG) 2021-12-07
5 555 Wabern (HttP) 2021-09-13
8 737 WerrePark HTTP ag 2020-07-14
12 884 kitern http 2021-04-08
16 584 Lausern (HTTP DTAG gg) 2021-03-05
CodePudding user response:
Here is another approach using regex for any possible combination of uppercase/lowercase HTTP:
import re
import pandas as pd
data = [
["436", "Minster", "2020-04-15"],
["436", "Minster (HTTP gg AG)", "2021-12-07"],
["145", "Denskin (HTTP geplan)", "2020-07-24"],
["145", "Denskinf HTTP DTAG", "2020-08-15"],
["884", "klintern", "2021-03-23"],
["884", "klintern http", "2021-04-08"],
["884", "kitener (http geplan)", "2021-07-24"]
]
df = pd.DataFrame(data, columns=['id', 'name', 'Date'])
filtered_df = df.loc[df['name'].str.contains("http", flags=re.IGNORECASE)]
filtered_df = filtered_df.groupby("id").first()
Output:
id name Date
145 Denskin (HTTP geplan) 2020-07-24
436 Minster (HTTP gg AG) 2021-12-07
884 klintern http 2021-04-08