I have a dataframe that looks like the following:
Date/time | File |
---|---|
2021-01-01 09:00:00 | File A |
2021-02-01 09:00:00 | File A |
2022-01-01 09:00:00 | File A |
2021-01-01 09:00:00 | File B |
2022-01-01 09:00:00 | File B |
2022-01-01 09:00:00 | File C |
2022-02-01 09:00:00 | File C |
2022-02-01 10:00:00 | File C |
What I am trying to do, is add a column for 'First occurrence' where each File is marked with a 1 for the first time it appears in each year
So the desired output would be:
Date/time | File | First occurrence |
---|---|---|
2021-01-01 09:00:00 | File A | 1 |
2021-02-01 09:00:00 | File A | |
2022-01-01 09:00:00 | File A | 1 |
2021-01-01 09:00:00 | File B | 1 |
2022-01-01 09:00:00 | File B | 1 |
2022-01-01 09:00:00 | File C | 1 |
2022-02-01 09:00:00 | File C | |
2022-02-01 10:00:00 | File C |
What I've tried so far which works:
df= df.sort_values(["File", "Date/time"])
df.reset_index(drop=True, inplace=True)
df["First occurrence"] = df.groupby(["File"]).cumcount()
df["First occurrence"] = np.where(df["First scan"] == 0, 1, 0)
This works at the first occurrence for each file, but doesn't work for each year. Building on that, I tried groupby methods and couldn't get it working so I went with a brute force approach which is:
# Get list of years
year_list = df["Date/time"].dt.year.unique().tolist()
year_frames = []
for year in year_list:
yr_df = df[df["Date/time"].dt.year==year]
yr_df = yr_df.sort_values(["File", "Date/time"])
yr_df = yr_df.reset_index(drop=True)
yr_df["First scan"] = yr_df.groupby(["File"]).cumcount()
yr_df["First scan"] = np.where(yr_df["First scan"] == 0, 1, 0)
year_frames.append(yr_df)
final_df = pd.concat(year_frames, axis=0, ignore_index=True)
This works for each year df but the concat overwrites values of 1 and 0 as the two frames are joined.
I'd like to get better with Pandas so would rather not brute force it in this way if there's a suitable way to do it but so far hours worth of Googling hasn't give me much to go on!
CodePudding user response:
You can create a year
column and use DataFrame.duplicated
to check the duplicated item in year
and File
columns
df['Date/time'] = pd.to_datetime(df['Date/time'])
df['year'] = df['Date/time'].dt.year
df['First occurrence'] = (~df.duplicated(['year', 'File'])).astype(int)
print(df)
Date/time File year First occurrence
0 2021-01-01 09:00:00 File A 2021 1
1 2021-02-01 09:00:00 File A 2021 0
2 2022-01-01 09:00:00 File A 2022 1
3 2021-01-01 09:00:00 File B 2021 1
4 2022-01-01 09:00:00 File B 2022 1
5 2022-01-01 09:00:00 File C 2022 1
6 2022-02-01 09:00:00 File C 2022 0
7 2022-02-01 10:00:00 File C 2022 0