Newbee here.
Given the below files, I am trying to count how many times a distinct value occurs. The data is for multiple facilities for each day so stuck how to get the correct totals.
Tried using nunique in combination with groupby but not able to get the logic.
df1 = df.groupby(['Date', 'Facility'], as_index=False)['By_Name'].nunique()
Sample Input (Sample.xlsx) of a given day [actual data is for every day of the month]:
Date | Facility | Begin Time | By Name | Preceptor | Preceptee | Circulator |
---|---|---|---|---|---|---|
2021-09-10 | Cape Fear | 10:01 | DI | KL | ||
2021-09-10 | Cape Fear | 10:31 | DI | KL | ||
2021-09-10 | Cape Fear | 10:36 | DI | KL | ||
2021-09-10 | Cape Fear | 11:58 | DI | KL | ||
2021-09-10 | Cape Fear | 12:11 | DI | KL | ||
2021-09-10 | Cape Fear | 12:56 | DI | KL | ||
2021-09-10 | Cape Fear | 13:35 | DI | KL | ||
2021-09-10 | Cape Fear | 17:30 | DI | KL | ||
2021-09-10 | Cape Fear | 09:50 | KL | DI | ||
2021-09-10 | Cape Fear | 10:47 | KL | DI | ||
2021-09-10 | Cape Fear | 11:14 | KL | DI | ||
2021-09-10 | Cape Fear | 16:18 | KL | DI | ||
2021-09-10 | Cape Fear | 16:34 | KL | DI | ||
2021-09-10 | Cape Fear | 18:09 | KL | DI | ||
2021-09-10 | Cape Fear | 18:20 | KL | DI | ||
2021-09-10 | Cape Fear | 09:30 | LOA | WH | ||
2021-09-10 | Cape Fear | 09:48 | SR | JR | ||
2021-09-10 | Cape Fear | 11:03 | SR | JR | ||
2021-09-10 | Cape Fear | 12:10 | SR | JR | ||
2021-09-10 | Cape Fear | 13:10 | SR | JR | ||
2021-09-10 | Cape Fear | 13:34 | SR | JR | ||
2021-09-10 | Cape Fear | 13:55 | SR | JR | ||
2021-09-10 | Cape Fear | 16:19 | SR | JR | ||
2021-09-10 | Cape Fear | 16:19 | SR | JR | ||
2021-09-10 | Cape Fear | 16:43 | SR | JR | ||
2021-09-10 | Cape Fear | 16:43 | SR | JR | ||
2021-09-10 | Cape Fear | 17:09 | SR | JR | ||
2021-09-10 | Cape Fear | 11:52 | WH |
Desired Output (By Facility)
Date | Facility | Entries | Begin Time | By Name | Preceptor | Preceptee | Circulator |
---|---|---|---|---|---|---|---|
2021-09-10 | Cape Fear | 6 | 09:30 | LOA | WH | ||
2021-09-10 | Cape Fear | 09:48 | SR | JR | |||
2021-09-10 | Cape Fear | 09:50 | KL | DI | |||
2021-09-10 | Cape Fear | 10:01 | DI | KL |
Any and all help will be appreciated. TIA
CodePudding user response:
IIUC, given a dataframe df
similar to the one OP shared in the question. For the sake of simplicity I didn't consider the columns Preceptor
and Preceptee
as they are empty.
import pandas as pd
df = pd.DataFrame( {'Date': [ "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10"],
'Facility': [ "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear" ],
'Begin Time': [ "10:01", "10:31", "10:36", "11:58", "12:11", "12:56", "13:35", "17:30", "09:50", "10:47", "11:14", "16:18", "16:34", "18:09" ],
'By Name': [ "DI", "DI", "DI", "DI", "DI", "DI", "DI", "DI", "KL", "KL", "KL", "KL", "KL", "KL" ],
'Circulator': [ "KL", "KL", "KL", "KL", "KL", "KL", "KL", "KL", "DI", "DI", "DI", "DI", "DI", "DI"] })
Now, and to answer your question (as the current data types on the df columns wont be a problem), let's create a new column called Entries
, that counts the number of times a By Name
occurs in each Facility
for a given Date
df['Entries'] = df.groupby(['Date', 'Facility', 'By Name'])['By Name'].transform('count')
However, as OP wants to hide the rows where the string in By Name
appears more than once
df = df.drop_duplicates(subset=['Date', 'Facility', 'By Name'], keep='first')
[Out]:
Date Facility Begin Time By Name Circulator Entries
0 2021-09-10 Cape Fear 10:01 DI KL 8
8 2021-09-10 Cape Fear 09:50 KL DI 6
Note that with the previous line the index goes from 0 to 8. If, instead of that, one wants the index to start in 0 and then go to 1, one needs to pass .reset_index()
as follows
df = df.drop_duplicates(subset=['Date', 'Facility', 'By Name'], keep='first').reset_index()
[Out]:
index Date Facility Begin Time By Name Circulator Entries
0 0 2021-09-10 Cape Fear 10:01 DI KL 8
1 8 2021-09-10 Cape Fear 09:50 KL DI 6
CodePudding user response:
UPDATED ANSWER:
Based on clarifications by OP in comments, here is a new strategy to get what is required:
df = df.set_index(['Date', 'Facility', 'Begin Time']).stack()
df.index=df.index.droplevel(3)
df = df.to_frame().rename(columns={0:'Name'}).reset_index()
df2 = df.groupby(['Date', 'Facility'])['Name'].nunique().to_frame().rename(columns={'Name': 'Entries'})
indexCols = ['Date', 'Facility', 'Entries']
df = df.join(df2, on=['Date', 'Facility']).reindex(columns=indexCols list(set(df.columns) - set(indexCols)))
df = df.assign(TEMP_COL=0).groupby(indexCols ['Name']).nth(0).drop(columns='TEMP_COL')
Explanation:
- Using
Date, Facility, Begin Time
as the index, callstack()
to rearrange the four columns containing names into a single column with the old column name indicating "Kind of Name" (By Name, Preceptor, Preceptee
orCirculator
) added as level of the row multiindex. - Drop the "Kind of Name" level (level 3) from the row multiindex.
- Rename column
0
to beName
, callreset_index()
to moveDate, Facility, Begin Time
out of the row multiindex and back into column labels. - Use
groupby()
andnunique()
to calculateEntries
: the number of unique names (across all four original columns) associated with eachDate, Facility
. - Use
join()
to add anEntries
column to the original dataframe. - Use
groupby()
onDate, Facility, Name
, then usenth(0)
to get the first row of each group (including itsBegin Time
), also using a dummy columnTEMP_COL
to make sure we don't run out of columns when performing thegroupby()
.
Example 1 (using original input from the question):
Output:
Begin Time
Date Facility Entries Name
2021-09-10 Cape Fear 6 DI 10:01
JR 09:48
KL 10:01
LOA 09:30
SR 09:48
WH 09:30
Example 2 (adds a second Facility
):
Date Facility Begin Time By Name Preceptor Preceptee Circulator
0 2021-09-11 Facility Two 10:01 AA <NA> <NA> ZZ
1 2021-09-11 Facility Two 10:31 AA <NA> <NA> ZZ
2 2021-09-11 Facility Two 10:36 AA <NA> <NA> ZZ
3 2021-09-11 Facility Two 11:58 DI <NA> <NA> KL
4 2021-09-11 Facility Two 12:11 DI <NA> <NA> KL
5 2021-09-11 Facility Two 12:56 DI <NA> <NA> KL
6 2021-09-11 Facility Two 13:35 DI <NA> <NA> KL
7 2021-09-10 Cape Fear 10:01 DI <NA> <NA> KL
8 2021-09-10 Cape Fear 10:31 DI <NA> <NA> KL
9 2021-09-10 Cape Fear 10:36 DI <NA> <NA> KL
10 2021-09-10 Cape Fear 11:58 DI <NA> <NA> KL
11 2021-09-10 Cape Fear 12:11 DI <NA> <NA> KL
12 2021-09-10 Cape Fear 12:56 DI <NA> <NA> KL
13 2021-09-10 Cape Fear 13:35 DI <NA> <NA> KL
14 2021-09-10 Cape Fear 17:30 DI <NA> <NA> KL
15 2021-09-10 Cape Fear 09:50 KL <NA> <NA> DI
16 2021-09-10 Cape Fear 10:47 KL <NA> <NA> DI
17 2021-09-10 Cape Fear 11:14 KL <NA> <NA> DI
18 2021-09-10 Cape Fear 16:18 KL <NA> <NA> DI
19 2021-09-10 Cape Fear 16:34 KL <NA> <NA> DI
20 2021-09-10 Cape Fear 18:09 KL <NA> <NA> DI
21 2021-09-10 Cape Fear 18:20 KL <NA> <NA> DI
22 2021-09-10 Cape Fear 09:30 LOA <NA> <NA> WH
23 2021-09-10 Cape Fear 09:48 SR <NA> <NA> JR
24 2021-09-10 Cape Fear 11:03 SR <NA> <NA> JR
25 2021-09-10 Cape Fear 12:10 SR <NA> <NA> JR
26 2021-09-10 Cape Fear 13:10 SR <NA> <NA> JR
27 2021-09-10 Cape Fear 13:34 SR <NA> <NA> JR
28 2021-09-10 Cape Fear 13:55 SR <NA> <NA> JR
29 2021-09-10 Cape Fear 16:19 SR <NA> <NA> JR
30 2021-09-10 Cape Fear 16:19 SR <NA> <NA> JR
31 2021-09-10 Cape Fear 16:43 SR <NA> <NA> JR
32 2021-09-10 Cape Fear 16:43 SR <NA> <NA> JR
33 2021-09-10 Cape Fear 17:09 SR <NA> <NA> JR
34 2021-09-10 Cape Fear 11:52 WH <NA> <NA> <NA>
Output:
2021-09-10 Cape Fear 6 DI 10:01
JR 09:48
KL 10:01
LOA 09:30
SR 09:48
WH 09:30
2021-09-11 Facility Two 4 AA 10:01
DI 11:58
KL 11:58
ZZ 10:01
ORIGINAL ANSWER:
Here's a way to get something close to what your question shows as desired output:
df1 = df.groupby(['Date', 'Facility'], as_index=False)['By Name'].nunique().rename(columns={'By Name': 'Entries'}).set_index(['Date', 'Facility'])
df2 = df[['Date', 'Facility', 'By Name']].join(df1, on=['Date', 'Facility'])
df = df.assign(Entries=df2['Entries']).reindex(columns=['Date', 'Facility', 'Entries'] [col for col in df.columns if col not in ['Date', 'Facility']])
df = df.groupby(['Date', 'Facility', 'Entries', 'By Name']).nth(0)
Example 1 (uses the original input in the question):
Date Facility Begin Time By Name Preceptor Preceptee Circulator
0 2021-09-10 Cape Fear 10:01 DI <NA> <NA> KL
1 2021-09-10 Cape Fear 10:31 DI <NA> <NA> KL
2 2021-09-10 Cape Fear 10:36 DI <NA> <NA> KL
3 2021-09-10 Cape Fear 11:58 DI <NA> <NA> KL
4 2021-09-10 Cape Fear 12:11 DI <NA> <NA> KL
5 2021-09-10 Cape Fear 12:56 DI <NA> <NA> KL
6 2021-09-10 Cape Fear 13:35 DI <NA> <NA> KL
7 2021-09-10 Cape Fear 17:30 DI <NA> <NA> KL
8 2021-09-10 Cape Fear 09:50 KL <NA> <NA> DI
9 2021-09-10 Cape Fear 10:47 KL <NA> <NA> DI
10 2021-09-10 Cape Fear 11:14 KL <NA> <NA> DI
11 2021-09-10 Cape Fear 16:18 KL <NA> <NA> DI
12 2021-09-10 Cape Fear 16:34 KL <NA> <NA> DI
13 2021-09-10 Cape Fear 18:09 KL <NA> <NA> DI
14 2021-09-10 Cape Fear 18:20 KL <NA> <NA> DI
15 2021-09-10 Cape Fear 09:30 LOA <NA> <NA> WH
16 2021-09-10 Cape Fear 09:48 SR <NA> <NA> JR
17 2021-09-10 Cape Fear 11:03 SR <NA> <NA> JR
18 2021-09-10 Cape Fear 12:10 SR <NA> <NA> JR
19 2021-09-10 Cape Fear 13:10 SR <NA> <NA> JR
20 2021-09-10 Cape Fear 13:34 SR <NA> <NA> JR
21 2021-09-10 Cape Fear 13:55 SR <NA> <NA> JR
22 2021-09-10 Cape Fear 16:19 SR <NA> <NA> JR
23 2021-09-10 Cape Fear 16:19 SR <NA> <NA> JR
24 2021-09-10 Cape Fear 16:43 SR <NA> <NA> JR
25 2021-09-10 Cape Fear 16:43 SR <NA> <NA> JR
26 2021-09-10 Cape Fear 17:09 SR <NA> <NA> JR
27 2021-09-10 Cape Fear 11:52 WH <NA> <NA> <NA>
Output:
Begin Time Preceptor Preceptee Circulator
Date Facility Entries By Name
2021-09-10 Cape Fear 5 DI 10:01 <NA> <NA> KL
KL 09:50 <NA> <NA> DI
LOA 09:30 <NA> <NA> WH
SR 09:48 <NA> <NA> JR
WH 11:52 <NA> <NA> <NA>
Example 2 (adds a second Facility
):
(see UPDATED ANSWER above for sample input.)
Output:
Begin Time Preceptor Preceptee Circulator
Date Facility Entries By Name
2021-09-10 Cape Fear 5 DI 10:01 <NA> <NA> KL
KL 09:50 <NA> <NA> DI
LOA 09:30 <NA> <NA> WH
SR 09:48 <NA> <NA> JR
WH 11:52 <NA> <NA> <NA>
2021-09-11 Facility Two 2 AA 10:01 <NA> <NA> ZZ
DI 11:58 <NA> <NA> KL