Home > Blockchain >  Pandas: get first occurrence and ignore if in any other column on same day
Pandas: get first occurrence and ignore if in any other column on same day

Time:05-18

Newbee here.

Given the below files, I am trying to count how many times a distinct value occurs. The data is for multiple facilities for each day so stuck how to get the correct totals.

Tried using nunique in combination with groupby but not able to get the logic.

df1 = df.groupby(['Date', 'Facility'], as_index=False)['By_Name'].nunique()

Sample Input (Sample.xlsx) of a given day [actual data is for every day of the month]:

Date Facility Begin Time By Name Preceptor Preceptee Circulator
2021-09-10 Cape Fear 10:01 DI KL
2021-09-10 Cape Fear 10:31 DI KL
2021-09-10 Cape Fear 10:36 DI KL
2021-09-10 Cape Fear 11:58 DI KL
2021-09-10 Cape Fear 12:11 DI KL
2021-09-10 Cape Fear 12:56 DI KL
2021-09-10 Cape Fear 13:35 DI KL
2021-09-10 Cape Fear 17:30 DI KL
2021-09-10 Cape Fear 09:50 KL DI
2021-09-10 Cape Fear 10:47 KL DI
2021-09-10 Cape Fear 11:14 KL DI
2021-09-10 Cape Fear 16:18 KL DI
2021-09-10 Cape Fear 16:34 KL DI
2021-09-10 Cape Fear 18:09 KL DI
2021-09-10 Cape Fear 18:20 KL DI
2021-09-10 Cape Fear 09:30 LOA WH
2021-09-10 Cape Fear 09:48 SR JR
2021-09-10 Cape Fear 11:03 SR JR
2021-09-10 Cape Fear 12:10 SR JR
2021-09-10 Cape Fear 13:10 SR JR
2021-09-10 Cape Fear 13:34 SR JR
2021-09-10 Cape Fear 13:55 SR JR
2021-09-10 Cape Fear 16:19 SR JR
2021-09-10 Cape Fear 16:19 SR JR
2021-09-10 Cape Fear 16:43 SR JR
2021-09-10 Cape Fear 16:43 SR JR
2021-09-10 Cape Fear 17:09 SR JR
2021-09-10 Cape Fear 11:52 WH

Desired Output (By Facility)

Date Facility Entries Begin Time By Name Preceptor Preceptee Circulator
2021-09-10 Cape Fear 6 09:30 LOA WH
2021-09-10 Cape Fear 09:48 SR JR
2021-09-10 Cape Fear 09:50 KL DI
2021-09-10 Cape Fear 10:01 DI KL

Any and all help will be appreciated. TIA

CodePudding user response:

IIUC, given a dataframe df similar to the one OP shared in the question. For the sake of simplicity I didn't consider the columns Preceptor and Preceptee as they are empty.

import pandas as pd

df = pd.DataFrame( {'Date': [ "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10"],
'Facility': [ "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear", "Cape Fear" ],
'Begin Time': [ "10:01", "10:31", "10:36", "11:58", "12:11", "12:56", "13:35", "17:30", "09:50", "10:47", "11:14", "16:18", "16:34", "18:09" ],
'By Name': [ "DI", "DI", "DI", "DI", "DI", "DI", "DI", "DI", "KL", "KL", "KL", "KL", "KL", "KL" ],
'Circulator': [ "KL", "KL", "KL", "KL", "KL", "KL", "KL", "KL", "DI", "DI", "DI", "DI", "DI", "DI"] })

Now, and to answer your question (as the current data types on the df columns wont be a problem), let's create a new column called Entries, that counts the number of times a By Name occurs in each Facility for a given Date

df['Entries'] = df.groupby(['Date', 'Facility', 'By Name'])['By Name'].transform('count')

However, as OP wants to hide the rows where the string in By Name appears more than once

df = df.drop_duplicates(subset=['Date', 'Facility', 'By Name'], keep='first')

[Out]:
         Date   Facility Begin Time By Name Circulator  Entries
0  2021-09-10  Cape Fear      10:01      DI         KL        8
8  2021-09-10  Cape Fear      09:50      KL         DI        6

Note that with the previous line the index goes from 0 to 8. If, instead of that, one wants the index to start in 0 and then go to 1, one needs to pass .reset_index() as follows

df = df.drop_duplicates(subset=['Date', 'Facility', 'By Name'], keep='first').reset_index()

[Out]:
   index        Date   Facility Begin Time By Name Circulator  Entries
0      0  2021-09-10  Cape Fear      10:01      DI         KL        8
1      8  2021-09-10  Cape Fear      09:50      KL         DI        6

CodePudding user response:

UPDATED ANSWER:

Based on clarifications by OP in comments, here is a new strategy to get what is required:

df = df.set_index(['Date', 'Facility', 'Begin Time']).stack()
df.index=df.index.droplevel(3)
df = df.to_frame().rename(columns={0:'Name'}).reset_index()
df2 = df.groupby(['Date', 'Facility'])['Name'].nunique().to_frame().rename(columns={'Name': 'Entries'})
indexCols = ['Date', 'Facility', 'Entries']
df = df.join(df2, on=['Date', 'Facility']).reindex(columns=indexCols   list(set(df.columns) - set(indexCols)))
df = df.assign(TEMP_COL=0).groupby(indexCols   ['Name']).nth(0).drop(columns='TEMP_COL')

Explanation:

  • Using Date, Facility, Begin Time as the index, call stack() to rearrange the four columns containing names into a single column with the old column name indicating "Kind of Name" (By Name, Preceptor, Preceptee or Circulator) added as level of the row multiindex.
  • Drop the "Kind of Name" level (level 3) from the row multiindex.
  • Rename column 0 to be Name, call reset_index() to move Date, Facility, Begin Time out of the row multiindex and back into column labels.
  • Use groupby() and nunique() to calculate Entries: the number of unique names (across all four original columns) associated with each Date, Facility.
  • Use join() to add an Entries column to the original dataframe.
  • Use groupby() on Date, Facility, Name, then use nth(0) to get the first row of each group (including its Begin Time), also using a dummy column TEMP_COL to make sure we don't run out of columns when performing the groupby().

Example 1 (using original input from the question):

Output:

                                  Begin Time
Date       Facility  Entries Name
2021-09-10 Cape Fear 6       DI        10:01
                             JR        09:48
                             KL        10:01
                             LOA       09:30
                             SR        09:48
                             WH        09:30

Example 2 (adds a second Facility):

          Date      Facility Begin Time By Name  Preceptor  Preceptee Circulator
0   2021-09-11  Facility Two      10:01      AA       <NA>       <NA>         ZZ
1   2021-09-11  Facility Two      10:31      AA       <NA>       <NA>         ZZ
2   2021-09-11  Facility Two      10:36      AA       <NA>       <NA>         ZZ
3   2021-09-11  Facility Two      11:58      DI       <NA>       <NA>         KL
4   2021-09-11  Facility Two      12:11      DI       <NA>       <NA>         KL
5   2021-09-11  Facility Two      12:56      DI       <NA>       <NA>         KL
6   2021-09-11  Facility Two      13:35      DI       <NA>       <NA>         KL
7   2021-09-10     Cape Fear      10:01      DI       <NA>       <NA>         KL
8   2021-09-10     Cape Fear      10:31      DI       <NA>       <NA>         KL
9   2021-09-10     Cape Fear      10:36      DI       <NA>       <NA>         KL
10  2021-09-10     Cape Fear      11:58      DI       <NA>       <NA>         KL
11  2021-09-10     Cape Fear      12:11      DI       <NA>       <NA>         KL
12  2021-09-10     Cape Fear      12:56      DI       <NA>       <NA>         KL
13  2021-09-10     Cape Fear      13:35      DI       <NA>       <NA>         KL
14  2021-09-10     Cape Fear      17:30      DI       <NA>       <NA>         KL
15  2021-09-10     Cape Fear      09:50      KL       <NA>       <NA>         DI
16  2021-09-10     Cape Fear      10:47      KL       <NA>       <NA>         DI
17  2021-09-10     Cape Fear      11:14      KL       <NA>       <NA>         DI
18  2021-09-10     Cape Fear      16:18      KL       <NA>       <NA>         DI
19  2021-09-10     Cape Fear      16:34      KL       <NA>       <NA>         DI
20  2021-09-10     Cape Fear      18:09      KL       <NA>       <NA>         DI
21  2021-09-10     Cape Fear      18:20      KL       <NA>       <NA>         DI
22  2021-09-10     Cape Fear      09:30     LOA       <NA>       <NA>         WH
23  2021-09-10     Cape Fear      09:48      SR       <NA>       <NA>         JR
24  2021-09-10     Cape Fear      11:03      SR       <NA>       <NA>         JR
25  2021-09-10     Cape Fear      12:10      SR       <NA>       <NA>         JR
26  2021-09-10     Cape Fear      13:10      SR       <NA>       <NA>         JR
27  2021-09-10     Cape Fear      13:34      SR       <NA>       <NA>         JR
28  2021-09-10     Cape Fear      13:55      SR       <NA>       <NA>         JR
29  2021-09-10     Cape Fear      16:19      SR       <NA>       <NA>         JR
30  2021-09-10     Cape Fear      16:19      SR       <NA>       <NA>         JR
31  2021-09-10     Cape Fear      16:43      SR       <NA>       <NA>         JR
32  2021-09-10     Cape Fear      16:43      SR       <NA>       <NA>         JR
33  2021-09-10     Cape Fear      17:09      SR       <NA>       <NA>         JR
34  2021-09-10     Cape Fear      11:52      WH       <NA>       <NA>       <NA>

Output:

2021-09-10 Cape Fear    6       DI        10:01
                                JR        09:48
                                KL        10:01
                                LOA       09:30
                                SR        09:48
                                WH        09:30
2021-09-11 Facility Two 4       AA        10:01
                                DI        11:58
                                KL        11:58
                                ZZ        10:01

ORIGINAL ANSWER:

Here's a way to get something close to what your question shows as desired output:

df1 = df.groupby(['Date', 'Facility'], as_index=False)['By Name'].nunique().rename(columns={'By Name': 'Entries'}).set_index(['Date', 'Facility'])
df2 = df[['Date', 'Facility', 'By Name']].join(df1, on=['Date', 'Facility'])

df = df.assign(Entries=df2['Entries']).reindex(columns=['Date', 'Facility', 'Entries']   [col for col in df.columns if col not in ['Date', 'Facility']])
df = df.groupby(['Date', 'Facility', 'Entries', 'By Name']).nth(0)

Example 1 (uses the original input in the question):

          Date   Facility Begin Time By Name  Preceptor  Preceptee Circulator
0   2021-09-10  Cape Fear      10:01      DI       <NA>       <NA>         KL
1   2021-09-10  Cape Fear      10:31      DI       <NA>       <NA>         KL
2   2021-09-10  Cape Fear      10:36      DI       <NA>       <NA>         KL
3   2021-09-10  Cape Fear      11:58      DI       <NA>       <NA>         KL
4   2021-09-10  Cape Fear      12:11      DI       <NA>       <NA>         KL
5   2021-09-10  Cape Fear      12:56      DI       <NA>       <NA>         KL
6   2021-09-10  Cape Fear      13:35      DI       <NA>       <NA>         KL
7   2021-09-10  Cape Fear      17:30      DI       <NA>       <NA>         KL
8   2021-09-10  Cape Fear      09:50      KL       <NA>       <NA>         DI
9   2021-09-10  Cape Fear      10:47      KL       <NA>       <NA>         DI
10  2021-09-10  Cape Fear      11:14      KL       <NA>       <NA>         DI
11  2021-09-10  Cape Fear      16:18      KL       <NA>       <NA>         DI
12  2021-09-10  Cape Fear      16:34      KL       <NA>       <NA>         DI
13  2021-09-10  Cape Fear      18:09      KL       <NA>       <NA>         DI
14  2021-09-10  Cape Fear      18:20      KL       <NA>       <NA>         DI
15  2021-09-10  Cape Fear      09:30     LOA       <NA>       <NA>         WH
16  2021-09-10  Cape Fear      09:48      SR       <NA>       <NA>         JR
17  2021-09-10  Cape Fear      11:03      SR       <NA>       <NA>         JR
18  2021-09-10  Cape Fear      12:10      SR       <NA>       <NA>         JR
19  2021-09-10  Cape Fear      13:10      SR       <NA>       <NA>         JR
20  2021-09-10  Cape Fear      13:34      SR       <NA>       <NA>         JR
21  2021-09-10  Cape Fear      13:55      SR       <NA>       <NA>         JR
22  2021-09-10  Cape Fear      16:19      SR       <NA>       <NA>         JR
23  2021-09-10  Cape Fear      16:19      SR       <NA>       <NA>         JR
24  2021-09-10  Cape Fear      16:43      SR       <NA>       <NA>         JR
25  2021-09-10  Cape Fear      16:43      SR       <NA>       <NA>         JR
26  2021-09-10  Cape Fear      17:09      SR       <NA>       <NA>         JR
27  2021-09-10  Cape Fear      11:52      WH       <NA>       <NA>       <NA>

Output:

                                     Begin Time  Preceptor  Preceptee Circulator
Date       Facility  Entries By Name
2021-09-10 Cape Fear 5       DI           10:01       <NA>       <NA>         KL
                             KL           09:50       <NA>       <NA>         DI
                             LOA          09:30       <NA>       <NA>         WH
                             SR           09:48       <NA>       <NA>         JR
                             WH           11:52       <NA>       <NA>       <NA>

Example 2 (adds a second Facility): (see UPDATED ANSWER above for sample input.) Output:

                                        Begin Time  Preceptor  Preceptee Circulator
Date       Facility     Entries By Name
2021-09-10 Cape Fear    5       DI           10:01       <NA>       <NA>         KL
                                KL           09:50       <NA>       <NA>         DI
                                LOA          09:30       <NA>       <NA>         WH
                                SR           09:48       <NA>       <NA>         JR
                                WH           11:52       <NA>       <NA>       <NA>
2021-09-11 Facility Two 2       AA           10:01       <NA>       <NA>         ZZ
                                DI           11:58       <NA>       <NA>         KL
  • Related