Home > database >  What is the most efficient way to number the instance of a value in a Series
What is the most efficient way to number the instance of a value in a Series

Time:08-13

I have a dataframe of visits that includes a person ID column, and a given person may have more than one visit. I want to number the visits for a given person.

I can sort the dataframe by visit date, then iterate and count the number of times a person ID has occurred, or its occurrences between the first index and current index. Is there a more efficient (ideally vectorized) way? (Note that I don't want the total counts as from value_counts(), I want a new series that is a running count of visits per person.)

Here is a dummy dataframe where person 'a' has two visits:

In[1]: df = pd.DataFrame({'Visit': [0, 1, 2, 3, 4, 5], 'Person_ID': ['a', 'b', 'c', 'a', 'd', 'e']})
In[2]: df['Person_ID'].value_counts()
Out[2]:
a    2
b    1
c    1
d    1
e    1
Name: Person_ID, dtype: int64

Example 1: incrementing dictionary counter

In[3]: person_visit_count = {}
       for index, row in df.iterrows():
           if row['Person_ID'] not in person_visit_count:
               person_visit_count[row['Person_ID']] = 1
           else:
               person_visit_count[row['Person_ID']]  = 1
           df.loc[index, 'Person_Visit'] = person_visit_count[row['Person_ID']]
In[4]: df
Out[4]:
   Visit Person_ID  Person_Visit
0      0         a           1.0
1      1         b           1.0
2      2         c           1.0
3      3         a           2.0
4      4         d           1.0
5      5         e           1.0

Example 2: index-based counter

In[5]: for index, row in df.iterrows():
           df.loc[index, 'Person_Visit2'] = (df.loc[0:index, 'Person_ID'] == row['Person_ID']).sum()
In[6]: df
Out[6]: 
   Visit Person_ID  Person_Visit  Person_Visit2
0      0         a           1.0            1.0
1      1         b           1.0            1.0
2      2         c           1.0            1.0
3      3         a           2.0            2.0
4      4         d           1.0            1.0
5      5         e           1.0            1.0

CodePudding user response:

Iterating is rarely the best option, you can group and do a cumulative count of the rows

df['Person_Visit']=df.groupby('Person_ID').transform('cumcount') 1

    Visit   Person_ID   Person_Visit
0   0       a           1
1   1       b           1
2   2       c           1
3   3       a           2
4   4       d           1
5   5       e           1

Just for comparison, here are the timings for the solutions you gave and this one (given by the timeit module)

groupby with cumcount

1.37 ms ± 177 µs per loop

incrementing dictionary counter

2.02 ms ± 321 µs per loop

index-based counter

3.61 ms ± 389 µs per loop

CodePudding user response:

Here's one idea. I thought of it looking at value_counts output you created

d = df['Person_ID'].value_counts().apply(lambda n: iter(range(1, n 1)))
df['PersonVisit'] = df['Person_ID'].apply(lambda id: next(d[id]))

In this solution, d acts as a key-value data structure where key is the person id and the value is an iterable, where each call to next() increments the value by 1.

  • Related