I have a dataframe of visits that includes a person ID column, and a given person may have more than one visit. I want to number the visits for a given person.
I can sort the dataframe by visit date, then iterate and count the number of times a person ID has occurred, or its occurrences between the first index and current index. Is there a more efficient (ideally vectorized) way? (Note that I don't want the total counts as from value_counts()
, I want a new series that is a running count of visits per person.)
Here is a dummy dataframe where person 'a' has two visits:
In[1]: df = pd.DataFrame({'Visit': [0, 1, 2, 3, 4, 5], 'Person_ID': ['a', 'b', 'c', 'a', 'd', 'e']})
In[2]: df['Person_ID'].value_counts()
Out[2]:
a 2
b 1
c 1
d 1
e 1
Name: Person_ID, dtype: int64
Example 1: incrementing dictionary counter
In[3]: person_visit_count = {}
for index, row in df.iterrows():
if row['Person_ID'] not in person_visit_count:
person_visit_count[row['Person_ID']] = 1
else:
person_visit_count[row['Person_ID']] = 1
df.loc[index, 'Person_Visit'] = person_visit_count[row['Person_ID']]
In[4]: df
Out[4]:
Visit Person_ID Person_Visit
0 0 a 1.0
1 1 b 1.0
2 2 c 1.0
3 3 a 2.0
4 4 d 1.0
5 5 e 1.0
Example 2: index-based counter
In[5]: for index, row in df.iterrows():
df.loc[index, 'Person_Visit2'] = (df.loc[0:index, 'Person_ID'] == row['Person_ID']).sum()
In[6]: df
Out[6]:
Visit Person_ID Person_Visit Person_Visit2
0 0 a 1.0 1.0
1 1 b 1.0 1.0
2 2 c 1.0 1.0
3 3 a 2.0 2.0
4 4 d 1.0 1.0
5 5 e 1.0 1.0
CodePudding user response:
Iterating is rarely the best option, you can group and do a cumulative count of the rows
df['Person_Visit']=df.groupby('Person_ID').transform('cumcount') 1
Visit Person_ID Person_Visit
0 0 a 1
1 1 b 1
2 2 c 1
3 3 a 2
4 4 d 1
5 5 e 1
Just for comparison, here are the timings for the solutions you gave and this one (given by the timeit module)
groupby with cumcount
1.37 ms ± 177 µs per loop
incrementing dictionary counter
2.02 ms ± 321 µs per loop
index-based counter
3.61 ms ± 389 µs per loop
CodePudding user response:
Here's one idea. I thought of it looking at value_counts
output you created
d = df['Person_ID'].value_counts().apply(lambda n: iter(range(1, n 1)))
df['PersonVisit'] = df['Person_ID'].apply(lambda id: next(d[id]))
In this solution, d
acts as a key-value data structure where key is the person id and the value is an iterable, where each call to next() increments the value by 1.