Here is a small sample of the data I'm working on.
I'm trying to calculate how many times the same ID appears in the data using
df['Total Occurences'] = df['ID'].value_counts()
but nothing appears in the new column.
Thanks in advance :)
CodePudding user response:
Using groupby
transform
or value_counts
map
should be the preferred ways of doing it.
df['Total Occurences'] = df.groupby('ID')['ID'].transform('count')
or
df['Total Occurences'] = df['ID'].map(df.value_counts('ID'))
Both ways are much faster than the other answer for large DataFrames.
Tests
n = 10_000
# DataFrame with 'n' random IDs (50 possible values)
df = pd.DataFrame({'ID': np.random.randint(50, size=n)})
# using groupby transform
>>> %timeit df.groupby('ID')['ID'].transform('count')
1.03 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# using map value_counts
>>> %timeit df['ID'].map(df['ID'].value_counts())
1.49 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# using apply (Pedro's solution)
>>> %timeit df['ID'].apply(lambda x: df['ID'].value_counts()[x])
8.96 s ± 742 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# computing value_counts only once outside apply
>>> %%timeit
... counts = df['ID'].value_counts()
... df['ID'].apply(lambda x: counts[x])
57.6 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Try this:
df['Total Occurences'] = df['ID'].apply(lambda x: df['ID'].value_counts()[x])
For performance create a variable with df['ID'].value_counts()
:
count = df['ID'].value_counts()
df['Total Occurences'] = df['ID'].apply(lambda x: count[x])
Real test:
# using groupby transform
>>> %timeit df.groupby('ID')['ID'].transform('count')
1.03 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# using map value_counts
>>> %timeit df['ID'].map(df['ID'].value_counts())
1.49 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# using apply (Pedro's solution)
>>> %timeit df['ID'].apply(lambda x: df['ID'].value_counts()[x])
5.11 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# computing value_counts only once outside apply
>>> counts = df['ID'].value_counts()
>>> %timeit df['ID'].apply(lambda x: counts[x])
522 µs ± 6.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The other guy either doesn't know how to do tests or he's just not honest counts = df['ID'].value_counts()
should be outside the loop. And i even checked my first answer look on his test it says it took 742 MS
per loop i remade the test and it gives 62.4 µs
per loop...