Home > Back-end >  df.value.counts() doesn't show number of occurrences in dataset
df.value.counts() doesn't show number of occurrences in dataset

Time:10-31

Here is a small sample of the data I'm working on.

enter image description here

I'm trying to calculate how many times the same ID appears in the data using

df['Total Occurences'] = df['ID'].value_counts()

but nothing appears in the new column.

Thanks in advance :)

CodePudding user response:

Using groupby transform or value_counts map should be the preferred ways of doing it.

df['Total Occurences'] = df.groupby('ID')['ID'].transform('count')

or

df['Total Occurences'] = df['ID'].map(df.value_counts('ID'))

Both ways are much faster than the other answer for large DataFrames.

Tests

n = 10_000
# DataFrame with 'n' random IDs (50 possible values)
df = pd.DataFrame({'ID': np.random.randint(50, size=n)})
# using groupby   transform
>>> %timeit df.groupby('ID')['ID'].transform('count')
1.03 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# using map   value_counts
>>> %timeit df['ID'].map(df['ID'].value_counts())
1.49 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# using apply (Pedro's solution)
>>> %timeit df['ID'].apply(lambda x: df['ID'].value_counts()[x])
8.96 s ± 742 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# computing value_counts only once outside apply
>>> %%timeit 
... counts = df['ID'].value_counts()
... df['ID'].apply(lambda x: counts[x])

57.6 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

Try this:

df['Total Occurences'] = df['ID'].apply(lambda x: df['ID'].value_counts()[x])

For performance create a variable with df['ID'].value_counts():

count = df['ID'].value_counts()
df['Total Occurences'] = df['ID'].apply(lambda x: count[x])

Real test:

# using groupby   transform
>>> %timeit df.groupby('ID')['ID'].transform('count')
1.03 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# using map   value_counts
>>> %timeit df['ID'].map(df['ID'].value_counts())
1.49 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# using apply (Pedro's solution)
>>> %timeit df['ID'].apply(lambda x: df['ID'].value_counts()[x])
5.11 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# computing value_counts only once outside apply
>>> counts = df['ID'].value_counts()
>>> %timeit df['ID'].apply(lambda x: counts[x])

522 µs ± 6.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The other guy either doesn't know how to do tests or he's just not honest counts = df['ID'].value_counts() should be outside the loop. And i even checked my first answer look on his test it says it took 742 MS per loop i remade the test and it gives 62.4 µs per loop...

  • Related