Home > Back-end >  Increasing performance for search pandas df, Count occurrences of starting string grouped by unique
Increasing performance for search pandas df, Count occurrences of starting string grouped by unique

Time:09-17

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
2        1           'happy'
3        2           'hello'
4        2           'happy'
5        3           'goodbye' 
6        3           'hello'
7        3           'hello'
8        4           'goodbye'
9        5           'goodbye'

Expected Output:

{ 'hello': 2, 'goodbye' : 3} 
Where hello was counted by idx 0 & 3, and goodbye was counted by idx 5 & 8 & 9. 
In the actual dataset there are more than two starting strings. 



I'm thinking of potentially using pandas .groupby() && .where(), 
to filter out for the first time a uniqueId occurs, then group 
by the string? Not entirely sure. 

Question: How do I get the counts of the 'starting string', only when uniqueID is occurring for the first time with increased performance.

Thus far, I'm doing with a simple for loop of the data and checking with if/else statements. But this is incredibly slow with a large dataframe.

I'm curious if there are any functions built in pandas, or another library out there, that would reduce the overall time it takes.

CodePudding user response:

Achieving better than O(N) is not possible.

You can drop_duplicates, then value_counts:

out = df.drop_duplicates('uniqueID')['String'].value_counts()

output:

goodbye    3
hello      2
Name: String, dtype: int64

As dictionary:

df.drop_duplicates('uniqueID')['String'].value_counts().to_dict()

output: {'goodbye': 3, 'hello': 2}

  • Related