Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
2 1 'happy'
3 2 'hello'
4 2 'happy'
5 3 'goodbye'
6 3 'hello'
7 3 'hello'
8 4 'goodbye'
9 5 'goodbye'
Expected Output:
{ 'hello': 2, 'goodbye' : 3}
Where hello was counted by idx 0 & 3, and goodbye was counted by idx 5 & 8 & 9.
In the actual dataset there are more than two starting strings.
I'm thinking of potentially using pandas .groupby() && .where(),
to filter out for the first time a uniqueId occurs, then group
by the string? Not entirely sure.
Question: How do I get the counts of the 'starting string', only when uniqueID is occurring for the first time with increased performance.
Thus far, I'm doing with a simple for loop of the data and checking with if/else statements. But this is incredibly slow with a large dataframe.
I'm curious if there are any functions built in pandas, or another library out there, that would reduce the overall time it takes.
CodePudding user response:
Achieving better than O(N) is not possible.
You can drop_duplicates
, then value_counts
:
out = df.drop_duplicates('uniqueID')['String'].value_counts()
output:
goodbye 3
hello 2
Name: String, dtype: int64
As dictionary:
df.drop_duplicates('uniqueID')['String'].value_counts().to_dict()
output: {'goodbye': 3, 'hello': 2}