I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers. So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!
CodePudding user response:
You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up
CodePudding user response:
A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
- elements of your input array are of a single type (in this case object),
- so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly into a DataFrame (see read_csv method), so there is no need to use any Numpy array.
The advantage is that read_csv is clever enough to recognize the data type of each column separately, at least it is able to tell apart numbers from strings.