Quickly search through a numpy array and sum the corresponding values-CodePudding

I have an array with around 160k entries which I get from a CSV-file and it looks like this:

data_arr = np.array(['ID0524', 1.0]
                    ['ID0965', 2.5]
                     .
                     .
                    ['ID0524', 6.7]
                    ['ID0324', 3.0])

I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers. So if I would need to look up "ID0524", the sum would be 7.7.

My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):

def sumValues(self, id)
    sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]

    sum_arr = sub_arr[0:sub_arr.size, 1]
    return sum_arr.sum()

And it takes around ~18s to do this for all 3k IDs.

I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!

CodePudding user response：

You could try the using builtin numpy methods.

numpy.intersect1d to find the unique IDs
numpy.sum to sum them up

CodePudding user response：

A convenient tool to do your task is Pandas, with its grouping mechanism.

Start from the necessary import:

import pandas as pd

Then convert data_arr to a pandasonic DataFrame:

df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})

The reason for some complication in the above code is that:

elements of your input array are of a single type (in this case object),
so there is necessary to convert the second column to float.

Then you can get the expected result in a single instruction:

result = df.groupby('Id').sum()

The result, for your data sample, is:

        Amount
Id            
ID0324     3.0
ID0524     7.7
ID0965     2.5

Another approach is that you could read your CSV file directly into a DataFrame (see read_csv method), so there is no need to use any Numpy array.

The advantage is that read_csv is clever enough to recognize the data type of each column separately, at least it is able to tell apart numbers from strings.