Home > OS >  Quickly search through a numpy array and sum the corresponding values
Quickly search through a numpy array and sum the corresponding values

Time:10-28

I have an array with around 160k entries which I get from a CSV-file and it looks like this:

data_arr = np.array(['ID0524', 1.0]
                    ['ID0965', 2.5]
                     .
                     .
                    ['ID0524', 6.7]
                    ['ID0324', 3.0])

I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers. So if I would need to look up "ID0524", the sum would be 7.7.

My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):

def sumValues(self, id)
    sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]

    sum_arr = sub_arr[0:sub_arr.size, 1]
    return sum_arr.sum()

And it takes around ~18s to do this for all 3k IDs.

I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!

CodePudding user response:

You could try the using builtin numpy methods.

numpy.intersect1d to find the unique IDs
numpy.sum to sum them up

CodePudding user response:

A convenient tool to do your task is Pandas, with its grouping mechanism.

Start from the necessary import:

import pandas as pd

Then convert data_arr to a pandasonic DataFrame:

df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})

The reason for some complication in the above code is that:

  • elements of your input array are of a single type (in this case object),
  • so there is necessary to convert the second column to float.

Then you can get the expected result in a single instruction:

result = df.groupby('Id').sum()

The result, for your data sample, is:

        Amount
Id            
ID0324     3.0
ID0524     7.7
ID0965     2.5

Another approach is that you could read your CSV file directly into a DataFrame (see read_csv method), so there is no need to use any Numpy array.

The advantage is that read_csv is clever enough to recognize the data type of each column separately, at least it is able to tell apart numbers from strings.

  • Related