Vectorizing a custom parsing function gives a ValueError-CodePudding

I've written a custom number parsing function. Basically, I want to convert app size information as it is given in the Google Play store (5.6M, 3M, 112K) to a standard float number.

To apply this function to a column of data in my data frame, I want to vectorize it using numpy.vectorize. However, when I'm testing it, i'm getting an error.

This is the function:

import numpy as np
import re

def parse_numbers(x, homo = False):
    if homo == False:
        if bool(re.match("^[0-9.] [Mm]{1}$", x)):
            new_number = float(re.sub("[^0-9.]", "", x))
            return new_number * 1000000
        elif bool(re.match("^[0-9.] [Kk]{1}$", x)):
            new_number = float(re.sub("[^0-9.]", "", x))
            return new_number * 1000
        else:
            return(x)
    elif homo == True:
        if bool(re.match("^[0-9.] [MmKk]{1}$", x)):
            return "parsed_number"
        else:
            return(x)
    else:
        return "invalid setting for homo attribute"

As you can see, if it receives any input that can not be parsed as a number, it returns the original input.

When I test this manually, it works fine: parse_numbers("3.1M") returns 3100000 and parse_numbers("not a number") returns not a number.

Now I try to vectorize the function, and I test it like this:

vparse_numbers = np.vectorize(parse_numbers)
vparse_numbers(["3.1M", "2k", "not a number"])

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_27076/2263610390.py in <module>
     21 
     22 vparse_numbers = np.vectorize(parse_numbers)
---> 23 vparse_numbers(["3.0M", "2k", "not a number"])

c:\programdata\miniconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
   2161             vargs.extend([kwargs[_n] for _n in names])
   2162 
-> 2163         return self._vectorize_call(func=func, args=vargs)
   2164 
   2165     def _get_ufunc_and_otypes(self, func, args):

c:\programdata\miniconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
   2247 
   2248             if ufunc.nout == 1:
-> 2249                 res = asanyarray(outputs, dtype=otypes[0])
   2250             else:
   2251                 res = tuple([asanyarray(x, dtype=t)

ValueError: could not convert string to float: 'not a number'

When I test it using only the parseable numbers: vparse_numbers(["3.1M", "2k", "not a number"]) it does return a list of the correct numbers to me.

What am I missing here? Am I not using the numpy.vectorize function correctly?

CodePudding user response：

np.vectorize creates numpy.array which can't mix different type of data - string and float - and it tries to convert all to float. You would have to return np.NaN instead of all strings.

import numpy as np
import re
import pandas as pd

def parse_numbers(text, homo=False):
    if homo == False:
        if bool(re.match("^[0-9.] [Mm]{1}$", text)):
            new_number = float(re.sub("[^0-9.]", "", text))
            return new_number * 1_000_000
        elif bool(re.match("^[0-9.] [Kk]{1}$", text)):
            new_number = float(re.sub("[^0-9.]", "", text))
            return new_number * 1_000
        else:
            return np.NaN
    elif homo == True:
        if bool(re.match("^[0-9.] [MmKk]{1}$", text)):
            return "parsed_number"
        else:
            return np.NaN
    else:
        raise Exception(f"invalid setting for homo attribute: {homo}")

vparse_numbers = np.vectorize(parse_numbers)
vparse_numbers(["3.1M", "2k", "not a number"])

But if you want to use DataFrame then you should use .apply() and it can return different type of data.

import numpy as np
import re
import pandas as pd

def parse_numbers(text, homo=False):
    if homo == False:
        if bool(re.match("^[0-9.] [Mm]{1}$", text)):
            new_number = float(re.sub("[^0-9.]", "", text))
            return new_number * 1_000_000
        elif bool(re.match("^[0-9.] [Kk]{1}$", text)):
            new_number = float(re.sub("[^0-9.]", "", text))
            return new_number * 1_000
        else:
            return text  # np.NaN
    elif homo == True:
        if bool(re.match("^[0-9.] [MmKk]{1}$", text)):
            return "parsed_number"
        else:
            return text  # np.NaN
    else:
        raise Exception(f"invalid setting for homo attribute: {homo}")

df = pd.DataFrame({"test": ["3.1M", "2k", "not a number"]})

df['result'] = df['test'].apply(parse_numbers)

print(df)

Result:

           test        result
0          3.1M     3100000.0
1            2k        2000.0
2  not a number  not a number

EDIT:

I would write parse_numbers little different

with re in only one line.
using {,1} to convert to number also strings like "123"
converting to integer at the end
using name parse_number without s because it parses only one number.

import re
import pandas as pd


def parse_number(text, homo=False):
    if not isinstance(homo, bool):
        raise Exception(f"invalid setting for homo attribute: {homo}")

    results = re.findall("^([0-9.] )([MmKk]{,1})$", text)
    #print(results)

    if results:
        if homo:
            return "parsed_number"
        else:
            number, name = results[0]
            new_number = float(number)
            if name in ('M', 'm'):
                new_number *= 1_000_000
            elif name in ('K', 'k'):
                new_number *= 1_000
            return int(new_number)
    else:
        return text  # np.NaN

        
df = pd.DataFrame({
    "test": ["3.1M", "2k", "123", "not a number"]
})

df['result'] = df['test'].apply(parse_number)
df['homo'] = df['test'].apply(lambda x:parse_number(x, True))

print(df)

Result:

           test        result           homo
0          3.1M       3100000  parsed_number
1            2k          2000  parsed_number
2           123           123  parsed_number
3  not a number  not a number   not a number

CodePudding user response：

In [475]: alist = ["3.1M", "2k", "not a number"]

The straightforward application of your function to elements of the list:

In [476]: [parse_numbers(s) for s in alist]
Out[476]: [3100000.0, 2000.0, 'not a number']

np.vectorize will work if you specify the right otypes:

In [478]: fn = np.vectorize(parse_numbers, otypes=[object])
In [479]: fn(alist)
Out[479]: array([3100000.0, 2000.0, 'not a number'], dtype=object)


In [480]: timeit [parse_numbers(s) for s in alist]
12.9 µs ± 51.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [481]: timeit fn(alist)
24.1 µs ± 33.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

for returning objects, frompyfunc is a bit faster, without some of np.vectorize overhead:

In [482]: fn = np.frompyfunc(parse_numbers, 1,1)
In [483]: fn(alist)
Out[483]: array([3100000.0, 2000.0, 'not a number'], dtype=object)
In [484]: timeit fn(alist)
21.1 µs ± 68.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

My observation from other SO is that pandas apply is much slower. pandas does a lot of extra work in tracking indices.

For a task that's basically string and list oriented, using numpy offers few, if any, advantages.

For a much larger list, the timing differences disappear:

In [485]: alist = alist*1000
In [486]: len(alist)
Out[486]: 3000
In [487]: timeit fn(alist)
12.5 ms ± 70.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [488]: timeit [parse_numbers(s) for s in alist]
12.3 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [489]: fn = np.vectorize(parse_numbers, otypes=[object])
In [490]: timeit fn(alist)
12.1 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)