Covert a column of integers and interger strings into all integers using multiplication based on w-CodePudding

How do I convert this column of values, mostly integers, and some strings to all integers.

The column looks like this,

x1
___
128455551
92571902
123125
985166
np.NaN
2241
1.50000MMM
2.5255MMM
1.2255MMMM
np.NaN
...

And I want it to look like this, where the rows with MMM, the characters are dropped and the number is multiplied by a billion (10**9) and converted to integers.

The rows where there are MMMM, the characters are dropped and the number is multiplied by a trillion (10**12) and converted to integers.

Basically each M means 1,000. There are other columns so I cannot drop the np.NaN.

x1
___
128455551
92571902
123125
985166
np.NaN
2241
1500000000
2525500000
1225500000000
np.NaN
...

I tried this,

df['x1'] =np.where(df.x1.astype(str).str.contains('MMM'), (df.x1.str.replace('MMM', '').astype(float) * 10**9).astype(int), df.x1)

When I do it with just the 2 rows it works fine, but when I do it with the whole dataframe I get this error, IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer.

How do i fix it?

CodePudding user response：

A possible solution:

def f(x):
    if isinstance(x, str):
        ms = x.count('M')
        return float(x.replace('M' * ms, '')) * 10**(3 * ms)
    else:
        return x

df['x1'] = df['x1'].map(f).astype('Int64')

Output:

              x1
0      128455551
1       92571902
2         123125
3         985166
4           <NA>
5           2241
6     1500000000
7     2525500000
8  1225500000000
9           <NA>

CodePudding user response：

You can also try the following solution:

import numpy as np

(df.x1.str.extract('([^M] )(M )?').replace({np.NaN : None})
 .assign(power = lambda x: 10 ** (3 * x.loc[:, 1].str.count('M').fillna(0)))
 .pipe(lambda d: d.loc[:, 0].replace({'np.NaN' : None}).astype(float).mul(d.power)))


0    1.284556e 08
1    9.257190e 07
2    1.231250e 05
3    9.851660e 05
4             NaN
5    2.241000e 03
6    1.500000e 09
7    2.525500e 09
8    1.225500e 12
9             NaN
dtype: float64

CodePudding user response：

When considering string values containing M, sanitized value can be multiplied on 1000 as much as M occurrences (according to your condition "Basically each M means 1,000"):

df['x1'] = np.where(df.x1.str.contains('M'),
                    (df.x1.str.replace('M', '').astype(float) \
                     * pow(1000, df.x1.str.count('M'))).astype('Int64'), df.x1)

print(df)

              x1
0      128455551
1       92571902
2         123125
3         985166
4            NaN
5           2241
6     1500000000
7     2525500000
8  1225500000000
9            NaN