How do I convert this column of values, mostly integers, and some strings to all integers.
The column looks like this,
x1
___
128455551
92571902
123125
985166
np.NaN
2241
1.50000MMM
2.5255MMM
1.2255MMMM
np.NaN
...
And I want it to look like this, where the rows with MMM, the characters are dropped and the number is multiplied by a billion (10**9) and converted to integers.
The rows where there are MMMM, the characters are dropped and the number is multiplied by a trillion (10**12) and converted to integers.
Basically each M means 1,000. There are other columns so I cannot drop the np.NaN
.
x1
___
128455551
92571902
123125
985166
np.NaN
2241
1500000000
2525500000
1225500000000
np.NaN
...
I tried this,
df['x1'] =np.where(df.x1.astype(str).str.contains('MMM'), (df.x1.str.replace('MMM', '').astype(float) * 10**9).astype(int), df.x1)
When I do it with just the 2 rows it works fine, but when I do it with the whole dataframe I get this error, IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
.
How do i fix it?
CodePudding user response:
A possible solution:
def f(x):
if isinstance(x, str):
ms = x.count('M')
return float(x.replace('M' * ms, '')) * 10**(3 * ms)
else:
return x
df['x1'] = df['x1'].map(f).astype('Int64')
Output:
x1
0 128455551
1 92571902
2 123125
3 985166
4 <NA>
5 2241
6 1500000000
7 2525500000
8 1225500000000
9 <NA>
CodePudding user response:
You can also try the following solution:
import numpy as np
(df.x1.str.extract('([^M] )(M )?').replace({np.NaN : None})
.assign(power = lambda x: 10 ** (3 * x.loc[:, 1].str.count('M').fillna(0)))
.pipe(lambda d: d.loc[:, 0].replace({'np.NaN' : None}).astype(float).mul(d.power)))
0 1.284556e 08
1 9.257190e 07
2 1.231250e 05
3 9.851660e 05
4 NaN
5 2.241000e 03
6 1.500000e 09
7 2.525500e 09
8 1.225500e 12
9 NaN
dtype: float64
CodePudding user response:
When considering string values containing M
, sanitized value can be multiplied on 1000
as much as M
occurrences (according to your condition "Basically each M means 1,000"):
df['x1'] = np.where(df.x1.str.contains('M'),
(df.x1.str.replace('M', '').astype(float) \
* pow(1000, df.x1.str.count('M'))).astype('Int64'), df.x1)
print(df)
x1
0 128455551
1 92571902
2 123125
3 985166
4 NaN
5 2241
6 1500000000
7 2525500000
8 1225500000000
9 NaN