I have a dataset with a column like this:
0 534K
1 482.2K
2 491.3K
3 247.6K
4 815.3K
...
144 403.4K
145 1.3M
146 3.1M
147 311.5K
148 3.7M
I'm trying to clean the column values by writing a function like this (to remove the 'K's and '.' as a test first):
def fix_column(val):
i = 0
if val[i][-1] == 'K' and len(val[i]) == 4:
val.replace("K","")
val = val "000"
elif val[i][-1] == 'K'and len(val[i]) == 5:
val.replace("K","").replace(".","")
val = val "00"
i = 1
return val
But when I try to apply the function on the column like df.col = df.col.apply(fix_column)
nothing happens and the column values remain unchanged. If I do something like df.col = df.col.apply(lambda x:x.replace("K","").replace(".",""))
I do get close to a desirable output like so:
0 534
1 4822
2 4913
3 2476
4 8153
...
144 4034
145 13M
146 31M
147 3115
148 37M
I may be missing something basic but would appreciate any pointers in the right direction. Thanks.
CodePudding user response:
Your first mistake is to use a loop at all. Your second mistake is to use a loop incorrectly. (i
is a local variable, its value is always 0.) Here's a way to go:
Separate the number and the decimal suffix (if any):
number_strings = df.col.str.extract('([\d\.] )([MK]?)')
Convert the number strings into proper numbers:
numbers = number_strings[0].astype(float)
Multiply the numbers, where necessary:
numbers[number_strings[1] == 'K'] *= 1,000
numbers[number_strings[1] == 'M'] *= 1,000,000
Paste the results back into the dataframe:
df.col = numbers