Home > other >  problem cleaning characters/str from a column and make it an int
problem cleaning characters/str from a column and make it an int

Time:10-22

I wrote this function to clean the and , out of a column:

def data_clean_installs(x):
    if ' ' in x:
        return float(x.replace(' ',''))
    elif ',' in x:
        return float(x.replace(',',''))
    else:
        return float(x)

I want to use the function to make a new int column called 'Clean_Installs' and ran this:

apr['Clean_Installs'] = apr.Installs.astype('str').apply(data_clean_installs).apply(int)

and get this ValueError: could not convert string to float: '10,000'

I tried everything I can think of, too much to put here and will take any inputs please... Oh, I am new and this is my first question ever. Sorry if I violated any rules... Really hope someone can help. Thanks!

CodePudding user response:

No need for a custom function here since you seem to already be using Pandas:

apr.Installs.str.replace("[, ]", "", regex=True).apply(int)

My only concern with using .apply(int) is that it'll fail in the case you have values in the column that won't translate to integers, like "1,000.53".

For a little bit of an explanation, regex=True is telling Pandas that the pattern (the first argument in Series.str.replace) should be treated as a regular expression.

The square brackets in the pattern [, ] form what's known as a character class. The pattern is basically telling Pandas, "use regex to match any string containing any of these characters, "," or " " or both, and replace them with the empty string.

Regex is super powerful, but there's a time and a place for it. This is one of those times!

CodePudding user response:

Probably the 10,000 used to be 10,000 or something like that. In the function data_clean_installs you try to convert if a is found, but there also is a , in there. Your function should look like this:

def data_clean_installs(x):
    return float(x.replace(' ', '').replace(',', ''))

You don't need to check if x has a or a ,, the replace will already to that for you and will convert it to '' automatically.

Also if you are converting it to int afterwards you can replace the float call with an int call in the data_clean_installs if all values are guaranteed to be integers

CodePudding user response:

You can do the following:

import re

apr['Clean_Installs'] = apr.Installs.apply(lambda x: int(re.sub('[ ,]', '', x)))
  • Related