My dataset has two columns, 'A' and 'B,' both of which have percentage values but are of the object datatype. For example,
A% | B% |
---|---|
1.x% | 3.x% |
2.x% | 4.x% |
Goal: I'm mostly interested in using this for machine learning clustering, hence my goal is to convert it to decimal form. For example, convert the '1.2%' object value to a float value of 0.012.
I tried two methods: the first was successful, but it took a long time.
I removed or stripped the object % from say '34%' using pandas.Series.str.strip to '34' obj and then converted this value to float using .to_numeric() --> 34. Now I divided this value with 100 and got the result 0.34.
However, in the second way I was attempting the below,
The function:
def Tab_to_float(z):
return float(z.strip('%'))/100
Now when I pass the column (which is an object) as below:
Tab_to_float(df['A'])
I get error:
AttributeError: 'Series' object has no attribute 'strip'
I tried feeding this function an int, float, numpy array, and even a dataframe, but I got the same error: 'that' object has no attribute'strip'. I'm not sure where I'm going wrong. Is there a better way to deal with such requirements? Any help is much appreciated!
CodePudding user response:
To make it a bit interesting, here is a snippet to convert all columns ending in '%' from text percentage format to float:
for col in df.filter(regex='.*%'): # if column name ends in '%'
df[col] = df[col].str.rstrip('%').astype(float).div(100) # remove %, convert to float, divide by 100
df.rename(columns={col: col.rstrip('%')}, inplace=True) # remove the '%' in the column name
output:
A B
0 0.011 0.033
1 0.022 0.044
CodePudding user response:
df['A'] = df.apply(lambda row : Tab_to_float(row['A']), axis = 1)
You can do this for these two columns and then you can apply this function.
We are applying a function along an axis of the DataFrame. (Here we are changing each element of a column). We are not changing anything for the Tab_to_float
function in this solution.
data = {
'A':['34.3%', '24%'],
'B':['32%','33%'] }
df = pd.DataFrame(data)
df['A'] = df.apply(lambda row : Tab_to_float(row['A']), axis = 1)
df['B'] = df.apply(lambda row : Tab_to_float(row['B']), axis = 1)
print(df)
Outputs:
A B
0 0.343 0.32
1 0.240 0.33
CodePudding user response:
You can use lambda operator to apply your functions to the pandas data frame or to the series. you ca convert each element on a column to a floating point number and divide by 100, like this:
(df['A']).apply(lambda x: float(x.strip('%'))/100)