Converting Set(float) into float/int-CodePudding

I have this df and trying to clean it. How to convert irs_pop,latitude,longitude and fips in real floats and ints?

The code below returns float() argument must be a string or a real number, not 'set'

mask['latitude'] = mask['latitude'].astype('float64')
mask['longitude'] = mask['irs_pop'].astype('float64')
mask['irs_pop'] = mask['irs_pop'].astype('int64')
mask['fips'] = mask['fips'].astype('int64')

Code below returns sequence item 0: expected str instance, float found

mask['fips'] = mask['fips'].apply(lambda x: ','.join(x))

mask = mask.astype({'fips' : 'int64'}) returns int() argument must be a string, a bytes-like object or a real number, not 'set'

CodePudding user response：

So, you could do the following. Notice, you need to convert every element in the set to a str, so just use map and str:

mask['fips'] = mask['fips'].apply(lambda x: ','.join(map(str, x)))

This will store your floats as a comma delimited string. This would have to be parsed back into whatever format you want when reading it back.

CodePudding user response：

Try this:

for col in ['irs_pop', 'latitude', 'longitude']:
    mask[col] = mask[col].astype(str).str[1:-1].astype(int)

It looks like you have multiple FIPS in your FIPS column so you wont be able to convert to a single FIPS code. Most importantly, FIPS can have leading zeros so should be converted to strings.

CodePudding user response：

You would need to convert to tuple/list and to slice with str:

df['col'] = df['col'].agg(tuple).str[0]

Example:

df = pd.DataFrame({'col': [{1},{2,3},{}]})
df['col2'] = df['col'].agg(tuple).str[0]

Output:

      col  col2
0     {1}   1.0
1  {2, 3}   2.0 # this doesn't seem to be the case in your data
2      {}   NaN

If you want a string as output, with all values if multiple:

df['col'] = df['col'].astype(str).str[1:-1]

Output (as new column for clarity):

      col  col2
0     {1}     1
1  {2, 3}  2, 3
2      {}

CodePudding user response：

It looks like you have sets with a single value in these columns. The problem may be upstream where these values were filled in the first place. But you could clean it up by applying a function that pops a value from the set and converts it to a float.

import pandas as pd

mask = pd.DataFrame({"latitude":[{40.81}, {40.81}], 
    "longitude":[{-73.04}, {-73.04}]})
print(mask)
columns = ["latitude", "longitude"]
for col in columns:
    mask[col] = mask[col].apply(lambda s: float(s.pop()))
print(mask)

You could have pandas handle the for loop by doing a double apply

mask[columns] = mask[columns].apply(
        lambda series: series.apply(lambda s: float(s.pop())))
print(mask)