Home > Mobile >  Why is DataFrame int column value sometimes returned as float?
Why is DataFrame int column value sometimes returned as float?

Time:01-20

I add a calculated column c to a DataFrame that only contains integers.

df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
df.dtypes

The DataFrame reports that the column type of c is indeed int:

a      int64
b    float64
c      int32
dtype: object

If I access a value from c like this then I get an int:

df.c.values[0]        # Returns "3"
type(df.c.values[0])  # Returns "numpy.int32"

But if I access the same value using loc I get a float:

df.iloc[0].c        # Returns "3.0"
type(df.iloc[0].c)  # Returns "numpy.float64"

Why is this?

I would like to be able to access the value using indexes without having to cast it (again) to an int.

CodePudding user response:

Looks like what's happening is when you are accessing df.iloc[0].c, you have to first access df.iloc[0] which includes all three columns. df.iloc[0] then casts to the type that represents all three columns, which is numpy.float64.

Interestingly enough, I can avoid this by adding a string column.

df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
df['d'] = ['hi', 'bye', 'hello', 'cya', 'sup']


print(df.iloc[0].c)
print(type(df.iloc[0].c))

print(df.dtypes)

To your end question, you can avoid this whole mess by using df.loc[0, 'c'] instead of iloc.

CodePudding user response:

  • When I execute your code, result is this dataframe :
df
   a         b   c
0  1  0.315388   4
1  1  0.111275   9
2  1  0.251253   4
3  2  0.043162  47
4  1  0.047985  21
  • When I type in the interpreter df['c'].values I get this : array([ 4, 9, 4, 47, 21])

It's to say all the c-column values

  • When I type in the interpreter df.iloc[0] I have the following values :
a    1.000000
b    0.315388
c    4.000000
Name: 0, dtype: float64

it's to say the first df row values.

What we could notice

All c-column values are integers while all first row values are not of the same types because we have then two integers and a float value. This fact is very important.

Indeed by definition an array is a collection of elements of the same type

So to represent a float in a collection of values that are integers, conversion must to be float for all elements to respect this rule, because floats can contains integers but the reverse is not true.

Conclusion

Type of a collection of integers is int...

Type of a collection of floats is float...

Type of a collection of integers containing at least one float is converted to float...

Quote

"An array is a concept that stores different items of the same type together as one and makes calculating the stance of each element easier by adding an offset to the base number." (codeinstitute.net)

  • Related