I am currently refactoring some code where I see both these lines being used :
foo = df['bar'].values[0]
foo = df['bar'].iloc[0]
From my current understanding, both lines do the same thing: retrieving the first value of the pandas series.
Are they really the same? If yes, is one way more recommended than the other? (due to internals subtleties, speed, behavior when setting value instead of getting value, etc)
CodePudding user response:
I think most time it is same output, if dont use datetimes, because .values
or Series.to_numpy
return first value of numpy array:
df = pd.DataFrame({'bar':pd.date_range('2001', freq='Q', periods=5)})
print (df)
bar
0 2001-03-31
1 2001-06-30
2 2001-09-30
3 2001-12-31
4 2002-03-31
foo = df['bar'].to_numpy()[0]
print(foo)
2001-03-31T00:00:00.000000000
print(type(foo))
<class 'numpy.datetime64'>
foo = df['bar'].values[0]
print(foo)
2001-03-31T00:00:00.000000000
print(type(foo))
<class 'numpy.datetime64'>
foo = df['bar'].iloc[0]
print(foo)
2001-03-31 00:00:00
print(type(foo))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
CodePudding user response:
The code df.values
actually returns a numpy.array
(i.e. it can be used without square brackets).
df[col].values
df[col].values[0] # 1st element of numpy array
df[col].values[1:3] # 2nd and 3rd element of numpy array
Meanwhile df.iloc
is a position based indexing to get elements from a dataframe. iloc
must be used with square brackets otherwise you'll see an error.
df.iloc # Error
df.iloc[row, col] # Returns a cell, array (`Series`), matrix (`DataFrame`) based on input
The subtle difference lies in the object being returned, and also the implementation behind the scenes.
iloc
directly reads data from memory and returns the output.
values
converts a DataFrame
into a numpy.array
object and then reads data from memory and returns the output (hence iloc
is faster).