I frequently go back and forth between using dataframe object attributes to refer to columns as well as using the bracket method.
I am wondering which format is considered "best practice," and if there are any performance differences between the two (or could this potentially vary based upon the circumstance?). I am not finding many resources on this subject.
Here's a simplistic example of what I mean: creating the column "green," with rows being True if columns "blue" and "yellow" are True, otherwise the rows are false.
# using brackets.
df['green'] = np.where((df['blue']==True) & (df['yellow']==True), True, False)
vs.
# using periods.
df['green'] = np.where((df.blue==True) & (df.yellow == True), True, False)
I often find myself using the latter as it looks cleaner, is shorter, and is easier to type. However, I often see pandas examples here and other sources using both methods.
- Is there a performance difference in using either format?
- Which format is considered best practice?
CodePudding user response:
There is no performance difference between the 2 notations:
df.blue
uses__getattr__
to lookup the right columndf['blue']
uses__getitem__
to lookup the right column (or index)
You need to have a valid python identifier if you want to use the first form and you can't use column name like shape
, size
, values
and so on.
The second form is more explicit and it used by the LocIndexer
. It allows you to use column name like 2022
or Energy (KWH)
. I clearly prefer this notation.
CodePudding user response:
If performance matters, don't use where or similar costly function. A classic mask will do the job. Using timeit
can give you an idea about your time consumption :
import pandas as pd
import numpy as np
n = 100
df = pd.DataFrame({'yellow' : np.random.randint(0, 2, n),
'blue' : np.random.randint(0, 2, n)}, dtype = np.bool8)
%timeit np.where((df['blue']==True) & (df['yellow']==True), True, False)
252 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit np.where((df.blue==True) & (df.yellow == True), True, False)
245 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df['blue'] & df['yellow']
72.1 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.blue & df.yellow
77.1 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In term of performance it's quite equivalent, and statically, you can't differentiate the two approach. In fact, in a costly implementation (as where for instance), the real bottleneck are not on how to access element.
Regarding the syntax, I prefer using .loc or .iloc to access elements since I find it more "pandas-ic", but that's totally up to you.