Home > Back-end >  Are there differences in performance/ is it best practice to use pandas dataframe attributes when re
Are there differences in performance/ is it best practice to use pandas dataframe attributes when re

Time:03-23

I frequently go back and forth between using dataframe object attributes to refer to columns as well as using the bracket method.

I am wondering which format is considered "best practice," and if there are any performance differences between the two (or could this potentially vary based upon the circumstance?). I am not finding many resources on this subject.

Here's a simplistic example of what I mean: creating the column "green," with rows being True if columns "blue" and "yellow" are True, otherwise the rows are false.

# using brackets.
df['green'] = np.where((df['blue']==True) & (df['yellow']==True), True, False)

vs.

# using periods.
df['green'] = np.where((df.blue==True) & (df.yellow == True), True, False)

I often find myself using the latter as it looks cleaner, is shorter, and is easier to type. However, I often see pandas examples here and other sources using both methods.

  • Is there a performance difference in using either format?
  • Which format is considered best practice?

CodePudding user response:

There is no performance difference between the 2 notations:

  • df.blue uses __getattr__ to lookup the right column
  • df['blue'] uses __getitem__ to lookup the right column (or index)

You need to have a valid python identifier if you want to use the first form and you can't use column name like shape, size, values and so on.

The second form is more explicit and it used by the LocIndexer. It allows you to use column name like 2022 or Energy (KWH). I clearly prefer this notation.

CodePudding user response:

If performance matters, don't use where or similar costly function. A classic mask will do the job. Using timeit can give you an idea about your time consumption :

import pandas as pd
import numpy as np
n = 100
df = pd.DataFrame({'yellow' : np.random.randint(0, 2, n),
                   'blue' : np.random.randint(0, 2, n)}, dtype = np.bool8)

%timeit np.where((df['blue']==True) & (df['yellow']==True), True, False)
252 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit np.where((df.blue==True) & (df.yellow == True), True, False)
245 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df['blue'] & df['yellow']
72.1 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit df.blue & df.yellow
77.1 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In term of performance it's quite equivalent, and statically, you can't differentiate the two approach. In fact, in a costly implementation (as where for instance), the real bottleneck are not on how to access element.

Regarding the syntax, I prefer using .loc or .iloc to access elements since I find it more "pandas-ic", but that's totally up to you.

  • Related