Home > other >  The Panda's DataFrame.apply() doesn't work as intended
The Panda's DataFrame.apply() doesn't work as intended

Time:07-08

The task that I am trying to accomplish is to define a function that adds 1 to the elements of the 'grade' column of a DataFrame if the corresponding element in the 'sqft_living' column is less than 400 and adds 2 to the elements of the 'grade' column if the corresponding element in the 'sqft_living' column is greater than 400. This function is then applied to the DataFrame using DataFrame.apply() method.

The dataset that I am working on is called 'House Sales in King County, USA' Link to the dataset: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

The 'grade' column and the 'sqft_living' column, of the dataset, looks like this:

            id     sqft_living grade
0       7129300520     1180      7
1       6414100192     2570      7
2       5631500400     770       6
3       2487200875     1960      7
4       1954400510     1680      8
...        ...         ...      ...
21608   263000018      1530      8
21609   6600060120     2310      8
21610   1523300141     1020      7
21611   291310100      1600      8
21612   1523300157     1020      7

The code that I am using is:

def myfunc(x):
    if x<400 and x>0:
        housing['grade'] = housing['grade'].add(1)
    elif x>400:
        housing['grade'] = housing['grade'].add(2)
housing['sqft_living'].apply(myfunc)

Here, 'housing' is the dataset.

This gives me the output as:

            id      sqft_living grade
0       7129300520      1180    86447
1       6414100192      2570    86447
2       5631500400      770     86446
3       2487200875      1960    86447
4       1954400510      1680    86448
...         ...          ...    ...
21608   263000018       1530    86448
21609   6600060120      2310    86448
21610   1523300141      1020    86447
21611   291310100       1600    86448
21612   1523300157      1020    86447

I notice here, that the last digits of the elements of the 'grade' column are the same as their original value

However, when I do something like:

def myfunc(x):
    if x<400 and x>0:
        housing['grade'] = ' '
    elif x>400:
        housing['grade'] = '-'
housing['sqft_living'].apply(myfunc)

The code works as intended, and gives the output

            id     sqft_living  grade
0       7129300520     1180      -
1       6414100192     2570      -
2       5631500400      770      -
3       2487200875     1960      -
4       1954400510     1680      -
...        ...          ...     ...
21608   263000018      1530      -
21609   6600060120     2310      -
21610   1523300141     1020      -
21611   291310100      1600      -
21612   1523300157     1020      -

I am unable to understand why the code gives the mentioned output in the first case and I'd also like to know the way by which I could accomplish the task

CodePudding user response:

the last digits of the elements of the 'grade' column are the same as their original value

It's just a coincide that add(1) and add(2) results to the multiples of ten which is 86440 in your example.

housing['grade'] is the whole column, you may want change it to row

def myfunc(row):
    if 0 < row['sqft_living'] < 400:
        row['grade']  = 1
    elif row['sqft_living'] > 400:
        row['grade']  = 2
housing.apply(myfunc, axis=1)

Or with np.select

housing['grade'] = np.select(
    [housing['sqft_living'].between(0, 400, inclusive='neither'),
     housing['sqft_living'] > 400],
    [housing['grade'].add(1), housing['grade'].add(2)]
)

CodePudding user response:

Given:

           id  sqft_living  grade
0  7129300520         1180      7
1  6414100192         2570      7
2  5631500400          770      6
3  2487200875         1960      7
4  1954400510         1680      8

Doing:

# Note, I use 1500 instead of 400 here so we can see differing output.
df['grade'] = df['grade'].mask(df['sqft_living'].between(0, 1500), df['grade'].add(1))
df['grade'] = df['grade'].mask(df['sqft_living'].ge(1500), df['grade'].add(2))

Output:

           id  sqft_living  grade
0  7129300520         1180      8
1  6414100192         2570      9
2  5631500400          770      7
3  2487200875         1960      9
4  1954400510         1680     10

Applying this same logic to an apply:

def stuff(row):
    if 0 < row['sqft_living'] < 1500:
        row['grade']  = 1
    elif row['sqft_living'] >= 1500:
        row['grade']  = 2

df.apply(stuff, axis=1)

Please note, apply is essentially just a simplified for-loop and will be significantly slower than other vectorized methods. How it's modifying the DataFrame inplace is also against best practice.


Your Function's Explanation:

def myfunc(x):
    print(x, 0<x<1500)
    if x<1500 and x>0:
        print(df['grade'])
        df['grade'] = df['grade'].add(1)
        print(df['grade'])

df['sqft_living'].apply(myfunc)

# Output:

1180 True
0    7
1    7
2    6
3    7
4    8
Name: grade, dtype: int64
0    8
1    8
2    7
3    8
4    9
Name: grade, dtype: int64
2570 False
770 True
0    8
1    8
2    7
3    8
4    9
Name: grade, dtype: int64
0     9
1     9
2     8
3     9
4    10
Name: grade, dtype: int64
1960 False
1680 False
0    None
1    None
2    None
3    None
4    None
Name: sqft_living, dtype: object

Looking at the simplified function output with added print statements, we can see that what you were doing is checking if an integer was between 0 and 400. If that was True, then you'd add one to every value in the whole dataframe column. If it was greater than 400, then you'd add two to every value in the whole dataframe column. You repeat this for every single value...

  • Related