The task that I am trying to accomplish is to define a function that adds 1 to the elements of the 'grade' column of a DataFrame if the corresponding element in the 'sqft_living' column is less than 400 and adds 2 to the elements of the 'grade' column if the corresponding element in the 'sqft_living' column is greater than 400. This function is then applied to the DataFrame using DataFrame.apply() method.
The dataset that I am working on is called 'House Sales in King County, USA' Link to the dataset: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction
The 'grade' column and the 'sqft_living' column, of the dataset, looks like this:
id sqft_living grade
0 7129300520 1180 7
1 6414100192 2570 7
2 5631500400 770 6
3 2487200875 1960 7
4 1954400510 1680 8
... ... ... ...
21608 263000018 1530 8
21609 6600060120 2310 8
21610 1523300141 1020 7
21611 291310100 1600 8
21612 1523300157 1020 7
The code that I am using is:
def myfunc(x):
if x<400 and x>0:
housing['grade'] = housing['grade'].add(1)
elif x>400:
housing['grade'] = housing['grade'].add(2)
housing['sqft_living'].apply(myfunc)
Here, 'housing' is the dataset.
This gives me the output as:
id sqft_living grade
0 7129300520 1180 86447
1 6414100192 2570 86447
2 5631500400 770 86446
3 2487200875 1960 86447
4 1954400510 1680 86448
... ... ... ...
21608 263000018 1530 86448
21609 6600060120 2310 86448
21610 1523300141 1020 86447
21611 291310100 1600 86448
21612 1523300157 1020 86447
I notice here, that the last digits of the elements of the 'grade' column are the same as their original value
However, when I do something like:
def myfunc(x):
if x<400 and x>0:
housing['grade'] = ' '
elif x>400:
housing['grade'] = '-'
housing['sqft_living'].apply(myfunc)
The code works as intended, and gives the output
id sqft_living grade
0 7129300520 1180 -
1 6414100192 2570 -
2 5631500400 770 -
3 2487200875 1960 -
4 1954400510 1680 -
... ... ... ...
21608 263000018 1530 -
21609 6600060120 2310 -
21610 1523300141 1020 -
21611 291310100 1600 -
21612 1523300157 1020 -
I am unable to understand why the code gives the mentioned output in the first case and I'd also like to know the way by which I could accomplish the task
CodePudding user response:
the last digits of the elements of the 'grade' column are the same as their original value
It's just a coincide that add(1)
and add(2)
results to the multiples of ten which is 86440
in your example.
housing['grade']
is the whole column, you may want change it to row
def myfunc(row):
if 0 < row['sqft_living'] < 400:
row['grade'] = 1
elif row['sqft_living'] > 400:
row['grade'] = 2
housing.apply(myfunc, axis=1)
Or with np.select
housing['grade'] = np.select(
[housing['sqft_living'].between(0, 400, inclusive='neither'),
housing['sqft_living'] > 400],
[housing['grade'].add(1), housing['grade'].add(2)]
)
CodePudding user response:
Given:
id sqft_living grade
0 7129300520 1180 7
1 6414100192 2570 7
2 5631500400 770 6
3 2487200875 1960 7
4 1954400510 1680 8
Doing:
# Note, I use 1500 instead of 400 here so we can see differing output.
df['grade'] = df['grade'].mask(df['sqft_living'].between(0, 1500), df['grade'].add(1))
df['grade'] = df['grade'].mask(df['sqft_living'].ge(1500), df['grade'].add(2))
Output:
id sqft_living grade
0 7129300520 1180 8
1 6414100192 2570 9
2 5631500400 770 7
3 2487200875 1960 9
4 1954400510 1680 10
Applying this same logic to an apply
:
def stuff(row):
if 0 < row['sqft_living'] < 1500:
row['grade'] = 1
elif row['sqft_living'] >= 1500:
row['grade'] = 2
df.apply(stuff, axis=1)
Please note, apply
is essentially just a simplified for-loop and will be significantly slower than other vectorized methods. How it's modifying the DataFrame inplace is also against best practice.
Your Function's Explanation:
def myfunc(x):
print(x, 0<x<1500)
if x<1500 and x>0:
print(df['grade'])
df['grade'] = df['grade'].add(1)
print(df['grade'])
df['sqft_living'].apply(myfunc)
# Output:
1180 True
0 7
1 7
2 6
3 7
4 8
Name: grade, dtype: int64
0 8
1 8
2 7
3 8
4 9
Name: grade, dtype: int64
2570 False
770 True
0 8
1 8
2 7
3 8
4 9
Name: grade, dtype: int64
0 9
1 9
2 8
3 9
4 10
Name: grade, dtype: int64
1960 False
1680 False
0 None
1 None
2 None
3 None
4 None
Name: sqft_living, dtype: object
Looking at the simplified function output with added print statements, we can see that what you were doing is checking if an integer was between 0 and 400. If that was True, then you'd add one to every value in the whole dataframe column. If it was greater than 400, then you'd add two to every value in the whole dataframe column. You repeat this for every single value...