Home > Back-end >  When creating a new column, why does it work with a function but not without it?
When creating a new column, why does it work with a function but not without it?

Time:10-02

I'm using the Titanic dataset to learn to clean data. What I'm trying to do is to create a new column and add values to it. The dataset contains two columns, 'SibSp'(Sibling & Spouse) and 'Parch'(Parent & Children) for the passengers. I created a new column 'Family Size' to keep it all in one place.

import pandas as pd
import os

filename = os.path.join(os.path.dirname(__file__),'train.csv')
data = pd.read_csv(filename)

#This is without a function
data['Family Size'] = data['SibSp']   data['Parch']
print(data)

#This is with a Function
def create_fam_size(data):
    return data['SibSp']   data['Parch']
data['Family Size'] = create_fam_size(data)
print(data)

So far everything is fine. Now, I want to create another column 'Is Alone' and populate it with a 1 for Alone and a 0 for Not Alone. I tried creating a function to do it, and it works fine.

def create_is_alone_column(data, colname):
    def is_alone(a):
        if a == 0:
            return 1
        else:
            return 0
    return data[colname].apply(is_alone)
data['Is Alone'] = create_is_alone_column(data, 'Family Size')
print(data)

But when I try to do it without a function, I get a ValueError and am stumped as to why.

data['Is Alone'] = data['Family Size']
if data['Family Size'] == 0:
    data['Is Alone'].apply(0)
else:
    data['Is Alone'].apply(1)
print(data)

And here is the error:

Traceback (most recent call last):
  File "c:/Users/Desktop/tiny.py", line 119, in <module>
    if data['Family Size'] == 0:
  File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\generic.py", line 1439, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I know it's something ridiculous I'm overlooking, so if anyone can offer me a glimmer of hope, I'll be eternally grateful.

The dataset can be downloaded from here if you want: https://www.kaggle.com/hesh97/titanicdataset-traincsv

CodePudding user response:

You can use:

[...] # "Family Size" is calculated

data["Alone"] = 0
data.loc[data["Family Size"] == 0, "Alone"] = 1

This creates a mask, based on the condition data["Family Size"] == 0 and sets all values in column "Alone" to 1, where this condition is True.

apply however does not work the way you used it, see doc. Because DataFrame.apply(func, ...) takes a func as input and is carried out on a column (axis=0) or on a row (axis=1).

You could do this in one line using numpy:

data["Alone"] = np.where(data["Family Size"] == 0, 1, 0)

See where docs, which is (condition, x, y) as "Where True, yield x, otherwise yield y".

  • Related