I'm using the Titanic dataset to learn to clean data. What I'm trying to do is to create a new column and add values to it. The dataset contains two columns, 'SibSp'
(Sibling & Spouse) and 'Parch'
(Parent & Children) for the passengers. I created a new column 'Family Size' to keep it all in one place.
import pandas as pd
import os
filename = os.path.join(os.path.dirname(__file__),'train.csv')
data = pd.read_csv(filename)
#This is without a function
data['Family Size'] = data['SibSp'] data['Parch']
print(data)
#This is with a Function
def create_fam_size(data):
return data['SibSp'] data['Parch']
data['Family Size'] = create_fam_size(data)
print(data)
So far everything is fine. Now, I want to create another column 'Is Alone'
and populate it with a 1 for Alone and a 0 for Not Alone. I tried creating a function to do it, and it works fine.
def create_is_alone_column(data, colname):
def is_alone(a):
if a == 0:
return 1
else:
return 0
return data[colname].apply(is_alone)
data['Is Alone'] = create_is_alone_column(data, 'Family Size')
print(data)
But when I try to do it without a function, I get a ValueError and am stumped as to why.
data['Is Alone'] = data['Family Size']
if data['Family Size'] == 0:
data['Is Alone'].apply(0)
else:
data['Is Alone'].apply(1)
print(data)
And here is the error:
Traceback (most recent call last):
File "c:/Users/Desktop/tiny.py", line 119, in <module>
if data['Family Size'] == 0:
File "C:\Users\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\generic.py", line 1439, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know it's something ridiculous I'm overlooking, so if anyone can offer me a glimmer of hope, I'll be eternally grateful.
The dataset can be downloaded from here if you want: https://www.kaggle.com/hesh97/titanicdataset-traincsv
CodePudding user response:
You can use:
[...] # "Family Size" is calculated
data["Alone"] = 0
data.loc[data["Family Size"] == 0, "Alone"] = 1
This creates a mask, based on the condition data["Family Size"] == 0
and sets all values in column "Alone"
to 1
, where this condition is True
.
apply
however does not work the way you used it, see doc. Because DataFrame.apply(func, ...)
takes a func
as input and is carried out on a column (axis=0
) or on a row (axis=1
).
You could do this in one line using numpy
:
data["Alone"] = np.where(data["Family Size"] == 0, 1, 0)
See where docs, which is (condition, x, y)
as "Where True, yield x, otherwise yield y".