Situation: trying to split one column of a pandas df into two separate columns, without changing the original data - if possible using the .assign()
method.
Below produces the expected result but each column requires its own assignment expression and feels like the wrong way of doing it.
pets = pd.DataFrame({'observation': ['black,cat', 'brown,dog']})
(
pets
.assign(colour = pets['observation'].str.split(',', expand=True)[0],
animal = pets['observation'].str.split(',', expand=True)[1])
.drop(columns='observation')
)
Below feels more like the right way: .str.split(...,expand=True)
results in a list so a list of variable names feels like what I should provide.
# throws error
(
pets
.assign(colour, animal = pets['observation'].str.split(',', expand=True))
.drop(columns='observation')
)
# throws error
(
pets
.assign([colour, animal] = pets['observation'].str.split(',', expand=True))
.drop(columns='observation')
)
NameError: name 'colour' is not defined
.assign([colour, animal] = pets['observation'].str.split(',', expand=True))
^
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
I'm still getting used to working in pandas so any help is appreciated.
CodePudding user response:
You can try:
def app(s):
s['colour']=s[0][0]
s['animal']=s[0][1]
return s
pets['observation']=pets.apply(lambda x: x[0].split(','),axis=1)
pets=pets.apply(app,axis=1)
CodePudding user response:
If you want to create a new dataframe, you can rename
the columns using a simple dictionary:
cols = ['colour', 'animal']
new_df = (pets['observation']
.str.split(',', expand=True)
.rename(columns=dict(enumerate(cols)))
)
output:
colour animal
0 black cat
1 brown dog
Assuming you want to use a pipeline, you can craft a custom function and use pipe
:
def split(df):
df = df.copy()
cols = ['colour', 'animal']
df[cols] = df['observation'].str.split(',', expand=True)
return df
(
pets
.pipe(split)
.drop(columns='observation')
)
NB. this is only a simple pipeline example, of course you can craft a more interesting function with parameters
def split(df, col_to_split, cols):
df = df.copy()
df[cols] = df[col_to_split].str.split(',', expand=True)
return df
(
pets
.pipe(split, col_to_split='observation', cols=['colour', 'animal'])
.drop(columns='observation')
)
CodePudding user response:
You can create dictioanry of Series
by DataFrame.set_axis
and DataFrame.to_dict
:
d = pets['observation'].str.split(',', expand=True).set_axis(cols, axis=1).to_dict('Series')
df1 = pets.assign(**d)
print (df1)
observation colour animal
0 black,cat black cat
1 brown,dog brown dog
If possible assign to original DataFrame is possible use:
cols = ['colour', 'animal']
pets[cols] = pets['observation'].str.split(',', expand=True)
print (pets)
observation colour animal
0 black,cat black cat
1 brown,dog brown dog
If need new DataFrame:
cols = ['colour', 'animal']
df = pets['observation'].str.split(',', expand=True).set_axis(cols, axis=1)
print (df)
colour animal
0 black cat
1 brown dog