I have a dataset of ~700 columns. I'd like to join all columns into a single column.
Input:
id | A | B | C | D | E | F | ... | Z
0 | yes | no | yes | no | yes| no | ... | no
1 | no | no | yes | no | no | no | ... | no
2 | yes | yes| yes | yes| yes| no | ... | no
Output:
id | A | B | C | D | E | F | ... | Z | joined_column
0 | yes | no | yes | no | yes| no | ... | no | yes no yes no yes no ... no
1 | no | no | yes | no | no | no | ... | no | no no yes no no no ... no
2 | yes | yes| yes | yes| yes| no | ... | no | yes yes yes yes yes no ... no
I've used the following in the past. However, I'm looking for a way to scale this when you're working with a large amount of columns.
def join(df):
list = []
for i in range(0, df.shape[0]):
list.append( str(df['A'][i]) ' ' str(df['B'][i]) ' ' str(df['C'][i]))
return list
CodePudding user response:
Given df
:
A B C D E F
0 yes no yes no yes no
1 no no yes no no no
2 yes yes yes yes yes no
Doing:
cols = df.columns
# As a string:
df['joined_column_str'] = df[cols].agg(' '.join, axis=1)
# As a list:
df['joined_column_list'] = df[cols].agg(list, axis=1)
Output:
A B C D E F joined_column_str joined_column_list
0 yes no yes no yes no yes no yes no yes no [yes, no, yes, no, yes, no]
1 no no yes no no no no no yes no no no [no, no, yes, no, no, no]
2 yes yes yes yes yes no yes yes yes yes yes no [yes, yes, yes, yes, yes, no]
CodePudding user response:
Perhaps you can use a list comprehension:
df['joined_column'] = [' '.join(item) for item in df.iloc[:,1:].values]
id A B C D E F Z joined_column
0 0 yes no yes no yes no no yes no yes no yes no no
1 1 no no yes no no no no no no yes no no no no
2 2 yes yes yes yes yes no no yes yes yes yes yes no no
CodePudding user response:
You can use agg
to aggregate data over a specific axis using a specific operation:
df['joined_column'] = df.agg(' '.join, axis=1)
Potentially use
df['joined'] = df.iloc[:,1:].agg(' '.join, axis=1)
if you don't want to join the first (or any other column).
I used ' '.join
to join the values with a space as separator.
axis
is set to 1
because you want to join columns and not rows.
CodePudding user response:
Considering the dataframe df
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
One can use a custom lambda function inside pandas.DataFrame.apply
as follows
df['joined_column'] = df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no 0 yes no yes no yes no yes no yes no yes no ye...
1 1 no no ... no no 1 no no yes no no no no no no no no no no no n...
2 2 yes yes ... yes yes 2 yes yes yes yes yes no yes yes yes yes yes y...
However, as one doesn't want the id
to appear in the joined_column
, one can add x.drop('id')
as
df['joined_column'] = df.apply(lambda x: ' '.join(x.drop('id').astype(str)), axis=1)
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...
Alternatively, for this last case, one can also leverage Numpy, more specifically numpy.array
as follows
df['joined_column'] = np.array([' '.join(x.astype(str)) for x in df.drop('id', axis=1).values])
[Out]:
id A B ... Y Z joined_column
0 0 yes no ... yes no yes no yes no yes no yes no yes no yes no yes ...
1 1 no no ... no no no no yes no no no no no no no no no no no no ...
2 2 yes yes ... yes yes yes yes yes yes yes no yes yes yes yes yes yes...