how to combine multiple columns with similar names?-CodePudding

I have multiple columns with the same name mixed with other columns. Some of these columns that I want to combine have null in the rows. For instance,

  apple_0 apple_1
0     abc    None
1     abc     efg
2     hig    None
3     dsf    None

and I want:

    apple 
0     abc
1     abc, efg
2     hig
3     dsf

I have like 85 of these columns. The actual names are: scheduleSettings_nodes_0_name, scheduleSettings_nodes_1_name and so on How can I combine these?

CodePudding user response：

In addition to the other answers, you could also try using agg something like this.

df = pd.DataFrame({'apple_0': ['abc', 'abc', 'hig', 'dsf'], 'apple_1': [None, 'efg', None, None]})


selected_cols = [col for col in df.columns if col.startswith('apple')]
df['apple'] = df[selected_cols].agg(lambda x: ', '.join(map(str, filter(None, x))), axis=1)

CodePudding user response：

Option 1: using stacking to drop the null, then aggregation per group.

(df.filter(like='apple_')
   .replace('None', pd.NA)
   .stack()
   .groupby(level=0).agg(','.join)
   .reindex(df.index)
   .to_frame('apple')
 )

Option 2: using an internal loop with agg.

(df.filter(like='apple_')
   .replace('None', pd.NA)
   .agg(lambda r: ','.join(x for x in r if pd.notna(x)), axis=1)
   .to_frame('apple')
)

Output:

     apple
0      abc
1  abc,efg
2      hig
3      dsf

CodePudding user response：

You can use df.apply with axis=1 to apply a function to each row. This function can use str.join to join the elements of all columns that match apple_* if the elements are not NA

def join_cols(row, sep, col_names):
    return sep.join(row[c] 
                    for c in col_names 
                    if not pd.isna(row[c]))

cols_to_combine = ["apple_0", "apple_1"]
df["apple"] = df.apply(join_cols, axis=1, args=(",", cols_to_combine))
df.drop(columns=cols_to_combine, inplace=True)

Which gives your desired output:

     apple
0      abc
1  abc,efg
2      hig
3      dsf

To figure out which columns match your pattern, you could do:

cols_to_combine = [c for c in df.columns if c.startswith("apple_")]

There's probably a way to vectorize this, which would be preferred over apply() in terms of speed