Home > Mobile >  How to split dataframe into multiple dataframes based on column-name?
How to split dataframe into multiple dataframes based on column-name?

Time:04-04

I have a dataframe with columns like this:

['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 
 't_energy1', 't_energy2']

And I have a code which returns the average of three columns with the same name:

# Takes in a dataframe with three columns and returns a dataframe with one column of their means as integers
def average_column(dataframe):

    dataframe = dataframe.copy()                  # To avoid SettingWithCopyWarning

    # Create new column name without integers
    temp = dataframe.columns.tolist()[0]
    col_name = temp.rstrip(temp[2:-1])

    dataframe[col_name] = dataframe.mean(axis=1)  # Add column to the dataframe (axis=1 means the mean() is applied row-wise)
    mean_df = dataframe.iloc[: , -1:]             # Isolated column of the mean by selecting all rows (:) for the last column (-1:)

    print("Original:\n{}\nAverage columns:\n{}".format(dataframe, mean_df))

    return mean_df.astype(float) 

This function gives me this output:

Original:
      t_dance0  t_dance1  t_dance2     dance
0        0.549     0.623    0.5190  0.563667
1        0.871     0.702    0.4160  0.663000
2        0.289     0.328    0.2340  0.283667
3        0.886     0.947    0.8260  0.886333
4        0.724     0.791    0.7840  0.766333
...        ...       ...       ...       ...

Average columns:
         dance
0     0.563667
1     0.663000
2     0.283667
3     0.886333
4     0.766333
...        ...

I asked this question about how I can split it into unique and duplicate columns. Which led me to this code:

# Function that splits dataframe into two separate dataframes, one with all unique 
columns and one with all duplicates
def sub_dataframes(dataframe):

    # Extract common prefix -> remove trailing digits
    cols = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()

    # Split columns
    unq_cols = cols[cols == 1].index
    dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)]  # All columns from dataframe that is not in unq_cols

    return dataframe[unq_cols], dataframe[dup_cols]

unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate columns:\n\n{}".format(unq_df, dup_df))

Which gives me this output:

Unique columns:

          id
0      22352
1     106534
2      23608
3       8655
4      49670
...      ...

Duplicate columns:

      t_dur0    t_dur1    t_dur2  t_dance0  t_dance1  t_dance2
0     292720  293760.0  292733.0     0.549     0.623    0.5190
1     213760  181000.0  245973.0     0.871     0.702    0.4160
2     157124  130446.0  152450.0     0.289     0.328    0.2340 
3     127896  176351.0  166968.0     0.886     0.947    0.8260
4     210320  226253.0  211880.0     0.724     0.791    0.7840
...      ...       ...       ...       ...       ...       ...
2828   70740  262400.0  220680.0     0.224     0.609    0.7110 
2829  252226  222400.0  214973.0     0.526     0.623    0.4820
2830  269146  251560.0  172760.0     0.551     0.756    0.7820
2831  344764  425613.0  249652.0     0.473     0.572    0.8230
2832  210955  339869.0  304124.0     0.112     0.523    0.0679

I have tried to combine these functions in another function that takes in a dataframe and returns the dataframe with all duplicate columns replaced by their mean, but I have trouble with splitting the dups_df into smaller dataframes. Is there a simpler way I can do this?

An example on the desired output:

Original: 
      total_tracks  t_dur0    t_dur1    t_dur2  t_dance0  t_dance1  t_dance2  \
0                4  292720  293760.0  292733.0     0.549     0.623    0.5190   
1               12  213760  181000.0  245973.0     0.871     0.702    0.4160   
2               59  157124  130446.0  152450.0     0.289     0.328    0.2340   
3                8  127896  176351.0  166968.0     0.886     0.947    0.8260   
4               17  210320  226253.0  211880.0     0.724     0.791    0.7840   
...            ...     ...       ...       ...       ...       ...       ...   

After function: 
    total_tracks popularity    duration     dance   
0        4          21      293071.000000  0.563667
1       12          14      213577.666667  0.663000
2       59          41      146673.333333  0.283667
3        8           1      157071.666667  0.886333
4       17          47      216151.000000  0.766333
        ...         ...         ...          ...

CodePudding user response:

Use wide_to_long for reshape original DataFrame first and then aggregate mean:

cols = ['total_tracks']
df1 = (pd.wide_to_long(df, 
                     stubnames=['t_dur','t_dance'], 
                     i=cols,
                     j='tmp')
        .reset_index()
        .drop('tmp', 1)
        .groupby(cols, as_index=False)
        .mean())

print (df1)
   total_tracks          t_dur   t_dance
0             4  293071.000000  0.563667
1             8  157071.666667  0.886333
2            12  213577.666667  0.663000
3            17  216151.000000  0.766333
4            59  146673.333333  0.283667

Details:

cols = ['total_tracks']
print(pd.wide_to_long(df, 
                     stubnames=['t_dur','t_dance'], 
                     i=cols,
                     j='tmp'))

                     t_dur  t_dance
total_tracks tmp                   
4            0    292720.0    0.549
12           0    213760.0    0.871
59           0    157124.0    0.289
8            0    127896.0    0.886
17           0    210320.0    0.724
4            1    293760.0    0.623
12           1    181000.0    0.702
59           1    130446.0    0.328
8            1    176351.0    0.947
17           1    226253.0    0.791
4            2    292733.0    0.519
12           2    245973.0    0.416
59           2    152450.0    0.234
8            2    166968.0    0.826
17           2    211880.0    0.784
  • Related