Home > OS >  Separate dataframe into multiple ones and add them as columns
Separate dataframe into multiple ones and add them as columns

Time:10-10

my dataset look similiar to this (but with a couple of more rows):

Raw data

The aim is to get this:

Aim

What I tried to do is:

# Identify names that are in the dataset
names = df['name'].unique().tolist()

# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_' str(names[0]), 'number':'number_' str(names[0])}, axis=1)

# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
for i in range(1, len(names)):
    df2 = pd.DataFrame()
    df2 = df[(df == names[i]).any(axis=1)]
    df2 = df2.drop(['name'], axis=1)
    df2 = df2.rename({'color':'color_' str(names[i]), 'number':'number_' str(names[i])}, axis=1)
    df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')

In the end I get this result for df_merged:

Result

As you can see the columns color_Donald and number_Donald are missing. Does anyone know why and how to improve the code? It seems as if the loop somehow skips or overwrites Donald.

Thanks in advance!

CodePudding user response:

sample df

import pandas as pd


data = {'name': {'2020-01-01 00:00:00': 'Justin', '2020-01-02 00:00:00': 'Justin', '2020-01-03 00:00:00': 'Donald'}, 'color': {'2020-01-01 00:00:00': 'blue', '2020-01-02 00:00:00': 'red', '2020-01-03 00:00:00': 'green'}, 'number': {'2020-01-01 00:00:00': 1, '2020-01-02 00:00:00': 2, '2020-01-03 00:00:00': 9}}
df = pd.DataFrame(data)
print(f"{df}\n")

                       name  color  number
2020-01-01 00:00:00  Justin   blue       1
2020-01-02 00:00:00  Justin    red       2
2020-01-03 00:00:00  Donald  green       9

final df

df = (
    df
    .reset_index(names="date")
    .pivot(index="date", columns="name", values=["color", "number"])
    .fillna("")
)
df.columns = ["_".join(x) for x in df.columns.values]
print(df)

                    color_Donald color_Justin number_Donald number_Justin
date                                                                     
2020-01-01 00:00:00                      blue                           1
2020-01-02 00:00:00                       red                           2
2020-01-03 00:00:00        green                          9              

CodePudding user response:

The problem is the line:

df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')

where df_merged will be set in the loop always to the join of df1 with current df2.

The result after the loop is therefore a join of df1 with the last df2 and the Donald gets lost in this process.

To fix this problem first join empty df_merged with df1 and then in the loop join df_merged with df2.

Here the full code with the changes (not tested):

# Identify names that are in the dataset
names = df['name'].unique().tolist()

# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_' str(names[0]), 'number':'number_' str(names[0])}, axis=1)

# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
df_merged = df_merged.join(df1) # <- add options if necessary
for i in range(1, len(names)):
    df2 = pd.DataFrame()
    df2 = df[(df == names[i]).any(axis=1)]
    df2 = df2.drop(['name'], axis=1)
    df2 = df2.rename({'color':'color_' str(names[i]), 'number':'number_' str(names[i])}, axis=1)
    # join the current df2 to df_merged: 
    df_merged = df_merged.join(df2, lsuffix="_left", rsuffix="_right", how='left')
  • Related