Home > Enterprise >  How does one combine a dataframe of different sizes....?
How does one combine a dataframe of different sizes....?

Time:03-06

I am trying to combine a list of projects into a master dataframe and I can't seem to figure out how to merge them together? The frame I generated are of different sizes, but most of the colum names will be the same, with the exception of one or two....

So basically, I am taking a list of project stages like so... (Some of the projects will only have 2 or 3 stages, where others will have 8 or 9 stages..) example:

Stage 1 SUCCESS
stage 2 SUCCESS
stage 3 SUCCESS
stage 4 DELAYED
stage 5 PENDING

and, I generate a dataframe like that below in a python loop...

df

       project_name    Stage 1    Stage 2     
0      project 1       SUCCESS    DELAYED

df

       project_name    Stage 1    Stage 2    Stage 3    Stage 4   Stage 5 
0      project-2       NaN        NaN        NaN        NaN       NaN

df

       project_name    Stage 1    Stage 2    Stage 3    Stage 4   Stage 5   Stage 6    Stage 7   Stage 8
0      project-3       NaN        NaN        STARTED    ABANDONED NaN       NaN        NaN       
    NaN 

However, I can't seem to figure out how to generate a master dataframe containing all the other frames...

# items passed in from other function...
project_data = [('Stage 1','SUCCESS'),('Stage 2','DELAYED')]
project_name = 'project-x' 
project_headers = ['Stage 1','Stage 2','Stage 3','Stage 4','Stage 5','Stage 6']
project_displayname = ''

# Create the pandas DataFrame
try:
    df
except NameError:
    print("Well, 'df' WASN'T defined after all!")
    df = pd.DataFrame( columns = project_headers, index=['0'])
else:
    df = df.reindex(list(range(0, 1))).reset_index(drop=True)
    df['project_name'] = project_name
    df.loc[df.project_name == project_name, "project"] = project_displayname


combined_frame = pd.DataFrame(columns = ['project_name']) # empty frame with one colum for merge
for details in project_data:
    (item, item_status) = details
    if item not in df:
        df[item] = np.nan
    df.loc[df.project_name == project_name, item] = item_status
    print('')
    print('')
    print(df)  
    print('')
# Which gives us a generated dataframe.... like so... 
#project_name    Stage 1    Stage 2    Stage 3    Stage 4   Stage 5   Stage 6    Stage 7   Stage 8
#project-3       NaN        NaN        STARTED    ABANDONED NaN       NaN        NaN       NaN

    #final_frame = combined_frame.merge(df, how='left')
    try:
        final_frame = pd.merge(df, combined_frame, how='outer', left_index=True, right_on=combined_frame.iloc[: , -1])
    except IndexError:
        final_frame = df.reindex_axis(df.columns.union(combined_frame.columns), axis=1)

print(final_frame)

When I run the code I get the error: Empty DataFrame

Or, I get...

Columns: [project, project_name, Stage 1, Stage 2, Stage 3, Stage 4, Stage 5, Stage 6, Stage 7, Stage 8, Stage 9]
Index: []

Or I get...

Columns: [project, project_name, Stage 1, Stage 2, Stage 3, Stage 4, Stage 5, Stage 6, Stage 7, Stage 8, Stage 9, project_x, project_name_x, Stage 1_x, Stage 2_x, Stage 3_x, Stage 4_x]
Index: []

Can someone point out the erros in my ways? Clearly I am missing something?

I would like to try and get an output like this:

   project_name    Stage 1    Stage 2    Stage 3    Stage 4   Stage 5   Stage 6    Stage 7   Stage 8
0  project-1       STARTED    NaN        NaN        NaN       NaN       NaN        NaN       NaN
1  project-2       STARTED    STARTED    STARTED    DELAYED   NaN       NaN        NaN       NaN
2  project-3       NaN        NaN        STARTED    ABANDONED NaN       NaN        NaN       NaN
3  project-4       NaN        NaN        STARTED    ABANDONED NaN       STARTED    NaN       NaN
4  project-5       CANCELED   NaN        NaN        NaN       NaN       NaN        NaN       NaN
5  project-6       DELAYED    DELAYED    STARTED    ABANDONED NaN       NaN        STARTED    NaN 

Thanks in advance,

E

CodePudding user response:

You can easily build an individual frame from your input data:

# items passed in from other function...
project_data = [('Stage 1','SUCCESS'),('Stage 2','DELAYED')]
project_name = 'project-x' 
project_headers = ['Stage 1','Stage 2','Stage 3','Stage 4','Stage 5','Stage 6']
project_displayname = ''

df = pd.DataFrame([dict(project_data)], columns = ['project','project_name']
                    project_headers)
df.loc[:, ['project', 'project_name']] = [[project_name, project_displayname]]

It will give for df:

     project project_name  Stage 1  Stage 2  Stage 3  Stage 4  Stage 5  Stage 6
0  project-x               SUCCESS  DELAYED      NaN      NaN      NaN      NaN

You can then use pd.concat to concatenate all the individual dataframes. The only limitation is that you must know in advance the names for all the columns (or here the maximum number of stages...)

  • Related