I am trying to combine a list of projects into a master dataframe and I can't seem to figure out how to merge them together? The frame I generated are of different sizes, but most of the colum names will be the same, with the exception of one or two....
So basically, I am taking a list of project stages like so... (Some of the projects will only have 2 or 3 stages, where others will have 8 or 9 stages..) example:
Stage 1 SUCCESS
stage 2 SUCCESS
stage 3 SUCCESS
stage 4 DELAYED
stage 5 PENDING
and, I generate a dataframe like that below in a python loop...
df
project_name Stage 1 Stage 2
0 project 1 SUCCESS DELAYED
df
project_name Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
0 project-2 NaN NaN NaN NaN NaN
df
project_name Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8
0 project-3 NaN NaN STARTED ABANDONED NaN NaN NaN
NaN
However, I can't seem to figure out how to generate a master dataframe containing all the other frames...
# items passed in from other function...
project_data = [('Stage 1','SUCCESS'),('Stage 2','DELAYED')]
project_name = 'project-x'
project_headers = ['Stage 1','Stage 2','Stage 3','Stage 4','Stage 5','Stage 6']
project_displayname = ''
# Create the pandas DataFrame
try:
df
except NameError:
print("Well, 'df' WASN'T defined after all!")
df = pd.DataFrame( columns = project_headers, index=['0'])
else:
df = df.reindex(list(range(0, 1))).reset_index(drop=True)
df['project_name'] = project_name
df.loc[df.project_name == project_name, "project"] = project_displayname
combined_frame = pd.DataFrame(columns = ['project_name']) # empty frame with one colum for merge
for details in project_data:
(item, item_status) = details
if item not in df:
df[item] = np.nan
df.loc[df.project_name == project_name, item] = item_status
print('')
print('')
print(df)
print('')
# Which gives us a generated dataframe.... like so...
#project_name Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8
#project-3 NaN NaN STARTED ABANDONED NaN NaN NaN NaN
#final_frame = combined_frame.merge(df, how='left')
try:
final_frame = pd.merge(df, combined_frame, how='outer', left_index=True, right_on=combined_frame.iloc[: , -1])
except IndexError:
final_frame = df.reindex_axis(df.columns.union(combined_frame.columns), axis=1)
print(final_frame)
When I run the code I get the error: Empty DataFrame
Or, I get...
Columns: [project, project_name, Stage 1, Stage 2, Stage 3, Stage 4, Stage 5, Stage 6, Stage 7, Stage 8, Stage 9]
Index: []
Or I get...
Columns: [project, project_name, Stage 1, Stage 2, Stage 3, Stage 4, Stage 5, Stage 6, Stage 7, Stage 8, Stage 9, project_x, project_name_x, Stage 1_x, Stage 2_x, Stage 3_x, Stage 4_x]
Index: []
Can someone point out the erros in my ways? Clearly I am missing something?
I would like to try and get an output like this:
project_name Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8
0 project-1 STARTED NaN NaN NaN NaN NaN NaN NaN
1 project-2 STARTED STARTED STARTED DELAYED NaN NaN NaN NaN
2 project-3 NaN NaN STARTED ABANDONED NaN NaN NaN NaN
3 project-4 NaN NaN STARTED ABANDONED NaN STARTED NaN NaN
4 project-5 CANCELED NaN NaN NaN NaN NaN NaN NaN
5 project-6 DELAYED DELAYED STARTED ABANDONED NaN NaN STARTED NaN
Thanks in advance,
E
CodePudding user response:
You can easily build an individual frame from your input data:
# items passed in from other function...
project_data = [('Stage 1','SUCCESS'),('Stage 2','DELAYED')]
project_name = 'project-x'
project_headers = ['Stage 1','Stage 2','Stage 3','Stage 4','Stage 5','Stage 6']
project_displayname = ''
df = pd.DataFrame([dict(project_data)], columns = ['project','project_name']
project_headers)
df.loc[:, ['project', 'project_name']] = [[project_name, project_displayname]]
It will give for df
:
project project_name Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6
0 project-x SUCCESS DELAYED NaN NaN NaN NaN
You can then use pd.concat
to concatenate all the individual dataframes. The only limitation is that you must know in advance the names for all the columns (or here the maximum number of stages...)