Transform DataFrame column of repeated attributes to rows of individual entries and account for sub--CodePudding

I read data from a CSV file and put it into a DataFrame. The data contains rows of attributes that are related to an element and need to be transformed to columns instead. For each element, there is also a sub-table that I need to account for.

Example DataFrame:

df = pd.DataFrame([["Drop-Down Field Name:", "Test List"],
                   ["Drop-Down List Name:", "Test List"],
                   ["Drop-Down List Type:", "Specific"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "DDL V1", "Yes", "Yes", "1.00", None],
                   [None, "DDL V2", "Yes", None, "2.00", None],
                   [None, "DDL V3", "Yes", None, "3.00", None],
                   [None, "DDL V4", "Yes", None, "4.00", None],
                   [None, "DDL V5", "Yes", None, "5.00", None],
                   ["Drop-Down Field Name:", "Test Empty List"],
                   ["Drop-Down List Name:", "Test Empty List"],
                   ["Drop-Down List Type:", "Specific"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   ["Drop-Down Field Name:", "Email verified?"],
                   ["Drop-Down List Name:", "Yes/No"],
                   ["Drop-Down List Type:", "Reusable"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
                   [None, "No", "Yes", None, "0", "red_cross.jpg"],
                   ["Drop-Down Field Name:", "Payment verified?"],
                   ["Drop-Down List Name:", "Yes/No"],
                   ["Drop-Down List Type:", "Reusable"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
                   [None, "No", "Yes", None, "0", "red_cross.jpg"],
                   ["Drop-Down Field Name:", "How many years expected from now?"],
                   ["Drop-Down List Name:", "How many years expected from now?"],
                   ["Drop-Down List Type:", "Specific"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "One", None, None, None, None],
                   [None, "Two", None, None, None, None],
                   [None, "Three", None, None, None, None],
                   [None, "Four", None, None, None, None],
                   [None, "Five", None, None, None, None],
                   [None, "1", "Yes", None, None, None],
                   [None, "2", "Yes", None, None, None],
                   [None, "3", "Yes", None, None, None],
                   [None, "4", "Yes", None, None, None],
                   [None, "5", "Yes", None, None, None],
                   [None, "6-10", "Yes", None, None, None],
                   [None, "11-15", "Yes", None, None, None],
                   [None, "16-20", "Yes", None, None, None],
                   [None, "20 ", "Yes", None, None, None]])

The sub-tables in the table have their cells in bold. They are not equal in number of entries and can have no entries at all.

	0	1	2	3	4	5
0	Drop-Down Field Name:	Test List	None	None	None	None
1	Drop-Down List Name:	Test List	None	None	None	None
2	Drop-Down List Type:	Specific	None	None	None	None
3	Drop-Down List Status:	Active	None	None	None	None
4	Drop-Down List Values:	Text	Active	Default	Weight	Image
5	None	DDL V1	Yes	Yes	1.00	None
6	None	DDL V2	Yes	None	2.00	None
7	None	DDL V3	Yes	None	3.00	None
8	None	DDL V4	Yes	None	4.00	None
9	None	DDL V5	Yes	None	5.00	None
10	Drop-Down Field Name:	Test Empty List	None	None	None	None
11	Drop-Down List Name:	Test Empty List	None	None	None	None
12	Drop-Down List Type:	Specific	None	None	None	None
13	Drop-Down List Status:	Active	None	None	None	None
14	Drop-Down List Values:	Text	Active	Default	Weight	Image
15	Drop-Down Field Name:	Email verified?	None	None	None	None
16	Drop-Down List Name:	Yes/No	None	None	None	None
17	Drop-Down List Type:	Reusable	None	None	None	None
18	Drop-Down List Status:	Active	None	None	None	None
19	Drop-Down List Values:	Text	Active	Default	Weight	Image
20	None	Yes	Yes	None	1.00	green_checkmark.jpg
21	None	No	Yes	None	0	red_cross.jpg
22	Drop-Down Field Name:	Payment verified?	None	None	None	None
23	Drop-Down List Name:	Yes/No	None	None	None	None
24	Drop-Down List Type:	Reusable	None	None	None	None
25	Drop-Down List Status:	Active	None	None	None	None
26	Drop-Down List Values:	Text	Active	Default	Weight	Image
27	None	Yes	Yes	None	1.00	green_checkmark.jpg
28	None	No	Yes	None	0	red_cross.jpg
29	Drop-Down Field Name:	How many years expected from now?	None	None	None	None
30	Drop-Down List Name:	How many years expected from now?	None	None	None	None
31	Drop-Down List Type:	Specific	None	None	None	None
32	Drop-Down List Status:	Active	None	None	None	None
33	Drop-Down List Values:	Text	Active	Default	Weight	Image
34	None	One	None	None	None	None
35	None	Two	None	None	None	None
36	None	Three	None	None	None	None
37	None	Four	None	None	None	None
38	None	Five	None	None	None	None
39	None	1	Yes	None	None	None
40	None	2	Yes	None	None	None
41	None	3	Yes	None	None	None
42	None	4	Yes	None	None	None
43	None	5	Yes	None	None	None
44	None	6-10	Yes	None	None	None
45	None	11-15	Yes	None	None	None
46	None	16-20	Yes	None	None	None
47	None	20	Yes	None	None	None

My Approach:

For each sub-table:
1. Find start and end locations
2. Transpose
3. Replace with the transpose in main table
Insert a column that performs groupby() and cumcount() on element attributes
Pivot the table

I am relatively new to pandas, so I'm more than open to hear better approaches :) But here is my code so far:

ddlv_locations = df.where(df.eq("Drop-Down List Values:")).stack().index.tolist()
result = df.copy()

# Get sub-table, transpose it, and concatenate with main table
for ddlv_location in ddlv_locations:
    x_start = list(ddlv_location)[0]
    x_end = list(ddlv_location)[0]
    y_start = list(ddlv_location)[1]
    
    # Capture the row index where the sub-table ends
    while(x_end   1 < df.shape[0] and pd.isna(df.loc[x_end   1, y_start])):
        x_end = x_end   1
        
    ddlv = df.loc[x_start:x_end, y_start   1:df.shape[1]].T     # Transpose the sub-table
    ddlv.columns = range(ddlv.shape[1])     # Reset column index
    
    # Concatenate sub-table into main table whilst removing the original rows for the sub-table
    result = pd.concat([result.loc[:x_start - 1], ddlv]) if x_end == df.shape[0] - 1 else pd.concat([result.loc[:x_start - 1], ddlv, result.loc[x_end   1:]])


result.insert(0, "count", result.groupby(0).cumcount())
result = result.reset_index(drop=True)

At this point, I can't wrap my head around how I would pivot the table and ensure the columns for the sub-tables are captured by individual rows in the final result.

Desired Result:

	0	1	2	3	4	5	6	7	8
0	Drop-Down Field Name:	Drop-Down List Name:	Drop-Down List Type:	Drop-Down List Status:	Text	Active	Default	Weight	Image
1	Test List	Test List	Specific	Active	DDL V1	Yes	Yes	1.00	None
2	Test List	Test List	Specific	Active	DDL V2	Yes	None	2.00	None
3	Test List	Test List	Specific	Active	DDL V3	Yes	None	3.00	None
4	Test List	Test List	Specific	Active	DDL V4	Yes	None	4.00	None
5	Test List	Test List	Specific	Active	DDL V5	Yes	None	5.00	None
6	Test Empty List	Test Empty List	Specific	Active	None	None	None	None	None
7	Email verified?	Yes/No	Reusable	Active	Yes	Yes	None	1.00	green_checkmark.jpg
8	Email verified?	Yes/No	Reusable	Active	No	None	None	0	red_cross.jpg
9	Payment verified?	Yes/No	Reusable	Active	Yes	Yes	None	1.00	green_checkmark.jpg
10	Payment verified?	Yes/No	Reusable	Active	No	None	None	0	red_cross.jpg
11	How many years expected from now?	How many years expected from now?	Specific	Active	One	None	None	None	None
12	How many years expected from now?	How many years expected from now?	Specific	Active	Two	None	None	None	None
13	How many years expected from now?	How many years expected from now?	Specific	Active	Three	None	None	None	None
14	How many years expected from now?	How many years expected from now?	Specific	Active	Four	None	None	None	None
15	How many years expected from now?	How many years expected from now?	Specific	Active	Five	None	None	None	None
16	How many years expected from now?	How many years expected from now?	Specific	Active	1	Yes	None	None	None
17	How many years expected from now?	How many years expected from now?	Specific	Active	2	Yes	None	None	None
18	How many years expected from now?	How many years expected from now?	Specific	Active	3	Yes	None	None	None
19	How many years expected from now?	How many years expected from now?	Specific	Active	4	Yes	None	None	None
20	How many years expected from now?	How many years expected from now?	Specific	Active	5	Yes	None	None	None
21	How many years expected from now?	How many years expected from now?	Specific	Active	6-10	Yes	None	None	None
22	How many years expected from now?	How many years expected from now?	Specific	Active	11-15	Yes	None	None	None
23	How many years expected from now?	How many years expected from now?	Specific	Active	15-20	Yes	None	None	None
24	How many years expected from now?	How many years expected from now?	Specific	Active	20	Yes	None	None	None

CodePudding user response：

def reorganize(gr):
    left = gr.iloc[:4, :2].set_index(0).T
    right = (
        gr.iloc[4:, 1:]
        .reset_index(drop=True)
        .T.set_index(0).T
    )
    # note that indexing in both frames starts from 1
    return left.join(right, how='outer')

start_marker = "Drop-Down Field Name:"
grouper = (df.iloc[:, 0] == start_marker).cumsum()

result = df.groupby(grouper).apply(reorganize).reset_index(drop=True)
result.iloc[:, :4] = result.iloc[:, :4].fillna(method='pad')

p.s. The code works on the assumption that columns of df have no specific names, i.e. they are indexed by default, starting from zero (sort of 0, 1, 2, 3, ...)