Home > Blockchain >  Transform DataFrame column of repeated attributes to rows of individual entries and account for sub-
Transform DataFrame column of repeated attributes to rows of individual entries and account for sub-

Time:08-21

I read data from a CSV file and put it into a DataFrame. The data contains rows of attributes that are related to an element and need to be transformed to columns instead. For each element, there is also a sub-table that I need to account for.

Example DataFrame:

df = pd.DataFrame([["Drop-Down Field Name:", "Test List"],
                   ["Drop-Down List Name:", "Test List"],
                   ["Drop-Down List Type:", "Specific"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "DDL V1", "Yes", "Yes", "1.00", None],
                   [None, "DDL V2", "Yes", None, "2.00", None],
                   [None, "DDL V3", "Yes", None, "3.00", None],
                   [None, "DDL V4", "Yes", None, "4.00", None],
                   [None, "DDL V5", "Yes", None, "5.00", None],
                   ["Drop-Down Field Name:", "Test Empty List"],
                   ["Drop-Down List Name:", "Test Empty List"],
                   ["Drop-Down List Type:", "Specific"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   ["Drop-Down Field Name:", "Email verified?"],
                   ["Drop-Down List Name:", "Yes/No"],
                   ["Drop-Down List Type:", "Reusable"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
                   [None, "No", "Yes", None, "0", "red_cross.jpg"],
                   ["Drop-Down Field Name:", "Payment verified?"],
                   ["Drop-Down List Name:", "Yes/No"],
                   ["Drop-Down List Type:", "Reusable"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
                   [None, "No", "Yes", None, "0", "red_cross.jpg"],
                   ["Drop-Down Field Name:", "How many years expected from now?"],
                   ["Drop-Down List Name:", "How many years expected from now?"],
                   ["Drop-Down List Type:", "Specific"],
                   ["Drop-Down List Status:", "Active"],
                   ["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
                   [None, "One", None, None, None, None],
                   [None, "Two", None, None, None, None],
                   [None, "Three", None, None, None, None],
                   [None, "Four", None, None, None, None],
                   [None, "Five", None, None, None, None],
                   [None, "1", "Yes", None, None, None],
                   [None, "2", "Yes", None, None, None],
                   [None, "3", "Yes", None, None, None],
                   [None, "4", "Yes", None, None, None],
                   [None, "5", "Yes", None, None, None],
                   [None, "6-10", "Yes", None, None, None],
                   [None, "11-15", "Yes", None, None, None],
                   [None, "16-20", "Yes", None, None, None],
                   [None, "20 ", "Yes", None, None, None]])

The sub-tables in the table have their cells in bold. They are not equal in number of entries and can have no entries at all.

0 1 2 3 4 5
0 Drop-Down Field Name: Test List None None None None
1 Drop-Down List Name: Test List None None None None
2 Drop-Down List Type: Specific None None None None
3 Drop-Down List Status: Active None None None None
4 Drop-Down List Values: Text Active Default Weight Image
5 None DDL V1 Yes Yes 1.00 None
6 None DDL V2 Yes None 2.00 None
7 None DDL V3 Yes None 3.00 None
8 None DDL V4 Yes None 4.00 None
9 None DDL V5 Yes None 5.00 None
10 Drop-Down Field Name: Test Empty List None None None None
11 Drop-Down List Name: Test Empty List None None None None
12 Drop-Down List Type: Specific None None None None
13 Drop-Down List Status: Active None None None None
14 Drop-Down List Values: Text Active Default Weight Image
15 Drop-Down Field Name: Email verified? None None None None
16 Drop-Down List Name: Yes/No None None None None
17 Drop-Down List Type: Reusable None None None None
18 Drop-Down List Status: Active None None None None
19 Drop-Down List Values: Text Active Default Weight Image
20 None Yes Yes None 1.00 green_checkmark.jpg
21 None No Yes None 0 red_cross.jpg
22 Drop-Down Field Name: Payment verified? None None None None
23 Drop-Down List Name: Yes/No None None None None
24 Drop-Down List Type: Reusable None None None None
25 Drop-Down List Status: Active None None None None
26 Drop-Down List Values: Text Active Default Weight Image
27 None Yes Yes None 1.00 green_checkmark.jpg
28 None No Yes None 0 red_cross.jpg
29 Drop-Down Field Name: How many years expected from now? None None None None
30 Drop-Down List Name: How many years expected from now? None None None None
31 Drop-Down List Type: Specific None None None None
32 Drop-Down List Status: Active None None None None
33 Drop-Down List Values: Text Active Default Weight Image
34 None One None None None None
35 None Two None None None None
36 None Three None None None None
37 None Four None None None None
38 None Five None None None None
39 None 1 Yes None None None
40 None 2 Yes None None None
41 None 3 Yes None None None
42 None 4 Yes None None None
43 None 5 Yes None None None
44 None 6-10 Yes None None None
45 None 11-15 Yes None None None
46 None 16-20 Yes None None None
47 None 20 Yes None None None

My Approach:

  1. For each sub-table:
    1. Find start and end locations
    2. Transpose
    3. Replace with the transpose in main table
  2. Insert a column that performs groupby() and cumcount() on element attributes
  3. Pivot the table

I am relatively new to pandas, so I'm more than open to hear better approaches :) But here is my code so far:

ddlv_locations = df.where(df.eq("Drop-Down List Values:")).stack().index.tolist()
result = df.copy()

# Get sub-table, transpose it, and concatenate with main table
for ddlv_location in ddlv_locations:
    x_start = list(ddlv_location)[0]
    x_end = list(ddlv_location)[0]
    y_start = list(ddlv_location)[1]
    
    # Capture the row index where the sub-table ends
    while(x_end   1 < df.shape[0] and pd.isna(df.loc[x_end   1, y_start])):
        x_end = x_end   1
        
    ddlv = df.loc[x_start:x_end, y_start   1:df.shape[1]].T     # Transpose the sub-table
    ddlv.columns = range(ddlv.shape[1])     # Reset column index
    
    # Concatenate sub-table into main table whilst removing the original rows for the sub-table
    result = pd.concat([result.loc[:x_start - 1], ddlv]) if x_end == df.shape[0] - 1 else pd.concat([result.loc[:x_start - 1], ddlv, result.loc[x_end   1:]])


result.insert(0, "count", result.groupby(0).cumcount())
result = result.reset_index(drop=True)

At this point, I can't wrap my head around how I would pivot the table and ensure the columns for the sub-tables are captured by individual rows in the final result.

Desired Result:

0 1 2 3 4 5 6 7 8
0 Drop-Down Field Name: Drop-Down List Name: Drop-Down List Type: Drop-Down List Status: Text Active Default Weight Image
1 Test List Test List Specific Active DDL V1 Yes Yes 1.00 None
2 Test List Test List Specific Active DDL V2 Yes None 2.00 None
3 Test List Test List Specific Active DDL V3 Yes None 3.00 None
4 Test List Test List Specific Active DDL V4 Yes None 4.00 None
5 Test List Test List Specific Active DDL V5 Yes None 5.00 None
6 Test Empty List Test Empty List Specific Active None None None None None
7 Email verified? Yes/No Reusable Active Yes Yes None 1.00 green_checkmark.jpg
8 Email verified? Yes/No Reusable Active No None None 0 red_cross.jpg
9 Payment verified? Yes/No Reusable Active Yes Yes None 1.00 green_checkmark.jpg
10 Payment verified? Yes/No Reusable Active No None None 0 red_cross.jpg
11 How many years expected from now? How many years expected from now? Specific Active One None None None None
12 How many years expected from now? How many years expected from now? Specific Active Two None None None None
13 How many years expected from now? How many years expected from now? Specific Active Three None None None None
14 How many years expected from now? How many years expected from now? Specific Active Four None None None None
15 How many years expected from now? How many years expected from now? Specific Active Five None None None None
16 How many years expected from now? How many years expected from now? Specific Active 1 Yes None None None
17 How many years expected from now? How many years expected from now? Specific Active 2 Yes None None None
18 How many years expected from now? How many years expected from now? Specific Active 3 Yes None None None
19 How many years expected from now? How many years expected from now? Specific Active 4 Yes None None None
20 How many years expected from now? How many years expected from now? Specific Active 5 Yes None None None
21 How many years expected from now? How many years expected from now? Specific Active 6-10 Yes None None None
22 How many years expected from now? How many years expected from now? Specific Active 11-15 Yes None None None
23 How many years expected from now? How many years expected from now? Specific Active 15-20 Yes None None None
24 How many years expected from now? How many years expected from now? Specific Active 20 Yes None None None

CodePudding user response:

def reorganize(gr):
    left = gr.iloc[:4, :2].set_index(0).T
    right = (
        gr.iloc[4:, 1:]
        .reset_index(drop=True)
        .T.set_index(0).T
    )
    # note that indexing in both frames starts from 1
    return left.join(right, how='outer')

start_marker = "Drop-Down Field Name:"
grouper = (df.iloc[:, 0] == start_marker).cumsum()

result = df.groupby(grouper).apply(reorganize).reset_index(drop=True)
result.iloc[:, :4] = result.iloc[:, :4].fillna(method='pad')

p.s. The code works on the assumption that columns of df have no specific names, i.e. they are indexed by default, starting from zero (sort of 0, 1, 2, 3, ...)

  • Related