I read data from a CSV file and put it into a DataFrame. The data contains rows of attributes that are related to an element and need to be transformed to columns instead. For each element, there is also a sub-table that I need to account for.
Example DataFrame:
df = pd.DataFrame([["Drop-Down Field Name:", "Test List"],
["Drop-Down List Name:", "Test List"],
["Drop-Down List Type:", "Specific"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "DDL V1", "Yes", "Yes", "1.00", None],
[None, "DDL V2", "Yes", None, "2.00", None],
[None, "DDL V3", "Yes", None, "3.00", None],
[None, "DDL V4", "Yes", None, "4.00", None],
[None, "DDL V5", "Yes", None, "5.00", None],
["Drop-Down Field Name:", "Test Empty List"],
["Drop-Down List Name:", "Test Empty List"],
["Drop-Down List Type:", "Specific"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
["Drop-Down Field Name:", "Email verified?"],
["Drop-Down List Name:", "Yes/No"],
["Drop-Down List Type:", "Reusable"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
[None, "No", "Yes", None, "0", "red_cross.jpg"],
["Drop-Down Field Name:", "Payment verified?"],
["Drop-Down List Name:", "Yes/No"],
["Drop-Down List Type:", "Reusable"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
[None, "No", "Yes", None, "0", "red_cross.jpg"],
["Drop-Down Field Name:", "How many years expected from now?"],
["Drop-Down List Name:", "How many years expected from now?"],
["Drop-Down List Type:", "Specific"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "One", None, None, None, None],
[None, "Two", None, None, None, None],
[None, "Three", None, None, None, None],
[None, "Four", None, None, None, None],
[None, "Five", None, None, None, None],
[None, "1", "Yes", None, None, None],
[None, "2", "Yes", None, None, None],
[None, "3", "Yes", None, None, None],
[None, "4", "Yes", None, None, None],
[None, "5", "Yes", None, None, None],
[None, "6-10", "Yes", None, None, None],
[None, "11-15", "Yes", None, None, None],
[None, "16-20", "Yes", None, None, None],
[None, "20 ", "Yes", None, None, None]])
The sub-tables in the table have their cells in bold. They are not equal in number of entries and can have no entries at all.
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | Drop-Down Field Name: | Test List | None | None | None | None |
1 | Drop-Down List Name: | Test List | None | None | None | None |
2 | Drop-Down List Type: | Specific | None | None | None | None |
3 | Drop-Down List Status: | Active | None | None | None | None |
4 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
5 | None | DDL V1 | Yes | Yes | 1.00 | None |
6 | None | DDL V2 | Yes | None | 2.00 | None |
7 | None | DDL V3 | Yes | None | 3.00 | None |
8 | None | DDL V4 | Yes | None | 4.00 | None |
9 | None | DDL V5 | Yes | None | 5.00 | None |
10 | Drop-Down Field Name: | Test Empty List | None | None | None | None |
11 | Drop-Down List Name: | Test Empty List | None | None | None | None |
12 | Drop-Down List Type: | Specific | None | None | None | None |
13 | Drop-Down List Status: | Active | None | None | None | None |
14 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
15 | Drop-Down Field Name: | Email verified? | None | None | None | None |
16 | Drop-Down List Name: | Yes/No | None | None | None | None |
17 | Drop-Down List Type: | Reusable | None | None | None | None |
18 | Drop-Down List Status: | Active | None | None | None | None |
19 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
20 | None | Yes | Yes | None | 1.00 | green_checkmark.jpg |
21 | None | No | Yes | None | 0 | red_cross.jpg |
22 | Drop-Down Field Name: | Payment verified? | None | None | None | None |
23 | Drop-Down List Name: | Yes/No | None | None | None | None |
24 | Drop-Down List Type: | Reusable | None | None | None | None |
25 | Drop-Down List Status: | Active | None | None | None | None |
26 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
27 | None | Yes | Yes | None | 1.00 | green_checkmark.jpg |
28 | None | No | Yes | None | 0 | red_cross.jpg |
29 | Drop-Down Field Name: | How many years expected from now? | None | None | None | None |
30 | Drop-Down List Name: | How many years expected from now? | None | None | None | None |
31 | Drop-Down List Type: | Specific | None | None | None | None |
32 | Drop-Down List Status: | Active | None | None | None | None |
33 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
34 | None | One | None | None | None | None |
35 | None | Two | None | None | None | None |
36 | None | Three | None | None | None | None |
37 | None | Four | None | None | None | None |
38 | None | Five | None | None | None | None |
39 | None | 1 | Yes | None | None | None |
40 | None | 2 | Yes | None | None | None |
41 | None | 3 | Yes | None | None | None |
42 | None | 4 | Yes | None | None | None |
43 | None | 5 | Yes | None | None | None |
44 | None | 6-10 | Yes | None | None | None |
45 | None | 11-15 | Yes | None | None | None |
46 | None | 16-20 | Yes | None | None | None |
47 | None | 20 | Yes | None | None | None |
My Approach:
- For each sub-table:
- Find start and end locations
- Transpose
- Replace with the transpose in main table
- Insert a column that performs
groupby()
andcumcount()
on element attributes - Pivot the table
I am relatively new to pandas, so I'm more than open to hear better approaches :) But here is my code so far:
ddlv_locations = df.where(df.eq("Drop-Down List Values:")).stack().index.tolist()
result = df.copy()
# Get sub-table, transpose it, and concatenate with main table
for ddlv_location in ddlv_locations:
x_start = list(ddlv_location)[0]
x_end = list(ddlv_location)[0]
y_start = list(ddlv_location)[1]
# Capture the row index where the sub-table ends
while(x_end 1 < df.shape[0] and pd.isna(df.loc[x_end 1, y_start])):
x_end = x_end 1
ddlv = df.loc[x_start:x_end, y_start 1:df.shape[1]].T # Transpose the sub-table
ddlv.columns = range(ddlv.shape[1]) # Reset column index
# Concatenate sub-table into main table whilst removing the original rows for the sub-table
result = pd.concat([result.loc[:x_start - 1], ddlv]) if x_end == df.shape[0] - 1 else pd.concat([result.loc[:x_start - 1], ddlv, result.loc[x_end 1:]])
result.insert(0, "count", result.groupby(0).cumcount())
result = result.reset_index(drop=True)
At this point, I can't wrap my head around how I would pivot the table and ensure the columns for the sub-tables are captured by individual rows in the final result.
Desired Result:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | Drop-Down Field Name: | Drop-Down List Name: | Drop-Down List Type: | Drop-Down List Status: | Text | Active | Default | Weight | Image |
1 | Test List | Test List | Specific | Active | DDL V1 | Yes | Yes | 1.00 | None |
2 | Test List | Test List | Specific | Active | DDL V2 | Yes | None | 2.00 | None |
3 | Test List | Test List | Specific | Active | DDL V3 | Yes | None | 3.00 | None |
4 | Test List | Test List | Specific | Active | DDL V4 | Yes | None | 4.00 | None |
5 | Test List | Test List | Specific | Active | DDL V5 | Yes | None | 5.00 | None |
6 | Test Empty List | Test Empty List | Specific | Active | None | None | None | None | None |
7 | Email verified? | Yes/No | Reusable | Active | Yes | Yes | None | 1.00 | green_checkmark.jpg |
8 | Email verified? | Yes/No | Reusable | Active | No | None | None | 0 | red_cross.jpg |
9 | Payment verified? | Yes/No | Reusable | Active | Yes | Yes | None | 1.00 | green_checkmark.jpg |
10 | Payment verified? | Yes/No | Reusable | Active | No | None | None | 0 | red_cross.jpg |
11 | How many years expected from now? | How many years expected from now? | Specific | Active | One | None | None | None | None |
12 | How many years expected from now? | How many years expected from now? | Specific | Active | Two | None | None | None | None |
13 | How many years expected from now? | How many years expected from now? | Specific | Active | Three | None | None | None | None |
14 | How many years expected from now? | How many years expected from now? | Specific | Active | Four | None | None | None | None |
15 | How many years expected from now? | How many years expected from now? | Specific | Active | Five | None | None | None | None |
16 | How many years expected from now? | How many years expected from now? | Specific | Active | 1 | Yes | None | None | None |
17 | How many years expected from now? | How many years expected from now? | Specific | Active | 2 | Yes | None | None | None |
18 | How many years expected from now? | How many years expected from now? | Specific | Active | 3 | Yes | None | None | None |
19 | How many years expected from now? | How many years expected from now? | Specific | Active | 4 | Yes | None | None | None |
20 | How many years expected from now? | How many years expected from now? | Specific | Active | 5 | Yes | None | None | None |
21 | How many years expected from now? | How many years expected from now? | Specific | Active | 6-10 | Yes | None | None | None |
22 | How many years expected from now? | How many years expected from now? | Specific | Active | 11-15 | Yes | None | None | None |
23 | How many years expected from now? | How many years expected from now? | Specific | Active | 15-20 | Yes | None | None | None |
24 | How many years expected from now? | How many years expected from now? | Specific | Active | 20 | Yes | None | None | None |
CodePudding user response:
def reorganize(gr):
left = gr.iloc[:4, :2].set_index(0).T
right = (
gr.iloc[4:, 1:]
.reset_index(drop=True)
.T.set_index(0).T
)
# note that indexing in both frames starts from 1
return left.join(right, how='outer')
start_marker = "Drop-Down Field Name:"
grouper = (df.iloc[:, 0] == start_marker).cumsum()
result = df.groupby(grouper).apply(reorganize).reset_index(drop=True)
result.iloc[:, :4] = result.iloc[:, :4].fillna(method='pad')
p.s. The code works on the assumption that columns of df
have no specific names, i.e. they are indexed by default, starting from zero (sort of 0, 1, 2, 3, ...
)