I have seen many relatable posts for my question but couldn't find the proper solutions. read excel sheet containing multiple tables, tables that have headers with a non white background cell color
here is the link of that excel data: https://docs.google.com/spreadsheets/d/1m4v_wbIJCGWBigJx53BRnBpMDHZ9CwKf/edit?usp=sharing&ouid=107579116880049687042&rtpof=true&sd=true
So far what i have tried:
import pandas as pd
df = pd.read_excel("dell.xlsx")
df =df.dropna()
The above code deletes the wanted data because it has nan.
df.iloc[1,2:5]=['Description','Qty','Price']
print(df)
nul_rows = list(df[df.isnull().all(axis=1)].index)
list_of_dataframes = []
for i in range(len(nul_rows) - 1):
list_of_dataframes.append(df.iloc[nul_rows[i] 1:nul_rows[i 1],:])
cleaned_tables = []
for _df in list_of_dataframes:
cleaned_tables.append(_df.dropna(axis=1, how='all'))
for p in cleaned_tables:
print(p.dropna())
couldn't get the data I want because those were not in the header format, its in unnamed.
I want to extract these data "Sku Description" "Qty" "Price" "Total" from the excel in the link.
Hope I get some response.
NOTE! the file content and format differs always, so based on one file solution cant be used for next file but the header's name won't change for example qty, description,total.
CodePudding user response:
Here is one way of creating dataframes out of that excel file. See comments in the code itself.
You can uncomment some print() to see how this is developed.
Code
import pandas as pd
import numpy as np
def get_dataframes(fn):
"""
Returns 3 dataframes from the 3 tables given inside excel file fn.
"""
# (1) Read file
df_start = pd.read_excel(fn)
df = df_start.copy() # preserve df_start, just in case we need it later
# print(df.to_string()) # study layout
# (2) Clean df
df = pd.DataFrame(np.vstack([df.columns, df])) # reset column index, so we can see 0, 1, 2 ...
df = df.drop([0], axis=1) # drop index column 0, it is not needed
df = df.dropna(how='all') # drop rows if all cells are empty
df = df.reset_index(drop=True) # reset row index so we can see the correct row no. 0, 1, ...
df = pd.DataFrame(np.vstack([df.columns, df])) # reset col index so we can see the correct col no. 0, 1, ...
# (3) Tasks
# We have 3 dataframes to create:
# 1) df1 is for the first table the one with "Description".
# 2) df2 is the one with the first "Sku Description:".
# 3) df3 is the one with the second "Sku Description:".
# (4) Get all the indexes (row and col) of "Description" and "Sku Description:" texts,
# we will use it to extract data in order to create dataframes df1, df2 and df3.
# Get the index of "Description" we will get row index and col index from it, by knowing
# the row index, we will be able to extract at correct row index on this particular table.
i_desc = df.where(df=='Description').dropna(how='all').dropna(how='all', axis=1)
# Assume that there is only one "Description" and this is true according to the given sample excel files.
i_desc_row = i_desc.index[0]
# print(f'row index of "Description": {i_desc_row}')
# Get the index of "Sku Description:".
i_sku = df.where(df=='Sku Description:').dropna(how='all').dropna(how='all', axis=1)
# There are 2 "Sku Description:", get the row indexes of each.
i_sku_row = i_sku.index
# print(f'row indexes of "Sku Description:": {i_sku_row}')
i_sku_row_1 = i_sku_row[0]
i_sku_row_2 = i_sku_row[1]
# print(f'i_sku_row_1: {i_sku_row_1}, i_sku_row_2: {i_sku_row_2}')
# There are 2 "Sku Description:", get the col indexes of each.
i_sku_col = i_sku.columns
# print(f'col indexes of "Sku Description:": {i_sku_col}')
if len(i_sku_col) == 2:
i_sku_col_1 = i_sku_col[0]
i_sku_col_2 = i_sku_col[1]
else:
i_sku_col_1 = i_sku_col[0]
i_sku_col_2 = i_sku_col_1
# print(f'i_sku_col_1: {i_sku_col_1}, i_sku_col_2: {i_sku_col_2}')
# (5) Create df1
cols = ['Description', 'Qty', 'Price', 'Total']
df1 = df.iloc[i_desc_row 1 : i_sku_row_1-2, 0:4] # [start_row:end_row, start_col:end_col]
df1.columns = cols
# print(df1)
df1 = df1.reset_index(drop=True) # reset the row index so we can see 0, 1, ...
# print(df1)
# (6) Create df2
cols = ['Sku Description:', 'Qty:']
df2 = df.iloc[i_sku_row_1 1 : i_sku_row_2, i_sku_col_1:i_sku_col_1 2]
df2.columns = cols
df2 = df2.reset_index(drop=True)
# print(df2)
# (7) Create df3
cols = ['Sku Description:', 'Qty:']
df3 = df.iloc[i_sku_row_2 1:, i_sku_col_2:i_sku_col_2 2]
df3.columns = cols
df3 = df3.reset_index(drop=True)
# print(df3)
return df1, df2, df3
def process_file():
# fn = 'F:\\Downloads\\dell.xlsx'
fn = 'F:\\Downloads\\dell2.xlsx'
desc_df, sku1_df, sku2_df = get_dataframes(fn)
print(f'file: {fn}')
print(f'desc_df:\n{desc_df}\n')
print(f'sku1_df:\n{sku1_df}\n')
print(f'sku2_df:\n{sku2_df}\n')
# Start
process_file()
Output
file: F:\Downloads\dell2.xlsx
desc_df:
Description Qty Price Total
0 PowerEdge R640 Server 2.0 6390.0 12780
1 PowerEdge R640 Server 8.0 4360.0 34880
sku1_df:
Sku Description: Qty:
0 PowerEdge R640 Server 1.0
1 PowerEdge R640 MLK Motherboard 1.0
2 Intel Xeon Silver 4216 2.1G, 16C/32T, 9.6GT/s,... 2.0
3 Intel Xeon Silver 4216 2.1G, 16C/32T, 9.6GT/s,... 2.0
4 iDRAC Group Manager, Enabled 1.0
5 iDRAC,Factory Generated Password 1.0
6 Additional Processor Selected 1.0
7 2.5 Chassis with up to 10 Hard Drives and 3PCI... 1.0
8 Standard Bezel 1.0
9 Riser Config 2, 3x16 LP 1.0
10 PowerEdge R640 Shipping Material for 4 and 10 ... 1.0
11 PowerEdge R640 Shipping(ICC), for 1300W below, V2 1.0
12 No Quick Sync 1.0
13 Dell EMC Luggage Tag for x10 1.0
14 Performance Optimized 1.0
15 3200MT/s RDIMMs 1.0
16 DIMM Blanks for System with 2 Processors 1.0
17 32GB RDIMM, 3200MT/s, Dual Rank 16Gb BASE x8 4.0
18 iDRAC9,Enterprise 1.0
19 2.4TB 10K RPM SAS 12Gbps 512e 2.5in Hot-plug H... 8.0
20 3.84TB SSD SATA Read Intensive 6Gbps 512 2.5in... 2.0
21 BOSS controller card with 2 M.2 Sticks 480GB... 1.0
22 PERC H750 Adapter, Low Profile 1.0
23 8 Standard Fans for R640 1.0
24 Performance BIOS Settings 1.0
25 Standard 1U Heatsink 2.0
26 No Internal Optical Drive 1.0
27 Dual, Hot-plug, Redundant Power Supply (1 1), ... 1.0
28 Jumper Cord - C13/C14, 2M, 250V, 10A 2.0
29 Power Cord - C13, 1.8M, 250V, 10A 2.0
30 Trusted Platform Module 2.0 V3 1.0
31 PowerEdge R640 No CE Marking, ICC, for 1300W P... 1.0
32 Broadcom 5720 Quad Port 1GbE BASE-T, rNDC 1.0
33 No Operating System 1.0
34 No Systems Documentation, No OpenManage DVD Kit 1.0
35 Basic Deployment Dell Server R Series 1U/2U 1.0
36 PowerEdge-SE02 Handling n Insurance Charges(In... 1.0
37 ReadyRails Sliding Rails With Cable Management... 1.0
38 Unconfigured RAID 1.0
39 UEFI BIOS Boot Mode with GPT Partition 1.0
40 OpenManage Enterprise Advanced 1.0
41 Basic Next Business Day 36 Months 1.0
42 ProSupport and Next Business Day Onsite Servic... 1.0
43 ProSupport and Next Business Day Onsite Servic... 1.0
44 INFO: Thank you for choosing Dell 1.0
45 Mod Specs Info 1.0
sku2_df:
Sku Description: Qty:
0 PowerEdge R640 Server 1.0
1 PowerEdge R640 MLK Motherboard 1.0
2 Intel Xeon Silver 4216 2.1G, 16C/32T, 9.6GT/s,... 1.0
3 iDRAC,Factory Generated Password 1.0
4 iDRAC Group Manager, Enabled 1.0
5 2.5 Chassis with up to 10 Hard Drives and 3PCI... 1.0
6 Standard Bezel 1.0
7 Riser Config 4, 2x16 LP 1.0
8 PowerEdge R640 Shipping(ICC), for 1300W below, V2 1.0
9 PowerEdge R640 Shipping Material for 4 and 10 ... 1.0
10 No Quick Sync 1.0
11 Dell EMC Luggage Tag for x10 1.0
12 Performance Optimized 1.0
13 32GB RDIMM, 3200MT/s, Dual Rank 16Gb BASE x8 2.0
14 Blank for 1CPU Configuration 1.0
15 3200MT/s RDIMMs 1.0
16 Blank for 1CPU Configuration 1.0
17 No Additional Processor 1.0
18 iDRAC9,Enterprise 1.0
19 2.4TB 10K RPM SAS 12Gbps 512e 2.5in Hot-plug H... 3.0
20 BOSS controller card with 2 M.2 Sticks 480GB... 1.0
21 PERC H750 Adapter, Low Profile 1.0
22 Performance BIOS Settings 1.0
23 5 Standard Fans for R640 1.0
24 Standard 1U Heatsink 1.0
25 1.92TB SSD vSAS Read Intensive 12Gbps 512e 2.5... 2.0
26 No Internal Optical Drive 1.0
27 Dual, Hot-plug, Redundant Power Supply (1 1), ... 1.0
28 Power Cord - C13, 1.8M, 250V, 10A 2.0
29 Jumper Cord - C13/C14, 2M, 250V, 10A 2.0
30 Trusted Platform Module 2.0 V3 1.0
31 PowerEdge R640 No CE Marking, ICC, for 1300W P... 1.0
32 Broadcom 5720 Quad Port 1GbE BASE-T, rNDC 1.0
33 No Operating System 1.0
34 No Systems Documentation, No OpenManage DVD Kit 1.0
35 Basic Deployment Dell Server R Series 1U/2U 1.0
36 PowerEdge-SE02 Handling n Insurance Charges 1.0
37 ReadyRails Sliding Rails With Cable Management... 1.0
38 Unconfigured RAID 1.0
39 UEFI BIOS Boot Mode with GPT Partition 1.0
40 OpenManage Enterprise Advanced 1.0
41 Basic Next Business Day 36 Months 1.0
42 ProSupport and Next Business Day Onsite Servic... 1.0
43 ProSupport and Next Business Day Onsite Servic... 1.0
44 INFO: Thank you for choosing Dell 1.0
45 Mod Specs Info 1.0