Home > Blockchain >  Storing multi level data from Excel workbooks for comparison. Not sure if a collection or Array is b
Storing multi level data from Excel workbooks for comparison. Not sure if a collection or Array is b

Time:05-18

Struggling to find the best path forward. I need to take data from two separate Excel, multi sheet, workbooks and store some data from each sheet, based on what type of data it is, so that I can later compare that data on each sheet from each workbook to another.

Then I will be creating another workbook to summarize the findings. It looks like this:

workbook1->sheetA->uniqueIdentifier->value associated with that identifier
workbook1->sheetB->uniqueIdentifier->value associated with that identifier 

(and so on for several sheets)

This is the same layout for workbook2 as well. Once I have the data from all the worksheets in each workbook, I then compare the data. The layout of the sheets is the same in both books, and the "uniqueIdentifier" is unique for that workbook but also exists in the other workbook. At some point this will grow to 3 then 4 then 5 workbooks in the future.

This feels like a messy and inefficient array, so I thought maybe a collection would be better. Anyway, this part of the code project I am just starting on, so I was hoping to get some insight before I went down the rabbit hole. Right now I am leaning on a collection of collections......

CodePudding user response:

Not sure if this is suitable for stackoverflow though? More like a design question. But here's my 2 cents.

My use case

I've been working on similar case, and in my use case it's always comparing workbooks - say, wbA and wbB -, and each one has one sheet - sheetA and sheetB.

My users have to compare new data sheets based on same configurations monthly, so I also ask user to write a configuration json and tell them what variable can they use in that json.

Input

For the input, I ask the user to give me three files - wbA's excel, 'wbB's excel' and a configuration json. Drag them from OS, and drop onto my program, then let my program handles from then on. I also support folders so users can categorize their files easily.

The configuration json tells me, for either workbook: which sheet to use, which columns to be Primary Key for the rows; and which column in wbA for which column in wbB (I called it wiring), if needed.

There're also a few extra settings like string replacements. Sometimes the workbooks might contain strings I don't need - I support both regexp and literal replacements for those.

Or, column names, though I also let user decide if the program should just take the first row in sheetA as column names (which datasheets usually do), for showing the column names in the result file later.

I also make sure that whenever user need to input columns (for example, setting PK columns), I let them input "A","B","C" instead of column number like 1,2,3. I wrote the conversion by myself which takes a little math calculating to generalize as NPOI takes number to get column from sheet, but IIRC I later found that NPOI already has such functions built-in. You'd want to check if those exist first.

Json supports arrays, so you if you want multiple sheets be compared between two workbooks, you can try to use array to express what you need. The basic idea is the same - two sheets are comparing each time, so you need to configure for the two sheets. If there're multiple pairs, change the configuration format and let your program know how to loop through those.

If you find it difficult to imagine, a great strategy of programing is to start simple. You start by making a program can compare two books, 1 sheet in each book first, then gradually add more functions into it until it eventually meets your actual needs.

Comparing

I use NPOI to read excels, and when getting cell contents I use XSSFFormulaEvaluator.FormatCellValue() - if I can't get the book of the cell (cell.Row?.Sheet?.Workbook returns null) since FormulaEvaluator needs reference to the cell's origin Workbook, I use simply cell.ToString().

Now for comparing it's like the process you wrote in the post.

  • Build primary keys for sheetB, make a Dictionary<PK, row>. I simply joined the PK columns into a string like pkA-pkB-pkC....
  • Create a List<string> from Dictionary.Keys. The list contains sheetB keys.
  • For each row in sheetA, build its primary key first, then search the same key in the dictionary. If not found, record it as an error. If found, compare and do what you need to do, and also remove the key from the List<string>. You don't have to build dictionaries for both sheets.
  • Lastly, iterate through the List<string> and record those as errors - sheetB row not found in sheetA.., etc.

Result output

I output comparison result as a single excel file, and copy the sheets I compared into it. The output has three sheets: sheetA, sheetB and result, so users can have a quick look upon original sheets when they're checking the results.

... good luck!

  • Related