Home > Enterprise >  Parsing a field and list of people from Excel files
Parsing a field and list of people from Excel files

Time:10-03

I'm trying to parse Excel (.xls, .xlsx) files. The structure of files is the same except for the amount of the records.

  • I need to parse the industry. In this case it is "FinTech". Due to the fact that it is in one cell, I guess I have to use a regex expression such as ^Industry: (.*)$?

  • It has to find which row/column the list of the people starts and put it into a IEnumerable<Person>. It could use the following regex expressions.

    • Number always consists of 6 digits. ^[0-9]{6}$
    • Name consists of at least two words where each one of them starts with a capital letter. ^([a-zA-Z] \s?\b){2,}$

A test .xlsx file can be found here https://docs.google.com/spreadsheets/d/15SR04cHXgGLWe0cuOOuuB5vUZigebh96/edit?usp=sharing&ouid=112418126731411268789&rtpof=true&sd=true.

List of people      
Normal condition        
Industry: FinTech       

#   Number  Name
1   226250  Zain Griffiths
2   226256  Michael Houghton
3   226259  Hugo Willis Johnson
4   226264  Anna-Maria Rose

The actual question

First of all, I'm not completely sure if my regex expressions are correct. I was only able to display the rows and the columns but I'm not sure how to actually parse the industry and the list of the people into a IEnumerable<Person>. So how do I do that?

Snippet

// Program.cs
var excel = new ExcelParser();
var sheet1 = excel.Import(@"a.xlsx");

Console.OutputEncoding = Encoding.UTF8;
for (var i = 0; i < sheet1.Rows.Count; i  )
{
    for (var j = 0; j < sheet1.Columns.Count; j  )
    {
        var cell = sheet1.Rows[i][j].ToString()?.Trim();

        Console.Write($"Column: {cell} | ");
    }

    Console.WriteLine();
}

Console.ReadLine();

// ExcelParser.cs
public sealed class ExcelParser
{
    public ExcelParser()
    {
        Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
    }

    public DataTable Import(string filePath)
    {
        // does file exist?
        if (!File.Exists(filePath))
        {
            throw new FileNotFoundException();
        }

        // .xls or .xlsx allowed
        var extension = new FileInfo(filePath).Extension.ToLowerInvariant();
        if (extension is not (".xls" or ".xlsx"))
        {
            throw new NotSupportedException();
        }

        // read .xls or .xlsx
        using var stream = File.Open(filePath, FileMode.Open, FileAccess.Read);
        using var reader = ExcelReaderFactory.CreateReader(stream);

        var dataSet = reader.AsDataSet(new ExcelDataSetConfiguration
        {
            ConfigureDataTable = _ => new ExcelDataTableConfiguration
            {
                UseHeaderRow = false
            }
        });

        // Sheet1
        return dataSet.Tables[0];
    }
}

CodePudding user response:

The structure of files is the same except for the amount of the records

As long as the table is structured (or semi-structured), you can state one/two simple assumptions and parse the tables based on these assumptions, and in case the structure is not following the assumptions, you will return false (throw exception, etc..).

Actually, designing regexs to parse the table is kind of assumptions encoding.. I just want to Keep it simple, So, Based on the problem statement, here are my assumptions:

  1. There will be a "industry" (or "industry:", call .ToLower()) string in a separate cell (regex will do nothing more than finding such a string), and industry's name will be in the same cell.[1]
  2. First person's name will be next to the first 6-digits-number cell.[2]

Here is the code

public (string industryName, List<string> peopleNames) ParseSheet(DataTable sheet1)
{
    // 1. Get Indices of industry cell and first Name in people names..

    var industryCellIndex = (-1, -1, false);
    var peopleFirstCellIndex = (-1, -1, false);
    
    for (var i = 0; i < sheet1.Rows.Count; i  )
    {
        for (var j = 0; j < sheet1.Columns.Count; j  )
        {
            // .ToLower() added
            var cell = sheet1.Rows[i][j].ToString()?.Trim().ToLower();
            if (cell.StartsWith("industry"))
            {
                industryCellIndex = (i, j, true);
                break;
            }

            // the name after the first 6-digits number cell will be the first name in people records
            if (cell.Length == 6 && int.TryParse(cell, out _))
            {
                peopleFirstCellIndex = (i, j   1, true);
                break;
            }
        }

        if (industryCellIndex.Item3 && peopleFirstCellIndex.Item3)
            break;
    }

    if (!industryCellIndex.Item3 || !peopleFirstCellIndex.Item3)
    {
        // throw new Exception("Excel file is not normalized!");
        return (null, null);
    }

    // 2. retrieve the desired data

    var industryName = sheet1.Rows[industryCellIndex.Item1][industryCellIndex.Item2]
                       .Replace(":", ""); // will do nothing if there were no ":"

    industryName = industryName.Substring(industryName.IndexOf("indusrty")   "indusrty".Length);
    var peopleNames = new List<string>();
    var colIndex = peopleFirstCellIndex.Item2;
    for (var rowIndex = peopleFirstCellIndex.Item1;
         rowIndex < sheet1.Rows.Count;
         rowIndex  )
    {
        peopleNames.Add(sheet1.Rows[rowIndex][colIndex].ToString()?.Trim());
    }

    return (industryName, peopleNames);
}

[1] If this assumption needs some editing (like: the indusrty name might be the next cell that has "industry" string), the idea still the same.. you can consider this in parsing.

[2] And, for example, after the "#" cell by 2 columns and 1 row.

  • Related