C# ExcelDataReader Invalid File Signature-CodePudding

I've already looked at this and it did not solve my issue https://stackoverflow.com/questions/51079664/c-sharp-error-with-exceldatareader

I've tried building a method that reads an XLS file and converts it to string[] But I get an error when trying to run it: ExcelDataReader.Exceptions.HeaderException: Invalid file signature.

I have tried running it with XLSX and it works fine

The files that I am using have worked before

Note. I have run the same method that worked with XLS before, so I'm confused as to why this error is occurring.(using ExcelDataReader Version 3.6.0)

Here is the code:


private static List<string[]> GetExcelRecords(string path, bool hasHeaders)
        {
            var records = new List<string[]>();
            using (var stream = File.Open(path, FileMode.Open, FileAccess.Read))
            {
                using (var reader = ExcelReaderFactory.CreateReader(stream))
                {
                    var sheetFile = reader.AsDataSet().Tables[0];

                    for (int i = 0; i < sheetFile.Rows.Count; i  )
                    {
                        var record = sheetFile.Rows[i];
                        if (hasHeaders)
                        {
                            hasHeaders = false;
                            continue;
                        }
                        var row = record.ItemArray.Select(o => o.ToString()).ToArray();
                        records.Add(row);
                    }
                }
            }
            return records;
        }

The exception occurs on line 4

I have tried using ExcelReaderFactory.CreateBinaryReader and ExcelReaderFactory.CreateOpenXlmReader

CodePudding user response：

The problem appears that my laptop is corrupting XLS files (I'm unsure why) The error isn't due to the file type, but instead the fact that my files are corrupted.

CodePudding user response：

This is more for fun (so community wiki), but here's how I'd write it:

private static IEnumerable<string[]> GetExcelRecords(string path, int headerRowsToSkip = 0)
{
    using (var stream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var reader = ExcelReaderFactory.CreateReader(stream))
    {
        while(headerRowsToSkip-- > 0 && reader.Read()) {} //intentionally empty
        while(reader.Read())
        {
             object[] temp = new object[reader.FieldCount];
             reader.GetValues(temp);
             yield return temp.Select(o => o.ToString()).ToArray();
        }
    }
}

I'm not happy with it yet, specifically this part:

object[] temp = new object[reader.FieldCount];
reader.GetValues(temp);
yield return temp.Select(o => o.ToString()).ToArray();

The problem is we end up with three copies of the data in each iteration: the copy included in the reader, the copy in the object[], and the copy in the string[].

It's also worth mentioning that, thanks to cultural/internationalization issues, converting back and forth between strings and date/numeric values is among the slowest things we do in a computer on a regular basis. If the ExcelDataReader is respecting type information in the sheet and giving you numeric and date values, it is usually a mistake that costs a LOT more performance than you realize to convert everything to string.

We can't avoid copying the data once out of the reader, so we're gonna need two copies no matter what. With that in mind, I think this might be better:

object[] temp = new object[reader.FieldCount];
reader.GetValues(temp);
yield return temp;

Then we also need to change the method signature. This leaves us returning a set ofobject[]s, which is still less than idea.

The other option is even more transformational: use generics and functional programming concepts to create an actual typed object, instead of just an array:

private static IEnumerable<T> GetExcelRecords<T>(string path, Func<IDataRecord, T> transform, int headerRowsToSkip = 0)
{
    using (var stream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var reader = ExcelReaderFactory.CreateReader(stream))
    {
        while(headerRowsToSkip-- > 0 && reader.Read()) {} //intentionally empty
        while(reader.Read())
        {
             yield return transform(reader);
        }
    }
}

This punts the hard work to a lambda expression. To explain how to use it, let's say you have a sheet with one header row and columns (in order) for an integer ID, a Date, a decimal price, a double quantity, and a string description. You might call the method like this:

var rows = GetExcelRecords("file path", row => 
    {
       return (row.GetInt32(0), row.GetDateTime(1), row.GetDecimal(2), row.GetDouble(3), row.GetString(4) );
    }, 1);

That used a ValueTuple to avoid declaring a class, while still preserving the original data types and avoiding two (2!) array allocations for every single row. Since we're using IEnumerable<> instead of List<> it also means we only need to keep one row in memory at a time.

I think we've ended up in a good place. The trick I'm using to read past the header rows is maybe a little too clever — relying on boolean short circuit ordering, an operator precedence subtlety, and an empty while loop all in the same line — but I like symmetry here.