Parse specifically structured .txt file (C#)-CodePudding

I have a file where the structure is table-like and I need to parse the file after which read and map to my POCO classes. The file looks like as following:

 Financial Institution   : LOREMIPSOM      - 019223
 FX Settlement Date      : 10.02.2021
 Reconciliation File ID  : 801-288881-0005543759-00001
 Transaction Currency    : AZN
 Reconciliation Currency : USD



     -------------------------------------- -------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------- --------------------- 
    !         Settlement Category          ! Transaction Amount ! Reconciliation Amnt !          Fee Amount !  Transaction Amount ! Reconciliation Amnt !          Fee Amount !   Count !           Net Value !
    !                                      !             Credit !              Credit !              Credit !               Debit !               Debit !               Debit !   Total !                     !
     -------------------------------------- -------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------- --------------------- 

    ! MC Acq Fin Detail ATM Out                           5.00                   3.57                 49.75                  0.00                  0.00                  0.00        31                  3.32 !
    ! MC Acq Fin Detail Retail Out                        5.40                 262.01                  0.00                  0.00                  0.00                 -3.96        10                258.05 !

    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                      Totals :                           10.40                 265.58                 49.75                  0.00                  0.00                 -3.96        41                261.37



 Financial Institution   : LOREMIPSOM      - 019223
 FX Settlement Date      : 10.02.2021
 Reconciliation File ID  : 801-288881-0005543759-00002
 Transaction Currency    : EUR
 Reconciliation Currency : USD



     -------------------------------------- -------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------- --------------------- 
    !         Settlement Category          ! Transaction Amount ! Reconciliation Amnt !          Fee Amount !  Transaction Amount ! Reconciliation Amnt !          Fee Amount !   Count !           Net Value !
    !                                      !             Credit !              Credit !              Credit !               Debit !               Debit !               Debit !   Total !                     !
     -------------------------------------- -------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------- --------------------- 

    ! Fee Collection Inc                                  0.00                   0.00                  0.00                  0.00                  0.00                  0.00         0                  0.00 !
    ! Fee Collection Inc                                  0.00                   0.00                  0.00                  8.00                  0.00                  0.00         0                  0.00 !
    ! Fee Collection Inc                                  0.00                   0.00                  0.00                  0.00                  0.00                  0.00         0                  0.00 !
    ! Fee Collection Inc                                  0.00                   0.00                  0.00                 -1.00                  0.00                  0.00         0                  0.00 !

    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                      Totals :                            0.00                   0.00                  0.00                  7.00                  0.00                  0.00         0                  0.00

I was thinking parsing it manually, I thought maybe there is a better way.. and about the parsing data so I need every data that is reasonable to parse, so I need to parse the file and get every data except for - symbols. Also the file structure doesn't change so the columns are always there (fixed). The file as you can see is a bank related file (transations). So There is this "Financial Institution" for example that I map and other data. "Settlement Category" of this "Financial Institution " is "MC Acq Fin Detail ATM Out" for example. What would be the best way to parse the file?

CodePudding user response：

You can probably do it by parsing one line at a time with Regular Expressions. With RegEx and having some known pattern to look for, you can apply whatever the current line is to a RegEx.Match() call and it will return a list of all the parts that are captured within parenthesis groups. This prevents you from having to keep doing complex IndexOf() searching and such along the way.

If the result returns the expected segment groups vs no entries, you should be good to pull the pieces out rather quickly. Having multiple patterns defined should help cycle through on which version has the context you are looking for.

One such tool to test what you are planning on parsing can be found at doing inline Regular Expression samples and testing expressions to see how they work, AND it allows you to debug and step through while describing what it is looking for. You can post the patterns I have in the code to see how they are described, and debug by putting in some sample text from your sample file

This link from StackOverflow also helped for getting possiblity of multiple words before next "marker" identifying section break to next part

Here is a quick something I threw together for you. Hopefully for you and others, it can help identify a parsing mechanism vs the complexity of parsing and looking for all the Index of, string extract, next parsing, etc. Learning how to do patterns can take time, but hopefully I have done enough in-line documentation to help you see that it is not AS difficult as one might think.

Good luck.

    private void TryRegParse()
    {

        if (!File.Exists("TestingRegex.txt"))
            return;

        // read the text content into already parsed individual lines
        var txtLines = File.ReadAllLines("TestingRegex.txt");

        // the "*" indicates zero or more spaces before whatever is following it.
        var patFinancial = @"^.*?Financial Institution.*?:.*?(?<FinInst>. ?-).*?(?<FinAccnt>.*)";
        // Explanation of what I have here for the pattern
        // ^ = start of the string
        // .*? = zero OR more possible white space/tab charaters
        // Financial Institution = find this exact string
        // .*?:  = there may be zero or more white-space/tab before coming up to the ":" character
        // .*? and additional check for zero or more white spaces
        // (?<FinInst>. ?-) = 
        //  using the outer (parens) allows Regular expression to pull the extracted portion into a group results
        //      the ?<FinInst> allows this "group" to be recognized by the name "FinInst" see shortly
        //      . indicates a single character 
        //      the  ?- means keep look ahead from where you are now for UNTIL you get to the - character (whatever appears after the ?)
        //      This allows you to get multiple possible word(s) / names up to the actual hyphen
        //      .*?:  = another instance there may be zero or more white-space/tab before the final data
        //      (?<FinAccnt>.*) = parens indicate another group, similarly named like ?<FinInst> above 

        // create a regular expression object of just this specific pattern
        var RegExFinInst = new Regex( patFinancial );


        // Now, prepare another string line to parse and its regular expression object to match against.
        // for Dates, https://regexland.com/regex-dates/ had a good clarification, but since your dates
        // appear in month.day.year format, I had to alter  
        var patFXSettlement = @"^.*?FX Settlement Date.*?:.*?(?<sMonth>(0[1-9]|1[0-2])).(?<sDay>(0[1-9]|[12][0-9]|3[01])).(?<sYear>\d{4})";
        // each pattern, just creating a regular expression of its corresponding pattern to match
        var RegSettle= new Regex(patFXSettlement);

        // same here on last 2 samples
        var patReconFile = @"^.*?Reconciliation File ID.*?:.*?(?<FileId>.*)";
        var RegRecon= new Regex(patReconFile);

        var patTxnCurr = @"^.*?Transaction Currency.*?:.*?(?<Currency>[A-Z]{3}).*";
        var RegTxnCurr = new Regex(patTxnCurr);

        // go through each line
        foreach ( var s in txtLines )
        {
            // see if the current line "matches" the Financial Institution pattern
            // As you can see from the "named" groups, you can get without having to
            // know what ordinal number the group is within the expression, you can get by its name
            var hasMatch = RegExFinInst.Match(s);
            if( hasMatch.Success )
            {
                MessageBox.Show("Financial Institution Group: "   hasMatch.Groups["FinInst"]   "\r\n"
                              "Account: "   hasMatch.Groups["FinAccnt"]);
                // done with this line
                continue;
            }

            // if not, try the next, and next and next
            hasMatch = RegSettle.Match(s);
            if( hasMatch.Success )
            {
                MessageBox.Show("FX Settlement Month: "   hasMatch.Groups["sMonth"]
                          "  Day: "   hasMatch.Groups["sDay"]
                          " Year: "   hasMatch.Groups["sYear"] );
                // done with this line
                continue;

            }

            hasMatch = RegRecon.Match(s);
            if (hasMatch.Success)
            {
                MessageBox.Show("Reconcilliation File: "   hasMatch.Groups["FileId"] );
                // done with this line
                continue;

            }

            hasMatch = RegTxnCurr.Match(s);
            if (hasMatch.Success)
            {
                MessageBox.Show("Transaction Currency: "   hasMatch.Groups["Currency"]);
                // done with this line
                continue;

            }
        }

    }

CodePudding user response：

I ended up parsing it manually. So as I said the structure is always the same. And I use fail fast technique where if there is something wrong I just throw an exception

IRuntimeServices runtimeServices = new RuntimeServices();

    List<string> transactionTitles = new();
    List<string> transactionDetails = new();
    string constText = "Financial Institution";
    bool isTitleFinished = false;
    int counterTable = 0;
    int counterTitle = 0;

    for (int i = 0, j = i; i < text.Length; i  )
    {
        if (text[i] == ' ' && !isTitleFinished)
        {
            Helper.AddItem(transactionTitles, text, j, counterTitle);
            isTitleFinished = true;
            j = i;
            counterTitle = 0;
        }
        else if(!isTitleFinished && text[i] != ' ')
        {
            counterTitle   ;
        }

        if (isTitleFinished)
        {
            if (text.Length >= i   constText.Length || text.IsLastIndex(i))
            {
                if(text.IsLastIndex(i))
                {
                    Helper.AddItem(transactionDetails, text, j,null);
                }
                else if (text.IsSubStrEqualToSpecificStr(i,constText))
                {
                    Helper.AddItem(transactionDetails, text, j, counterTable);
                    isTitleFinished = false;
                    counterTable = 0;
                    j = i;
                }
                else
                {
                    counterTable  ;
                }
            }
        }
    }

    ICollection<Transaction> transactions = new List<Transaction>();

    for (int i = 0; i < transactionTitles.Count; i  )
    {
        string[] titlePairs = transactionTitles[i]
            .Trim()
            .Split(new char[] { '\n', '\r' }, 
            StringSplitOptions.RemoveEmptyEntries); 
        
        Dictionary<string, string> transactionTitlesDict = new ();
        for (int j = 0; j < titlePairs.Length; j  )
        {
            string[] nameAndValue = titlePairs[j].Split(":");
            transactionTitlesDict.Add(nameAndValue[0].Trim(), nameAndValue[1].Trim());
        }
         Transaction transaction = runtimeServices
            .CreateCustomObject<Transaction>(transactionTitlesDict);


        string[] detailPairs = transactionDetails[i]
            .Trim()
            .Split(new char[] { '\n', '\r' },
            StringSplitOptions.RemoveEmptyEntries);

        string[] detailTitlesPart1 = detailPairs[1]
            .Trim()
            .Split(new char[] { '\n', '\r','!' },
            StringSplitOptions.RemoveEmptyEntries);

        string[] detailTitlesPart2 = detailPairs[2]
            .Trim()
            .Split(new char[] { '\n', '\r', '!' },
            StringSplitOptions.RemoveEmptyEntries);

        IList<string> transactionDetailsTitles = new List<string>();

        if(detailTitlesPart1.Length != detailTitlesPart2.Length)
        {
            throw new Exception("Invalid format");
        }

        for (int p = 0; p < detailTitlesPart1.Length; p  )
        {
            transactionDetailsTitles
                .Add($"{detailTitlesPart1[p].Trim()} {detailTitlesPart2[p].Trim()}");
        }

        IList<string[]> transactionDetailsData = new List<string[]>();

        for (int k = 4; k < detailPairs.Count() - 2; k  )
        {
            string[] data = detailPairs[k]
                .Trim()
                .Split(new[] { "  ","!" },
                StringSplitOptions.RemoveEmptyEntries);
            transactionDetailsData.Add(data);
        }

        Dictionary<string, string> transactionDetailsDict = new();

        foreach (string[] transactionDetailDataRow in transactionDetailsData)
        {
            for (int l = 0; l < transactionDetailsTitles.Count; l  )
            {
                if (transactionDetailDataRow.Count() != transactionDetailsTitles.Count)
                {
                    throw new Exception("Invalid format");
                }
                transactionDetailsDict
                    .Add(transactionDetailsTitles[l].Trim(), transactionDetailDataRow[l].Trim());
            }
        // Don't pay attention to this part
            SettlementDetail settlementDetail = runtimeServices
                .CreateCustomObject<SettlementDetail>(transactionDetailsDict);
            transaction.SettlementDetails.Add(settlementDetail);
            transactionDetailsDict.Clear();
        }
        transactions.Add(transaction);
    }