How to regex convert text input file with " ", " " as group separators to C-CodePudding

I have a file formatted like this:

03.12.2020      baz;bar;  bik;
04.12.2020     bar;
05.12.2020      baz;bar;bur,jojo;   bik;buch; pac;

It follows the format:

date, tab, " ", list of keywords with magnitude 4 with ";"after each keyword, " " list of keywords with magnitude 3 with ";"after each keyword etc.

It the specific magnitude does not exist the section is omitted for example ( keyword_with_magnitude_3; keyword_with_magnitude_1;another_keyword_with_magnitude_1;)

I need to convert it to: date, keyword; magnitude

For example from:

03.12.2020      baz;bar;  bik;

03.12.2020, baz, 4
03.12.2020, bar, 4
03.12.2020, bik, 2

Regex ^(\d\d\.\d\d\.\d\d\d\d\t)\ \ \ \ ([^\ \r] ) finds only the lines with "four's and nothing else

EDIT1: I could drop it into NodeJS if it is easier. I don't know how to split the lines while keeping the date as a first thing in each lane.

CodePudding user response：

If you don't mind running few steps, this might work:

run all of the next steps till no changes are made (assuming that global replace is not available)
distribute the magnitudes:

Find: (\ )([a-zA-Z0-9] )[;,]([a-zA-Z0-9] )

Replace: \1\2;\1\3

move to a new line each keyword, by copying the date

Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ *)([a-zA-Z0-9] )[;,](\ *)([a-zA-Z0-9] )(.*)$

Replace: \1 \3\4;\n\1 \5\6\7

reformat the lines to have the pluses as a digit after the keyword (1 plus)

Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {1})([a-zA-Z0-9] )[;,]

Replace: \1,\4,1

reformat the 2 pluses

Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {2})([a-zA-Z0-9] )[;,]

Replace: \1,\4,2

reformat the 3 pluses

Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {3})([a-zA-Z0-9] )[;,]

Replace: \1,\4,3

reformat the 4 pluses

Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {4})([a-zA-Z0-9] )[;,]

Replace: \1,\4,4

Empty lines you remove with: (\r?\n)(\r?\n) => \1

Lines with only dates you remove with: ^(\d\d\.\d\d.\d\d\d\d)(\s*)(\r?\n) => nothing

CodePudding user response：

You can use

const text = `03.12.2020      baz;bar;  bik;
04.12.2020     bar;
05.12.2020      baz;bar;bur,jojo;   bik;buch; pac;`
for (const line of text.split(/[\r\n] /)) {
  console.log(`=== Processing '${line}' ===`);
  [_, date, data] = line.match(/^(\d{2}\.\d{2}\.\d{4})  (. )/);
  const matches = data.matchAll(/(\ *)([^;,] )/g)
  let magnitude = 0;
  for (const m of matches) {
    if (m[1].length > 0) { magnitude = m[1].length; }
    const val = m[2];
    console.log(`${date}, ${val}, ${magnitude}`);
  }
}

Notes:

.split(/[\r\n] /) splits the text into lines (you might probably deal with it in a different way, I just assumed you have the input as a single string)
[_, date, data] = line.match(/^(\d{2}\.\d{2}\.\d{4}) (. )/); fills out date and data with the Group 1 and Group 2 values: the first group captures the data and the second group matches any text after the first spaces (can even be written as line.match(/^(\S )\s (\S.*)/);)
Since data contains the elements we need to split, it is matched with (\ *)([^;,] ), Group 1 now contains zero or more pluses, and the second group contains one or more chars other than comma and semi-colon.
Iterating over the above matches, magnitude is re-assigned once Group 1 (the pluses) is not empty. The final result is a concatenation of date, value and magnitude.

CodePudding user response：

Find: ^(\d\d\.\d\d\.\d{4})(.*)(\t|;)(\ .*)

Replace: \1\2\3\n\1\t\4

run this 3 times

Remove empty lines and lines that have only date and no data after.

Find: ^(\d\d\.\d\d\.\d{4})\t(\ \ \ \ )(. )

Replace: \1,\3,4

Find: ^(\d\d\.\d\d\.\d{4})\t(\ \ \ )(. )

Replace: \1,\3,3

Find: ^(\d\d\.\d\d\.\d{4})\t(\ \ )(. )

Replace: \1,\3,2

Find: ^(\d\d\.\d\d\.\d{4})\t(\ )(. )

Replace: \1,\3,1