I have a file formatted like this:
03.12.2020 baz;bar; bik;
04.12.2020 bar;
05.12.2020 baz;bar;bur,jojo; bik;buch; pac;
It follows the format:
date, tab, " ", list of keywords with magnitude 4 with ";"after each keyword, " " list of keywords with magnitude 3 with ";"after each keyword etc.
It the specific magnitude does not exist the section is omitted for example ( keyword_with_magnitude_3; keyword_with_magnitude_1;another_keyword_with_magnitude_1;)
I need to convert it to: date, keyword; magnitude
For example from:
03.12.2020 baz;bar; bik;
to
03.12.2020, baz, 4
03.12.2020, bar, 4
03.12.2020, bik, 2
Regex ^(\d\d\.\d\d\.\d\d\d\d\t)\ \ \ \ ([^\ \r] )
finds only the lines with "four's and nothing else
EDIT1: I could drop it into NodeJS if it is easier. I don't know how to split the lines while keeping the date as a first thing in each lane.
CodePudding user response:
If you don't mind running few steps, this might work:
- run all of the next steps till no changes are made (assuming that global replace is not available)
- distribute the magnitudes:
Find: (\ )([a-zA-Z0-9] )[;,]([a-zA-Z0-9] )
Replace: \1\2;\1\3
- move to a new line each keyword, by copying the date
Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ *)([a-zA-Z0-9] )[;,](\ *)([a-zA-Z0-9] )(.*)$
Replace: \1 \3\4;\n\1 \5\6\7
- reformat the lines to have the pluses as a digit after the keyword (1 plus)
Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {1})([a-zA-Z0-9] )[;,]
Replace: \1,\4,1
- reformat the 2 pluses
Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {2})([a-zA-Z0-9] )[;,]
Replace: \1,\4,2
- reformat the 3 pluses
Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {3})([a-zA-Z0-9] )[;,]
Replace: \1,\4,3
- reformat the 4 pluses
Find: ^(\d\d\.\d\d.\d\d\d\d)(\s )(\ {4})([a-zA-Z0-9] )[;,]
Replace: \1,\4,4
Empty lines you remove with: (\r?\n)(\r?\n)
=> \1
Lines with only dates you remove with: ^(\d\d\.\d\d.\d\d\d\d)(\s*)(\r?\n)
=> nothing
CodePudding user response:
You can use
const text = `03.12.2020 baz;bar; bik;
04.12.2020 bar;
05.12.2020 baz;bar;bur,jojo; bik;buch; pac;`
for (const line of text.split(/[\r\n] /)) {
console.log(`=== Processing '${line}' ===`);
[_, date, data] = line.match(/^(\d{2}\.\d{2}\.\d{4}) (. )/);
const matches = data.matchAll(/(\ *)([^;,] )/g)
let magnitude = 0;
for (const m of matches) {
if (m[1].length > 0) { magnitude = m[1].length; }
const val = m[2];
console.log(`${date}, ${val}, ${magnitude}`);
}
}
Notes:
.split(/[\r\n] /)
splits the text into lines (you might probably deal with it in a different way, I just assumed you have the input as a single string)[_, date, data] = line.match(/^(\d{2}\.\d{2}\.\d{4}) (. )/);
fills outdate
anddata
with the Group 1 and Group 2 values: the first group captures the data and the second group matches any text after the first spaces (can even be written asline.match(/^(\S )\s (\S.*)/);
)- Since
data
contains the elements we need to split, it is matched with(\ *)([^;,] )
, Group 1 now contains zero or more pluses, and the second group contains one or more chars other than comma and semi-colon. - Iterating over the above matches,
magnitude
is re-assigned once Group 1 (the pluses) is not empty. The final result is a concatenation of date, value and magnitude.
CodePudding user response:
Find:
^(\d\d\.\d\d\.\d{4})(.*)(\t|;)(\ .*)
Replace:
\1\2\3\n\1\t\4
run this 3 times
- Remove empty lines and lines that have only date and no data after.
Find: ^(\d\d\.\d\d\.\d{4})\t(\ \ \ \ )(. )
Replace: \1,\3,4
Find: ^(\d\d\.\d\d\.\d{4})\t(\ \ \ )(. )
Replace: \1,\3,3
Find: ^(\d\d\.\d\d\.\d{4})\t(\ \ )(. )
Replace: \1,\3,2
Find: ^(\d\d\.\d\d\.\d{4})\t(\ )(. )
Replace: \1,\3,1