Can anyone help to format this text file(YYYYMMDD) as a date formatted(YYYY-MM-DD) text file using bash script or in Linux command line? I am not sure how to start editing 23millon lines!!!
I have YYYYMMDD format textfile :-
3515034013|50008|20140601|20240730
and I want to edit like YYYY-MM-DD formatted text file(Only 3rd and 4th fields need to be changed for 23million lines):-
3515034013|50008|2014-06-01|2024-07-30
I Want to convert from YYYYMMDD formatted text file to the YYYY-DD-MM format and I want to get specific lines from the text file based on the time period after this file manipulation which is the end goal.
The end goal is to format the 3rd field and 4th field as YYYY-MM-DD and also want to grep the line by date from that formatted text file:- 03rd field is the start date and the 04th field is the end date Let's say for example I need, (01). The end date(04th field) before today i.e 2022-08-06 - all the old lines (02). The end date(04th field) is 2 years from now i.e lines in between 2022-08-06th <-> 2024-08-06th?
Please note:- There are more than a 23million lines to edit and analyze based on the date.
How to approach this problem statement? which method is time efficient awk or sed or Bash line-by-line editing?
CodePudding user response:
$ awk '
BEGIN { FS=OFS="|" }
{
for ( i=3; i<=4; i ) {
$i = substr($i,1,4) "-" substr($i,5,2) "-" substr($i,7)
}
print
}
' file
3515034013|50008|2014-06-01|2024-07-30
CodePudding user response:
Here is a way to do it with sed. It has the same restrictions as steffens answer: |
as fieldseparator and that all dates have the same format i.e. leading zeros in the month and date part.
sed -E 's/^(.*[|])([0-9]{4})([0-9]{2})([0-9]{2})[|]([0-9]{4})([0-9]{2})([0-9]{2})$/\1\2-\3-\4|\5-\6-\7/g'
Here is what the regular expression does:
^(.*[|])
captures the first part of the string from linestart (^
) to a|
into\1
, this captures the first two columns, because the remaining part of the re matches the remaining part of the line up until lineend!([0-9]{4})([0-9]{2})([0-9]{2})[|]
captures the first date field parts into\2
to\4
, notice the[|]
([0-9]{4})([0-9]{2})([0-9]{2})$
does the same for the second date column anchored at lineend ($
) and captures the parts into\5
to\7
, notice the$
- the replacement part
\1\2-\3-\4|\5-\6-\7
inserts-
at the different places - the capturing into
\n
happens because of the use of(...)
parens in the regular expression.
CodePudding user response:
Here's one way to change the format with awk
:
awk '{$3=substr($3,1,4) "-" substr($3,5,2) "-" substr($3,7,2); $4=substr($4,1,4) "-" substr($4,5,2) "-" substr($4,7,2); print}' FS='|' OFS='|'
It should work given that
|
is only used for field separation- all dates have the same format
You can pipe the transformed lines to a new file or change it in place. Of course you can do the same with sed
or ed
. I'd go for awk
because you'd be able to extract your specific lines just in the same run to an extra file.
CodePudding user response:
Assuming only the date columns have 8 characters, you could use GNU sed
;
$ sed -E 's/\<(....)(..)(..)\>/\1-\2-\3/g' input_file
3515034013|50008|2014-06-01|2024-07-30
CodePudding user response:
This might work for you (GNU sed):
sed -E 's/^([^|]*\|[^|]*\|....)(..)(..\|....)(..)/\1-\2-\3-\4-/' file
Pattern match and insert -
where desired.
Or if the file is only 4 columns:
sed -E 's/(..)(..\|....)(..)(..)$/-\1-\2-\3-\4/' file