Home > Back-end >  How to display the latest line based on the file's name or the line's position in bash
How to display the latest line based on the file's name or the line's position in bash

Time:12-05

I have a tricky question about how to keep the latest log data as my server reposted it two times

This is the result after I grep from my folder :(i have tons of data, just to keep it simpler)

...
20150630-201427.csv:20150630,CFIIASU,233,96.21786,0.44644,
20150630-201427.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150630-201427.csv:20150630,CFIIASU_CN,68,102.19569,0.10692
20150630-201427.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150630-201427.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150630-201427.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
...

The data actually came from many csv files, I only pick two csv files to make the example, and here are some explainations of this:

  1. the example came from two files 20150630-201427.csv and 20150701-151654.csv, and it has 4 columns which correspond to date, datanme, data_column1, data_column2, data_column3.
  2. these line have the same data date 20150630 and the same dataname CFIIASU,CFIIASU_AU...etc, but the numbers in the fourth and fifth column (which are data_column2 and data_column3) are different.

How could i keep the data of 20150701-151654.csv based on the file's name and data date and apply it on my whole data set?

To make it more clearly. I'd like to keep the lines of "the latest csv" and since the latest csv is corresponding to the file's name, which in this example is 2015070. but when it comes to my whole data set i need to handle with so many 20xxxxxx.csv that i can't check it one by one.

for the example, i made this should end up like this:

20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743

Thanks in advance.

CodePudding user response:

Your question isn't clear but it sounds like this might be what you're trying to do (print all lines from the last csv mentioned in the input file):

$ tac file | awk -F':' 'NR>1 && $1!=prev{exit} {print; prev=$1}' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743

or maybe this (print the last line seen for every 20150630,CFIIASU etc. pair in the input file):

$ tac file | awk -F'[:,]' '!seen[$2,$3]  ' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743

CodePudding user response:

One way to do this would be to use the Unix command "awk" to filter the data based on the file name and date. Here's an example of how this could be done:

First, use the "awk" command to filter the data by the file name "20150701-151654.csv" and the date "20150630":

awk -F "," '$1 == "20150701-151654.csv" && $2 == "20150630"'

This will return only the lines that match the specified file name and date.

Next, you can use the "sort" command to sort the data by the dataname field, so that all the lines with the same dataname are grouped together:

awk -F "," '$1 == "20150701-151654.csv" && $2 == "20150630"' | sort -t "," -k3

Finally, you can use the "uniq" command to remove duplicate lines based on the dataname field, so that only the latest data is kept:

awk -F "," '$1 == "20150701-151654.csv" && $2 == "20150630"' | sort -t "," -k3 | uniq -f2

This should give you the desired output of only the latest data from the file "20150701-151654.csv" with the date "20150630". You can then apply this command to your entire data set to get the latest data for all dates and files.

  •  Tags:  
  • bash
  • Related