Home > Back-end >  Comparing values in two files using awk
Comparing values in two files using awk

Time:09-01

I'm doing my own projects to attempt to learn bash more, especially awk, and I've become stuck on something in regards to multi-file input in bash. I'm attempting to write a script that takes the inputs of three columns in file1.txt. If any names in file2.txt match anything in column 11 of file1.txt then do a certain command. File1.txt sample contents (only showing columns 8, 9, 10, 11):

September   23    12pm   Banana 
December    1     1am    Pineapple
May         28    7pm    Apple

File2.txt sample contents:

Grapefruit
Apple
Lemon
Banana

The command would be in regards to comparing the dates to the system clock, but I do not want to get ahead of myself (I believe converting the dates into epoch time would be the way to go). Currently this is what I came up with:

awk '{Dates[$1]=$8, $9, $10, $11 FS Names[$2]=$1;next} {a[$1, $2]} {print Dates[$1]}' file1.txt file2.txt

I'm aware that this is lacking severely. I hope I'm at least heading in the right direction.

CodePudding user response:

Your attempt is very good; if you change a couple of things you get (I assume) your intended outcome:

awk -F"\t" 'NR==FNR{Dates[$4]=$1 FS $2 FS $3; Names[$4]; next} $1 in Names {print Dates[$1]}' file1.txt file2.txt
May 28  7pm

Does that solve your problem?

CodePudding user response:

Your description of what you are attempting is a bit vague, but if I understand you want to find the dates in file1.txt where the fourth field is also contained as field one in file2.txt. Your approach with using arrays is the right approach, but if you are attempting to do what I've described above, you want to read file2.txt first and then process the records in file1.txt.

In doing so, you want to get the time as a number removing the am or pm that is attached to the third field. You can do that in awk simply by adding 0 to the field value to tell awk to interpret the field as a number (non-numeric suffixes will be discarded)

Putting that together, and using f1 for file1.txt and f2 for file2.txt and processing the data you posted and not some other file with more fields, you could do:

awk 'FNR==NR {a[$1]=1; next} $4 in a {print $1, $2, $3 0}' f2 f1

(note how file2.txt is processed before file1.txt, e.g. f2 and f1 respectively)

Where are records in file2.txt (f2 above) are processing into the a[] array indexed by the field value itself. FNR == NR is a simple conditional that says while the file-record-number (FNR) is equal to the number of records (NR) meaning that applies to the first file processed (FNR resets to 1 for each new file, NR continues sequentially). So the array is only filled with records from the first file processed (file2.txt).

The test $4 in a simply tests if the value of field four is found as an index in the a[] array. You can then process the date/time using the corresponding field numbers, 1 - 3 in the data you posted, adding 0 to the third field to interpret as a number.

Output

That will isolate the dates/times matching the two records in file2.txt that also appear in file1.txt, e.g.

$ awk 'FNR==NR {a[$1]=1; next} $4 in a {print $1, $2, $3 0}' f2 f1
September 23 12
May 28 7

With the dates and times separated, you can use a printf() call in awk to form the date into a format that can be converted to epoch. If using GNU awk or mawk, there are time and date extensions that handle converting from a datespec (normal month day year time) into a timespec (seconds since epoch), see, e.g. GNU Awk User's Guide - 9.1.5 Time Functions. For the month/number (e.g. 1 - 12) conversion you can create a simple array of months in the BEGIN section to use as a lookup table.

Let me know if I misunderstood your question or if you have further questions and I can help further.

  • Related