Is it possible to extract certain rows from a file where the first column contains an identifier tha-CodePudding

I asked this as part of a larger question yesterday. A commenter suggested splitting it off into its own question.

I have x files. The contents of the first five files appear below. On the second row, i is a counter so that i = 0 for the first file. Next to i is time and it always increases by 0.5 when i increases by 1.

However, I have only kept every xth file. So while i = 0 for the first file that I have, i = 100 in the second file, and so on.

6 # This file is called "0.xyz" (<--the "6" is the same in all files)
i =       0, time =        0.000, k =      9000000000000
X        -7.6415350292        6.0494971539        8.1919697993
Y        -6.6418362233        5.9231018862        8.4056822626
Y        -8.0518670684        6.3158684817        9.0061271154
X        26.8252967820       20.4661074967       17.8025744066
Y        26.4477411207       20.4071029058       16.9121571912
Y        26.4399648474       21.2950722068       18.1009273227

6 # This file is called "100.xyz"
i =       100, time =        50.000, k =      2500000000000
X        -6.2423192714       -1.5704681396       -9.5648670474
Y        -5.4925100813       -1.6522059045       -8.9030589772
Y        -6.7765278574       -2.3616512405       -9.4776648590
X         4.1248924594       27.8487302083      -17.5400886312
Y         4.1238657681       26.9869907778      -17.9727402579
Y         5.0750649402       28.1292768156      -17.6848507559

6 # This file is called "200.xyz"
i =       200, time =        100.000, k =      3945000000000
X        19.0090162215       -5.9338939011        6.1931167954
Y        18.4748060757       -6.4905073540        5.6656446036
Y        19.2825591449       -6.4479943255        7.0179774953
X        11.0203415273       34.6029396705        2.7220660957
Y        11.1184002007       34.8398120338        1.8089008500
Y        10.3349649622       33.9509485292        2.5605794622

6 # This file is called "300.xyz"
i =       300, time =       150.000, k =      2341000000000
X        -7.6415350292        6.0494971539        8.1919697993
Y        -6.6418362233        5.9231018862        8.4056822626
Y        -8.0518670684        6.3158684817        9.0061271154
X        26.8252967820       20.4661074967       17.8025744066
Y        26.4477411207       20.4071029058       16.9121571912
Y        26.4399648474       21.2950722068       18.1009273227

6 # This file is called "400.xyz"
i =       400, time =        200.000, k =      2500000000000
X        -6.2423192714       -1.5704681396       -9.5648670474
Y        -5.4925100813       -1.6522059045       -8.9030589772
Y        -6.7765278574       -2.3616512405       -9.4776648590
X         4.1248924594       27.8487302083      -17.5400886312
Y         4.1238657681       26.9869907778      -17.9727402579
Y         5.0750649402       28.1292768156      -17.6848507559

What I would like to do is match these files (above) with data from another file (below). In the file below, each row matches one file above according to the i (Step). Then I would like to print the first three columns of the matching rows in the file below to a new file.

Although I have included file names, I would prefer to do the matching by using i, not the file names.

I know how to do this by simple iteration. What is mean is, I can count and print every xth row of the file below to a new file. However, I would like to use a more sophisticated approach that specifically matches i because these are very long files and there could potentially be rows missing so that I'd end up with a mismatch between the files above and the one I want here.

    #   Step       Time        Ax                   Ay                  Az                  Bx                  By                  Bz                  Cx                  Cy                  Cz                     Final 
           0       0.000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290          3243.9033438318
           1       0.500       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502          3243.7307919244
           2       1.000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814          3243.5615395313
           3       1.500       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604          3243.3955453870
           4       2.000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354          3243.2327751298
...
           100       50.000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290          3243.9033438318
...
           200       100.000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502          3243.7307919244
...
           300       150.000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814          3243.5615395313
           301       150.500       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604          3243.3955453870
...
           400       200.000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354          3243.2327751298

Example of what I would like the result to be from manipulating the file above to match the set of example files at the top of the question:

   0         0.000         14.8032123290        
   100       50.000        14.8032123290
   200       100.000       14.8029498502
   300       150.000       14.8026923814
   400       200.000       14.8021922354

If anyone has any hints on how to approach this, I would be grateful.

CodePudding user response：

You can use an awk script to do this as follows:

awk 'FNR == 1 {
if ($0 ~ /^i =/) {
    dataFile = 0;
    step[$3 0] = FILENAME;
}
else {dataFile = 1;}
}

dataFile == 1 && step[$1] {
    print $1, $2, $3;
}' *.xyz data.txt

(assumes the final file is called data.txt; change as necessary)

FNR == 1 matches the first line of every file and will either capture the step from the xyz file, or set a flag indicating that we've reached the data file. The $3 0 bit will just force awk to convert the 3rd field to a number (i.e., drop the trailing comma) due to the request to perform a mathematical operation.

dataFile == 1 && step[$i] matches lines within the data file where the step value was seen in an xyz file.

NOTE: You must specify all the xyz files before the final data file, so that all the steps are collected before the data file is processed.

Sorry, when I put the above solution together, I thought the # This file is called lines were not part of the file. Modified script is here:

awk '
FNR == 2 && FILENAME != ARGV[ARGC-1] {
    step[$3 0] = FILENAME;
}

FILENAME == ARGV[ARGC-1] && step[$1] {
    print $1, $2, $3;
}' *.xyz data.txt

This version uses ARGV & ARGC to determine whether the "data" file is being processed.

If not the data file and line number == 2, cache the "step" value
If the data file, and the step is in the list, print the first 3 fields

Result:

0 0.000 14.8032123290
100 50.000 14.8032123290
200 100.000 14.8029498502
300 150.000 14.8026923814
400 200.000 14.8021922354

CodePudding user response：

Assumptions:

matches are based on both i (aka step) and time

The step / time / Ax data file:

$ cat match.dat
    #   Step       Time        Ax                   Ay                  Az                  Bx                  By                  Bz                  Cx                  Cy                  Cz                     Final
           0       0.000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290          3243.9033438318
           1       0.500       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502          3243.7307919244
           2       1.000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814          3243.5615395313
           3       1.500       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604          3243.3955453870
           4       2.000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354          3243.2327751298
           100       50.000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290          3243.9033438318
           200       100.000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502          3243.7307919244
           300       150.000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814          3243.5615395313
           301       150.500       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604          3243.3955453870
           400       200.000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354          3243.2327751298

One awk idea:

awk '
FNR==NR   { if (FNR>1)                        # skip header line in 1st file
               Ax[$1 OFS $2]=$3               # use step   OFS   time as index for Ax[] array
            next
          }

$1 == "i" { gsub(/,/,"")                      # remove commas from line so we can use normal FS delimiter to pull ...
            i=$3                              # field #3 (i) and ...
            time=$6                           # field #6 (time)
            if ( (i OFS time) in Ax)          # if i   OFS   time is an index in Ax[] array ...
                print i,time,Ax[i OFS time]   # print our 3 values to stdout
          }
' match.dat [0-9]*.xyz

This generates:

0 0.000 14.8032123290
100 50.000 14.8032123290
200 100.000 14.8029498502
300 150.000 14.8026923814
400 200.000 14.8021922354

If OP needs the output displayed with pretty columns one idea would be to pipe the results to column, eg:

$ awk '...' match.dat [0-9]*.xyz | column -t
0    0.000    14.8032123290
100  50.000   14.8032123290
200  100.000  14.8029498502
300  150.000  14.8026923814
400  200.000  14.8021922354

NOTE: this code is matching on exact matches of character strings; it is not matching based on numeric values; so 150 != 150.00

CodePudding user response：

This might work for you (GNU sed):

sed -En '2~100s/^((\s*\S ){3}).*/\1/p' file

Turn on extended regexp and off implicit printing -En.

Starting from line 2 (the line following the header) use modulus 100 to select the required lines and then the substitution command to keep only the first three columns.