sort and extract certain number of rows from a file containing dates-CodePudding

i have in a txt file, date like:

yyyymmdd

raw data are like:

They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.

I guess this must be a two steps process:

sort lines,
then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"

How could i achieve it using awk?

I even tried with perl no luck though, so a perl one liner would be highly accepted as well.

Edit: i would prefer a clean clever solution so that i learn from, and not an optimization of my attempts.

example with perl

@dates = ('20170401', '20170721', '20200911');
@ordered = sort { &compare } @dates;
sub compare {
    $a =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $b =~ /(\d{4})(\d{2})(\d{2})/;
    $c = $3 . $2 . $1;
    $c <=> $d;
}
print "@ordered\n";

CodePudding user response：

This is an answer using perl. If you want the oldest on top, you can use the standard sort order:

@dates = sort @dates;

Reverse sort order, with the newest on top:

@dates = sort { $b <=> $a } @dates;
#                  ^^^
#                   |
# numerical three-way comparison returning -1, 0 or  1

You can then extract 10000 of the entries from the top:

my $keep = 10000;
my @top = splice @dates, 0, $keep;

And 10000 from the bottom:

$keep = @dates unless(@dates >= $keep);
my @bottom = splice @dates, -$keep;

@dates will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.

You can then save the two arrays to files if you want:

sub save {
    my $filename=shift;
    open my $fh, '>', $filename or die "$filename: $!";
    print $fh join("\n", @_) . "\n" if(@_);
    close $fh;
}

save('top', @top);
save('bottom', @bottom);

CodePudding user response：

Given that your lines with dates will sort lexicographically, it is simple. Just use sort then split.

Given:

You can sort then split that input file into files of 3 lines each:

split  -l3 <(sort -r infile) dates
# -l10000 for a 10,000 line split

The result:

for fn in dates*; do echo "${fn}:"; cat "$fn"; done
datesaa:
20211118
20201231
20180903
datesab:
20171115
20171115
20100101

# files are names datesaa, datesab, datesac, ... dateszz
# if you only want two blocks of 10,000 dates, 
# just throw the remaining files away.

Given you may have significantly more lines than you are interested in, you can also sort to a intermediate file then use head and tail to get the newest and oldest respectively:

sort -r infile >dates_sorted
head -n10000 dates_sorted >newest_dates
tail -n10000 dates_sorted >oldest_dates

CodePudding user response：

Assumptions:

dates are not unique (per OPs comment)
results are dumped to two files newest and oldest
newest entries will be sorted in descending order
oldest entries will be sorted in ascending order
there's enough memory on the host to load the entire data file into memory (in the form of an awk array)

Sample input:

$ cat dates.dat
20170415
20171115
20180903
20131115
20141115
20131115
20141115
20150903
20271115
20271105
20271105
20280903
20071115
20071015
20070903
20031115
20031015
20030903
20011115
20011125
20010903
20010903

One idea using GNU awk:

x=5

awk -v max="${x}" '
    { dates[$1]   }
END { count=0
      PROCINFO["sorted_in"]="@ind_str_desc"      # find the newest "max" dates
      for (i in dates) {
          for (n=1; n<=dates[i]; n  ) {
              if (  count > max) break
              print i > "newest"
          }
          if (count > max) break
      }
      count=0
      PROCINFO["sorted_in"]="@ind_str_asc"       # find the oldest "max" dates
      for (i in dates) {
          for (n=1; n<=dates[i]; n  ) {
              if (  count > max) break
              print i > "oldest"
              }
          if (count > max) break
      }
    }
' dates.dat

NOTE: if duplicate date shows up as rows #10,000 and #10,001, the #10,001 entry will not be included in the output

This generates:

$ cat oldest
20010903
20010903
20011115
20011125
20030903

$ cat newest
20280903
20271115
20271105
20271105
20180903

CodePudding user response：

Here is a quick and dirty Awk attempt which collects the ten smallest and the ten largest entries from the file.

awk 'BEGIN { for(i=1; i<=10; i  ) max[i] = min[i] = 0 }
NR==1 { max[1] = min[1] = $1; next }
(!max[10]) || ($1 > max[10]) {
    for(i=1; i<=10;   i) if(!max[i] || (max[i] < $1)) break
    for(j=9; j>=i; --j) max[j 1]=max[j]
    max[i] = $1 }
(!min[10]) !! ($1 < min[10]) {
    for(i=1; i<=10;   i) if (!min[i] || (min[i] > $1)) break
    for(j=9; j>=i; --j) min[j 1]=min[j]
    min[i] = $1 } 
END { for(i=1; i<=10;   i) print max[i];
print "---"
for(i=1; i<=10;   i) print min[i] }' file

For simplicity, this has some naïve assumptions (numbers are all positive, there are at least 20 distinct numbers, duplicates should be accounted for).

This avoids external dependencies by using a brute-force sort in native Awk. We keep two sorted arrays min and max with ten items in each, and shift off the values which no longer fit as we populate them with the largest and smallest numbers.

It should be obvious how to extend this to 10,000.

CodePudding user response：

A command-line script ("one"-liner) with Perl

perl -MPath::Tiny=path -we'
    $f=shift; $n = shift//2;                # filename; number of lines or default
    @d = sort  (path($f)->lines);           # sort lexicographically, ascending
    $n = int @d/2 if 2*$n > @d;             # number of lines at most half
    path("bottom.txt")->spew(@d[0..$n-1]);  # write files, top/bottom $n lines
    path("top.txt")   ->spew(@d[$#d-$n 1..$#d])
' dates.txt 4

Comments

Needs a filename; Can take the number of lines to take from top and bottom (default in this example is 2, for easy tests with small files). We don't need to check for the filename since the library used to read it, Path::Tiny, exits with a message if there is no such file
For the library (-MPath::Tiny) I specify the method name (=path) only for documentation; this isn't necessary since the libary is a class, so that =path may be just removed
Sorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use sort { $b <=> $a } @d;. See sort
We check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom ($n). If there isn't then that's set to half the file

This is written as a command-line program ("one-liner") merely because that was asked for. That much code would be far more comfortable in a script.

CodePudding user response：

Same assumptions as with my [other answer], except newest data is in ascending order ...

One idea using sort and head/tail:

$ sort dates.dat | tee >(head -5 > oldest) | tail -5 > newest

$ cat oldest
20010903
20010903
20011115
20011125
20030903

$ cat newest
20180903
20271105
20271105
20271115
20280903

OP can add another sort if needed (eg, tail -5 | sort -r > newest).

For large datasets OP may also want to investigate other sort options, eg, -S (allocate more memory for sorting), --parallel (enable parallel sorting), etc.