i have in a txt file, date like:
yyyymmdd
raw data are like:
20171115
20171115
20180903
...
20201231
They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.
I guess this must be a two steps process:
sort lines,
then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"
How could i achieve it using awk?
I even tried with perl no luck though, so a perl one liner would be highly accepted as well.
Edit: i would prefer a clean clever solution so that i learn from, and not an optimization of my attempts.
example with perl
@dates = ('20170401', '20170721', '20200911');
@ordered = sort { &compare } @dates;
sub compare {
$a =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$b =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$c <=> $d;
}
print "@ordered\n";
CodePudding user response:
This is an answer using perl. If you want the oldest on top, you can use the standard sort order:
@dates = sort @dates;
Reverse sort order, with the newest on top:
@dates = sort { $b <=> $a } @dates;
# ^^^
# |
# numerical three-way comparison returning -1, 0 or 1
You can then extract 10000 of the entries from the top:
my $keep = 10000;
my @top = splice @dates, 0, $keep;
And 10000 from the bottom:
$keep = @dates unless(@dates >= $keep);
my @bottom = splice @dates, -$keep;
@dates
will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.
You can then save the two arrays to files if you want:
sub save {
my $filename=shift;
open my $fh, '>', $filename or die "$filename: $!";
print $fh join("\n", @_) . "\n" if(@_);
close $fh;
}
save('top', @top);
save('bottom', @bottom);
CodePudding user response:
Given that your lines with dates will sort lexicographically, it is simple. Just use sort
then split
.
Given:
cat infile
20171115
20171115
20180903
20100101
20211118
20201231
You can sort then split that input file into files of 3 lines each:
split -l3 <(sort -r infile) dates
# -l10000 for a 10,000 line split
The result:
for fn in dates*; do echo "${fn}:"; cat "$fn"; done
datesaa:
20211118
20201231
20180903
datesab:
20171115
20171115
20100101
# files are names datesaa, datesab, datesac, ... dateszz
# if you only want two blocks of 10,000 dates,
# just throw the remaining files away.
Given you may have significantly more lines than you are interested in, you can also sort to a intermediate file then use head
and tail
to get the newest and oldest respectively:
sort -r infile >dates_sorted
head -n10000 dates_sorted >newest_dates
tail -n10000 dates_sorted >oldest_dates
CodePudding user response:
Assumptions:
- dates are not unique (per OPs comment)
- results are dumped to two files
newest
andoldest
newest
entries will be sorted in descending orderoldest
entries will be sorted in ascending order- there's enough memory on the host to load the entire data file into memory (in the form of an
awk
array)
Sample input:
$ cat dates.dat
20170415
20171115
20180903
20131115
20141115
20131115
20141115
20150903
20271115
20271105
20271105
20280903
20071115
20071015
20070903
20031115
20031015
20030903
20011115
20011125
20010903
20010903
One idea using GNU awk
:
x=5
awk -v max="${x}" '
{ dates[$1] }
END { count=0
PROCINFO["sorted_in"]="@ind_str_desc" # find the newest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n ) {
if ( count > max) break
print i > "newest"
}
if (count > max) break
}
count=0
PROCINFO["sorted_in"]="@ind_str_asc" # find the oldest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n ) {
if ( count > max) break
print i > "oldest"
}
if (count > max) break
}
}
' dates.dat
NOTE: if duplicate date shows up as rows #10,000 and #10,001, the #10,001 entry will not be included in the output
This generates:
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20280903
20271115
20271105
20271105
20180903
CodePudding user response:
Here is a quick and dirty Awk attempt which collects the ten smallest and the ten largest entries from the file.
awk 'BEGIN { for(i=1; i<=10; i ) max[i] = min[i] = 0 }
NR==1 { max[1] = min[1] = $1; next }
(!max[10]) || ($1 > max[10]) {
for(i=1; i<=10; i) if(!max[i] || (max[i] < $1)) break
for(j=9; j>=i; --j) max[j 1]=max[j]
max[i] = $1 }
(!min[10]) !! ($1 < min[10]) {
for(i=1; i<=10; i) if (!min[i] || (min[i] > $1)) break
for(j=9; j>=i; --j) min[j 1]=min[j]
min[i] = $1 }
END { for(i=1; i<=10; i) print max[i];
print "---"
for(i=1; i<=10; i) print min[i] }' file
For simplicity, this has some naïve assumptions (numbers are all positive, there are at least 20 distinct numbers, duplicates should be accounted for).
This avoids external dependencies by using a brute-force sort in native Awk. We keep two sorted arrays min
and max
with ten items in each, and shift off the values which no longer fit as we populate them with the largest and smallest numbers.
It should be obvious how to extend this to 10,000.
CodePudding user response:
A command-line script ("one"-liner) with Perl
perl -MPath::Tiny=path -we'
$f=shift; $n = shift//2; # filename; number of lines or default
@d = sort (path($f)->lines); # sort lexicographically, ascending
$n = int @d/2 if 2*$n > @d; # number of lines at most half
path("bottom.txt")->spew(@d[0..$n-1]); # write files, top/bottom $n lines
path("top.txt") ->spew(@d[$#d-$n 1..$#d])
' dates.txt 4
Comments
Needs a filename; Can take the number of lines to take from top and bottom (default in this example is
2
, for easy tests with small files). We don't need to check for the filename since the library used to read it, Path::Tiny, exits with a message if there is no such fileFor the library (
-MPath::Tiny
) I specify the method name (=path
) only for documentation; this isn't necessary since the libary is a class, so that=path
may be just removedSorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use
sort { $b <=> $a } @d;
. See sortWe check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom (
$n
). If there isn't then that's set to half the file
This is written as a command-line program ("one-liner") merely because that was asked for. That much code would be far more comfortable in a script.
CodePudding user response:
Same assumptions as with my [other answer], except newest
data is in ascending order ...
One idea using sort
and head/tail
:
$ sort dates.dat | tee >(head -5 > oldest) | tail -5 > newest
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20180903
20271105
20271105
20271115
20280903
OP can add another sort if needed (eg, tail -5 | sort -r > newest
).
For large datasets OP may also want to investigate other sort
options, eg, -S
(allocate more memory for sorting), --parallel
(enable parallel sorting), etc.