Home > Blockchain >  Print top N files by word count in two columns
Print top N files by word count in two columns

Time:02-26

I would like to make a script that prints the filenames for the top n files from two directories (n being the number of files I give in in the command line) in order of number of words they have. My biggest problem however is in the way they should be displayed.

Say my command line looks like this:

myscript.sh 5 dir1 dir2

The output should have 2 columns: on the left the top 5 files in descending order from dir1, and on the right the top 5 files in descending order from dir2.

This is what I have in terms of code, however I'm missing something. I think that pr -m -t should do what i want, but I couldn't make it work.

#!/bin/bash
dir=$1
dir2=$2
for files in "$dir"
do  
    find ./reuters-topics/$dir -type f -exec wc -l {}   | sort -rn |head -n 15    
done
for files in "$dir2"
do
    find ./reuters-topics/$dir2 -type f -exec wc -l {}   | sort -rn | head -n 15 
    
done

CodePudding user response:

This is a solution in fish:

for i in (find . -type f); wc -l $i; end | sort -rn | head -n15 | awk '{print $2 "\t" $1}'

As you can see, the re-ordering (filename first, number of words second) is done by awk. As a separator I use a tab character:

awk '{print $2 "\t" $1}'

The difference between my loop and your find call, btw, is that I do not get the "total" line in the output. I did not test if this (including awk) also works well for files with spaces in the name.

CodePudding user response:

#!/usr/bin/env bash

_top_files_by_words_usage() {
  local usage=""
  read -r -d '' usage <<-"EOF"
    Usage:
        top_files_by_words <show_count> <dir1> <dir2>
    EOF
  1>&2 printf "%s\n" "$usage"
}

top_files_by_words() {
  if (( $# != 3 )) || [[ "$1" !=  ([0-9]) ]]; then
    _top_files_by_words_usage
    return 1
  fi

  local -i showCount=0
  local dir1=""
  local dir2=""

  showCount="$1"
  dir1="$2"
  dir2="$3"

  shopt -s extglob
  
  if [[ ! -d "$dir1" ]]; then
    1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir1"
    return 1
  fi

  if [[ ! -d "$dir2" ]]; then
    1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir2"
    return 1
  fi

  local -a out1=()
  local -a out2=()

  IFS=$'\n' read -r -d '' -a out1 < <(find "$dir1" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
  IFS=$'\n' read -r -d '' -a out2 < <(find "$dir2" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")

  local -i i=0
  local -i maxLen=0
  local -i len=0;
  for ((i = 0; i < showCount;   i)); do
    len="${#out1[$i]}"
    if (( len > maxLen )); then
      maxLen=$len
    fi
    # len="${#out2[$i]}"
    # if (( len > maxLen )); then
    #   maxLen=$len
    # fi
  done

  for (( i = 0; i < showCount;   i)); do
    printf "%-*.*s %s\n" "$maxLen" "$maxLen" "${out1[$i]}" "${out2[$i]}"
  done
    
  return 0
}

top_files_by_words "$@"

$ ~/tmp/count_words.bash 15 tex tikz
 2309328 tex/resume.log                                       9692402 tikz/tikz-Graphics in LaTeX with TikZ.mp4
 2242997 tex/resume_cv.log                                    2208818 tikz/tikz-Tikz-Graphs and Automata.mp4
 2242969 tex/cover_letters/resume_cv.log                       852631 tikz/tikz-Drawing Automata with TikZ in LaTeX.mp4
   73859 tex/pgfplots/plotdata/heightmap.dat                   711004 tikz/tikz-tutorial.mp4
   49152 tex/pgfplots/lena.dat                                 300038 tikz/.ipynb_checkpoints/TikZ 11 Design Principles-checkpoint.ipynb
   43354 tex/nancy.mp4                                         300038 tikz/TikZ 11 Design Principles.ipynb
   31226 tex/pgfplots/pgfplotstodo.tex                         215583 tikz/texample/bridges-of-konigsberg.svg
   26000 tex/pgfplots/plotdata/ou.dat                          108040 tikz/Visual TikZ.pdf
   20481 tex/pgfplots/pgfplotstable.tex                         82540 tikz/worldflags.pdf
   19571 tex/pgfplots/pgfplots.reference.3dplots.tex            37608 tikz/texample/india-map.tex
   19561 tex/pgfplots/plotdata/risingdrop3d_coord.dat           35798 tikz/.ipynb_checkpoints/TikZ-checkpoint.ipynb
   19561 tex/pgfplots/plotdata/risingdrop3d_vel.dat             35656 tikz/texample/periodic_table.svg
   18207 tex/pgfplots/ChangeLog                                 35501 tikz/TikZ.ipynb
   17710 tex/pgfplots/pgfplots.reference.markers-meta.tex       25677 tikz/tikz-Graphics in LaTeX with TikZ.info.json
   13800 tex/pgfplots/pgfplots.reference.axisdescription.tex    14760 tikz/tikz-Tikz-Graphs and Automata.info.json

CodePudding user response:

column can print files side-by-side in columns. You can use process substitution with <(command) to have those "files" be live commands instead of actual files.

#!/bin/bash

top-files() {
    local n="$1"
    local dir="$2"

    find "$dir" -type f -exec wc -l {}   |
        head -n -1 | sort -rn | head -n "$n"
}

n="$1"
dir1="$2"
dir2="$3"

column <(top-files "$n" reuters-topics/"$dir1") \
       <(top-files "$n" reuters-topics/"$dir2")
  • Related