Splitting the file based on first character of word-CodePudding

I want to split the file based on the 1st Character of the word and create output files based on the 1st character. I am doing...

awk '{print > substr($0, 1, 1)}' "$File"

But the awk is giving 'fatal: expression for >' redirection has null string value'. The file contains some blank lines. How do I ignore the blank lines while I do the split.

The content of $File is

100009-01  -- This should go in file named 1
200009-01  -- This should go in file named 2
300009-01  -- This should go in file named 3
400009-01
500009-01
600037-01
700037-01
800037-01
900037-01
100037-01  -- This should go in file named 1
A0037-02_  -- This should go in file named A
a00037-02  -- This should go in file named a
c00037-02
B00037-02
200037-02

It should generate the file named "1" and all the lines that are starting with 1 should go into this file.

Thanks

CodePudding user response：

With your shown samples, please try following awk code.

sort -k1.1 Input_file | 
awk '
!NF{ next }
{
  currentFile=substr($1,1,1)
}
prev!=currentFile{
  close(prev)
}
{
  print > (currentFile)
  prev=currentFile
}
'

Explanation: Adding detailed explanation for above.

sort -k1.1 Input_file |         ##Sorting Input_file with 1st letter to make it easier for awk.
awk '                           ##Sending output to awk program as an input.
!NF{ next }                     ##If its an empty line then move to next line.
{
  currentFile=substr($1,1,1)    ##Setting currentFile to 1st letter of current line.
}
prev!=currentFile{              ##If prev is NOT equal to currentFile then do following.
  close(prev)                   ##Closing prev file in backend to avoid errors.
}
{
  print > (currentFile)         ##Printing current line into currentFile output file.
  prev=currentFile              ##Setting currentFile value to prev here.
}
'

CodePudding user response：

The file contains some blank lines. How do I ignore the blank lines while I do the split.

If this is sole problem you might simply fix your code following way:

awk '$0!=""{print > substr($0, 1, 1)}' "$File"

Explanation: I added condition to your action, which is true if whole line ($0) is not equal (!=) empty string (""), therefore empty lines will be ignored.

CodePudding user response：

Here is a minor update to your original code:

awk 'NF{print > substr($1, 1, 1)}' "$File"

Since awk works with (pattern){action} rules, it implies that action is taken when pattern is non-zero or non-empty. The value of NF gives the total number of fields in your current record (line). By using NF as the pattern, awk will perform the action if the current line contains non-space characters.

Besides that, we also use $1 instead of $0. This is just to avoid that there are lines that could start with a space and we use the first character of the first field.

CodePudding user response：

Here's how it could be done with bash:

while read -r line; do
    echo "$line" >> "${line:0:1}"
done < "$File"

CodePudding user response：

I don't know how to put this into one shell script, but you can base yourself on following:

cut -c 1 test.txt | sort | uniq

This gives the list of the first characters, present in your file (it also gives you the filenames you're about to create).

grep "^1" test.txt

This gives you all the lines of your file, starting with "1".

Take care: don't use a>file because this will always delete and recreate your file. I propose you do a>>file, which creates the files in case of non-existing and appends otherwise.

So, in pseudocode, you should get something like:

foreach (char a in $(cut -c 1 test.txt | sort | uniq))
{
  grep "^$a" test.txt >>$a
}