Home > Back-end >  Simple one-liner to merge lines with common first field
Simple one-liner to merge lines with common first field

Time:02-19

In my work building an English language database, I often deal with text content from different sources, and need to merge lines that share the same first field. I often hack this in a text editor with a regex that captures a first field, searching across "\n", but often I have text files >10GB, so a command-line, streaming solution is preferred to in-memory.

Sample input:

apple|pear 
apple|quince 
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise

Desired output:

apple|pear|quince 
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

The logic is to concatenate (joined by "|") all lines with the same first field.

The only delimiter is "|", and the delimiter only appears once per input line. i.e. it's effectively a 2-column text file. The file sorting does not matter, the only concern is consecutive lines with the identical first field.

I have lots of solutions and one-liners (often in awk or ruby) to process same-line content, but I run into knots when dealing with multiple lines, and would appreciate help. For some reason, multiline processing always bogs me down.

I'm sure this is can be done succinctly with awk.

CodePudding user response:

Assumptions/understandings:

  • overall file may not be sorted (by 1st field)
  • all lines with the same string in the 1st field will be listed consecutively; this should eliminate the need to maintain a large volume of data in memory with the tradeoff that we'll need a bit more typing
  • 2nd field may contain trailing white space (per sample input); this will need to be removed
  • ouput does not need to be sorted (by 1st field)

One awk idea:

awk '

function print_line() {
    if (prev != "")
       print prev,data
}

BEGIN { FS=OFS="|" }

      { if ($1 != prev) {
           print_line()
           prev=$1
           data=""
        }
        gsub(/[[:space:]] $/,"",$2)              # strip trailing white space
        data= data (data=="" ? "" : OFS) $2      # concatentate 2nd fields with OFS="|"
      }

END   { print_line() }                           # flush last set of data to stdout
' pipe.dat

This generates:

apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

CodePudding user response:

Using any awk in any shell on every Unix box and assuming your input is grouped by the first field as shown in your sample input and you don't really have trailing blanks at the end of some lines:

$ cat tst.awk
BEGIN { FS=OFS="|" }
$1 != prev {
    if ( NR>1 ) {
        print out
    }
    out = prev = $1
}
{ out = out OFS $2 }
END { print out }

$ awk -f tst.awk file
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

If it's not grouped then do sort file | awk -f tst.awk and if there are trailing blanks then add { sub(/ $/,"") } as the first line of the script.

CodePudding user response:

Here is a Ruby solution that reads the file line-by-line. At the end I show how much simpler the solution could be if the file could be gulped into a string.

Let's first create an input file to work with.

str =<<~_
  apple|pear 
  apple|quince 
  apple cider|juice
  banana|plantain
  cherry|cheerful, crimson
  cherry|ruddy
  cherry|cerise
_
file_name_in = 'file_in'
File.write(file_name_in, str)
  #=> 112

Solution when file is read line-by-line

We can produce the desired output file with the following method.

def doit(file_name_in, file_name_out)  
  fin = File.new(file_name_in, "r")
  fout = File.new(file_name_out, "w")
  str = ''
  until fin.eof?
    s = fin.gets.strip
    k,v = s.split(/(?=\|)/)
    if str.empty?
      str = s
      key = k
    elsif k == key
      str << v
    else
      fout.puts(str)
      str = s
      key = k
    end
  end
  fout.puts(str)
  fin.close
  fout.close
end

Let's try it.

file_name_out = 'file_out'
doit(file_name_in, file_name_out)
puts File.read(file_name_out)

prints the following.

apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

Note that

"apple|pear".split(/(?=\|)/)
  #=> ["apple", "|pear"]

The regular expression contains the positive lookahead (?=\|) which matches the zero-width location between 'e' and '|'.

Solution when file is gulped into a string

I realize (after asking) that the OP does not want to gulp the file into a string (hence my solution above) but I would like to show how much simpler the problem is if one could do so. Here is one of many ways of doing that.

def gulp_it(file_name_in, file_name_out)
  File.write(file_name_out,
    File.read(file_name_in).gsub(/^(. )\|.*[^ ]\K *\r?\n\1/, ''))
end
gulp_it(file_name_in, file_name_out)
  #=> 98
puts File.read(file_name_out)

prints

apple|pear|quince 
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy
cherry|cerise

Thinking about what the regex engine will be doing, this may be acceptably fast, depending on file size, of course.

Regex demo

While the link uses the PCRE engine the result would be the same using Ruby's regex engine (Onigmo). We can make the regular expression self-documenting by writing it in free-spacing mode.

/
^        # match the beginning of a line
(. )     # match one or more characters
\|.*[^ ] # match '|', then zero or more chars, then a non-space
\K       # resets the starting point of the match and discards
         # any previously-matched characters 
[ ]*     # match zero or more chars
\r?\n    # march the line terminator(s)
\1       # match the content of capture group 1
/x       # invoke free-spacing mode

(. ) matches, 'apple', 'banana' and 'cherry' because those words are at the beginning lines. One could alternatively write ([^|]*).

CodePudding user response:

If we assume s is a string containing all of the lines in the file.

s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }

Will yield:

{"apple"=>["pear", "quince"], "apple cider"=>["juice"], "banana"=>["plantain"], "cherry"=>["cheerful, crimson", "ruddy", "cerise"]}

Then:

s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }.map { |k, v| "#{k}|#{v.join('|')}" }

Yields:

["apple|pear|quince", "apple cider|juice", "banana|plantain", "cherry|cheerful, crimson|ruddy|cerise"]
  • Related