Adding text to a specific (row,column) in csv with sed-CodePudding

Improved question for clarity:

Hello there so I have about 2000 csv files

One master file called fileaa.csv

And 1999 description files called fileaa-1.csv, fileaa-2.csv, fileaa-4.csv... (some numbers are missing)

I want to add a 3rd column to the 2 column master file:

| link | link2 |

1| somelink.com | somelink2.com |

like so

| link | link2 | description |

1| somelink.com | somelink2.com | some description |

where the description of line 1 comes from fileaa-1.csv, which is a single-cell csv with a paragraph of text.

Does anyone know how to do this at scale? I have 100 other masters with about 2000 descriptions each.

Edit (incl. commands):

Things I couldn't try:

cat * | awk 'NR==FNR{a[NR]=$0;next}{print a[FNR],$0}' fileaa.csv fileaa-1.csv wouldn't work because of the missing numbers

awk '{print $0,NR}' fileaa.csv; \

find /mnt/media/fileaa.csv -type f -exec sed -i 's/1/fileaa-1.csv/g' {} \; because sed can't read external files inside the -exec sed command

Edit 1:

The exact contents of fileaa-1.csv are:

"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"

The exact input:

| link | link2 |

1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |

The exact desired output:

| link | link2 | description |

1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |

Edit 2:

The contents of fileaa.csv are already in order and do not need to be sorted. It is not possible for there to be a fileaa-[number].csv that does not match a row in fileaa.csv.

Edit 3:

There are no | of linefeeds in the data.

To be honest I am a complete beginner and I don't really know where to start on this one.

Any help will be appreciated ❤️

CodePudding user response：

Assumptions:

the 'paragraph' from the fileaa-*.csv files is on a single line (ie, does not include any embedded linefeeds)
assuming the sample from OP's fileaa-1.csv is one long line and what we're seeing in the question is an issue of incorrect formatting of the paragraph (ie, there are no linefeeds)
we can ignore anything on lines 2-N from the fileaa-*.csv files
we only append a field to a line in fileaa.csv if we find a matching file (ie, we don't worry about appending an empty field if the matching fileaa-*.csv files does not exist)
the finale result (ie, contents of all files) will fit in memory

Adding some additional sample data:

$ head fileaa*csv
==> fileaa-1.csv <==
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"

==> fileaa-2.csv <==
"this one has a short paragraph ... 1 ... 2 ... 3"

==> fileaa-3.csv <==
and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas

==> fileaa.csv <==
 |     link     |     link2     |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB |
 4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |

NOTE: since there is no fileaa-4.csv we will not append anything to the last line (where 1st field = 4) in fileaa.csv

One awk idea:

master='fileaa'

awk '
FNR==NR { if (FNR==1)
             lines[0]=$0 " Description |"      # save header line
          else {
             split($0,a,"|")                   # get line number
             ndx=a[1] 0                        # remove spaces and leading zeros
             lines[ndx]=$0                     # save line
             max=ndx > max ? ndx : max         # keep track of the max line number
          }
          next
        }
        { split(FILENAME,a,/[-.]/)             # split filename on dual delimiters: hyphen and period
          ndx=a[2] 0                           # remove leading zeros
          lines[ndx]=lines[ndx] " " $0 " |"    # append current line to matching line from 1st file
          nextfile                             # skip the rest of the current file
        }

END     { for (i=0;i<=max;i  )
              print lines[i]
        }
' "${master}".csv "${master}"-*.csv

This generates:

 |     link     |     link2     | Description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA | "this one has a short paragraph ... 1 ... 2 ... 3" |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB | and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas |
 4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |

CodePudding user response：

This might work.

Based on the FILENAME ending in a number or not ending in a number, columns one and two are collected if the FILENAME does not end in a number and column three is collected if the FILENAME ends in a number.

After all input files are processed, columns one, two, and three are printed.

./doit.awk fileeaa*

|link|link2|Description
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)

#!/usr/local/bin/gawk -f

BEGIN { FS="|" }
FILENAME !~ /[0-9]\.csv$/ && $1 > 0 {
  join_on[$1]=$1
  c1[$1] = $2
  c2[$1] = $3
  joins  
}
FILENAME  ~ /[0-9]\.csv$/ {
  match(FILENAME , /-([0-9] )\.csv/, join_int)
  c3[join_int[1]] = $0
}
END {
  print "|link|link2|Description"
  for (j in join_on) {
    print j "|" c1[j] "|" c2[j] "|" c3[j]
  }
}