Improved question for clarity:
Hello there so I have about 2000 csv files
One master file called fileaa.csv
And 1999 description files called fileaa-1.csv, fileaa-2.csv, fileaa-4.csv...
(some numbers are missing)
I want to add a 3rd column to the 2 column master file:
| link | link2 |
1| somelink.com | somelink2.com |
like so
| link | link2 | description |
1| somelink.com | somelink2.com | some description |
where the description of line 1 comes from fileaa-1.csv
, which is a single-cell csv with a paragraph of text.
Does anyone know how to do this at scale? I have 100 other masters with about 2000 descriptions each.
Edit (incl. commands):
Things I couldn't try:
cat * | awk 'NR==FNR{a[NR]=$0;next}{print a[FNR],$0}' fileaa.csv fileaa-1.csv
wouldn't work because of the missing numbers
awk '{print $0,NR}' fileaa.csv; \
find /mnt/media/fileaa.csv -type f -exec sed -i 's/1/fileaa-1.csv/g' {} \;
because sed can't read external files inside the -exec sed command
Edit 1:
The exact contents of fileaa-1.csv
are:
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
The exact input:
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
The exact desired output:
| link | link2 | description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
Edit 2:
The contents of fileaa.csv
are already in order and do not need to be sorted. It is not possible for there to be a fileaa-[number].csv
that does not match a row in fileaa.csv
.
Edit 3:
There are no |
of linefeeds in the data.
To be honest I am a complete beginner and I don't really know where to start on this one.
Any help will be appreciated ❤️
CodePudding user response:
Assumptions:
- the 'paragraph' from the
fileaa-*.csv
files is on a single line (ie, does not include any embedded linefeeds) - assuming the sample from OP's
fileaa-1.csv
is one long line and what we're seeing in the question is an issue of incorrect formatting of the paragraph (ie, there are no linefeeds) - we can ignore anything on lines 2-N from the
fileaa-*.csv
files - we only append a field to a line in
fileaa.csv
if we find a matching file (ie, we don't worry about appending an empty field if the matchingfileaa-*.csv
files does not exist) - the finale result (ie, contents of all files) will fit in memory
Adding some additional sample data:
$ head fileaa*csv
==> fileaa-1.csv <==
"Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)"
==> fileaa-2.csv <==
"this one has a short paragraph ... 1 ... 2 ... 3"
==> fileaa-3.csv <==
and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas
==> fileaa.csv <==
| link | link2 |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
NOTE: since there is no fileaa-4.csv
we will not append anything to the last line (where 1st field = 4
) in fileaa.csv
One awk
idea:
master='fileaa'
awk '
FNR==NR { if (FNR==1)
lines[0]=$0 " Description |" # save header line
else {
split($0,a,"|") # get line number
ndx=a[1] 0 # remove spaces and leading zeros
lines[ndx]=$0 # save line
max=ndx > max ? ndx : max # keep track of the max line number
}
next
}
{ split(FILENAME,a,/[-.]/) # split filename on dual delimiters: hyphen and period
ndx=a[2] 0 # remove leading zeros
lines[ndx]=lines[ndx] " " $0 " |" # append current line to matching line from 1st file
nextfile # skip the rest of the current file
}
END { for (i=0;i<=max;i )
print lines[i]
}
' "${master}".csv "${master}"-*.csv
This generates:
| link | link2 | Description |
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx | "Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)" |
2| https://www.youtube.com/watch?v=AAAAAAAAAAAAAAAAA | https://www.youtube.com/user/AAAAAAAA | "this one has a short paragraph ... 1 ... 2 ... 3" |
3 | https://www.youtube.com/watch?v=BBBBB | https://www.youtube.com/user/BBBBBBBBBBBBBBB | and then there's this paragraph with a bunch of random characters ... as;dlkfjaw;eorifujqw4[-09hjavnd;oitjuwae[-0g9ujadg;flkjas |
4| https://www.youtube.com/watch?v=CCCCCCCC | https://www.youtube.com/user/CCCCCC |
CodePudding user response:
This might work.
Based on the FILENAME
ending in a number or not ending in a number, columns one and two are collected if the FILENAME
does not end in a number and column three is collected if the FILENAME
ends in a number.
After all input files are processed, columns one, two, and three are printed.
./doit.awk fileeaa*
|link|link2|Description
1| https://www.youtube.com/watch?v=lhNFZ37OfE4 | https://www.youtube.com/user/kdhx |Texan singer-songwriter Robert Earl Keen performs the song "What I Really Mean" acoustically with his band, live in the Magnolia Avenue Studios of KDHX, St. Louis, Missouri, February 11, 2010. The full session aired Sun, Feb. 28, 2010 on Songwriter's Showcase, heard Sundays from 10:30 a.m.-noon Central on KDHX with host Ed Becker. Sound and Video by Andy Coco and Ed Kleinberg. Discover more great music (streaming audio, photos, video and more)
#!/usr/local/bin/gawk -f
BEGIN { FS="|" }
FILENAME !~ /[0-9]\.csv$/ && $1 > 0 {
join_on[$1]=$1
c1[$1] = $2
c2[$1] = $3
joins
}
FILENAME ~ /[0-9]\.csv$/ {
match(FILENAME , /-([0-9] )\.csv/, join_int)
c3[join_int[1]] = $0
}
END {
print "|link|link2|Description"
for (j in join_on) {
print j "|" c1[j] "|" c2[j] "|" c3[j]
}
}