I have two lists containing absolute paths of all files in PWD.
I generated this list from using find "$(pwd)" -type f
List 1:
/home/ec2-user/eclipsebio_toolkit/scripts/get_gene_counts_for_miRNA_specific_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_gene_info_sample_comparisons.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_miRNA_counts_for_gene_specific_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_mrna_lengths.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_nonnegative_peaks.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peak_gene_ids_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peak_gene_ids_w_output.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peaks_not_sig.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peaks_overlapping_chimeric_reads.py
/home/ec2-user/eclipsebio_toolkit/scripts/create_ribo_html_report.py
List 2:
/home/ec2-user/snakemake_eclip/scripts/count_reads_broadfeatures_frombamfi_SRmap.pl
/home/ec2-user/snakemake_eclip/scripts/create_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_idr_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_mapped_read_num.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations_peaks.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations_reads.py
/home/ec2-user/snakemake_eclip/scripts/create_peak_norm_manifests.py
/home/ec2-user/snakemake_eclip/scripts/create_pureclip_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_ribo_html_report.py
I would like to find duplicate files between these two lists and then delete duplicate items only found in list 1 (rm
from disk).
I have tried using awk 'NR == FNR{ a[$0] = 1;next } !a[$0]' list1 list2
to delete items only found in list 1, but this does not take the absolute path into consideration.
CodePudding user response:
Your awk script is almost right,
awk -F / 'NR == FNR{ a[$NF] = 1;next } a[$NF]' file2 file1
CodePudding user response:
awk -F '/' '{print $NF}' list1 list2 \
| sort \
| uniq -d \
| xargs -I {} echo rm -v /home/ec2-user/eclipsebio_toolkit/scripts/{}
Output:
rm -v /home/ec2-user/eclipsebio_toolkit/scripts/create_ribo_html_report.py
If output looks okay, remove echo
.
CodePudding user response:
awk -F'/' 'NR==FNR{ a[$NF]; next } $NF in a' file2 file1 | xargs rm
Always use the above key in a
idiom for the second part of the script instead of a[key]
as you were trying to do:
awk 'NR == FNR{ a[$0] = 1;next } !a[$0]' list1 list2
because what you're doing is wasting cycles and memory by storing 1
for every key that exists in your first file (a[$0] = 1
) and then wasting more memory by storing every key from the 2nd file too with !a[key]
.
This doesn't require you to assign any value when populating a[key]
and tests if key
exists in a
by a hash lookup without adding anything else to a[]
:
key in a
!(key in a)
while this does require you to initially assign a non-zero value to a[key]
and then the code below does a hash lookup of key
in a[]
and if that key doesn't exist adds an entry to a[]
indexed by key
and then tests if the value for that entry is non-zero:
a[key]
!a[key]
so just don't do that as it's a waste of time and memory.