Delete duplicates from two lists (bash)-CodePudding

I have two lists containing absolute paths of all files in PWD.

I generated this list from using find "$(pwd)" -type f

List 1:

/home/ec2-user/eclipsebio_toolkit/scripts/get_gene_counts_for_miRNA_specific_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_gene_info_sample_comparisons.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_miRNA_counts_for_gene_specific_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_mrna_lengths.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_nonnegative_peaks.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peak_gene_ids_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peak_gene_ids_w_output.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peaks_not_sig.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peaks_overlapping_chimeric_reads.py
/home/ec2-user/eclipsebio_toolkit/scripts/create_ribo_html_report.py

List 2:

/home/ec2-user/snakemake_eclip/scripts/count_reads_broadfeatures_frombamfi_SRmap.pl
/home/ec2-user/snakemake_eclip/scripts/create_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_idr_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_mapped_read_num.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations_peaks.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations_reads.py
/home/ec2-user/snakemake_eclip/scripts/create_peak_norm_manifests.py
/home/ec2-user/snakemake_eclip/scripts/create_pureclip_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_ribo_html_report.py

I would like to find duplicate files between these two lists and then delete duplicate items only found in list 1 (rm from disk).

I have tried using awk 'NR == FNR{ a[$0] = 1;next } !a[$0]' list1 list2 to delete items only found in list 1, but this does not take the absolute path into consideration.

CodePudding user response：

Your awk script is almost right,

awk -F / 'NR == FNR{ a[$NF] = 1;next } a[$NF]' file2 file1

CodePudding user response：

awk -F '/' '{print $NF}' list1 list2 \
| sort \
| uniq -d \
| xargs -I {} echo rm -v /home/ec2-user/eclipsebio_toolkit/scripts/{}

Output:

rm -v /home/ec2-user/eclipsebio_toolkit/scripts/create_ribo_html_report.py

If output looks okay, remove echo.

CodePudding user response：

awk -F'/' 'NR==FNR{ a[$NF]; next } $NF in a' file2 file1 | xargs rm

Always use the above key in a idiom for the second part of the script instead of a[key] as you were trying to do:

awk 'NR == FNR{ a[$0] = 1;next } !a[$0]' list1 list2

because what you're doing is wasting cycles and memory by storing 1 for every key that exists in your first file (a[$0] = 1) and then wasting more memory by storing every key from the 2nd file too with !a[key].

This doesn't require you to assign any value when populating a[key] and tests if key exists in a by a hash lookup without adding anything else to a[]:

key in a
!(key in a)

while this does require you to initially assign a non-zero value to a[key] and then the code below does a hash lookup of key in a[] and if that key doesn't exist adds an entry to a[] indexed by key and then tests if the value for that entry is non-zero:

a[key]
!a[key]

so just don't do that as it's a waste of time and memory.