I have two files. A file disk.txt contains 57665977 rows and database.txt 39035203 rows;
To test my script I made two example files:
$ cat database.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ cat disk.txt
01fffca9-05c8-41a9-8539-8bb2f587cef2
02fffd0d-fbcf-4759-9478-cfd32c987101
03fffd54-8d62-4555-a4ce-370f061048d5
04fffdb6-24f9-4b98-865f-ce32bc44872c
05fffe0c-2b9d-47fa-8ee9-2d20d0b28334
06fffea1-46f2-4aa2-93b9-be627189e38b
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
What I try to accomplish is to create files for the differences.
- A file with the uniques in disk.txt (so I can delete them from disk)
- A file with the uniques in database.txt (So I can retrieve them from backup and restore)
Using comm
to retrieve differences
I used comm
to see the differences between the two files. Sadly comm
also returns the duplicates after some uniques.
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
$ comm -13 database.txt disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
10ffff8a-cc20-4a2b-b9b2-a3cbc2000e49
using comm
on one of these large files takes 28,38s. This is really fast but is solely not a solution.
using fgrep
to strip duplicates from comm
result
I can use fgrep
to remove the duplicates from the comm
result and this works on the example.
$ fgrep -vf duplicate-plus-uniq-disk.txt duplicate-plus-uniq-database.txt
11ffffaf-fd54-49f3-9719-4a63690430d9
12ffffc6-4ea8-4336-bdf1-e2d9d71a1c29
$ fgrep -vf duplicate-plus-uniq-database.txt duplicate-plus-uniq-disk.txt
07fffeed-5a0b-41f8-86cd-e6d99834c187
08ffff24-fb12-488c-87eb-1a07072fc706
09ffff29-ba3d-4582-8ce2-80b47ed927d1
On the large files this script just crashed after a while. So it is not a viable option to solve my problem.
Using python difflib
to get uniques
I tried using this python script I got from BigSpicyPotato's answer on a different post
import difflib
with open(r'disk.txt','r') as masterdata:
with open(r'database.txt','r') as useddata:
with open(r'uniq-disk.txt','w ') as Newdata:
usedfile = [ x.strip('\n') for x in list(useddata) ]
masterfile = [ x.strip('\n') for x in list(masterdata) ]
for line in masterfile:
if line not in usedfile:
Newdata.write(line '\n')
this also works on the example. Currently this is still running and takes up alot my CPU power.. Looking at the uniq-disk file it is really slow aswell..
Question
Any faster / better option I can try in bash / python? I was aswell looking into awk
/ sed
to maybe parse the the results form comm
.
CodePudding user response:
From man comm
, *
added by me:
Compare **sorted** files FILE1 and FILE2 line by line.
You have to sort the files for comm
.
sort database.txt > database_sorted.txt
sort disk.txt > disk_sorted.txt
comm -13 database_sorted.txt disk_sorted.txt
See man sort
for various speed and memory enhancing options, like --batch-size
, --temporary-directory
--buffer-size
--parallel
.
A file with the uniques in disk.txt
A file with the uniques in database.txt
After sorting, you can implement your python program that compares line-by-line the files and write to mentioned files, just like comm
with custom output. Do not store whole files in memory.
You can also do something along this with join
or comm --output-delimiter=' '
:
join -v1 -v2 -o 1.1,2.1 disk_sorted.txt database_sorted.txt | tee >(
cut -d' ' -f1 | grep -v '^$' > unique_in_disk.txt) |
cut -d' ' -f2 | grep -v '^$' > unique_in_database.txt
CodePudding user response:
comm
does exactly what I needed. I had a white space behind line 10 of my disk.txt file. therefor comm returned it as a unique string. Please check @KamilCuk answer for more context about sorting your files and using comm.