I work on database that holds file information. Data are spread across few tables, with data from File
table, which contains file_id
there are associated data in other tables such as disk
, directory
, hashes
, etc. My current goal is to be able to find duplicate files based on collected earlier file hash data. Additionaly target is to be able to make query across two databases at once, so query result will include also value indicating source database, in this query these values are dummy and present in result as org_db
and fnd_db
as placeholders that, as I guess, their presence is significant in solving my problem so I do mention their existence.
SELECT
A.file_id org_file_id,
B.file_id fnd_file_id,
AF.directory_id org_dir_id,
BF.directory_id fnd_dir_id,
AD.disk_id org_disk,
BD.disk_id fnd_disk,
1 org_db,
1 fnd_db
FROM fhash A, file AF, file BF, directory AD, directory BD
INNER JOIN fhash B ON B.data = A.data
WHERE
A.file_id <> B.file_id AND
A.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id <> 0)) AND
B.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id <> 0)) AND
A.file_id = AF._id AND
B.file_id = BF._id AND
AF.directory_id = AD._id AND
BF.directory_id = BD._id AND
ORDER BY org_file_id
| org_file_id | fnd_file_id | org_dir_id | fnd_dir_id | org_disk | fnd_disk | org_db | fnd_db |
===============================================================================================
| 97 | 118 | 5 | 8 | 2 | 3 | 1 | 1 |
| 106 | 147 | 8 | 9 | 3 | 3 | 1 | 1 |
| 106 | 175 | 8 | 10 | 3 | 3 | 1 | 1 |
| 107 | 148 | 8 | 9 | 3 | 3 | 1 | 1 |
| 107 | 176 | 8 | 10 | 3 | 3 | 1 | 1 |
| 108 | 149 | 8 | 9 | 3 | 3 | 1 | 1 |
| 108 | 177 | 8 | 10 | 3 | 3 | 1 | 1 |
| 110 | 151 | 8 | 9 | 3 | 3 | 1 | 1 |
| 110 | 179 | 8 | 10 | 3 | 3 | 1 | 1 |
...
| 118 | 97 | 8 | 5 | 3 | 2 | 1 | 1 |
| 147 | 106 | 9 | 8 | 3 | 3 | 1 | 1 |
| 148 | 107 | 9 | 8 | 3 | 3 | 1 | 1 |
| 149 | 108 | 9 | 8 | 3 | 3 | 1 | 1 |
| 151 | 110 | 9 | 8 | 3 | 3 | 1 | 1 |
| 175 | 106 | 10 | 8 | 3 | 3 | 1 | 1 |
| 176 | 107 | 10 | 8 | 3 | 3 | 1 | 1 |
| 177 | 108 | 10 | 8 | 3 | 3 | 1 | 1 |
| 179 | 110 | 10 | 8 | 3 | 3 | 1 | 1 |
Result I got is fairly good and can be used after further processing but what i'm getting is double what I desire. For each row there is a mirror one, further in table, where in fact each of following four pairs of columns is copy of earlier row with data switched between coresponding columns.
What I'm wondering about is if, and how I could get right result straight from query with no further processing.
If I would got only two columns and no additional factors like disks, directories and databases fix would be simple as changing
A.file_id <> B.file_id AND
to
A.file_id < B.file_id AND
Additionaly these lines
A.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id <> 0)) AND
B.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id <> 0)) AND
are just sample, and depending on search criteria they are changing, these could be as well like
A.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id = 0)) AND
B.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id = 0)) AND
or
A.file_id IN (SELECT _id FROM file WHERE directory_id IN ([query gets ids of dir subdirs])) AND
B.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id <> 0)) AND
Or in other way. Rest of query stays pretty much unchanged besides further modifications to make it work across two databases.
CodePudding user response:
For this sample data I believe that using:
GROUP BY MIN(org_file_id, fnd_file_id), MAX(org_file_id, fnd_file_id)
where MIN()
and MAX()
are SQLite's scalar functions, will eliminate the duplicates:
SELECT A.file_id org_file_id,
B.file_id fnd_file_id,
AF.directory_id org_dir_id,
BF.directory_id fnd_dir_id,
AD.disk_id org_disk,
BD.disk_id fnd_disk,
1 org_db,
1 fnd_db
FROM fhash A
INNER JOIN file AF ON A.file_id = AF._id
INNER JOIN directory AD ON AF.directory_id = AD._id
INNER JOIN fhash B ON B.data = A.data
INNER JOIN file BF ON B.file_id = BF._id
INNER JOIN directory BD ON BF.directory_id = BD._id
WHERE A.file_id <> B.file_id
AND A.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id <> 0))
AND B.file_id IN (SELECT _id FROM file WHERE directory_id IN (SELECT _id FROM directory WHERE disk_id <> 0))
GROUP BY MIN(org_file_id, fnd_file_id), MAX(org_file_id, fnd_file_id)
ORDER BY org_file_id;
I also changed to proper explicit joins with ON
clauses.
See a simplified demo.