I'm writing an application in PHP that stores names in a database. These names originated on web pages, and these pages are now stored in a database table. The full HTML document is stored in a table on the database.
Part of the processing means that each of these records are loaded individually, then parsed using DOMXpath. Each name is taken in a format of "J Doe", and this is then added to a table. Before the name is added to the table, the table is checked to see if the name already exists. If it does, then we get the ID of the name and then move on with the rest of the processing. If it doesn't, it gets added.
The issue I found is that while a name may exist in the table, it would be added again as a duplicate. I reviewed the code and found that I had rouge white spaces at the beginning of the string. I stripped the string of this white space, leaving me with "J Doe" and I tried again. The script was still adding the same name twice!
I reviewed again my code. The beginning of the stored HTML that I was parsing has this:
HTTP/2 200
content-type: text/html; charset=utf-8
cache-control: public, max-age=1
content-encoding: gzip
date: Thu, 17 Feb 2022 11:38:21 GMT
strict-transport-security: max-age=7776000
x-powered-by: Next.js
vary: Accept-Encoding
I noticed it was set to use UTF-8, so part of the processing script I added this to the DOMDocument part:
new DOMDocument('1.0', 'UTF-8');
I tried again. Still the name was being added twice. I also checked to see what charset the table with the names in was using. That too is set to UTF-8
, with a utf8_unicode_ci
charset. Everytihng is set to use UTF-8!
It's at this point my head started to melt.
I wrote the following SQL query, and ran it. I expected it to return 2 results:
SELECT * FROM names WHERE name = 'J Doe';
It returned only one record, with an ID of 1
;
I checked my data, I 100% have two records for J Doe
. On the second record, I wanted to check for the hundreth time that I didn't have any white space, other than the gap between the first initial and surname. I copied the contents of the name
cell, pasted it in to the above query. This time, only one record was retrieved, and this record had the ID of 2
.
Confused, I emptied the WHERE parameter and typed in J Doe
again and ran the query. Only one record was returned, the record with an ID of 1
.
I then decided to manually edit the two records. On both records I typed in J Doe
in to their respective name
cells. I then re-ran the above SELECT query - both records were returned! I re-ran my processor so that it added the names it found, tried the query again and it would only return one of the records even though they were both there. Furthermore, I then changed the query to this to ignore any white space either side of the string:
SELECT * FROM names WHERE name LIKE '%J Doe%';
This only returned one record.
So my situation here is that to me, I have two strings which are visually identical. Both of them are J Doe
. There is no white space surrounding the string. Yet when queries are run as above, only one record is returned. In the table, while the names are visually identical, somehow on the database they are different. But not in an obvious way! I have made sure the original HTML has a charset of UTF-8. I have made sure PHP's DOMDocument loads the HTML as UTF-8. I have made sure the table this data goes in to is UTF-8.
How do I investigate the actual difference in these two identical texts? And how do I go about resolving it so that this name as well as the others being added are treated identically?
EDIT #1: Following from @patfreeze's comment, I ran this query:
SELECT * FROM names WHERE name LIKE '%Doe%'
Returned the two names! I added a space to '% Doe%'
and ran the query, only one was then returned.
I then ran the query: SELECT * FROM names WHERE name LIKE '%J%'
And again, the two records were returned. Like above, I then added a space '%J %'
and only one record was returned.
When I add the space, I'm adding it via the spacebar on my computer. One of the HTML files I'm parsing has a space that doesn't use an equivalent space, but I'm not sure now what space it is using?
EDIT #2: Following on from @Solarflare, I ran this query (which will just target the two name records by ID):
SELECT HEX(name) FROM names WHERE id IN ('1', '2');
This has identified that the white space in both names is 20
and A0
respectively.
This is what is causing my issue. I am now trying to work out how to convert the spaces so that they're uniform.
CodePudding user response:
I suspect that the culprit is non- breaking white space, char 169.
Try running
UPDATE Your_Table
SET Your_Column = REPLACE(Your_Column, NCHAR(0x00A0), '')
WHERE Id = x
To replace it with a regular space char 32.
If you want to check first you could run this query:
SELECT name
FROM names
WHERE name = 'J Doe'
OR name = CONCAT('J',NCHAR(0x00A0),'Doe');
which I would expect to return both records.
CodePudding user response:
This answer is based on the investigation that was helped by @Akina, @Patfreeze and @Solarflare in the comments.
The issue was down to the white space between the J
and the Doe
. Using:
SELECT HEX(name) FROM names WHERE id IN ('1', '2');
Identified that the spaces were indeed different.
I've amended my processing code so that these white spaces are converted in to something more uniform. That code is:
preg_replace("/[\pZ\pC] /u", " ", $name)
As I wrote this answer, @Kendle contributed with a similar solution, but at a database level. As I'm adding data to the database, it makes sense to do it before it gets to the table.