Home > other >  Preg match html language with lang attribute php
Preg match html language with lang attribute php

Time:06-20

I would like to use preg_match in PHP to parse the site lang out of a html document;

My preg_match:

$sitelang = preg_match('!<html lang="(.*?)">!i', $result, $matches) ? $matches[1] : 'Site Language not detected';

When I have a simple attribute without any class or ids. For example: Input:

<html lang="de">

Output:

de

But when I have a other html code like this: Input:

<html lang="en" >

Output:

en " ([^"] )"!i', $result, $matches)

it also works well for your last sample

CodePudding user response:

Standard disclaimer of using regex to parse HTML aside, there are two things you likely want. First, get rid of the closing bracket in your pattern. Once you have the close quote, the rest of the line doesn't matter. Second, make sure what's inside the quotes doesn't itself contain quotes.

Current, open quote, then anything, then close quote:

preg_match('!<html lang="(.*?)">!i', $result, $matches)

This means if you have lang="foo" you get foo" ([^"] )"!i', $result, $matches)

If you want to be more resilient, change the hard space to one or more whitespace chars:

preg_match('!<html\s lang="([^"] )"!i', $result, $matches)

CodePudding user response:

Thank you so much Alex.

but now I can't parse when the lang attribute is not coming after the html tag.

<html data-n-head-ssr lang="en">

Output:
Site Language not detected
Any idea?

CodePudding user response:

When parsing arbitrary html, preferably use some html parser like DOMDocument.

$dom = new DOMDocument();
@$dom->loadHTML($html);

$lang = $dom->getElementsByTagName('html')[0]->getAttribute('lang');

See PHP demo at tio.run (used the @ to suppress errors if anything goes wrong)


If you insist on using regex, here a bit broader pattern for matching more cases:

$pattern = '~<html\b[^><] ?\blang\s*=\s*["\']\s*\K[^"\'] ~i';

$lang = preg_match($pattern, $html, $out) ? $out[0] : "";

\K resets beginning of the reported match, so we don't need to capture.

See regex demo at regex101 (explanation on right side) or a PHP demo at tio.run


Fyi: Your pattern <html lang="(.*?)"> matches lazily just anything from <html lang=" to ">

  • Related