Home > Enterprise >  Regex of repeating string in linearized XML
Regex of repeating string in linearized XML

Time:08-12

I have a linearized XML where I need to match two possible string values with a regex.

The two strings are:

  • Support Advisor
  • Plus Support Specialist

These strings repeat in the XML as a title of the speaker. The XML holds a transcribed conversation. It is also possible that the transcription contains the word "support" in it, but I only want to match the two above when they are titles of the speaker. I know that they are titles of the speaker when they are surrounded by parenthesis and capitalized.

The two titles are always enclosed in parentheses, following the speakers first name and last initial, and the name and title is surrounded by quotes.

Here is an example of the linearized XML that I'm currently practicing on:

<?xml version='1.0' encoding='UTF-8'?><contact><id>555255</id><type>channel_chat</type><starttime>2022-07-19 07:44:00</starttime><endtime>2022-07-19 09:50:00</endtime><assigneeid>111</assigneeid><Subject>Support Chat</Subject><transcript><line by="Test Customer" time="2022-07-19 07:44:00">Cristina Customer</line><line by="System" time="2022-07-19 07:44:00">chat.</line><line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent?</line><line by="Test2 (Plus Support Specialist)" time="2022-07-19 07:44:00">Test2 Agent.</line><line by="Test Customer " time="2022-07-19 07:44:00">Customer Support man hey there</line><line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent? and i'm a plus Support specialist</line><line by="Test2 (Support Advisor)" time="2022-07-19 07:44:00">Test2 Agent. </line><line by="Test Customer " time="2022-07-19 07:44:00">Customer</line><line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent?</line><line by="Test2 (Support Advisor)" time="2022-07-19 07:44:00">Test2 Agent.</line><line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent?</line><line by="Test2 (Support Advisor)" time="2022-07-19 07:44:00">Test2 Agent.</line></transcript><metadata><assigneename>Test2arcia</assigneename><user_agent_ooo>False</user_agent_ooo><user_alias/><user_employee_id>22</user_employee_id><user_guru_region>Somewhere</user_guru_region><user_guru_start_date/><user_slack_handle/><user_smiley_status/><user_social_messaging_user_info/><user_squad_code>LTN</user_squad_code><user_squad_lead>Leader</user_squad_lead><user_systemembeddable_last_seen/><user_systemlast_nps_survey_date/><user_systemnps_comment/><user_systemnps_rating/><user_team_code>Nitrogen</user_team_code><user_time_lead/><user_time_period>3AM - 11AM</user_time_period><user_whatsapp/><agentchatname>Test2</agentchatname></metadata></contact>

Here are some regex's I have tried, to no avail:

(. Support.?)"

( Support.?)"

"?.( Support.?)"

CodePudding user response:

Could you be using the DOM for your advantage? If so, then this works.

var elem = document.querySelector("contact");
var str = elem.outerHTML;

// ignore previous lines, assume `str` is input.
var div = document.createElement("div");
div.innerHTML = str;

var matches = (div.querySelectorAll("[by*='(Support Advisor)'], [by*='(Plus Support Specialist)']"))

console.log(matches)
<?xml version='1.0' encoding='UTF-8'?>
<contact>
    <id>555255</id>
    <type>channel_chat</type>
    <starttime>2022-07-19 07:44:00</starttime>
    <endtime>2022-07-19 09:50:00</endtime>
    <assigneeid>111</assigneeid>
    <Subject>Support Chat</Subject>
    <transcript>
        <line by="Test Customer" time="2022-07-19 07:44:00">Cristina Customer</line>
        <line by="System" time="2022-07-19 07:44:00">chat.</line>
        <line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent?</line>
        <line by="Test2 (Plus Support Specialist)" time="2022-07-19 07:44:00">Test2 Agent.</line>
        <line by="Test Customer " time="2022-07-19 07:44:00">Customer Support man hey there</line>
        <line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent? and i'm a plus Support specialist</line>
        <line by="Test2 (Support Advisor)" time="2022-07-19 07:44:00">Test2 Agent. </line>
        <line by="Test Customer " time="2022-07-19 07:44:00">Customer</line>
        <line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent?</line>
        <line by="Test2 (Support Advisor)" time="2022-07-19 07:44:00">Test2 Agent.</line>
        <line by="Test1 (Support Advisor)" time="2022-07-19 07:44:00">Test1 Agent?</line>
        <line by="Test2 (Support Advisor)" time="2022-07-19 07:44:00">Test2 Agent.</line>
    </transcript>
    <metadata>
        <assigneename>Test2arcia</assigneename>
        <user_agent_ooo>False</user_agent_ooo>
        <user_alias/>
        <user_employee_id>22</user_employee_id>
        <user_guru_region>Somewhere</user_guru_region>
        <user_guru_start_date/>
        <user_slack_handle/>
        <user_smiley_status/>
        <user_social_messaging_user_info/>
        <user_squad_code>LTN</user_squad_code>
        <user_squad_lead>Leader</user_squad_lead>
        <user_systemembeddable_last_seen/>
        <user_systemlast_nps_survey_date/>
        <user_systemnps_comment/>
        <user_systemnps_rating/>
        <user_team_code>Nitrogen</user_team_code>
        <user_time_lead/>
        <user_time_period>3AM - 11AM</user_time_period>
        <user_whatsapp/>
        <agentchatname>Test2</agentchatname>
    </metadata>
</contact>

CodePudding user response:

If I understand correctly this regex will do:

\(Plus Support Specialist\)|\(Support Advisor\).*?\>(.*?)\<

Here is a working example, just extract group 1 from every match.

Do note that parsing XMLs with Regular Expressions is not recommended.

  • Related