This is the code but the part of the error is where is the extraction of the substrings after validating the regex pattern structure
def name_and_img_identificator(input_text, text):
input_text = re.sub(r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f] ", r"\1", normalize("NFD", input_text), 0, re.I)
input_text = normalize( 'NFC', input_text) # -> NFC
input_text_to_check = input_text.lower() #Convierte a minuscula todo
#regex_patron_01 = r"\s*\¿?(?:dime los|dime las|dime unos|dime unas|dime|di|cuales son los|cuales son las|cuales son|cuales|que animes|que|top)\s*((?:\w \s*) )\s*(?:de series anime|de anime series|de animes|de anime|animes|anime)\s*(?:similares al|similares a|similar al|similar a|parecidos al|parecidos a|parecido al|parecido a)\s*(?:la serie de anime|series de anime|la serie anime|la serie|anime|)\s*(llamada|conocida como|cuyo nombre es|la cual se llama|)\s*((?:\w \s*) )\s*\??"
#Regex in english
regex_patron_01 = r "\ s * \ ¿? (?: tell me the | tell me some| tell me | say | which are the | which are the | which are | which | which animes | which | top) \ s * ((?: \ w \ s *) ) \ s * (?: anime series | anime series | anime | anime | anime | anime) \ s * (?: similar to | similar to | similar to | similar to | similar to | similar to | similar to | similar to) \ s * (?: the anime series | anime series | the anime series | the series | anime |) \ s * (called | known like | whose name is | which is called |) \ s * ((?: \ w \ s *) ) \ s * \ ?? "
m = re.search(regex_patron_01, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
if m:
num, anime_name = m.groups()[2]
num = num.strip()
anime_name = anime_name.strip()
print(num)
print(anime_name)
return text
input_text_str = input("ingrese: ")
text = ""
print(name_and_img_identificator(input_text_str, text))
It gives me this error, and the truth is I don't know how to structure this regex pattern so that it only extracts those 2 values (substrings) from that input
Traceback (most recent call last):
File "serie_recommendarion_for_chatbot.py", line 154, in <module>
print(serie_and_img_identificator(input_text_str, text))
File "anime_recommendarion_for_chatbot.py", line 142, in name_and_img_identificator
num, anime_name = m.groups()
ValueError: too many values to unpack (expected 2)
If I put an input like this: 'Dame el top 8 de animes parecidos a Gundam' 'Give me the top 8 anime like Gundam'
I need you to extract:
num = '8'
anime_name = 'Gundam'
How do I have to fix my regex sequence in that case?
CodePudding user response:
You can try extracting the first 2 values, maybe you are missing a colon.
num, anime_name = m.groups()[:2]
That might be the case because you are facing the too many values to unpack
error.
Use two separate patterns for the number and the name. For simplicity, I only included a few examples.
For the number Test cases
(?<=(which are the|which|top)\s)[0-9] (?=\s(anime series|anime))
For the name Test cases
(?<=(like|called|which is called)\s)[A-Za-z]
The rest is your job to implement the patterns in Spanish.
CodePudding user response:
Try this out in the Regex playground: Link
So nothing much is changed, the first capture group is still the quantifier for the number of animes, and the 2nd group is the name of the anime itself. I just simplified the regex a bit (got rid of some unnecessary bits for demo purposes). Most of it is unchanged from your version, which was actually pretty solid regex.
Regex: \b(\d ).*(?:called|that are like|known like|whose name is|which is called)\s*((?:\w \s*) )\s*\??
Test with your original question - which I translated roughly to English :-)
import re
from unicodedata import normalize
def name_and_img_identificator(input_text, text):
input_text = re.sub(r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f] ", r"\1",
normalize("NFD", input_text), 0, re.I)
input_text = normalize('NFC', input_text) # -> NFC
input_text_to_check = input_text.lower() # Convierte a minuscula todo
# Regex in english
# original
# note: you have extra spaces here, which regex might not like.
# you can get rid of spaces and then it should hopefully be fine.
# regex_patron_01 = r "\ s * \ ¿? (?: tell me the | tell me some| tell me | say | which are the | which are the | which are | which | which animes | which | top) \ s * ((?: \ w \ s *) ) \ s * (?: anime series | anime series | anime | anime | anime | anime) \ s * (?: similar to | similar to | similar to | similar to | similar to | similar to | similar to | similar to) \ s * (?: the anime series | anime series | the anime series | the series | anime |) \ s * (called | known like | whose name is | which is called |) \ s * ((?: \ w \ s *) ) \ s * \ ?? "
# simplified
regex_patron_01 = r'\b(\d ).*(?:called|that are like|known like|whose name is|which is called)\s*((?:\w \s*) )\s*\??'
m = re.search(regex_patron_01, input_text_to_check,
re.IGNORECASE) # Con esto valido la regex haber si entra o no en el bloque de code
if m:
num, anime_name = m.groups()[:2]
num = num.strip()
anime_name = anime_name.strip()
print(num)
print(anime_name)
return text
#input_text_str = input("ingrese: ")
input_text_str = 'Tell me the top 8 animes that are like Gundam?'
text = ""
print(name_and_img_identificator(input_text_str, text))
CodePudding user response:
Errors in the regex pattern
- You forgot to add
?:
to not capture this group. Change:
regex_patron_01 = r"...(llamada|conocida como|cuyo nombre es|la cual se llama|)..."
To:
regex_patron_01 = r"...(?:llamada|conocida como|cuyo nombre es|la cual se llama|)..."
- To not capture additional spaces or words, your capturing of the
num
should be non-greedy so that it doesn't catch words like"de"
and let the succeeding patterns match it. Change:
regex_patron_01 = r"...((?:\w \s*) )..."
To:
regex_patron_01 = r"...((?:\w ?\s*?) )..."
- The
.groups()
contain already the string matches, thus accessing an index would give you a single string only, which is the root cause of your error. Change:
num, anime_name = m.groups()[2]
To:
num, anime_name = m.groups()
With those changes above, it would be successful:
8
gundam
Improvement
Your regex is too complicated and contains a lot of hard-coded words which would differ by language. My suggestion is to set a standard on the format of the string it can accept to:
Any text here (num) any text here (anime_name)
Which is already the format of your input:
Dame el top 8 de animes parecidos a Gundam
Thus you can remove that long regex and replace with this and the output would be the same:
regex_patron_01 = r"^.*?(\d ).*\s(. )$"
Note that this requires the (anime_name)
to be a single-word. To support multi-words, we have to set a special character that will mark the start of the anime name such as colon :
Dame el top 8 de animes parecidos a: Gundam X
Then the regex would be:
regex_patron_01 = r"^.*?(\d ).*:\s(. )$"
Output
8
gundam x