Question: The genres column contains the genres that are present in the games. It has all the genres written together without any space or special characters. Whatever is the major genre of the game is given first, followed by the other genres. For better understanding refer to the table below. Game Genres A ActionComedyAdventure B AdventureComedy C NarrationShooting
In the above table, the major genres for the Games A, B and C are Action, Adventure and Narration respectively.
Your job is to extract the major genre for each game and store it in a new column and name the column as “Major Genre”. (Hint: All the genre name starts with uppercase).
I want to split the word "ActionAction-AdventureShooterStealth" into a list of words in the below format
['Action', 'Action-Adventure', 'Shooter', 'Stealth']
I tried the below approach but didn't work out
text = "ActionAction-AdventureShooterStealth"
res = text.split(',')
print(res)
CodePudding user response:
One way to do this is with re
, where it will matches "Action" followed by zero or more occurrences of -[A-Za-z]
i.e -
, capital
and lowercase
characters.
import re
string = "ActionAction-AdventureShooterStealth"
pattern = r"Action(-[A-Za-z] )*"
string_list = re.findall(pattern, string)
print(string_list)
Output:
['Action', 'Action-Adventure', 'Shooter', 'Stealth']
CodePudding user response:
This regex matches would work with your examples: r'[A-Z][a-z] (?:-[A-Z][a-z] )*'
That is:
[A-Z]
: an uppercase letter[a-z]
: one or more lowercase letters(?:-[A-Z][a-z] )*
: zero or more of: a-
, then an uppercase letter, followed by one or more lowercase letters
When using with re.findall
, we use (?:...)
instead of simply (...)
to make it a non-matching capture, otherwise re.findall
returns the matched capture groups instead of matches.
Demo:
pattern = re.compile(r'[A-Z][a-z] (?:-[A-Z][a-z] )*')
pattern.findall('ActionAction-AdventureShooterStealth')
# returns: ['Action', 'Action-Adventure', 'Shooter', 'Stealth']
pattern.findall('ActionAction')
# returns: ['Action', 'Action']
pattern.findall('Action-ShooterStealth')
# returns: ['Action-Shooter', 'Stealth']