Using regex in python for a dynamic string-CodePudding

I have a pandas columns with strings which dont have the same pattern, something like this:

{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}
{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}
{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}

How do I only keep the name of the country for every row? I would only like to keep "France", "United States of America", "France".

I tried building the regex pattern: something like this

r"^\W [a-z] _[0-9]\W "

But this turns out to be very specific, and if there is a slight change in the string the pattern wont work. How do we resolve this?

CodePudding user response：

As you have dictionaries in the column, you can get the values of the name keys:

import pandas as pd
df = pd.DataFrame({'col':[{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'},
    {'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'},
    {'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}]})
df['col'] = df['col'].apply(lambda x: x["name"])

Output of df['col']:

0                      France
1    United States of America
2                      France
Name: col, dtype: object

If the column contains stringified dictionaries, you can use ast.literal_eval before accessing the name key value:

import pandas as pd
import ast

df = pd.DataFrame({'col':["{'iso_2': 'FR', 'iso_3': 'FRA', 'name': 'France'}",
    "{'iso': 'FR', 'iso_2': 'USA', 'name': 'United States of America'}",
    "{'iso_3': 'FR', 'iso_4': 'FRA', 'name': 'France'}"]})

df['col'] = df['col'].apply(lambda x: ast.literal_eval(x)["name"])

And in case your column is totally messed up, yes, you can resort to regex:

df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"'] )""")
# or to support escaped " and ':
df['col'] = df['col'].str.extract(r"""['"]name['"]\s*:\s*['"]([^"'\\] (?:\\.[^'"\\]*)*)""")>>> df['col']
                          0
0                    France
1  United States of America
2                    France

See the regex demo.