Home > database >  Extracting 'year' from a column
Extracting 'year' from a column

Time:06-01

I have a working code, but I think my logic isn't on the right path (although it works). I just need some help with optimizing it. Essentially, to see if what I did was an acceptable way of doing what I am doing or if there's a better way. I am rooting for the latter, because I know what I did isn't the "right" way.

I have a pd column of strings with "year" in it and I am trying to extract it from it. The problem is that a few entries do not have a year listed. So something like this:

Index string_values
0 String A (1995)
1 String B (1995)
2 String C (1995)
3 String D has no year
4 String E has (something in braces) AND also the year (2003)

re.search('\d{4}', df['string_values '][0]).group(0) works, but in a for loop, it throws this error (I guess when it hit the non-4-digit string): AttributeError: 'NoneType' object has no attribute 'group'. I think this because len(_temp) gives 15036 and it has the years listed. Just that it's throwing this error.

Here's the for loop:

_temp = []
for i in df['string_values']:
    year = re.search("\d{4}", i)
    if year.group():
        _temp.append(year.group())
    else:
        _temp.append(None)

Then I also tried the Try-Except way to do it, and that works - len(<var>) gives 62423, which is also the total row in the df. And here's the code:

_without_year = []
_with_year = []
for i in df['string_values']:
    year = re.search("\d{4}", i)
    try:
        if year.group():
            # _with_year.append(year.group())
            pass
    except:
        _without_year.append(i)

I just need to know if what I did is acceptable. It works, like I said. _without_year does display all the entries without the year.

The thing with the Try-Except block is that I am passing on the if condition catching the ith error.

CodePudding user response:

You can use extract to extract the year values directly:

df['string_values'].str.extract(r'(?<=\()(\d{4})(?=\))', expand=False)

Output:

0    1995
1    1995
2    1995
3     NaN
4    2003
Name: string_values, dtype: object

Note I've used forward and backward lookarounds to assert that the year occurs inside parentheses; if you don't want that but just to match a 4-digit string replace them with \b (word break) e.g.

df['string_values'].str.extract(r'\b(\d{4})\b', expand=False)

To convert the output to a list, you can use tolist:

df['string_values'].str.extract('(?<=\()(\d{4})(?=\))', expand=False).tolist()

Output:

['1995', '1995', '1995', nan, '2003']

To find the string values that don't contain a year, you can use contains to find matches and invert that to use as an index:

df[~df['string_values'].str.contains(r'(?<=\()\d{4}(?=\))')]

Output:

   Index          string_values
3       3  String D has no year
  • Related