I have a working code, but I think my logic isn't on the right path (although it works). I just need some help with optimizing it. Essentially, to see if what I did was an acceptable way of doing what I am doing or if there's a better way. I am rooting for the latter, because I know what I did isn't the "right" way.
I have a pd
column of strings with "year" in it and I am trying to extract it from it. The problem is that a few entries do not have a year listed. So something like this:
Index | string_values |
---|---|
0 | String A (1995) |
1 | String B (1995) |
2 | String C (1995) |
3 | String D has no year |
4 | String E has (something in braces) AND also the year (2003) |
re.search('\d{4}', df['string_values '][0]).group(0)
works, but in a for loop, it throws this error (I guess when it hit the non-4-digit string): AttributeError: 'NoneType' object has no attribute 'group'
. I think this because len(_temp)
gives 15036
and it has the years listed. Just that it's throwing this error.
Here's the for
loop:
_temp = []
for i in df['string_values']:
year = re.search("\d{4}", i)
if year.group():
_temp.append(year.group())
else:
_temp.append(None)
Then I also tried the Try-Except way to do it, and that works - len(<var>)
gives 62423
, which is also the total row in the df. And here's the code:
_without_year = []
_with_year = []
for i in df['string_values']:
year = re.search("\d{4}", i)
try:
if year.group():
# _with_year.append(year.group())
pass
except:
_without_year.append(i)
I just need to know if what I did is acceptable. It works, like I said. _without_year
does display all the entries without the year.
The thing with the Try-Except block is that I am pass
ing on the if
condition catching the i
th error.
CodePudding user response:
You can use extract
to extract the year values directly:
df['string_values'].str.extract(r'(?<=\()(\d{4})(?=\))', expand=False)
Output:
0 1995
1 1995
2 1995
3 NaN
4 2003
Name: string_values, dtype: object
Note I've used forward and backward lookarounds to assert that the year occurs inside parentheses; if you don't want that but just to match a 4-digit string replace them with \b
(word break) e.g.
df['string_values'].str.extract(r'\b(\d{4})\b', expand=False)
To convert the output to a list, you can use tolist
:
df['string_values'].str.extract('(?<=\()(\d{4})(?=\))', expand=False).tolist()
Output:
['1995', '1995', '1995', nan, '2003']
To find the string values that don't contain a year, you can use contains
to find matches and invert that to use as an index:
df[~df['string_values'].str.contains(r'(?<=\()\d{4}(?=\))')]
Output:
Index string_values
3 3 String D has no year