Hello,
I'm working on a column that has missing values ('year_of_release'). The data type is 'timestamp64'.
At first, I created a function that "pulls" the year numbers, from a column in which years appears next to the names of some games, and finally, I combined this data into a new column - 'years_from_titles':
def get_year(row):
regex="\d{4}"
match=re.findall(regex, row)
for i in match:
if (int(i) > 1970) & (int(i) < 2017):
return int(I)
gaming['years_from_titles']=gaming['name'].apply(lambda x: get_year(str(x)))
I tested the function and it works.
Now, I'm trying to create another function, which will fill in those missing years of the original column - 'year_of_release', but only if they appear on the same row:
def year_row(row):
if math.isnan(row['year_of_release']):
return row['years_from_titles']
else:
return row['year_of_release']
gaming['year_of_release']=gaming.apply(year_row,axis=1)
But when I'm running the code I get TypeError:
/tmp/ipykernel_31/133192424.py in <module>
7 return row['year_of_release']
8
----> 9 gaming['year_of_release']=gaming.apply(year_row,axis=1)
/opt/conda/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
/tmp/ipykernel_31/133192424.py in year_row(row)
2 # but only if a year is found, on the same row, and in correspond to years_from_titles column.
3 def year_row(row):
----> 4 if math.isnan(row['year_of_release']):
5 return row['years_from_titles']
6 else:
TypeError: must be real number, not Timestamp.
If anyone knows how to overcome this I would greatly appreciate it. Thanks
CodePudding user response:
You can use the feature that NaN
is not equal with itself.
def year_row(row):
if row['year_of_release'] != row['year_of_release']:
return row['years_from_titles']
else:
return row['year_of_release']
gaming['year_of_release']=gaming.apply(year_row,axis=1)
Or with Series.mask
gaming['year_of_release'] = gaming['year_of_release'].mask(gaming['year_of_release'].isna(), gaming['years_from_titles'])
Or with Series.fillna
gaming['year_of_release'] = gaming['year_of_release'].fillna(gaming['years_from_titles'])
CodePudding user response:
Instead of using the math
module to check for missing values, here's a more pandas-specific approach.
Change this line:
if math.isnan(row['year_of_release']):
to this:
if row['year_of_release'].isna():