Home > Mobile >  Filling in missing values ​using a function
Filling in missing values ​using a function

Time:07-06

Hello,

I'm working on a column that has missing values ('year_of_release'). The data type is 'timestamp64'.

At first, I created a function that "pulls" the year numbers, from a column in which years appears next to the names of some games, and finally, I combined this data into a new column - 'years_from_titles':

def get_year(row):
    regex="\d{4}"
    match=re.findall(regex, row)
    
    for i in match:
        if (int(i) > 1970) & (int(i) < 2017):
            return int(I)

gaming['years_from_titles']=gaming['name'].apply(lambda x: get_year(str(x)))

I tested the function and it works.

Now, I'm trying to create another function, which will fill in those missing years of the original column - 'year_of_release', but only if they appear on the same row:

def year_row(row):
   if math.isnan(row['year_of_release']):
      return row['years_from_titles']
   else:
      return row['year_of_release']

gaming['year_of_release']=gaming.apply(year_row,axis=1)

But when I'm running the code I get TypeError:

/tmp/ipykernel_31/133192424.py in <module>
      7         return row['year_of_release']
      8 
----> 9 gaming['year_of_release']=gaming.apply(year_row,axis=1)

/opt/conda/lib/python3.9/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

/opt/conda/lib/python3.9/site-packages/pandas/core/apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

/tmp/ipykernel_31/133192424.py in year_row(row)
      2 # but only if a year is found, on the same row, and in correspond to years_from_titles column.
      3 def year_row(row):
----> 4     if math.isnan(row['year_of_release']):
      5         return row['years_from_titles']
      6     else:

TypeError: must be real number, not Timestamp.

If anyone knows how to overcome this I would greatly appreciate it. Thanks

CodePudding user response:

You can use the feature that NaN is not equal with itself.

def year_row(row):
   if row['year_of_release'] != row['year_of_release']:
      return row['years_from_titles']
   else:
      return row['year_of_release']

gaming['year_of_release']=gaming.apply(year_row,axis=1)

Or with Series.mask

gaming['year_of_release'] = gaming['year_of_release'].mask(gaming['year_of_release'].isna(), gaming['years_from_titles'])

Or with Series.fillna

gaming['year_of_release'] = gaming['year_of_release'].fillna(gaming['years_from_titles'])

CodePudding user response:

Instead of using the math module to check for missing values, here's a more pandas-specific approach.

Change this line:

if math.isnan(row['year_of_release']):

to this:

if row['year_of_release'].isna():
  • Related