Home > OS >  How to fill default values based on column names without typeerror - Pandas
How to fill default values based on column names without typeerror - Pandas

Time:09-30

I have a dataframe like as shown below

obs = pd.DataFrame({'person_id' :[1,2,3],'obs_date':['12/31/2007','11/25/2009',np.nan],
       'hero_id':[2,4,np.nan],'date':['12/31/2017',np.nan,'10/06/2015'],
       'heroine_id':[1,np.nan,5],'datetime':['12/31/2027','11/25/2029',np.nan],
       'bud_source_value':[1250000,250000,np.nan],
       'prod__source_value':[10000,20000,np.nan]})

The objective is to fill 3 default values based on column names.

  1. cols ending with id - fillna with 0

  2. cols ending with value - fillna with np.nan (leave it as is)

  3. cols containing date[pandas datetime format] - fillna with 12/31/2000

When I tried the below

col = obs.columns.str
c1 = col.endswith('id')   
c2 = col.contains('value')
c3 = col.endswith('date')
c4 = col.endswith('datetime')
filled_values = np.select([c1,c2,c3,c4], [obs.fillna(0), obs.fillna(np.nan), obs.fillna("1900-01-01"), obs.fillna("1900-01-01 00:00:00")])
obs= pd.DataFrame(filled_values,columns=cols)

But I get an error message like as below

typeerror u10 cannot be converted to an IntegerDtype pandas

Additionally, in one of my functions above, I use a piece of code like below

testdf['hero_id'] = testdf['hero_id'].astype('float').astype('Int64')

Not sure whether the above line is causing the error. I am not sure how to reproduce/debug this error. Can help me avoid this error please

CodePudding user response:

This might not be exactly what you've asked for, but I hope it helps a bit:

import pandas as pd
import numpy as np

obs = pd.DataFrame({'person_id': [1, 2, 3], 'obs_date': ['12/31/2007', '11/25/2009', np.nan],
                    'hero_id': [2, 4, np.nan], 'date': ['12/31/2017', np.nan, '10/06/2015'],
                    'heroine_id': [1, np.nan, 5], 'datetime': ['12/31/2027', '11/25/2029', np.nan],
                    'bud_source_value': [1250000, 250000, np.nan],
                    'prod__source_value': [10000, 20000, np.nan]})

for column_name in obs.columns:
    if column_name.endswith("id"):
        obs[column_name] = obs[column_name].fillna(0)
    elif "value" in column_name:
        obs[column_name] = obs[column_name].fillna(np.nan)
    elif column_name.endswith("date"):
        obs[column_name] = obs[column_name].fillna("1900-01-01")
    elif column_name.endswith("datetime"):
        obs[column_name] = obs[column_name].fillna("1900-01-01 00:00:00")


print(obs)

Output

   person_id    obs_date  hero_id        date  heroine_id             datetime  bud_source_value  prod__source_value
0          1  12/31/2007      2.0  12/31/2017         1.0           12/31/2027         1250000.0             10000.0
1          2  11/25/2009      4.0  1900-01-01         0.0           11/25/2029          250000.0             20000.0
2          3  1900-01-01      0.0  10/06/2015         5.0  1900-01-01 00:00:00               NaN                 NaN

CodePudding user response:

Idea is create dictionary by columns names with replaced values and pass to DataFrame.fillna:

c1 = dict.fromkeys(obs.filter(regex='id$').columns, 0)
c2 = dict.fromkeys(obs.filter(like='value').columns, np.nan)
c3 = dict.fromkeys(obs.filter(regex='date$').columns, "1900-01-01")
c4 = dict.fromkeys(obs.filter(regex='datetime$').columns, "1900-01-01 00:00:00") 

d = {**c1, **c2, **c3, **c4}
print (d)
{'person_id': 0, 'hero_id': 0, 'heroine_id': 0, 
 'bud_source_value': nan, 'prod__source_value': nan, 
 'obs_date': '1900-01-01',  'date': '1900-01-01', 
 'datetime': '1900-01-01 00:00:00'}

obs = obs.fillna(d)
print (obs)
   person_id    obs_date  hero_id        date  heroine_id  \
0          1  12/31/2007      2.0  12/31/2017         1.0   
1          2  11/25/2009      4.0  1900-01-01         0.0   
2          3  1900-01-01      0.0  10/06/2015         5.0   

              datetime  bud_source_value  prod__source_value  
0           12/31/2027         1250000.0             10000.0  
1           11/25/2029          250000.0             20000.0  
2  1900-01-01 00:00:00               NaN                 NaN  

CodePudding user response:

Filter using column names and the dtypes and fillna accordingly

Coerce dates to datetime

obs['obs_date']=pd.to_datetime(obs['obs_date'])
obs['date']=pd.to_datetime(obs['date'])
obs['datetime']=pd.to_datetime(obs['datetime'])

Filter using column names and the dtypes. That will give you different dataframes. Join the dataframes and use them to update the obs dataframe using the combine_first method.

Code below

df= obs.combine_first((obs.filter(regex='[_id]$', axis=1).fillna('0')#filter columns ending with _id and fillna with 0
     .join( 
         (obs.select_dtypes(include=['datetime64[ns]'])).fillna('12/31/2000')) # join output of columns ending with _id with columns of the dtype datetime64
     ))


df=df[list(obs.columns)]#rearrange column order to the original df  

Output

 person_id     obs_date         hero_id      date       heroine_id \
0          1  2007-12-31 00:00:00     2.0  2017-12-31 00:00:00        1.0   
1          2  2009-11-25 00:00:00     4.0           12/31/2000          0   
2          3           12/31/2000       0  2015-10-06 00:00:00        5.0   

              datetime  bud_source_value  prod__source_value  
0  2027-12-31 00:00:00         1250000.0             10000.0  
1  2029-11-25 00:00:00          250000.0             20000.0  
2           12/31/2000               NaN                 NaN  
  • Related