Home > Enterprise >  Adding new rows with default value based on dataframe values into dataframe
Adding new rows with default value based on dataframe values into dataframe

Time:12-19

I have data with a large number of columns:

df:
ID   col1   col2   col3 ... col100
1    a      x      0        1  
1    a      x      1        1
2    a      y      1        1
4    a      z      1        0  
...
98   a      z      1        1
100  a      x      1        0

I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100

ID   col1   col2  col3 ... col100
3    a      x     1        0
99   a      x     1        0

Expected output:

df:
ID   col1   col2   col3 ... col100
1    a      x      0        1  
1    a      x      1        1
2    a      y      1        1
3    a      x      1        0
4    a      z      1        0  
...
98   a      z      1        1
99   a      x     1        0
100  a      x      1        0

I'm also ok with the 3 and 99 being at the bottom.

I have tried several ways of appending new rows:

noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100

for i in range (1, maxID):
    if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
        temp = noresponse.copy()
        temp['ID'] = i
        df.append(temp, ignore_index = True)

This method doesn't seem to append anything.

I have also tried

pd.concat([df, temp], ignore_index = True)

instead of df.append

I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:

noresponserows = []

for i in range (1, maxID):
    if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
        temp = noresponse.copy()
        temp['ID'] = i
        noresponserows.append(temp)

But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.

I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.

I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.

CodePudding user response:

Say you have a dataframe like this:

>>> df
  col1 col2  col100  ID
0    a    x       0   1
1    a    y       3   2
2    a    z       1   4

First, set the ID column to be the index:

>>> df = df.set_index('ID')
>>> df
   col1 col2  col100
ID                  
1     a    x       0
2     a    y       3
4     a    z       1

Now you can use df.loc to easily add rows.

Let's select the last row as the default row:

>>> default_row = df.iloc[-1]
>>> default_row
col1      a
col2      z
col100    1
Name: 4, dtype: object

We can add it right into the dataframe at ID 3:

>>> df.loc[3] = default_row
>>> df
   col1 col2  col100
ID                  
1     a    x       0
2     a    y       3
4     a    z       1
3     a    z       1

Then use sort_index to sort the rows lexicographically by index:

>>> df = df.sort_index()
>>> df
   col1 col2  col100
ID                  
1     a    x       0
2     a    y       3
3     a    z       1
4     a    z       1

And, optionally, reset the index:

>>> df = df.reset_index()
>>> df
   ID col1 col2  col100
0   1    a    x       0
1   2    a    y       3
2   3    a    z       1
3   4    a    z       1

  • Related