Home > Enterprise >  Create column based on row data when column doesn't exist or column is NaN in pandas
Create column based on row data when column doesn't exist or column is NaN in pandas

Time:01-09

I have a dataframe from OSM data. In this I got everything but the colour column in my area. However in other areas the column may exist. Now I want to create the column if it is missing by providing calculated colors and also want to replace any NaN values with a color code when the column exists but a row has no color value yet.

TLDR: How do I create a colum if needed and otherwise map NaN otherwise?

I already tried just doing:

import random
def setColor(_):
    r = lambda: random.randint(0,255)
    return '#XXX' % (r(),r(),r())



lines.loc[lines['colour'].isnull(),'colour'] = lines["colour"].map(setColor)

However this fails if colour doesnt exist initially.

I could run lines["colour"] = np.nan first but while that works for empty colums this doesn't work for the case when the column already partially exists. So I wonder if there is a better way.

CodePudding user response:

It's not fully clear what you want, but maybe this is close.

Given df1 and df2:

import pandas as pd
import numpy as np
import random

df1 = pd.DataFrame({'Col_01': ['x', 'y', 'z']})
df2 = pd.DataFrame({'Col_01': ['x', 'y', 'z'], 'colour': ['#D30000', '#C21807', '']})

print("df1:\n", df1)
print("df2:\n", df2)

Console output:

df1:
   Col_01
0      x
1      y
2      z
df2:
   Col_01   colour
0      x  #D30000
1      y  #C21807
2      z

With a slight change to your function (removing argument) and looping through all dataframes:

def setColor(): # change: remove the "_" here
    r = lambda: random.randint(0, 255)
    return '#XXX' % (r(),r(),r())

for df in [df1, df2]:
    if "colour" not in df:
        df["colour"] = df.apply(lambda x: setColor(), axis=1)
    else:
        df["colour"] = np.where(df["colour"] == '', setColor(), df["colour"])

print("df1:\n", df1)
print("df2:\n", df2)

Console output:

df1:
   Col_01   colour
0      x  #C0ACB3
1      y  #1FA09E
2      z  #4A35FF
df2:
   Col_01   colour
0      x  #D30000
1      y  #C21807
2      z  #D97652

It's probably self-explanatory, but the loop first looks to see if the colour column exists; if not, it adds it and creates a hex code for each row. Otherwise, if the column exists, it uses np.where() to create a hex code for blank rows, otherwise keeping hex code if it's there.

  • Related