Home > Software engineering >  Taking 2 columns with equal value x or differing values x/y to create a 3rd columm with value x
Taking 2 columns with equal value x or differing values x/y to create a 3rd columm with value x

Time:05-05

I have df

technologies = {'Writer1': ['Spark', 'PySpark', 'Hadoop', 'Python'],
 'Location1': ['LA', 'RIV', 'CHV', 'LA'],
 'Area1': ['L', 'R', 'C', 'L'],
 'Writer2': ['Spark', 'Dil', 'Chee', 'Python'],
 'Location2': ['RIV', 'RIV', 'LA', 'RIV'],
 'Area2': ['R', 'R', 'L', 'R'],}
df = pd.DataFrame(technologies)

I want to create a new Location column indexing location 1/2 but only taking locations in Area1/2 "R" and "C"

So I would want

technologies = {'Writer1': ['Spark', 'PySpark', 'Hadoop', 'Python'],
 'Location1': ['LA', 'RIV', 'CHV', 'LA'],
 'Area1': ['L', 'R', 'C', 'L'],
 'Writer2': ['Spark', 'Dil', 'Chee', 'Python'],
 'Location2': ['RIV', 'RIV', 'LA', 'RIV'],
 'Area2': ['R', 'R', 'L', 'R'],
 'Location3': ['RIV', 'RIV', 'CHV', 'RIV']}

Is this possible? I am stuck and can't think of what would work for so many requirements.

Any help appreciated Thank you ===EDIT Sorry I did not include vital detail. I would like the location to index with the Writer1/2. For example if I index PySpark with RIV, I also want Dil to index with RIV. The code should not bypass a Writer if they both are in RIV or CHV.

CodePudding user response:

Replace non R,C values in Location1/2 columns to missing values by Series.where and then replace missing values from s1 by s2 in Series.fillna:

df = pd.DataFrame(technologies)

s1 = df['Location1'].where(df['Area1'].isin(['R','C']))  
s2 = df['Location2'].where(df['Area2'].isin(['R','C']))
df['Location3'] = s1.fillna(s2)
print (df)
   Writer1 Location1 Area1 Writer2 Location2 Area2 Location3
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil        LA     L       RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV

Solution for multiple values - if match both values are joined:

technologies = {'Writer1': ['Spark', 'PySpark', 'Hadoop', 'Python'],
 'Location1': ['LA', 'RIV', 'CHV', 'RIV'],
 'Area1': ['L', 'R', 'C', 'L'],
 'Writer2': ['Spark', 'Dil', 'Chee', 'Python'],
 'Location2': ['RIV', 'RIV', 'RIV', 'RIV'],
 'Area2': ['R', 'R', 'L', 'R'],}
df = pd.DataFrame(technologies)



s1 = df['Location1'].add(', ').where(df['Area1'].isin(['R','C']), '')  
s2 = df['Location2'].where(df['Area2'].isin(['R','C']), '')

df['Location3'] = s1.add(s2).str.strip(', ')
print (df)
   Writer1 Location1 Area1 Writer2 Location2 Area2 Location3
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R  RIV, RIV
2   Hadoop       CHV     C    Chee       RIV     L       CHV
3   Python       RIV     L  Python       RIV     R       RIV

CodePudding user response:

For a generic method to use with any number of Location/Area pairs (in order), you can use:

lst = ['R', 'C']
df = pd.DataFrame(technologies)
df2 = (df.filter(like='Location')
      .where(df.filter(like='Area').isin(lst).values)
      )
df['Location3'] = df2.stack().groupby(level=0).first()
# first() for first value as preference if many

or, using a MultiIndex:

idx = pd.MultiIndex.from_frame(df.columns.str.extract('(\w )(\d )'))
df2 = df.set_axis(idx, axis=1)
df['LocationX'] = (df2['Location'].where(df2['Area'].isin(lst)).stack()
                   .groupby(level=0).first()
                   )

output:

   Writer1 Location1 Area1 Writer2 Location2 Area2 Location3
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil        LA     L       RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV

duplicates

if there are multiple possibilities and you want to keep all

as concatenated values:

lst = ['R', 'C']
df = pd.DataFrame(technologies)
df2 = (df.filter(like='Location')
      .where(df.filter(like='Area').isin(lst).values)
      )
df['LocationX'] = df2.stack().groupby(level=0).agg(','.join)

output:

   Writer1 Location1 Area1 Writer2 Location2 Area2 LocationX
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R   RIV,RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV

or, as multiple rows:

lst = ['R', 'C']
df = pd.DataFrame(technologies)
df2 = (df.filter(like='Location')
      .where(df.filter(like='Area').isin(lst).values)
      )
df = df.join(df2.stack().rename('LocationX').droplevel(1))

output:

   Writer1 Location1 Area1 Writer2 Location2 Area2 LocationX
0    Spark        LA     L   Spark       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R       RIV
1  PySpark       RIV     R     Dil       RIV     R       RIV
2   Hadoop       CHV     C    Chee        LA     L       CHV
3   Python        LA     L  Python       RIV     R       RIV
  • Related