Home > Software design >  pandas insert new column based on two column header value
pandas insert new column based on two column header value

Time:11-28

I want to add new column to see exam differences in a percentage value.

import pandas as pd

exam_1 = {
  'Name': ['Jonn', 'Tomas', 'Fran', 'Olga', 'Veronika', 'Stephan'],
  'Mat': [85, 75, 50, 93, 88, 90],
  'Science': [96, 97, 99, 87, 90, 88],
  'Reading': [80, 60, 72, 86, 84, 77],
  'Wiritng': [78, 82, 88, 78, 86, 82],
  'Lang': [77, 79, 77, 72, 90, 92],
}

exam_2 = {
  'Name': ['Jonn', 'Tomas', 'Fran', 'Olga', 'Veronika', 'Stephan'],
  'Mat': [80, 80, 90, 90, 85, 80],
  'Science': [50, 60, 85, 90, 66, 82],
  'Reading': [60, 75, 55, 90, 85, 60],
  'Wiritng': [56, 66, 90, 82, 60, 80],
  'Lang': [80, 78, 76, 90, 77, 66],
}

df_1 = pd.DataFrame(exam_1)
df_2 = pd.DataFrame(exam_2)

#cmp = pd.merge(df_1, df_2, how="outer", on=["Name"], suffixes=("_1", "_2"))

cmp = pd.merge(
  df_1, df_2, how="outer", on=["Name"],
  suffixes=("_1", "_2")).set_index("Name").sort_index(axis=1).reset_index()

print(cmp)

The output of the above code is like below;

       Name  Lang_1  Lang_2  Mat_1  Mat_2  Reading_1  Reading_2  Science_1  Science_2  Wiritng_1  Wiritng_2
0      Jonn      77      80     85     80         80         60         96         50         78         56
1     Tomas      79      78     75     80         60         75         97         60         82         66
2      Fran      77      76     50     90         72         55         99         85         88         90
3      Olga      72      90     93     90         86         90         87         90         78         82
4  Veronika      90      77     88     85         84         85         90         66         86         60
5   Stephan      92      66     90     80         77         60         88         82         82         80

What I want is that, add new column after compared value, is there any built-in function for that one. Because constant section like Name can be change, maybe 3 column can be constant in the future. I want to use built-in function to use reusability.

I try to use it manually but it is not reusable.

What I want exactly in below;

       Name  Lang_1  Lang_2  Lang_Res Mat_1  Mat_2  Mat_Res Reading_1  Reading_2  Reading_Res Science_1  Science_2  Science_Res Writing_1  Writing_2  Writing_Res 
0      Jonn      77      80  Lang_data   85     80  Mat_data       80         60  Reading_data       96         50  Science_data       78         56  Writing_data 
1     Tomas      79      78  Lang_data   75     80  Mat_data       60         75  Reading_data       97         60  Science_data       82         66  Writing_data 
2      Fran      77      76  Lang_data   50     90  Mat_data       72         55  Reading_data       99         85  Science_data       88         90  Writing_data 
3      Olga      72      90  Lang_data   93     90  Mat_data       86         90  Reading_data       87         90  Science_data       78         82  Writing_data 
4  Veronika      90      77  Lang_data   88     85  Mat_data       84         85  Reading_data       90         66  Science_data       86         60  Writing_data 
5   Stephan      92      66  Lang_data   90     80  Mat_data       77         60  Reading_data       88         82  Science_data       82         80  Writing_data

CodePudding user response:

You can start by making a list with every column having the suffixe _2 and then use pandas.DataFrame.insert with pandas.Index.get_loc on a list comprehension to insert the result columns where they should.

Try this :

edge_cols= cmp.columns.str.extractall("(\w _2)")[0].tolist()

[cmp.insert(cmp.columns.get_loc(col) 1, col.split("_")[0] "_Res", col.split("_")[0] "_Data") for col in edge_cols]

# Output :

print(cmp.to_string())

       Name  Lang_1  Lang_2   Lang_Res  Mat_1  Mat_2   Mat_Res  Reading_1  Reading_2   Reading_Res  Science_1  Science_2   Science_Res  Wiritng_1  Wiritng_2   Wiritng_Res
0      Jonn      77      80  Lang_Data     85     80  Mat_Data         80         60  Reading_Data         96         50  Science_Data         78         56  Wiritng_Data
1     Tomas      79      78  Lang_Data     75     80  Mat_Data         60         75  Reading_Data         97         60  Science_Data         82         66  Wiritng_Data
2      Fran      77      76  Lang_Data     50     90  Mat_Data         72         55  Reading_Data         99         85  Science_Data         88         90  Wiritng_Data
3      Olga      72      90  Lang_Data     93     90  Mat_Data         86         90  Reading_Data         87         90  Science_Data         78         82  Wiritng_Data
4  Veronika      90      77  Lang_Data     88     85  Mat_Data         84         85  Reading_Data         90         66  Science_Data         86         60  Wiritng_Data
5   Stephan      92      66  Lang_Data     90     80  Mat_Data         77         60  Reading_Data         88         82  Science_Data         82         80  Wiritng_Data

CodePudding user response:

If I understand correctly, you're hoping to compute a column from two other columns that are related.

What I suggest is this.

  1. Keep your basic column prefixes in a list.
prefixes = ['Lang', 'Mat', 'Reading', ...]
  1. Use these prefixes to automate the lookup and calculation on each column. Let's say we want to store the average of items _1 and _2 for every prefix.
for prefix in prefixes:
    column1 = df[f"{prefix}_1"]
    column2 = df[f"{prefix}_2"]
    averaged = (column1   column2) / 2
    df.loc[:, f"{prefix}_average"] = averaged

This will add an average column for every category you have a prefix for.

  • Related