Home > Back-end >  Pandas.DataFrame - create a new column, based on whether value in another column has occur-ed or not
Pandas.DataFrame - create a new column, based on whether value in another column has occur-ed or not

Time:07-04

I'm an amateur user having some experiences VBA but trying to switch to Python because my beautiful new MBP runs VBA miserably. I'm trying to create a df column, based on whether another column value has occur-ed already. If it has, then the new column value is 0 on that row, if not 1.

For example: I want to create column C in the example below. How do I do it quickly?

      A     B        C (to create column C)
0     001   USA      1
1     002   Canada   1 
3     003   China    1
4     004   India    1
5     005   UK       1
6     006   Japan    1
7     007   USA      0
8     008   UK       0

CodePudding user response:

After creating your dataframe :

import pandas as pandas

data = [["001", "USA"], ["002", "Canada"], ["003", "China"],
["004", "India"], ["005", "UK"], ["006", "Japan"], ["007", "USA"], ["008", "UK"]]

# Create a dataframe
df = pandas.DataFrame(data, columns=["A", "B"])

You can apply a function to each value of one of the columns (in your case, the B column) and have the output of the function as the value of your column.

df["C"] = df.B.apply(lambda x: 1 if df.B.value_counts()[x] == 1 else 0)

This will check if the value in the B column appears somewhere else in the column, and will return 1 if unique and 0 if duplicated.

The dataframe looks like this :

     A       B  C
0  001     USA  0
1  002  Canada  1
2  003   China  1
3  004   India  1
4  005      UK  0
5  006   Japan  1
6  007     USA  0
7  008      UK  0

If you want the values to be recalculated each time you need to have the command

df["C"] = df.B.apply(lambda x: 1 if df.B.value_counts()[x] == 1 else 0)

executed each time after you add a row.

CodePudding user response:

You can check for duplicates on the 'B' column and set duplicates to 0. Then set any non-duplicates to 1 like this:

df = pd.DataFrame({'A':[1, 2, 3, 4, 5, 6, 7, 8], 'B':['USA', 'Canada', 'China', 'India', 'UK', 'Jpan', 'USA', 'UK']})
df.loc[df['B'].duplicated(), 'C'] = 0
df['C'] = df['C'].fillna(1).astype(int)
print(df)

Output:

   A       B  C
0  1     USA  1
1  2  Canada  1
2  3   China  1
3  4   India  1
4  5      UK  1
5  6    Jpan  1
6  7     USA  0
7  8      UK  0
  • Related