Fill columns if selected column names's substrings are same as one column's content's-CodePudding

Given a dataframe df1 as follows, for columns list: ppi_cols = ['PPI_Coal Mining_MoM', 'PPI_Oil & Gas_MoM'], if I removed PPI_ and _MoM, it will become ppi_cols = ['Coal Mining', 'Oil & Gas'], also for column names, after removed PPI: and : YoY, this column will become to Industrial Products, Coal Mining, Oil & Gas, Ferrous Metal Mining:

                            names  PPI_Coal Mining_MoM  PPI_Oil & Gas_MoM
0   PPI: Industrial Products: YoY                  NaN                NaN
1           PPI: Coal Mining: YoY                  NaN                NaN
2             PPI: Oil & Gas: YoY                  NaN                NaN
3  PPI: Ferrous Metal Mining: YoY                  NaN                NaN

Then I will compare new columns and new content of names column, if they are same, then I will fill that cell with 1.

How could I achieve that in Pandas? Thanks.

My code to create new column by replacing : with _ and YoY with MoM for names column.

df1['new_names'] = df1['names'].str.replace(': ', '_')
df1['new_names'] = df1['new_names'].str.replace('YoY', 'MoM')

The expected result:

                            names  PPI_Coal Mining_MoM  PPI_Oil & Gas_MoM
0   PPI: Industrial Products: MoM                  NaN                NaN
1           PPI: Coal Mining: MoM                  1.0                NaN
2             PPI: Oil & Gas: MoM                  NaN                1.0
3  PPI: Ferrous Metal Mining: MoM                  NaN                NaN

CodePudding user response：

What is exactly your question?

You can get the 1s using pandas.get_dummies, you can slice the columns using the existing empty columns (based on index or index union).

Example 1:

pd.get_dummies(df1['new_names'])[df1.columns[1:-1]]

output:

   PPI_Coal Mining_MoM  PPI_Oil & Gas_MoM
0                    0                  0
1                    1                  0
2                    0                  1
3                    0                  0

Example 2:

df2 = pd.get_dummies(df1['new_names'])
cols = df1.columns.intersection(df2.columns)
df1[['new_names']].join(df2[cols].replace(0, float('nan')))

output:

                      new_names  PPI_Coal Mining_MoM  PPI_Oil & Gas_MoM
0   PPI_Industrial Products_MoM                  NaN                NaN
1           PPI_Coal Mining_MoM                  1.0                NaN
2             PPI_Oil & Gas_MoM                  NaN                1.0
3  PPI_Ferrous Metal Mining_MoM                  NaN                NaN

CodePudding user response：

You can use:

cols = ['PPI_Coal Mining_MoM', 'PPI_Oil & Gas_MoM']

for col in cols:
    df1[col] = df1['new_names'].eq(col).astype(int).replace(0, np.nan)

OUTPUT

                       names       PPI_Coal Mining_MoM  PPI_Oil & Gas_MoM                     new_names
0   PPI: Industrial Products: YoY                  NaN                NaN   PPI_Industrial Products_MoM
1           PPI: Coal Mining: YoY                  1.0                NaN           PPI_Coal Mining_MoM
2             PPI: Oil & Gas: YoY                  NaN                1.0             PPI_Oil & Gas_MoM
3  PPI: Ferrous Metal Mining: YoY                  NaN                NaN  PPI_Ferrous Metal Mining_MoM

CodePudding user response：

Compare columns with numpy broadcasting:

ppi_cols = ['PPI_Coal Mining_MoM', 'PPI_Oil & Gas_MoM']

a = np.where(df1['new_names'].to_numpy()[:, None] == ppi_cols, 1, np.nan)

#get all columns names without list ppi_cols and `new_names` (if need remove new_names col)
cols = df1.columns.difference(ppi_cols   ['new_names'], sort=False)
df1 = df1[cols].join(pd.DataFrame(a, columns=ppi_cols, index=df1.index))

print (df1)
                            names  PPI_Coal Mining_MoM  PPI_Oil & Gas_MoM
0   PPI: Industrial Products: YoY                  NaN                NaN
1           PPI: Coal Mining: YoY                  1.0                NaN
2             PPI: Oil & Gas: YoY                  NaN                1.0
3  PPI: Ferrous Metal Mining: YoY                  NaN                NaN