Given a dataframe df1
as follows, for columns list: ppi_cols = ['PPI_Coal Mining_MoM', 'PPI_Oil & Gas_MoM']
, if I removed PPI_
and _MoM
, it will become ppi_cols = ['Coal Mining', 'Oil & Gas']
, also for column names
, after removed PPI:
and : YoY
, this column will become to Industrial Products, Coal Mining, Oil & Gas, Ferrous Metal Mining
:
names PPI_Coal Mining_MoM PPI_Oil & Gas_MoM
0 PPI: Industrial Products: YoY NaN NaN
1 PPI: Coal Mining: YoY NaN NaN
2 PPI: Oil & Gas: YoY NaN NaN
3 PPI: Ferrous Metal Mining: YoY NaN NaN
Then I will compare new columns and new content of names
column, if they are same, then I will fill that cell with 1
.
How could I achieve that in Pandas? Thanks.
My code to create new column by replacing :
with _
and YoY
with MoM
for names
column.
df1['new_names'] = df1['names'].str.replace(': ', '_')
df1['new_names'] = df1['new_names'].str.replace('YoY', 'MoM')
The expected result:
names PPI_Coal Mining_MoM PPI_Oil & Gas_MoM
0 PPI: Industrial Products: MoM NaN NaN
1 PPI: Coal Mining: MoM 1.0 NaN
2 PPI: Oil & Gas: MoM NaN 1.0
3 PPI: Ferrous Metal Mining: MoM NaN NaN
CodePudding user response:
What is exactly your question?
You can get the 1s using pandas.get_dummies
, you can slice the columns using the existing empty columns (based on index or index union).
Example 1:
pd.get_dummies(df1['new_names'])[df1.columns[1:-1]]
output:
PPI_Coal Mining_MoM PPI_Oil & Gas_MoM
0 0 0
1 1 0
2 0 1
3 0 0
Example 2:
df2 = pd.get_dummies(df1['new_names'])
cols = df1.columns.intersection(df2.columns)
df1[['new_names']].join(df2[cols].replace(0, float('nan')))
output:
new_names PPI_Coal Mining_MoM PPI_Oil & Gas_MoM
0 PPI_Industrial Products_MoM NaN NaN
1 PPI_Coal Mining_MoM 1.0 NaN
2 PPI_Oil & Gas_MoM NaN 1.0
3 PPI_Ferrous Metal Mining_MoM NaN NaN
CodePudding user response:
You can use:
cols = ['PPI_Coal Mining_MoM', 'PPI_Oil & Gas_MoM']
for col in cols:
df1[col] = df1['new_names'].eq(col).astype(int).replace(0, np.nan)
OUTPUT
names PPI_Coal Mining_MoM PPI_Oil & Gas_MoM new_names
0 PPI: Industrial Products: YoY NaN NaN PPI_Industrial Products_MoM
1 PPI: Coal Mining: YoY 1.0 NaN PPI_Coal Mining_MoM
2 PPI: Oil & Gas: YoY NaN 1.0 PPI_Oil & Gas_MoM
3 PPI: Ferrous Metal Mining: YoY NaN NaN PPI_Ferrous Metal Mining_MoM
CodePudding user response:
Compare columns with numpy broadcasting:
ppi_cols = ['PPI_Coal Mining_MoM', 'PPI_Oil & Gas_MoM']
a = np.where(df1['new_names'].to_numpy()[:, None] == ppi_cols, 1, np.nan)
#get all columns names without list ppi_cols and `new_names` (if need remove new_names col)
cols = df1.columns.difference(ppi_cols ['new_names'], sort=False)
df1 = df1[cols].join(pd.DataFrame(a, columns=ppi_cols, index=df1.index))
print (df1)
names PPI_Coal Mining_MoM PPI_Oil & Gas_MoM
0 PPI: Industrial Products: YoY NaN NaN
1 PPI: Coal Mining: YoY 1.0 NaN
2 PPI: Oil & Gas: YoY NaN 1.0
3 PPI: Ferrous Metal Mining: YoY NaN NaN