I converted a JSON variable to multiple paired variables. As a result, I have a dataset like
home_city_1 home_number_1 home_city_2 home_number_2 home_city_3 home_number_3 home_city_4 home_number_4
Coeur D Alene, ID 13.0 Hayden, ID 8.0 Renton, WA 2.0 NaN NaN
Spokane, WA 3.0 Amber, WA 2.0 NaN NaN NaN NaN
Sioux Falls, SD 9.0 Stone Mountain, GA 2.0 Watertown, SD 2.0 Dell Rapids, SD 2.0
Ludowici, GA 11.0 NaN NaN NaN NaN NaN NaN
This data set has 600 columns (300 * 2).
I want to convert the values with those conditions:
- Change ' ' or ',' in the home_city_# column values to '_' (under bar). For example, 'Sioux Falls, SD' to 'Sioux_Falls__SD'
- Convert missing values to 'm' (missing in home_city_#) or -1 (missing in home_number_#)
I have tried
customer_home_city_json_2 = customer_home_city_json_1.replace(',', '_')
customer_home_city_json_2 = customer_home_city_json_2 .apply(lambda x: x.replace('null', "-1"))
CodePudding user response:
Try
citys = [col for col in df.columns if 'home_city_' in col]
numbers = [col for col in df.columns if 'home_number_' in col]
df[citys] = df[citys].replace("\s|,", "_", regex=True)
df[citys] = df[citys].fillna('m')
df[numbers] = df[numbers].fillna(-1)
To perform the correct tasks you have to get columns names for 'home_city_#' and 'home_number_#'. This is done in the first two lines.
For replacing " "
and ","
with "_"
I call replace()
with regex=True
to use regular expressions. \s
(is a shortcut) and removes all whitespaces, this could be replaced also by
.
For filling the NaNs I use fillna
and set the wanted value -1
or m
. I suggestnot to mix types in a column. Therefor I use -1
for "numbers" and m
for citys.
Example
It this is you DataFrame
home_city_1 home_number_1 home_city_2 home_number_2
0 Coeur D Alene, ID 13.0 Hayden, ID 8.0
1 Spokane, WA 3.0 Amber, WA 2.0
2 Sioux Falls, SD 9.0 Stone Mountain, GA 2.0
3 Ludowici, GA 11.0 NaN NaN
the output will be
home_city_1 home_number_1 home_city_2 home_number_2
0 Coeur_D_Alene__ID 13.0 Hayden__ID 8.0
1 Spokane__WA 3.0 Amber__WA 2.0
2 Sioux_Falls__SD 9.0 Stone_Mountain__GA 2.0
3 Ludowici__GA 11.0 m -1.0
CodePudding user response:
Considering that df
is the name of your dataframe, you can try this :
city_cols = df.filter(regex='^home_city').columns
df[city_cols] = (df[city_cols]
.replace('', '-')
.replace(',', '-', regex=True)
.fillna('m'))
number_cols = df.filter(regex='^home_number').columns
df[number_cols] = df[number_cols].fillna(-1)
By using pandas.DataFrame.filter
and regex you can filter by columns that have the same prefix.