Let's say I have a pandas dataframe that looks like this:
df = pd.read_json('{"id":{"0":"21 Delta","1":"38 Bravo","2":"Charlie 37","3":"Alpha 56"},"name_1":{"0":"Tom","1":"Nick","2":"Chris","3":"David 56"},"name_2":{"0":"Peter 17","1":"Emma 53","2":"Jeff 11","3":"Oscar"},"name_3":{"0":"Jeffrey","1":"Olivier 12","2":null,"3":null},"name_4":{"0":"Henry 23","1":null,"2":null,"3":null}}')
df
id name_1 name_2 name_3 name_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23
1 38 Bravo Nick Emma 53 Olivier 12 None
2 Charlie 37 Chris Jeff 11 None None
3 Alpha 56 David 56 Oscar None None
What I would like to do is to iterate over the columns in this df and check if the column name starts with name
. If so, I would like to add the number after the white space in each row of that particular column in an extra column called age_
which increments by one like so:
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 None 17 None 23
1 38 Bravo Nick Emma 53 Olivier 12 None None 53 12 None
2 Charlie 37 Chris Jeff 11 None None None 11 None None
3 Alpha 56 David 56 Oscar None None 56 None None None
So far I came up with this, but I struggle how to get to the end result:
for column in df.columns:
if column.startswith("name"):
age = df[column].str.split(" ").str.get(1)
CodePudding user response:
ages = (df.filter(like="name")
.apply(lambda col: col.str.extract(r" (\d )$", expand=False))
.rename(columns=lambda c: c.replace("name", "age")))
- get the "name" involving columns
- for each of them, extract the numbers near end with a regex
- column names are still "name_*", so replace "name" with "age" there
and lastly join with the original frame to get
>>> df.join(ages)
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 NaN 17 NaN 23
1 38 Bravo Nick Emma 53 Olivier 12 None NaN 53 12 None
2 Charlie 37 Chris Jeff 11 None None NaN 11 None None
3 Alpha 56 David 56 Oscar None None 56 NaN None None
CodePudding user response:
Besides Mustafa Aydın's approach. Here is the fixed version of your for-loop
for column in df.columns:
if column.startswith("name"):
age = f"age_{column[-1]}"
df[age] = df[column].str.extract(r"(\d )")
print(df)
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 NaN 17 NaN 23
1 38 Bravo Nick Emma 53 Olivier 12 None NaN 53 12 NaN
2 Charlie 37 Chris Jeff 11 None None NaN 11 NaN NaN
3 Alpha 56 David 56 Oscar None None 56 NaN NaN NaN