I'm trying to create new columns in pandas from sub-strings of another column.
import pandas as pd
import re
df = {'title':['Apartment 2 roomns, 40 m²', 'House 7 rooms, 183 m²', 'House 4 rooms, 93 m²', 'Apartment 12 rooms, 275 m²']}
I'm trying with regex to capture groups:
df['Name'] = df.title.str.extract(r'(^[a-zA-Z] )', expand=True)
This one I got a good result. But I need a column with the number of rooms (without the word "rooms") and another column with the size without "m²". I tried:
df['Rooms'] = df.title.str.replace(r'(^[0-9] )\s(rooms)', r'\1') #to capture only the first group, which is the number
df['Size'] = df.title.str.replace(r'(^[0-9] )\s(m²)', r'\1') #to capture only the first group, which is the number
My output:
Name Rooms Size
0 Apartment Apartment 2 roomns, 40 m² Apartment 2 roomns, 40 m²
1 House House 7 rooms, 183 m² House 7 rooms, 183 m²
2 House House 4 rooms, 93 m² House 4 rooms, 93 m²
3 Apartment Apartment 12 rooms, 275 m² Apartment 12 rooms, 275 m²
Good output:
Name Rooms Size
0 Apartment 2 40
1 House 7 183
2 House 4 93
3 Apartment 12 275
CodePudding user response:
You can use
df["Rooms"] = df["title"].str.extract(r'(\d )\s*room', expand=False)
df['Size'] = df["title"].str.extract(r'(\d (?:\.\d )?)\s*m²', expand=False)
Output:
>>> df
title Rooms Size
0 Apartment 2 roomns, 40 m² 2 40
1 House 7 rooms, 183 m² 7 183
2 House 4 rooms, 93 m² 4 93
3 Apartment 12 rooms, 275 m² 12 275
The (\d )\s*room
regex matches and captures into Group 1 one or more digits, and then just matches zero or more whitespaces (\s*
) and then a room
string.
The (\d (?:\.\d )?)\s*m²
regex matches and captured one or more digits, and an optional string of a .
and one or more digits, and then matches zero or more whitespaces and then a m²
string.
See regex #1 demo and the regex #2 demo.