I scraped a real estate website and produced a CSV output with data requiring to be cleaned and structured. So far, my code properly organized and reformatted the data for it to work with stats software.
However, every row and then, my 'Gross area' column has the wrong value in m2. The correct value appears in another column ('Furbished').
Gross_area | Furbished |
---|---|
170 #erroneous | 190 m2 |
170 #erroneous | 190 m2 |
160 #correct | Yes |
155 #correct | No |
I tried using the np.where function. However, I could not specify the condition based on string length, which would allow me to target all '_ _ _ m2' values in column 'Furbished' and reinsert them in 'Gross_area'. It just doesn't work.
df['Gross area']=np.where(len(df['Furbished]) == 6, df['Furbished'],df['Gross area']
As an alternative, I tried setting cumulative conditions to precisely target my '_ _ _ m2' values and insert them in my 'Gross area' column. It does not work:
df['Gross area']=np.where((df['Furbished]) != 'Yes' or 'No', df['Furbished'],df['Gross area']
The outcome I seek is:
Gross_area | Furbished |
---|---|
190 m2 | 190 m2 |
190 m2 | 190m2 |
160 | Yes |
Any suggestions? Column Furbished string length criterion would be the best option, as I have other instances that would require the same treatment :)
Thanks in advance for your help!
CodePudding user response:
There is probably a better way to do this, but you could get the intended effect by a simple df.apply()
function.
df['Gross area'] = df.apply(lambda row: row['Furbished'] if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
With a simple change, you can also keep the 'Gross area'
column in the right type.
df['Gross area'] = df.apply(lambda row: float(row['Furbished'][:-2]) if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
CodePudding user response:
You can use pd.where
:
df['Gross_area'] = df['Furbished'].where(df['Furbished'].str.len() == 6, df['Gross_area'])
This tells you to use the value in the Furbished
column if its length is 6, otherwise use the value in the Gross_area
column.
Result:
Gross_area Furbished
0 190 m2 190 m2
1 190 m2 190 m2
2 160 #correct Yes
3 155 #correct No