I am working on a data set that contains longitude and latitude values.
I converted those values to clusters using DBSCAN.
Then I plotted the clusters just as a sanity check.
I get this:
The point at (0, 0) is obviously an issue.
So I ran this code to capture which row(s) are a problem.
a = df3.loc[(df3['latitude'] < 0.01) & (df3['longitude'] < 0.01)].index
print(a) # 1812 rows with 0.0 longitude and -2e-08 latitude
I have 1812 rows with missing data all represented as 0.0 longitude and -2e-08 latitude in the source file.
I am debating some imputation strategies but first I want to replace the 0.0 and -2e-08 values with np.NA or np.nan so that I can then use fillna() with whatever I ultimately decide to do.
I have tried both:
df3.replace((df3['longitude'] == 0.0), pd.NA, inplace=True)
df3.replace((df3['latitude'] == -2e-08), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
and
df3.replace((df3['longitude'] < 0.01), pd.NA, inplace=True)
df3.replace((df3['latitude'] < 0.01), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
In both cases the existing values remain in place, i.e., the desired substitution with pd.NA is not occurring.
What would be the correct procedure to replace the unwanted 1812 values in both the latitude and longitude columns with pd.NA or np.nan, as I simply plan to the impute something to replace the null values.
CodePudding user response:
Try this one out:
df3['longitude'] = df3['longitude'].apply(lambda x:np.nan if x == 0.0 else x)
df3['latitude'] = df3['latitude'].apply(lambda x:np.nan if x==-2e-08 else x)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
CodePudding user response:
With an example
import numpy as np
import pandas as pd
a = [1, 2, 0.0, -2e-08]
b = [1, 2, 0.0, -2e-08]
df = pd.DataFrame(zip(a, b))
df.columns = ['lat', 'long']
df.long = df.long.apply(lambda x:np.nan if x == 0.0 else x)
df.lat = df.lat.apply(lambda x:np.nan if x==-2e-08 else x)