Home > Enterprise >  Replace specific column values with pd.NA
Replace specific column values with pd.NA

Time:10-26

I am working on a data set that contains longitude and latitude values.

I converted those values to clusters using DBSCAN.

Then I plotted the clusters just as a sanity check.

I get this:

enter image description here

The point at (0, 0) is obviously an issue.

So I ran this code to capture which row(s) are a problem.

a = df3.loc[(df3['latitude'] < 0.01) & (df3['longitude'] < 0.01)].index
print(a)  # 1812 rows with 0.0 longitude and -2e-08 latitude

I have 1812 rows with missing data all represented as 0.0 longitude and -2e-08 latitude in the source file.

I am debating some imputation strategies but first I want to replace the 0.0 and -2e-08 values with np.NA or np.nan so that I can then use fillna() with whatever I ultimately decide to do.

I have tried both:

df3.replace((df3['longitude'] == 0.0), pd.NA, inplace=True)
df3.replace((df3['latitude'] == -2e-08), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))

and

df3.replace((df3['longitude'] < 0.01), pd.NA, inplace=True)
df3.replace((df3['latitude'] < 0.01), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))

In both cases the existing values remain in place, i.e., the desired substitution with pd.NA is not occurring.

What would be the correct procedure to replace the unwanted 1812 values in both the latitude and longitude columns with pd.NA or np.nan, as I simply plan to the impute something to replace the null values.

CodePudding user response:

Try this one out:

df3['longitude'] = df3['longitude'].apply(lambda x:np.nan if x == 0.0 else x)
df3['latitude'] = df3['latitude'].apply(lambda x:np.nan if x==-2e-08 else x)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))

CodePudding user response:

With an example

import numpy as np
import pandas as pd
a = [1, 2, 0.0, -2e-08]
b = [1, 2, 0.0, -2e-08]
df = pd.DataFrame(zip(a, b))
df.columns = ['lat', 'long']

df.long = df.long.apply(lambda x:np.nan if x == 0.0 else x)
df.lat = df.lat.apply(lambda x:np.nan if x==-2e-08 else x)
  • Related