Taking care of missing data-CodePudding

I have a data frame about the rent price in São Paulo, but some values of "Latitude" and "Longitude" are missing, so I wanna replace the "0" with the mean. The thing is that I wanna replace the Latitude and the Longitude with the mean just of the same District.

A slice of the dataFrame bellow.

	Price	Condo	Size	Rooms	Toilets	Suites	Parking	Elevator	District	Negotiation Type	Property Type	Latitude	Longitude
0	930	220	47	2	2	1	1	0	Artur Alvim/São Paulo	rent	apartment	-23.543138	-46.479486
1	1000	148	45	2	2	1	1	0	Artur Alvim/São Paulo	rent	apartment	-23.550239	-46.480718
2	1000	100	48	2	2	1	1	0	Artur Alvim/São Paulo	rent	apartment	-23.542818	-46.485665
3	1000	200	48	2	2	1	1	0	Artur Alvim/São Paulo	rent	apartment	-23.547171	-46.483014
4	1300	410	55	2	2	1	1	1	Artur Alvim/São Paulo	rent	apartment	-23.525025	-46.482436

How can I do it?

CodePudding user response：

Here is a very simple answer, below is the pseudo-code.

for i in range(len(pd.row)):
     if pd[i][Latitude] == 0 and pd[i][Longitude] == 0:
        //Do replace.

I am sorry that I forget the syntax of the pandas, but I think you can get the idea.

CodePudding user response：

first get the mean latitude and longitude for each district

df_meanll = df.groupby('District').agg(long_mean=('Longitude','mean'), lat_mean=('Latitude','mean')).reset_index()

replace the missing values from here, for example:

df = df.merge(df_meanll, on='District', how='left')

fill missing values as follows:

df.Longitude.fillna(df.long_mean, inplace=True)
df.Latitude.fillna(df.lat_mean, inplace=True)

CodePudding user response：

Using Pandas built in functions .groupby, .agg, .assign, .map, .apply

means_mapping = (
    df
    .groupby("District")
    .agg(LongitudeMean=("Longitude", "mean"), LatitudeMean=("Latitude", "mean"))
    .reset_index()
).set_index("District").transpose().to_dict("list")

df = df.assign(
    Longitude=df["Longitude"].fillna(df["District"].map(means_mapping).apply(lambda x: x[1])),
    Latitude=df["Latitude"].fillna(df["District"].map(means_mapping).apply(lambda x: x[0]))
)