Hi I have a dataset of multiple households where all people within households have been matched between two datasources. The dataframe therefore consists of a 'household' col, and two person cols (one for each datasource). However some people (like Jonathan or Peter below) where not able to be matched and so have a blank second person column.
Household | Person_source_A | Person_source_B |
---|---|---|
1 | Oliver | Oliver |
1 | Jonathan | |
1 | Amy | Amy |
2 | David | Dave |
2 | Mary | Mary |
3 | Lizzie | Elizabeth |
3 | Peter |
As the dataframe is gigantic, my aim is to take a sample of the unmatched individuals, and then output a df that has all people within households where only sampled unmatched people exist. Ie say my random sample includes Oliver but not Peter, then I would only household 1 in the output.
My issue is I've filtered to take the sample and now am stuck making progress. Some combination of join, agg/groupBy... will work but I'm struggling. I add a flag to the sampled unmatched names to identify them which i think is helpful...
My code:
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1)
# add flag of sampled unmatched persons
df_unmatched_sample = df_unmatched.withColumn('sample_flag', lit('1'))
CodePudding user response:
As it pertains to your intent:
I just want to reduce my dataframe to only show the full households of households where an unmatched person exists that has been selected by a random sample out of all unmatched people
Using your existing approach you could use a join on the Household
of the sample records
# filter to unmatched people
df_unmatched = df.filter(col('per_A').isNotNull()) & col('per_B').isNull())
# take random sample of 10%
df_unmatched_sample = df_unmatched.sample(0.1).select("Household").distinct()
desired_df = df.join(df_unmatched_sample,["Household"],"inner")