I am trying to retrieve a random df entry from all the entries of a df that appear n
times in the df itself, but am facing some problems.
This is the code I'm using, with n
= 2.
d = {
"letters": ["a", "b", "c", "a", "b", "a", "d", "d"],
"one": [1, 1, 1, 1, 1, 1, 1, 1],
"two": [2, 2, 2, 2, 2, 2, 2, 2],
}
df = pd.DataFrame(d)
s = df["letters"].value_counts()
df2 = df.loc[np.where(s.to_numpy() == 2)]
rand = df2.sample(n=1, random_state = 2)
This would look ok to a first look, but inspecting df2 returns that df2["letters"] has two entries: "b" and "c', and clearly "c" does not appear twice in the original df.
I guess that the error needs to be in the way I define the concept of "look only to the entries that appear n
times, but I can't wrap my mind around this.
What is going on here, and how can I fix the problem?
CodePudding user response:
Use Series.map
by original column letters
for filtering:
s = df["letters"].value_counts()
df2 = df[df["letters"].map(s) == 2]
print (df2)
letters one two
1 b 1 2
4 b 1 2
6 d 1 2
7 d 1 2
Then if need random row per letter
s use:
rand = df2.groupby('letters').sample(n=1, random_state = 2)
print (rand)
letters one two
4 b 1 2
6 d 1 2