i have a conceptional problem.
I working on pandas fron kaggle for learn and train my new skill. I tried to solve an exercise, but I don't understand why the result is different from what I expected
question:
"There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)"
my answer:
tropical_count= reviews["description"].str.count(pat ="tropical").sum()
fruity_count= reviews["description"].str.count(pat ="fruity").sum()
descriptor_counts = pd.Series({"tropical":tropical_count,"fruity":fruity_count},index=["tropical","fruity"])
kaggle answare:
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
all work grate, but the result are different, does anyone know why?
my result
tropical 3703
fruity 9259
dtype: int64
kaggle result
tropical 3607
fruity 9090
dtype: int64
CodePudding user response:
Output is expected, because str.count
counts substrings, but if use in
operator it test only if exist value. So ouput is only True
or False
. Then if use sum
boolean True
s are processing like 1
and False
like 0
, so ouput is different.
Sample:
reviews = pd.DataFrame(["Ttropical are tropical so fruity words you can",
"fruity ",
"fruity fruity",
"anythi"], columns=['description'])
tropical_count= reviews["description"].str.count(pat ="tropical")
fruity_count= reviews["description"].str.count(pat ="fruity")
print (tropical_count)
0 2
1 0
2 0
3 0
Name: description, dtype: int64
print (fruity_count)
0 1
1 1
2 2
3 0
Name: description, dtype: int64
n_trop = reviews.description.map(lambda desc: "tropical" in desc)
n_fruity = reviews.description.map(lambda desc: "fruity" in desc)
print (n_trop)
0 True
1 False
2 False
3 False
Name: description, dtype: bool
print (n_fruity)
0 True
1 True
2 True
3 False
Name: description, dtype: bool
CodePudding user response:
counts(pat=..)
, counts the number of times the pattern is in the string so it can add 2 per row (or more), tropical in desc
will evaluate true or false only counting one even if is repeated.
For instance this dataframe with two entries sums 3 under the "count" construct:
df = pd.DataFrame({'name':['tropical','tropicaltropical']})
df.name.str.count(pat ="tropical").sum()
The "in" construct will sum only 2, one per row.