Home > front end >  pandas/ python sum data different result
pandas/ python sum data different result

Time:04-27

i have a conceptional problem.

I working on pandas fron kaggle for learn and train my new skill. I tried to solve an exercise, but I don't understand why the result is different from what I expected

question:

"There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)"

my answer:

tropical_count= reviews["description"].str.count(pat ="tropical").sum()
fruity_count= reviews["description"].str.count(pat ="fruity").sum()

descriptor_counts = pd.Series({"tropical":tropical_count,"fruity":fruity_count},index=["tropical","fruity"])

kaggle answare:

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

all work grate, but the result are different, does anyone know why?

my result

tropical    3703
fruity      9259
dtype: int64

kaggle result

tropical    3607
fruity      9090
dtype: int64

CodePudding user response:

Output is expected, because str.count counts substrings, but if use in operator it test only if exist value. So ouput is only True or False. Then if use sum boolean Trues are processing like 1 and False like 0, so ouput is different.

Sample:

reviews = pd.DataFrame(["Ttropical are tropical so fruity words you can",
                   "fruity ",
                   "fruity fruity",
                   "anythi"], columns=['description'])

tropical_count= reviews["description"].str.count(pat ="tropical")
fruity_count= reviews["description"].str.count(pat ="fruity")
print (tropical_count)
0    2
1    0
2    0
3    0
Name: description, dtype: int64
print (fruity_count)
0    1
1    1
2    2
3    0
Name: description, dtype: int64

n_trop = reviews.description.map(lambda desc: "tropical" in desc)
n_fruity = reviews.description.map(lambda desc: "fruity" in desc)
print (n_trop)
0     True
1    False
2    False
3    False
Name: description, dtype: bool

print (n_fruity)
0     True
1     True
2     True
3    False
Name: description, dtype: bool

CodePudding user response:

counts(pat=..), counts the number of times the pattern is in the string so it can add 2 per row (or more), tropical in desc will evaluate true or false only counting one even if is repeated.

For instance this dataframe with two entries sums 3 under the "count" construct:

df = pd.DataFrame({'name':['tropical','tropicaltropical']})
df.name.str.count(pat ="tropical").sum()

The "in" construct will sum only 2, one per row.

  • Related