Number of items in list (from pandas column) increasing for unknown reason?-CodePudding

I am working through an online notebook: https://medium.com/@anateresa.mdneto/starbucks-capstone-project-79f84b2a1558

and I ran into an issue. Not important for now.

After beating my head against it I decided to 'start from square one' to try to resolve it.

Starting point: transcript.shape --> (306534, 4) NOTE the number of rows, 306534.

There is a column of dicts in the df. I want to find out how many unique keys there are and how many times each key appears. NOTE: There are four keys - but - two of the keys are really the same except for a typo. So I ultimately want to end up with three.

I start by extracting the column and casting it as a dict so I am just working with python. I also double check the len to make sure nothing goofy happened.

TEMP = transcript['value'].to_dict()
print(len(TEMP))  # 306534   <-- So far no surprises.

TEMP is a dict of dicts in which the keys are the index and the values are the dicts that I really want. So I extract the values into a list with a comprehension:

dkeys = [v for k, v in TEMP.items()]
print(len(dkeys))  # 306534   <-- Still no surprises.

Sanity check:

pprint(dkeys[:10])

[{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},
 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},
 {'offer id': '2906b810c7d4411798c6938adc9daaa5'},
 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},
 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},
 {'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'},
 {'offer id': '2298d6c36e964ae4a3e7e9706d1fb8c2'},
 {'offer id': '3f207df678b143eea3cee63160fa8bed'},
 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},
 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}]

Here is where the wheels come off the wagon. Now I extract the keys from the dict to count them. I also cast as a set to get the unique values.

total_keys = [k for d in dkeys for k in d]
print(len(total_keys))
print(Counter(total_keys))
unique_keys = set([k for d in dkeys for k in d])
print(unique_keys)

340113
Counter({'amount': 138953, 'offer id': 134002, 'offer_id': 33579, 'reward': 33579})
{'offer id', 'amount', 'reward', 'offer_id'}

What?! 340113? How did that happen? My count increased by 33579. Coincidence that two of my counts in the Counter above are that same value? Somehow I doubt it.

What am I missing\doing wrong? All I am doing is extracting the keys and counting them. Why is the total number of items increasing?

After downloading the code from the links in the article, I load the files with:

file1 = "e:\\python\\pandas\\datasets\\Starbucks\\portfolio.json"
portfolio = pd.read_json(file1, orient='records', lines=True)

file2 = "e:\\python\\pandas\\datasets\\Starbucks\\profile.json"
profile = pd.read_json(file2, orient='records', lines=True)

file3 = "e:\\python\\pandas\\datasets\\Starbucks\\transcript.json"
transcript = pd.read_json(file3, orient='records', lines=True)

CodePudding user response：

The answer is simple: you are not comparing the same thing.

In the initial case you count the number of dictionaries/rows:

dkeys = list(transcript['value'])
len(dkeys)
# 306534

(NB. I took the opportunity to simplify your code that was unnecessarily complicated)

But then you count the cumulated number of items in all the dictionaries.

total_keys = [k for d in dkeys for k in d]
len(total_keys)
# 340113

Some dictionaries simply have more than one key, and you flatten/count those in your list comprehension.

[d for d in dkeys if len(d)>1]

[{'offer_id': '2906b810c7d4411798c6938adc9daaa5', 'reward': 2},
 {'offer_id': 'fafdcd668e3743c1bb461111dcafc2a4', 'reward': 2},
 {'offer_id': '9b98b8c7a33c4b65b9aebfe6a799e6d9', 'reward': 5},
 ...
]

I double-checked, you have 33579 dictionaries with 2 keys, all the others have a unique key. 306534 33579 = 340113: all are counted!