I am working through an online notebook: https://medium.com/@anateresa.mdneto/starbucks-capstone-project-79f84b2a1558
and I ran into an issue. Not important for now.
After beating my head against it I decided to 'start from square one' to try to resolve it.
Starting point: transcript.shape --> (306534, 4) NOTE the number of rows, 306534.
- There is a column of dicts in the df. I want to find out how many unique keys there are and how many times each key appears. NOTE: There are four keys - but - two of the keys are really the same except for a typo. So I ultimately want to end up with three.
I start by extracting the column and casting it as a dict so I am just working with python. I also double check the len to make sure nothing goofy happened.
TEMP = transcript['value'].to_dict()
print(len(TEMP)) # 306534 <-- So far no surprises.
- TEMP is a dict of dicts in which the keys are the index and the values are the dicts that I really want. So I extract the values into a list with a comprehension:
dkeys = [v for k, v in TEMP.items()]
print(len(dkeys)) # 306534 <-- Still no surprises.
Sanity check:
pprint(dkeys[:10])
[{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},
{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},
{'offer id': '2906b810c7d4411798c6938adc9daaa5'},
{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},
{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},
{'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'},
{'offer id': '2298d6c36e964ae4a3e7e9706d1fb8c2'},
{'offer id': '3f207df678b143eea3cee63160fa8bed'},
{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},
{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}]
- Here is where the wheels come off the wagon. Now I extract the keys from the dict to count them. I also cast as a set to get the unique values.
total_keys = [k for d in dkeys for k in d]
print(len(total_keys))
print(Counter(total_keys))
unique_keys = set([k for d in dkeys for k in d])
print(unique_keys)
340113
Counter({'amount': 138953, 'offer id': 134002, 'offer_id': 33579, 'reward': 33579})
{'offer id', 'amount', 'reward', 'offer_id'}
What?! 340113? How did that happen? My count increased by 33579. Coincidence that two of my counts in the Counter above are that same value? Somehow I doubt it.
What am I missing\doing wrong? All I am doing is extracting the keys and counting them. Why is the total number of items increasing?
After downloading the code from the links in the article, I load the files with:
file1 = "e:\\python\\pandas\\datasets\\Starbucks\\portfolio.json"
portfolio = pd.read_json(file1, orient='records', lines=True)
file2 = "e:\\python\\pandas\\datasets\\Starbucks\\profile.json"
profile = pd.read_json(file2, orient='records', lines=True)
file3 = "e:\\python\\pandas\\datasets\\Starbucks\\transcript.json"
transcript = pd.read_json(file3, orient='records', lines=True)
CodePudding user response:
The answer is simple: you are not comparing the same thing.
In the initial case you count the number of dictionaries/rows:
dkeys = list(transcript['value'])
len(dkeys)
# 306534
(NB. I took the opportunity to simplify your code that was unnecessarily complicated)
But then you count the cumulated number of items in all the dictionaries.
total_keys = [k for d in dkeys for k in d]
len(total_keys)
# 340113
Some dictionaries simply have more than one key, and you flatten/count those in your list comprehension.
[d for d in dkeys if len(d)>1]
[{'offer_id': '2906b810c7d4411798c6938adc9daaa5', 'reward': 2},
{'offer_id': 'fafdcd668e3743c1bb461111dcafc2a4', 'reward': 2},
{'offer_id': '9b98b8c7a33c4b65b9aebfe6a799e6d9', 'reward': 5},
...
]
I double-checked, you have 33579 dictionaries with 2 keys, all the others have a unique key. 306534 33579 = 340113
: all are counted!