I am trying to convert my pandas DataFrame data into a different medium which is easily represented via JSON. I have chosen to do this by turning it into python dictionaries then converting it into JSON.
The problem I am encountering is that the data I am putting through the process of formatting is coming out in a different order than expected - the values I am expecting are being replaced by the last values in my for loop.
Here is a reproducible example, which is split between 2 files:
import re
import pandas as pd
import json
from help import Model # Note! this is another file help.py
jan = {'Month': ["January", "January", "January", "January"],
'Date': ['1st', '2nd', '28th', '29th'],
'a': ["j1a", "a3x", "d9c", "h9c"],
'b': ["X1", "SG", "DV", "XP"]}
dec = {'Month': ["December", "December", "December", "December"],
'Date': ['1st', '2nd', '28th', '29th'],
'a': ["d1a", "o3x", "j9c", "h9c"],
'b': ["X2", "SG", "DV", "XP"]}
a = pd.DataFrame.from_dict(jan)
b = pd.DataFrame.from_dict(dec)
dfs = [a, b]
df = pd.concat(dfs)
DateNum = []
for values in df['Date']:
DateNum.append(re.search(r'\d ', values).group())
df['Date Num'] = DateNum
df.reset_index(drop=True, inplace=True)
dfl = df.Month.tolist()
months = []
for data in dfl:
if data not in months:
months.append(data)
# months = ['January', 'December']
models = []
for month in months:
models.append(Model(month))
calendar = {}
for month in models:
datacopy = df.copy()
datacopy = datacopy[datacopy.Month == month.name]
month.data = datacopy
month.update(debug=True)
calendar[month.name] = month.days
print(json.dumps(calendar, indent=4))
Here is the other file - help.py contains the classes Model and Day
class Model:
"""
model for months
"""
name = ""
data = None
days = {}
def __init__(self, monthname):
self.name = monthname
def update(self, debug=False):
edit = self.data # a copy of a slice from the df
edit = edit.drop("Month", axis=1) # drop Month column
edit = edit.set_index('Date Num').T.to_dict('list') # set Date Num column to be the index and make dict
data_formatted = {self.name: edit} # save the dict with key as month name as data_formatted
for k, v in data_formatted[self.name].items(): # data_formatted [month] = (day number : data)
if debug:
print(k, v) # e.g. k=1 v=['1st', 'a', 'n']
day_object = Day(v) # make a day object out of the values (formatting in initializer)
self.days[k] = day_object.data_formatted # assign the formatted value e.g. days[1] = (formatted data)
# print(self.days[k]) # shows correct data e.g. {'date': '25th', 'a': 'a', 'b': 'n', 'c': 'x'}
class Day:
date = ""
a = ""
b = ""
data_formatted = {}
def __init__(self, data):
self.date = data[0]
self.a = data[1]
self.b = data[2]
self.format_data()
def format_data(self):
self.data_formatted = {
"date": self.date,
"a": self.a,
"b": self.b,
}
As expected, the data is being processed in the expected order:
1 ['1st', 'j1a', 'X1']
2 ['2nd', 'a3x', 'SG']
28 ['28th', 'd9c', 'DV']
29 ['29th', 'h9c', 'XP']
1 ['1st', 'd1a', 'X2']
2 ['2nd', 'o3x', 'SG']
28 ['28th', 'j9c', 'DV']
29 ['29th', 'h9c', 'XP']
But the output of the json.dumps is different (identical to the last month in months):
{
"January": {
"1": {
"date": "1st",
"a": "d1a", - Should be j1a
"b": "X2" - should be X1
},
"2": {
"date": "2nd",
"a": "o3x", - Should be a3x
"b": "SG"
} ...
Thank you for reading this and I hope you can help me.
Here are some other notes:
- The code without the Model class is being run in an interactive python notebook - could this change things?
- The code I have provided only shows 2 months. In my case, the data from the last month (which I assume to be the last iteration) is being saved as the data for ALL the months.
CodePudding user response:
The problem is here:
month.data = datacopy
month.update(debug=True)
calendar[month.name] = month.days
That's fine the first time around, but in the next iteration, you change the data and rerun .update
for month
, but its .days
is still the same dictionary. So, you're not just updating the dictionary for the next month, but also for all previous months.
Edit: you asked for some clarification in the comments - that's fine, it's perhaps not immediately obvious.
The problem starts here, in your Model
class:
class Model:
...
# this is the only place a new dictionary is created
days = {}
def __init__(self, monthname):
# after __init__, this object will have a reference to the 1 days in your class
...
def update(self, debug=False):
...
for k, v in data_formatted[self.name].items():
...
day_object = Day(v)
# so here, you just update that one dictionary
self.days[k] = day_object.data_formatted
I've removed the code that doesn't contribute to the problem and added some comments to explain. The key problem is that you defined days
as an attribute of Model
- that means it's a class attribute, to which all instances of the class have access, but there's only of it.
If you need each instance of Model
to have a unique instance of .days
, you should just create it in __init__
(and you don't need it on the class body at all):
def __init__(self, monthname):
self.name = monthname
self.days = {}
So, the problem is not really to do with loops, the problem is the difference between a class attribute and an object attribute.