I have a Dataframe with columns that look like this:
df=pd.DataFrame()
df['symbol'] = ['A','B','C']
df['json_list'] = ['[{name:S&P500, perc:25, ticker:SPY, weight:1}]',
'[{name:S&P500, perc:25, ticker:SPY, weight:0.5}, {name:NASDAQ, perc:26, ticker:NASDAQ, weight:0.5}]',
'[{name:S&P500, perc:25, ticker:SPY, weight:1}]']
df['date'] = ['2022-01-01', '2022-01-02', '2022-01-02']
df:
symbol json_list date
0 A [{name:S&P500, perc:25, ticker:SPY, weight:1}] 2022-01-01
1 B [{name:S&P500, perc:25, ticker:SPY, weight:0.5... 2022-01-02
2 C [{name:S&P500, perc:25, ticker:SPY, weight:1}] 2022-01-02
Note the json_list
column is in json format.
What I want to do is separate the DataFrame into two DataFrames. df1
contains the rows where json_list
only has one item in the list and df2
contains the rows where json_list
only has two or more items in the list.
Therefore, the two DataFrames would look like this:
df1:
symbol json_list date
0 A [{name:S&P500, perc:25, ticker:SPY, weight:1}] 2022-01-01
1 C [{name:S&P500, perc:25, ticker:SPY, weight:1}] 2022-01-02
df2:
symbol json_list date
0 B [{name:S&P500, perc:25, ticker:SPY, weight:0.5... 2022-01-02
Is there a Pythonic way to do this?
Thank you in advance.
CodePudding user response:
You can try
cond = df.json_list.str.contains('}, {')
df1 = df[~cond]
df2 = df[cond]
CodePudding user response:
You can try count the {
count in json_list
column
c = df['json_list'].str.count('{')
df1 = df[c.eq(1)]
df2 = df[c.eq(2)]
print(df1)
symbol json_list date
0 A [{name:S&P500, perc:25, ticker:SPY, weight:1}] 2022-01-01
2 C [{name:S&P500, perc:25, ticker:SPY, weight:1}] 2022-01-02
print(df2)
symbol json_list date
1 B [{name:S&P500, perc:25, ticker:SPY, weight:0.5... 2022-01-02
CodePudding user response:
There is a problem with your JSON column since there are no quotes around the strings.
Assuming that you have a properly formatted JSON object stored as a string you should be able to decode it using json.JSONDecoder.decode
In the format it is currently you can get an not very elegant approach by simply doing the following
df1=df[df['json_list'].apply(lambda s: s.count('{')) == 1]
df2=df.drop(df1.index)
CodePudding user response:
You have several options and they all go by counting the occurrences of curly brackets. Either with str.count
, str.split
or with regex re.findall
:
# use the str.split method, looking for the closing curly bracket (or the opening bracket). Since it is a split, it will return always one element more than there are (it is not counting the occurrences)
num_json = df['json_list'].apply(lambda x: len(x.split('}')) -1)
# use re.findall to look for the bracket. This will return the the number of occurrences directly
num_json = df['json_list'].apply(lambda x: len(re.findall('\}', x)))
# count the curly brackets directly
num_json = df['json_list'].apply(lambda x: x.count('}'))
df1 = df[num_json == 1]
df2 = df[num_json > 1]
At the end, you can just slice up the table according to your counts