Python: Filter Dataframe into two separate dataframes-CodePudding

I have a Dataframe with columns that look like this:

df=pd.DataFrame()
df['symbol'] = ['A','B','C']
df['json_list'] = ['[{name:S&P500, perc:25, ticker:SPY, weight:1}]',
          '[{name:S&P500, perc:25, ticker:SPY, weight:0.5}, {name:NASDAQ, perc:26, ticker:NASDAQ, weight:0.5}]',
          '[{name:S&P500, perc:25, ticker:SPY, weight:1}]']
df['date'] = ['2022-01-01', '2022-01-02', '2022-01-02']
df:
    symbol  json_list                                         date
0   A       [{name:S&P500, perc:25, ticker:SPY, weight:1}]    2022-01-01
1   B       [{name:S&P500, perc:25, ticker:SPY, weight:0.5... 2022-01-02
2   C       [{name:S&P500, perc:25, ticker:SPY, weight:1}]    2022-01-02

Note the json_list column is in json format.

What I want to do is separate the DataFrame into two DataFrames. df1 contains the rows where json_list only has one item in the list and df2 contains the rows where json_list only has two or more items in the list.

Therefore, the two DataFrames would look like this:

df1:
    symbol  json_list                                         date
0   A       [{name:S&P500, perc:25, ticker:SPY, weight:1}]    2022-01-01
1   C       [{name:S&P500, perc:25, ticker:SPY, weight:1}]    2022-01-02

df2:
    symbol  json_list                                         date
0   B       [{name:S&P500, perc:25, ticker:SPY, weight:0.5... 2022-01-02

Is there a Pythonic way to do this?

Thank you in advance.

CodePudding user response：

You can try

cond = df.json_list.str.contains('}, {')
df1 = df[~cond]
df2 = df[cond]

CodePudding user response：

You can try count the { count in json_list column

c = df['json_list'].str.count('{')
df1 = df[c.eq(1)]
df2 = df[c.eq(2)]

print(df1)
  symbol                                       json_list        date
0      A  [{name:S&P500, perc:25, ticker:SPY, weight:1}]  2022-01-01
2      C  [{name:S&P500, perc:25, ticker:SPY, weight:1}]  2022-01-02

print(df2)

  symbol                                          json_list        date
1      B  [{name:S&P500, perc:25, ticker:SPY, weight:0.5...  2022-01-02

CodePudding user response：

There is a problem with your JSON column since there are no quotes around the strings.

Assuming that you have a properly formatted JSON object stored as a string you should be able to decode it using json.JSONDecoder.decode

In the format it is currently you can get an not very elegant approach by simply doing the following

df1=df[df['json_list'].apply(lambda s: s.count('{')) == 1]
df2=df.drop(df1.index)

CodePudding user response：

You have several options and they all go by counting the occurrences of curly brackets. Either with str.count, str.split or with regex re.findall:

# use the str.split method, looking for the closing curly bracket (or the opening bracket). Since it is a split, it will return always one element more than there are (it is not counting the occurrences)
num_json = df['json_list'].apply(lambda x: len(x.split('}')) -1)
# use re.findall to look for the bracket. This will return the the number of occurrences directly
num_json = df['json_list'].apply(lambda x: len(re.findall('\}', x)))
# count the curly brackets directly
num_json = df['json_list'].apply(lambda x: x.count('}'))

df1 = df[num_json == 1]
df2 = df[num_json > 1]

At the end, you can just slice up the table according to your counts