Python not recognizing list of dictionaries-CodePudding

I am still quite new to Python and have a problem creating the code for following problem. I have a dataframe with many columns that look like this:

Some of the rows have all the regions, others only have 1 or 3, each row has unique values.

This is an example for a cell content:

{"region":"Baden-W\u00fcrttemberg","percentage":0.169558},{"region":"Bayern","percentage":0.144416},{"region":"Bremen","percentage":0.010524},{"region":"Hamburg","percentage":0.022997},{"region":"Hessen","percentage":0.074644},{"region":"Niedersachsen","percentage":0.104073},{"region":"Nordrhein-Westfalen","percentage":0.264081},{"region":"Rheinland-Pfalz","percentage":0.063535},{"region":"Saarland","percentage":0.012278},{"region":"Schleswig-Holstein","percentage":0.028065},{"region":"Brandenburg","percentage":0.011304},{"region":"Mecklenburg-Vorpommern","percentage":0.013448},{"region":"Sachsen","percentage":0.019489},{"region":"Saxony-Anhalt","percentage":0.014617},{"region":"Th\u00fcringen","percentage":0.017346},{"region":"Berlin","percentage":0.029624}

I need to create a df where each region is its own column and the row is the corresponding percentage. Tried with following code to get the columns out, where column looks as follows:

import itertools as itts
lysts = [t[1] for t in column.iteritems()]
regions = set(itts.chain.from_iterable([[d["region"] for d in lyst] for lyst in lysts]))
print(regions)

Unfortunately I always get TypeError: string indices must be integers, and after checking it looks like its not recognizing the values as dictionaries but as a pandas Series/List.

Ultimately I want to have a df where each column is a region and the rows are their corresponding percentages. Something like this e.g.

CodePudding user response：

Is this how your desired output should look like for one row?

df = pd.DataFrame([item['percentage'] for item in region_list]).T
df.columns = [item['region'] for item in region_list]

>>>
Baden-Württemberg   Bayern  Bremen  Hamburg Hessen  Niedersachsen   NordrheinWestfalen  Rheinland-Pfalz Saarland    Schleswig-Holstein  Brandenburg Mecklenburg-Vorpommern  Sachsen Saxony-Anhalt   Thüringen   Berlin
0   0.169558    0.144416    0.010524    0.022997    0.074644    0.104073    0.264081    0.063535    0.012278    0.028065    0.011304    0.013448    0.019489    0.014617    0.017346    0.029624

CodePudding user response：

It looks like your data is probably all string type (if it were a list or array it would be enclosed in brackets), which means you'll need to parse it somehow. This is what the error is telling you - I can reproduce TypeError: string indices must be integers with this line:

In [4]: sample = '{"region":"Baden-W\u00fcrttemberg","percentage":0.169558},{"region":"Bayern","percentage":0.144416},{
    ...: "region":"Bremen","percentage":0.010524},{"region":"Hamburg","percentage":0.022997},{"region":"Hessen","percent
    ...: age":0.074644},{"region":"Niedersachsen","percentage":0.104073},{"region":"Nordrhein-Westfalen","percentage":0.
    ...: 264081},{"region":"Rheinland-Pfalz","percentage":0.063535},{"region":"Saarland","percentage":0.012278},{"region
    ...: ":"Schleswig-Holstein","percentage":0.028065},{"region":"Brandenburg","percentage":0.011304},{"region":"Mecklen
    ...: burg-Vorpommern","percentage":0.013448},{"region":"Sachsen","percentage":0.019489},{"region":"Saxony-Anhalt","p
    ...: ercentage":0.014617},{"region":"Th\u00fcringen","percentage":0.017346},{"region":"Berlin","percentage":0.029624
    ...: }'

In [5]: sample["region"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 sample["region"]

TypeError: string indices must be integers

So, you'll need to parse the string data in each cell with json.loads, ast.literal_eval, or pd.read_json. I'll use the last option just because it will return a dataframe, which we can easily manipulate into the format we need. But just be warned that this will be a slow operation if you have a lot of data.

Something along these lines should do the trick though. You can walk through the applied operation using sample instead of data to understand what is happening at each step:

# apply on a Series/DataFrame column will run the applied function on each cell
parsed = df.delivery_by_region.apply(
    # in each cell:
    lambda data: (
        # read the cell as json. We'll put brackets around the
        # string so it's valid json (you can't just have comma-separated dicts)
        pd.read_json("["   data   "]")
        # move the region field into the index
        .set_index("region")
        # select the percentage column to return a Series. pandas will
        # interpret a series in the result of an apply as the other dimension
        # in the result, so we'll get a DataFrame back
        .percentage
    )
)

I made a dummy mock-up of your data with random-length subsets of the string you provided. Since it's the same string repeated over and over the values are all the same with the exception of the region list, but you get the idea:

 df = pd.DataFrame(
    [
        ("},{".join(sample[:-1].split("},{")[:np.random.randint(1, 16)])  "}") 
        for i in range(20)
    ],
    columns=["delivery_by_region"],
)

Running the above function on this results in the following DataFrame:

region  Baden-Württemberg    Bayern    Bremen   Hamburg  ...  Mecklenburg-Vorpommern   Sachsen  Saxony-Anhalt  Thüringen
0                0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
1                0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
2                0.169558  0.144416  0.010524  0.022997  ...                0.013448       NaN            NaN        NaN
3                0.169558  0.144416  0.010524  0.022997  ...                0.013448       NaN            NaN        NaN
4                0.169558  0.144416       NaN       NaN  ...                     NaN       NaN            NaN        NaN
5                0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
6                0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
7                0.169558  0.144416  0.010524       NaN  ...                     NaN       NaN            NaN        NaN
8                0.169558  0.144416  0.010524  0.022997  ...                0.013448  0.019489       0.014617        NaN
9                0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
10               0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
11               0.169558  0.144416       NaN       NaN  ...                     NaN       NaN            NaN        NaN
12               0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
13               0.169558  0.144416       NaN       NaN  ...                     NaN       NaN            NaN        NaN
14               0.169558  0.144416  0.010524  0.022997  ...                0.013448  0.019489       0.014617   0.017346
15               0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
16               0.169558  0.144416  0.010524  0.022997  ...                0.013448  0.019489            NaN        NaN
17               0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
18               0.169558  0.144416  0.010524  0.022997  ...                     NaN       NaN            NaN        NaN
19               0.169558  0.144416       NaN       NaN  ...                     NaN       NaN            NaN        NaN

[20 rows x 15 columns]