Is there a way to convert a list of dictionaries in array format into individual columns in a datafr-CodePudding

I can't use Pyspark as an FYI!

My data looks like this:

0   { "CountryOfManufacture": "China", "Tags": ["U...
1   { "CountryOfManufacture": "China", "Tags": ["U...
2   { "CountryOfManufacture": "China", "Tags": [] }
3   { "CountryOfManufacture": "Japan", "Tags": ["3...
4   { "CountryOfManufacture": "Japan", "Tags": ["1...
... ...
222 { "CountryOfManufacture": "USA", "ShelfLife": ...
223 { "CountryOfManufacture": "USA", "ShelfLife": ...
224 { "CountryOfManufacture": "USA", "ShelfLife": ...
225 { "CountryOfManufacture": "USA", "ShelfLife": ...
226 { "CountryOfManufacture": "USA", "ShelfLife": .

So there are different values contained in the dictionary. I am only interested in the first one (Country of Manufacture) and want to split this out and then add to another dataframe.

Thanks!

CodePudding user response：

If all your dictionaries have the same keys (or even if they don't! see comment from Pranav below!), then pandas.DataFrame.from_records will work well (link to docs page).

import pandas as pd

data = [{'CountryOfManufacture': 'China', 'col_2': 'a'},
        {'CountryOfManufacture': 'Japan', 'col_2': 'b'},
        {'CountryOfManufacture': 'China', 'col_2': 'c'},
        {'CountryOfManufacture': 'USA', 'col_2': 'd'}]

df = pd.DataFrame.from_records(data)
print(df.head())

#   CountryOfManufacture col_2
# 0                China     a
# 1                Japan     b
# 2                China     c
# 3                  USA     d

if you just want one column, you can select that column after, df["CountryOfManufacture"], or use the exclude keyword and provide a list of all the columns you don't want df = pd.DataFrame.from_records(data, exclude=['col_2'])

CodePudding user response：

When I try to use from_records my result looks like this:

                                        CustomFields
0  { "CountryOfManufacture": "China", "Tags": ["U...
1  { "CountryOfManufacture": "China", "Tags": ["U...
2    { "CountryOfManufacture": "China", "Tags": [] }
3  { "CountryOfManufacture": "Japan", "Tags": ["3...
4  { "CountryOfManufacture": "Japan", "Tags": ["1...

I think this is because my data is in an unusual format. My data was provided in a CSV file originally, and this was one of the columns. All other columns were in integer/float/object format, whilst this column was already in dictionary format when you viewed it in Excel.

The data you used for your example below is formatted as I would expect, but this is what mine looks like when converted into a list:

['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": [] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["Comedy"] }', ...

As you can see, I have additional quotes outside of each dictionary list, illustrated with a single line here: ['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }'.

Is there a way to get around this without pyspark?

Thanks!