Home > database >  Is there a way to convert a list of dictionaries in array format into individual columns in a datafr
Is there a way to convert a list of dictionaries in array format into individual columns in a datafr

Time:10-13

I can't use Pyspark as an FYI!

My data looks like this:

0   { "CountryOfManufacture": "China", "Tags": ["U...
1   { "CountryOfManufacture": "China", "Tags": ["U...
2   { "CountryOfManufacture": "China", "Tags": [] }
3   { "CountryOfManufacture": "Japan", "Tags": ["3...
4   { "CountryOfManufacture": "Japan", "Tags": ["1...
... ...
222 { "CountryOfManufacture": "USA", "ShelfLife": ...
223 { "CountryOfManufacture": "USA", "ShelfLife": ...
224 { "CountryOfManufacture": "USA", "ShelfLife": ...
225 { "CountryOfManufacture": "USA", "ShelfLife": ...
226 { "CountryOfManufacture": "USA", "ShelfLife": .

So there are different values contained in the dictionary. I am only interested in the first one (Country of Manufacture) and want to split this out and then add to another dataframe.

Thanks!

CodePudding user response:

If all your dictionaries have the same keys (or even if they don't! see comment from Pranav below!), then pandas.DataFrame.from_records will work well (link to docs page).

import pandas as pd

data = [{'CountryOfManufacture': 'China', 'col_2': 'a'},
        {'CountryOfManufacture': 'Japan', 'col_2': 'b'},
        {'CountryOfManufacture': 'China', 'col_2': 'c'},
        {'CountryOfManufacture': 'USA', 'col_2': 'd'}]

df = pd.DataFrame.from_records(data)
print(df.head())

#   CountryOfManufacture col_2
# 0                China     a
# 1                Japan     b
# 2                China     c
# 3                  USA     d

if you just want one column, you can select that column after, df["CountryOfManufacture"], or use the exclude keyword and provide a list of all the columns you don't want df = pd.DataFrame.from_records(data, exclude=['col_2'])

CodePudding user response:

When I try to use from_records my result looks like this:

                                        CustomFields
0  { "CountryOfManufacture": "China", "Tags": ["U...
1  { "CountryOfManufacture": "China", "Tags": ["U...
2    { "CountryOfManufacture": "China", "Tags": [] }
3  { "CountryOfManufacture": "Japan", "Tags": ["3...
4  { "CountryOfManufacture": "Japan", "Tags": ["1...

I think this is because my data is in an unusual format. My data was provided in a CSV file originally, and this was one of the columns. All other columns were in integer/float/object format, whilst this column was already in dictionary format when you viewed it in Excel.

The data you used for your example below is formatted as I would expect, but this is what mine looks like when converted into a list:

['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": [] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["Comedy"] }', ...

As you can see, I have additional quotes outside of each dictionary list, illustrated with a single line here: ['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }'.

Is there a way to get around this without pyspark?

Thanks!

  • Related