I can't use Pyspark as an FYI!
My data looks like this:
0 { "CountryOfManufacture": "China", "Tags": ["U...
1 { "CountryOfManufacture": "China", "Tags": ["U...
2 { "CountryOfManufacture": "China", "Tags": [] }
3 { "CountryOfManufacture": "Japan", "Tags": ["3...
4 { "CountryOfManufacture": "Japan", "Tags": ["1...
... ...
222 { "CountryOfManufacture": "USA", "ShelfLife": ...
223 { "CountryOfManufacture": "USA", "ShelfLife": ...
224 { "CountryOfManufacture": "USA", "ShelfLife": ...
225 { "CountryOfManufacture": "USA", "ShelfLife": ...
226 { "CountryOfManufacture": "USA", "ShelfLife": .
So there are different values contained in the dictionary. I am only interested in the first one (Country of Manufacture) and want to split this out and then add to another dataframe.
Thanks!
CodePudding user response:
If all your dictionaries have the same keys (or even if they don't! see comment from Pranav below!), then pandas.DataFrame.from_records
will work well (link to docs page).
import pandas as pd
data = [{'CountryOfManufacture': 'China', 'col_2': 'a'},
{'CountryOfManufacture': 'Japan', 'col_2': 'b'},
{'CountryOfManufacture': 'China', 'col_2': 'c'},
{'CountryOfManufacture': 'USA', 'col_2': 'd'}]
df = pd.DataFrame.from_records(data)
print(df.head())
# CountryOfManufacture col_2
# 0 China a
# 1 Japan b
# 2 China c
# 3 USA d
if you just want one column, you can select that column after, df["CountryOfManufacture"]
, or use the exclude
keyword and provide a list of all the columns you don't want df = pd.DataFrame.from_records(data, exclude=['col_2'])
CodePudding user response:
When I try to use from_records my result looks like this:
CustomFields
0 { "CountryOfManufacture": "China", "Tags": ["U...
1 { "CountryOfManufacture": "China", "Tags": ["U...
2 { "CountryOfManufacture": "China", "Tags": [] }
3 { "CountryOfManufacture": "Japan", "Tags": ["3...
4 { "CountryOfManufacture": "Japan", "Tags": ["1...
I think this is because my data is in an unusual format. My data was provided in a CSV file originally, and this was one of the columns. All other columns were in integer/float/object format, whilst this column was already in dictionary format when you viewed it in Excel.
The data you used for your example below is formatted as I would expect, but this is what mine looks like when converted into a list:
['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": [] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["32GB","USB Powered"] }', '{ "CountryOfManufacture": "Japan", "Tags": ["16GB","USB Powered"] }', '{ "CountryOfManufacture": "China", "Tags": ["Comedy"] }', ...
As you can see, I have additional quotes outside of each dictionary list, illustrated with a single line here: ['{ "CountryOfManufacture": "China", "Tags": ["USB Powered"] }'.
Is there a way to get around this without pyspark?
Thanks!