My database has a column where all the cells have a string of data. There are around 15-20 variables, where the information is assigned to the variables with an "=" and separated by a space. The number and names of the variables can differ in the individual cells... The issue I face is that the data is separated by spaces and so are some of the variables. The variable name is in every cell, so I can't just make the headers and add the values to the data frame like a csv. The solution also needs to be able to do this process automatically for all the new data in the database.
Example:
Cell 1: TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520"... RELEASED="1880".
Cell 2: TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655"... MAIN CHARACTER="Ishmael".
I want to convert these strings of data into a structured dataframe like.
TITLE | AUTHOR | PAGES | RELEASED | MAIN |
---|---|---|---|---|
Brothers Karamazov | Fyodor Dostoevsky | 520 | 1880 | NaN |
Moby Dick | Herman Meville | 655 | NaN | Ishmael |
Any tips on how to move forwards? I have though about converting it into a JSON format by using the replace() function, before turning it into a dataframe, but have not yet succeeded. Any tips or ideas are much appreciated.
Thanks,
CodePudding user response:
I guess this sample is what you need.
import pandas as pd
# Helper function
def str_to_dict(cell) -> dict:
normalized_cell = cell.replace('" ', '\n').replace('"', '').split('\n')
temp = {}
for x in normalized_cell:
key, value = x.split('=')
temp[key] = value
return temp
list_of_cell = [
'TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520" RELEASED="1880"',
'TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655" MAIN CHARACTER="Ishmael"'
]
dataset = [str_to_dict(i) for i in list_of_cell]
print(dataset)
"""
[{'TITLE': 'Brothers Karamazov', 'AUTHOR': 'Fyodor Dostoevsky', 'PAGES': '520', 'RELEASED': '1880'}, {'TITLE': 'Moby Dick', 'AUTHOR': 'Herman Melville', 'PAGES': '655', 'MAIN CHARACTER': 'Ishmael'}]
"""
df = pd.DataFrame(dataset)
df.head()
"""
TITLE AUTHOR PAGES RELEASED MAIN CHARACTER
0 Brothers Karamazov Fyodor Dostoevsky 520 1880 NaN
1 Moby Dick Herman Melville 655 NaN Ishmael
"""
CodePudding user response:
Pandas lib can read them from a .csv file and make a data frame - try this:
import pandas as pd
file = 'xx.csv'
data = pd.read_csv(file)
print(data)
CodePudding user response:
- Create a Python dictionary from your database rows.
- Then create Pandas dataframe using the function: pandas.DataFrame.from_dict
Something like this:
import pandas as pd
# Assumed data from DB, structure it like this
data = [
{
'TITLE': 'Brothers Karamazov',
'AUTHOR': 'Fyodor Dostoevsky'
}, {
'TITLE': 'Moby Dick',
'AUTHOR': 'Herman Melville'
}
]
# Dataframe as per your requirements
dt = pd.DataFrame.from_dict(data)