Convert string from database to dataframe-CodePudding

My database has a column where all the cells have a string of data. There are around 15-20 variables, where the information is assigned to the variables with an "=" and separated by a space. The number and names of the variables can differ in the individual cells... The issue I face is that the data is separated by spaces and so are some of the variables. The variable name is in every cell, so I can't just make the headers and add the values to the data frame like a csv. The solution also needs to be able to do this process automatically for all the new data in the database.

Example:

Cell 1: TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520"... RELEASED="1880".

Cell 2: TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655"... MAIN CHARACTER="Ishmael".

I want to convert these strings of data into a structured dataframe like.

TITLE	AUTHOR	PAGES	RELEASED	MAIN
Brothers Karamazov	Fyodor Dostoevsky	520	1880	NaN
Moby Dick	Herman Meville	655	NaN	Ishmael

Any tips on how to move forwards? I have though about converting it into a JSON format by using the replace() function, before turning it into a dataframe, but have not yet succeeded. Any tips or ideas are much appreciated.

Thanks,

CodePudding user response：

I guess this sample is what you need.

import pandas as pd


# Helper function
def str_to_dict(cell) -> dict:
    normalized_cell = cell.replace('" ', '\n').replace('"', '').split('\n')
    temp = {}
    for x in normalized_cell:
        key, value = x.split('=')
        temp[key] = value
    return temp


list_of_cell = [
    'TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520" RELEASED="1880"',
    'TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655" MAIN CHARACTER="Ishmael"'
]


dataset = [str_to_dict(i) for i in list_of_cell]

print(dataset)
"""
[{'TITLE': 'Brothers Karamazov', 'AUTHOR': 'Fyodor Dostoevsky', 'PAGES': '520', 'RELEASED': '1880'}, {'TITLE': 'Moby Dick', 'AUTHOR': 'Herman Melville', 'PAGES': '655', 'MAIN CHARACTER': 'Ishmael'}]
"""

df = pd.DataFrame(dataset)
df.head()
"""
                TITLE             AUTHOR PAGES RELEASED MAIN CHARACTER
0  Brothers Karamazov  Fyodor Dostoevsky   520     1880            NaN
1           Moby Dick    Herman Melville   655      NaN        Ishmael
"""

CodePudding user response：

Pandas lib can read them from a .csv file and make a data frame - try this:

import pandas as pd
file = 'xx.csv'
data = pd.read_csv(file)
print(data)

CodePudding user response：

Create a Python dictionary from your database rows.
Then create Pandas dataframe using the function: pandas.DataFrame.from_dict

Something like this:

import pandas as pd

# Assumed data from DB, structure it like this
data = [
    {
        'TITLE': 'Brothers Karamazov',
        'AUTHOR': 'Fyodor Dostoevsky'
    }, {

        'TITLE': 'Moby Dick',
        'AUTHOR': 'Herman Melville'
    }
]

# Dataframe as per your requirements
dt = pd.DataFrame.from_dict(data)