I have a dataframe with just one column with content like:
view: meta_record_extract
dimension: e_filter
type: string
hidden: yes
sql: "SELECT * FROM files"
dimension: category
type: string
...
What I tried to produce would be a dataframe with columns and the data like this:
____________________________________________________________________
view | dimension |label | type | hidden | sql |
meta_record_extract| e_filter | NaN | string| yes |"SELECT * FROM files"
NaN | category | NaN | string ...
Given that splitting the string data like
df.header[0].split(': ')[0]
was giving me label with [0] or value with [1] I tried this:
df.pivot_table(df, columns = df.header.str.split(': ')[0], values = df.header.str.split(': ')[1])
but it did not work giving the error.
Can anyone help me to achieve the result I need?
CodePudding user response:
Use str.findall()
map
, as follows:
str.findall()
helps you extract the keyword and value pairs into a list. We then map the list of keyword-value pairs into a dict for pd.Dataframe
to turn the dict into a dataframe.
(Assuming the column label of your column is Col1
):
df_extract = df['Col1'].str.findall(r'(\w ):\s*(.*)')
df_result = pd.DataFrame(map(dict, df_extract))
Result:
print(df_result)
view dimension type hidden sql
0 meta_record_extract NaN NaN NaN NaN
1 NaN e_filter NaN NaN NaN
2 NaN NaN string NaN NaN
3 NaN NaN NaN yes NaN
4 NaN NaN NaN NaN "SELECT * FROM files"
5 NaN category NaN NaN NaN
6 NaN NaN string NaN NaN
Update
To compress the rows to minimize the NaN
's, we can further use .apply()
with .dropna()
, as follows:
df_compressed = df_result.apply(lambda x: pd.Series(x.dropna().values))
Result:
print(df_compressed)
view dimension type hidden sql
0 meta_record_extract e_filter string yes "SELECT * FROM files"
1 NaN category string NaN NaN