Home > other >  txt file with braces convert to dataframe
txt file with braces convert to dataframe

Time:11-27

i have a txt file which i got from zoom api. I want to transform it to a dataframe for later cleaning. But there is a column that comes with curly braces, it doesn't read it like a dictionary

open = pd.read_csv("registrados111.txt", sep = " ")
open.columns = ["page_size","total_records","next_page_token","registrants"]
open.head(1)

enter image description here

strong text

I want to be able to clean up that smear, but it won't allow me the square brackets. when I select what type is the dataframe I get pandas.core.frame.DataFrame. and when I only select the "registrants" column I get pandas.core.series.Series

I would like to have an idea of how to display that column in several others. Because I have more than 6000 thousand records that appear like this.

The output should be a dataframe like in the image

enter image description here

I've thought about using methods like explode or using split to cut it but it doesn't allow me to read the subscripts it generates later.

df = pd.DataFrame(union['registrants']).explode('custom_questions').reset_index(drop=True)
pe= pd.json_normalize(json.loads(ap.explode("custom_questions").to_json(orient="records")))

I would really appreciate anyone who could help me or give me a guide.

These are some records that I get from the txt file

300 6139 4D1YdjmRScmpymp "{'id': '6ciHOSm4Rw', 'first_name': 'yumi', 'last_name': 'napanga', 'email': '[email protected]', 'address': '', 'city': '', 'country': 'PE', 'zip': '', 'state': '', 'phone': '', 'industry': '', 'org': '', 'job_title': '', 'purchasing_time_frame': '', 'role_in_purchase_process': '', 'no_of_employees': '', 'comments': '', 'custom_questions': [{'title': 'Departamento/ Región', 'value': 'Lima'}, {'title': 'Género', 'value': 'Femenino'}, {'title': 'Edad', 'value': 'De 18 a 35 años'}, {'title': 'Nivel de estudio', 'value': 'Secundaria'}], 'status': 'approved', 'create_time': '2022-11-24T19:57:18Z'}" 300 6139 4D1YdjmRScmpy "{'id': 'DgyhfejIug', 'first_name': 'Artur', 'last_name': '', 'email': '[email protected]', 'address': '', 'city': '', 'country': 'CL', 'zip': '', 'state': '', 'phone': '', 'industry': '', 'org': '', 'job_title': '', 'purchasing_time_frame': '', 'role_in_purchase_process': '', 'no_of_employees': '', 'comments': '', 'custom_questions': [{'title': 'Departamento/ Región', 'value': 'CL'}, {'title': 'Género', 'value': 'Masculino'}, {'title': 'Edad', 'value': 'De 18 a 35 años'}, {'title': 'Nivel de estudio', 'value': 'Técnico / Superior'}], 'status': 'approved', 'create_time': '2022-11-24T17:22:44Z'}"

CodePudding user response:

This should do the trick - assuming df is your input dataframe:

import json

df["Registrant_formatted"] = df["Registrant"].str.replace("'", "\"").apply(json.loads)

Then you can navigate through the json using str handler - for example

>>> df["Registrant_formatted"].str["custom_questions"].str[0].str["title"]
0    Departamento/ Región
Name: Registrant_formatted, dtype: object

CodePudding user response:

you can use:

import ast
df['registrants']=df['registrants'].apply(ast.literal_eval)# to avoid getting this error: JSONDecodeError: Expecting property name enclosed in double quotes

df = df.join(pd.json_normalize(df['registrants'])) #normalize registrants column
dfx = pd.json_normalize(df.pop('registrants'),record_path=['custom_questions']).set_index('title').T.reset_index(drop=True) #custom questions column. Convert questions to column.
final = df.drop(['custom_questions'],axis=1).join(dfx)
print(final)
'''
|    |   page_size |   total_records | next_page_token   | id      | first_name   | last_name   | email        | address   | city   | country   | zip   | state   | phone   | industry   | org   | job_title   | purchasing_time_frame   | role_in_purchase_process   | no_of_employees   | comments   | status   | create_time          | Departamento/ Región   | Género    | Edad            | Nivel de estudio   |
|---:|------------:|----------------:|:------------------|:--------|:-------------|:------------|:-------------|:----------|:-------|:----------|:------|:--------|:--------|:-----------|:------|:------------|:------------------------|:---------------------------|:------------------|:-----------|:---------|:---------------------|:-----------------------|:----------|:----------------|:-------------------|
|  0 |         300 |            6139 | 4D1YdjmRScmpymp   | DvJh6zm | Artur        |             | [email protected] |           |        | CL        |       |         |         |            |       |             |                         |                            |                   |            | approved | 2022-11-24T17:22:44Z | Colombia               | Masculino | De 18 a 35 años | Universitario      |
'''
  • Related