Home > Net >  Cant make Pandas recognize columns from string
Cant make Pandas recognize columns from string

Time:03-08

So I was trying do get some information about Cancer from GDA , I'm new with this so please don't judge. Through a tutorial on the website i managed to get a response from a request but I don't know how to transform it into a Dataframe (so I can make Merges with other data). Here down below I show how the data come in string version and everything is fine:

enter image description here

Then here when I make it into a Dataframe it all compress it into a single column with \ as divisors. I don't get how can I actually make it as a Pandas Dataframe. Here is how it looks , if I print the first column all of this is printed out meaning that it does not recognize the columns as shown in the string version above:

enter image description here

This is the full Tutorial page code used:

from io import BytesIO
from io import StringIO
import ast
import pandas as pd
import requests
import json

fields = [
    "file_name",
    "cases.submitter_id",
    "cases.samples.sample_type",
    "cases.disease_type",
    "cases.project.project_id"
    ]

fields = ",".join(fields)

files_endpt = "https://api.gdc.cancer.gov/files"

# This set of filters is nested under an 'and' operator.
filters = {
    "op": "and",
    "content":[
        {
        "op": "in",
        "content":{
            "field": "cases.project.primary_site",
            "value": ["Breast"]
            }
        },
        {
        "op": "in",
        "content":{
            "field": "files.experimental_strategy",
            "value": ["RNA-Seq"]
            }
        }
    ]
}

# A POST is used, so the filter parameters can be passed directly as a Dict object.
params = {
    "filters": filters,
    "fields": fields,
    "format": "TSV", #TSV
    "size": "2000"
    }

# The parameters are passed to 'json' rather than 'params' in this case
response = requests.post(files_endpt, headers = {"Content-Type": "application/json"}, json = params)
string = response.content.decode("utf-8")
df = pd.read_csv(BytesIO(response.content),on_bad_lines='skip')
print(df)```

CodePudding user response:

The data you are getting appears to be tab delimited, as such the following tweak should work fine:

df = pd.read_csv(BytesIO(response.content), sep='\t', on_bad_lines='skip')

Giving you a dataframe starting:

              cases.0.disease_type  ...                                    id
0     Ductal and Lobular Neoplasms  ...  37175dfe-e34e-4f97-88b1-c0ba4bd5d093
1     Ductal and Lobular Neoplasms  ...  319bc898-6d70-4c38-a177-37ed7824dd7a
2     Complex Epithelial Neoplasms  ...  42c461fe-31a4-4ee4-8d17-95a5da96a8eb
  • Related