How do I convert a list containing lines(in string) in a pandas data frame from an url using request-CodePudding

I'm trying to get a final pandas data frame from an initial uniprot url:

import requests
url = 'http://www.uniprot.org/uniprot/?query=Interferon lambda receptor 1&sort=score&format=tab'
req = requests.get(url)

The output of:

req.text

is something like this:

""Entry\tEntry name\tStatus\tProtein names\tGene names\tOrganism\tLength\nQ8IU57\tINLR1_HUMAN\treviewed\tInterferon lambda receptor 1 (IFN-lambda receptor 1) (IFN-lambda-R1) (Cytokine receptor class-II member 12) (Cytokine receptor family 2 member 12) (CRF2-12) (Interleukin-28 receptor subunit alpha) (IL-28 receptor subunit alpha) (IL-28R-alpha) (IL-28RA) (Likely interleukin or cytokine receptor 2).....

To get the lines I did:

lines = req.text.splitlines()

#every line is separated by a comma ' ', ' ', ... #columns are separated by '\t'

If I use:

import re
re.split(r'\t ', lines[0])

this gives to the correct splitting of each columns.

Out:
['Entry',
 'Entry name',
 'Status',
 'Protein names',
 'Gene names',
 'Organism',
 'Length']

However, if I want to do a for loop and do it on all lines I get an error: string2list(lines): list indices must be integers or slices, not str'

import re

def string2list(file):
    list = []
    for i in lines:
        re.split(r'\t ', lines[i])
        list  = lines
        return list

My aim is to get a list of lists to finally use this code:

import pandas as pd
list_name = lines
df = pd.DataFrame (list_name, columns = lines[i])

Any ideas on what is the best approach? Is the conversion of string in a list to a list of list possible? what is the best way? Or is there another way to reach the pandas data frame directly from the url? Thank you in advance!

CodePudding user response：

The simplest way to load that file into a DataFrame is to use pd.read_csv(), which supports url input.

import pandas as pd
url = 'http://www.uniprot.org/uniprot/?query=Interferon lambda receptor 1&sort=score&format=tab'
df = pd.read_csv(url, sep='\t')

BTW, Regarding your code:

def string2list(file):
    list = []
    for i in lines:
        re.split(r'\t ', lines[i])
        list  = lines
        return list

There are several problems.

file is unused
lines is undefined
i is a string, not an integer.
Therefore, you probably meant re.split(r'\t ', i)
list = lines is probably not what you meant...
Your return statement is inside the for loop, rather than in the outer scope.

IIUC, I think you were aiming to write something like this:

def split_lines(file):
    with open(file, 'r') as f:
        lines = f.readlines()

    results = []
    for line in lines:
        words = re.split(r'\t', line.strip())
        results.append(words)
    return results

CodePudding user response：

Thank you very much Stuart for the `pd.read_csv()' function. It does exactly what I needed in a very efficient way!

For the other for loop, thank you so much for correcting!

This worked too, thanks for your input on the for loop:

def split_lines2(url):
    req = requests.get(url) 
    lines = req.text.splitlines()

    results = []
    for line in lines:
        words = re.split(r'\t', line.strip())
        results.append(words)
    return results

test_x = split_lines2(my_url) #this give a list of lists
df = pd.DataFrame (data = test_x, columns = test_x[0])
df_drop_row_1 = df.drop(df.index[0])

Thanks again:)