I'm trying to get a final pandas data frame from an initial uniprot url:
import requests
url = 'http://www.uniprot.org/uniprot/?query=Interferon lambda receptor 1&sort=score&format=tab'
req = requests.get(url)
The output of:
req.text
is something like this:
""Entry\tEntry name\tStatus\tProtein names\tGene names\tOrganism\tLength\nQ8IU57\tINLR1_HUMAN\treviewed\tInterferon lambda receptor 1 (IFN-lambda receptor 1) (IFN-lambda-R1) (Cytokine receptor class-II member 12) (Cytokine receptor family 2 member 12) (CRF2-12) (Interleukin-28 receptor subunit alpha) (IL-28 receptor subunit alpha) (IL-28R-alpha) (IL-28RA) (Likely interleukin or cytokine receptor 2).....
To get the lines I did:
lines = req.text.splitlines()
#every line is separated by a comma ' ', ' ', ... #columns are separated by '\t'
If I use:
import re
re.split(r'\t ', lines[0])
this gives to the correct splitting of each columns.
Out:
['Entry',
'Entry name',
'Status',
'Protein names',
'Gene names',
'Organism',
'Length']
However, if I want to do a for loop and do it on all lines I get an error: string2list(lines): list indices must be integers or slices, not str'
import re
def string2list(file):
list = []
for i in lines:
re.split(r'\t ', lines[i])
list = lines
return list
My aim is to get a list of lists to finally use this code:
import pandas as pd
list_name = lines
df = pd.DataFrame (list_name, columns = lines[i])
Any ideas on what is the best approach? Is the conversion of string in a list to a list of list possible? what is the best way? Or is there another way to reach the pandas data frame directly from the url? Thank you in advance!
CodePudding user response:
The simplest way to load that file into a DataFrame is to use pd.read_csv()
, which supports url input.
import pandas as pd
url = 'http://www.uniprot.org/uniprot/?query=Interferon lambda receptor 1&sort=score&format=tab'
df = pd.read_csv(url, sep='\t')
BTW, Regarding your code:
def string2list(file):
list = []
for i in lines:
re.split(r'\t ', lines[i])
list = lines
return list
There are several problems.
file
is unusedlines
is undefinedi
is a string, not an integer.- Therefore, you probably meant
re.split(r'\t ', i)
list = lines
is probably not what you meant...- Your
return
statement is inside the for loop, rather than in the outer scope.
IIUC, I think you were aiming to write something like this:
def split_lines(file):
with open(file, 'r') as f:
lines = f.readlines()
results = []
for line in lines:
words = re.split(r'\t', line.strip())
results.append(words)
return results
CodePudding user response:
Thank you very much Stuart for the `pd.read_csv()' function. It does exactly what I needed in a very efficient way!
For the other for loop, thank you so much for correcting!
This worked too, thanks for your input on the for loop:
def split_lines2(url):
req = requests.get(url)
lines = req.text.splitlines()
results = []
for line in lines:
words = re.split(r'\t', line.strip())
results.append(words)
return results
test_x = split_lines2(my_url) #this give a list of lists
df = pd.DataFrame (data = test_x, columns = test_x[0])
df_drop_row_1 = df.drop(df.index[0])
Thanks again:)