I have a file as follows:
>abc 123
MMKFKPNQTRTYSRYPDQWIVPGGGME
GAAVREVYEEAGVKGKLGRLLGIFEQN
NMNJ
>hik rre
MMKFKPNPGDREGFKKRAACLCFRSEQ
EDEVLLVSSQTRTYSRYPDQWIVPGGG
MEPEEE
>dmd kij
MMKFKPNQTRTYSRYPDQWIVPGGGME
>dmd 879
MMKFKPNQTRTYSRYPDQWIVPGGGME
G
I want to convert them as data from with the data >
in one column and others in another column as follows:
Name Sequence
abc 123 MMKFKPNQTRTYSRYPDQWIVPGGGME
GAAVREVYEEAGVKGKLGRLLGIFEQN
NMNJ
hik rre MMKFKPNPGDREGFKKRAACLCFRSEQ
EDEVLLVSSQTRTYSRYPDQWIVPGGG
MEPEEE
dmd kij MMKFKPNQTRTYSRYPDQWIVPGGGME
dmd 879 MMKFKPNQTRTYSRYPDQWIVPGGGME
G
I tried the code mentioned here It did not worked for me
CodePudding user response:
One solution, IIUC:
df = pd.read_csv("data.csv", lineterminator=">", header=None)
res = (df[0].str.split("\n", expand=True, n=1)
.set_axis(["Name", "Sequence"], axis=1, inplace=False))
res["Sequence"] = res["Sequence"].str.replace(r"\s ", "", regex=True)
print(res)
Output
Name Sequence
0 abc 123 MMKFKPNQTRTYSRYPDQWIVPGGGMEGAAVREVYEEAGVKGKLGRLLGIFEQNNMNJ
1 hik rre MMKFKPNPGDREGFKKRAACLCFRSEQEDEVLLVSSQTRTYSRYPDQWIVPGGGMEPEEE
2 dmd kij MMKFKPNQTRTYSRYPDQWIVPGGGME
3 dmd 879 MMKFKPNQTRTYSRYPDQWIVPGGGMEG
Note that you need to change "data.csv"
for your file name.
CodePudding user response:
You can try this:
import pandas as pd
import re
text = '''>abc 123
MMKFKPNQTRTYSRYPDQWIVPGGGME
GAAVREVYEEAGVKGKLGRLLGIFEQN
NMNJ
>hik rre
MMKFKPNPGDREGFKKRAACLCFRSEQ
EDEVLLVSSQTRTYSRYPDQWIVPGGG
MEPEEE
>dmd kij
MMKFKPNQTRTYSRYPDQWIVPGGGME
>dmd 879
MMKFKPNQTRTYSRYPDQWIVPGGGME
G'''
a = text.split('\n\n')
df = pd.DataFrame()
df['text'] = a
def split_text(text):
return re.split('\\n',text)
def separate_text(df, col):
Name,Sequence = [],[]
for i in range(len(df)):
Name.append(re.sub('[^A-Za-z0-9]',' ',col[i][0]))
Sequence.append(re.sub('[^A-Za-z0-9]',' ',str(col[i][1:])).replace(' ',''))
return Name, Sequence
df['text'] = df['text'].apply(split_text)
Name, Sequence = separate_text(df, df['text'])
df['Name'] = Name
df['Sequence'] = Sequence
df.drop('text',axis='columns')
CodePudding user response:
You could read each row in as a single column then extract what you need form there.
df = pd.read_fwf('your.csv', header=None, widths=[999999], names=['Sequence'])
dfs = (
df.assign(Name=df['Sequence'].str.extract(r'\>(.*)').ffill()) # assoc name with seq data rows
.loc[lambda x: ~x['Sequence'].str.contains('>')] # get rid of name-only rows
.groupby('Name', as_index=False)['Sequence'].apply(''.join) # join seq into single string by name
)
print(dfs)
Result
Name Sequence
0 abc 123 MMKFKPNQTRTYSRYPDQWIVPGGGMEGAAVREVYEEAGVKGKLGR...
1 dmd 879 MMKFKPNQTRTYSRYPDQWIVPGGGMEG
2 dmd kij MMKFKPNQTRTYSRYPDQWIVPGGGME
3 hik rre MMKFKPNPGDREGFKKRAACLCFRSEQEDEVLLVSSQTRTYSRYPD...