Home > other >  How to convert file in a specific pattern to a dataframe?
How to convert file in a specific pattern to a dataframe?

Time:07-31

I have a file as follows:

>abc 123
MMKFKPNQTRTYSRYPDQWIVPGGGME
GAAVREVYEEAGVKGKLGRLLGIFEQN
NMNJ

>hik rre
MMKFKPNPGDREGFKKRAACLCFRSEQ
EDEVLLVSSQTRTYSRYPDQWIVPGGG
MEPEEE

>dmd kij
MMKFKPNQTRTYSRYPDQWIVPGGGME

>dmd 879
MMKFKPNQTRTYSRYPDQWIVPGGGME
G

I want to convert them as data from with the data > in one column and others in another column as follows:

Name       Sequence
abc 123    MMKFKPNQTRTYSRYPDQWIVPGGGME
           GAAVREVYEEAGVKGKLGRLLGIFEQN
           NMNJ
hik rre    MMKFKPNPGDREGFKKRAACLCFRSEQ
           EDEVLLVSSQTRTYSRYPDQWIVPGGG
           MEPEEE
dmd kij    MMKFKPNQTRTYSRYPDQWIVPGGGME
dmd 879    MMKFKPNQTRTYSRYPDQWIVPGGGME
           G

I tried the code mentioned here It did not worked for me

CodePudding user response:

One solution, IIUC:

df = pd.read_csv("data.csv", lineterminator=">", header=None)
res = (df[0].str.split("\n", expand=True, n=1)
           .set_axis(["Name", "Sequence"], axis=1, inplace=False))

res["Sequence"] = res["Sequence"].str.replace(r"\s ", "", regex=True)
print(res)

Output

      Name                                                      Sequence
0  abc 123    MMKFKPNQTRTYSRYPDQWIVPGGGMEGAAVREVYEEAGVKGKLGRLLGIFEQNNMNJ
1  hik rre  MMKFKPNPGDREGFKKRAACLCFRSEQEDEVLLVSSQTRTYSRYPDQWIVPGGGMEPEEE
2  dmd kij                                   MMKFKPNQTRTYSRYPDQWIVPGGGME
3  dmd 879                                  MMKFKPNQTRTYSRYPDQWIVPGGGMEG

Note that you need to change "data.csv" for your file name.

CodePudding user response:

You can try this:

import pandas as pd
import re

text = '''>abc 123
MMKFKPNQTRTYSRYPDQWIVPGGGME
GAAVREVYEEAGVKGKLGRLLGIFEQN
NMNJ

>hik rre
MMKFKPNPGDREGFKKRAACLCFRSEQ
EDEVLLVSSQTRTYSRYPDQWIVPGGG
MEPEEE

>dmd kij
MMKFKPNQTRTYSRYPDQWIVPGGGME

>dmd 879
MMKFKPNQTRTYSRYPDQWIVPGGGME
G'''

a = text.split('\n\n')
df = pd.DataFrame()
df['text'] = a

def split_text(text):
    return re.split('\\n',text)


def separate_text(df, col):
    Name,Sequence = [],[]
    for i in range(len(df)):
        Name.append(re.sub('[^A-Za-z0-9]',' ',col[i][0]))
        Sequence.append(re.sub('[^A-Za-z0-9]',' ',str(col[i][1:])).replace(' ',''))
    return Name, Sequence

df['text'] = df['text'].apply(split_text)

Name, Sequence = separate_text(df, df['text'])
df['Name'] = Name
df['Sequence'] = Sequence
df.drop('text',axis='columns')

CodePudding user response:

You could read each row in as a single column then extract what you need form there.

df = pd.read_fwf('your.csv', header=None, widths=[999999], names=['Sequence'])

dfs = (
    df.assign(Name=df['Sequence'].str.extract(r'\>(.*)').ffill())    # assoc name with seq data rows
        .loc[lambda x: ~x['Sequence'].str.contains('>')]             # get rid of name-only rows
        .groupby('Name', as_index=False)['Sequence'].apply(''.join)  # join seq into single string by name
)
print(dfs)

Result

      Name                                           Sequence
0  abc 123  MMKFKPNQTRTYSRYPDQWIVPGGGMEGAAVREVYEEAGVKGKLGR...
1  dmd 879                       MMKFKPNQTRTYSRYPDQWIVPGGGMEG
2  dmd kij                        MMKFKPNQTRTYSRYPDQWIVPGGGME
3  hik rre  MMKFKPNPGDREGFKKRAACLCFRSEQEDEVLLVSSQTRTYSRYPD...
  • Related