How to extract text from a column in pandas-CodePudding

I have column in a pandas df that has this format "1_A01_1_1_NA I want to extract the text that is between the underscores e.g. "A01" "1" "1" and "NA" , I tried to use left right and mid but the problem is that at some point the column value changes into something like this 11_B40_11_8_NA.

Pd the df has 7510 rows.

CodePudding user response：

Use str.split:

df = pd.DataFrame({'Col1': ['1_A01_1_1_NA', '11_B40_11_8_NA']})
out = df['Col1'].str.split('_', expand=True)

Output:

>>> out
    0    1   2  3   4
0   1  A01   1  1  NA
1  11  B40  11  8  NA

CodePudding user response：

The function you are looking for is Pandas.series.str.split().

You should be able to take your nasty column as a series and use the str.split("_", expand = True) method. You can see the "expand" keyword is exactly what you need to make new columns out of the results (splitting on the "_" character, not any specific index).

So, something like this:

First we need to create a little bit of nonsense like yours. (Please forgive my messy and meandering code, I'm still new)

import pandas as pd
from random import choice
import string

# Creating Nonsense Data Frame

def make_nonsense_codes():
    """
    Returns a string of nonsense like '11_B40_11_8_NA'
    """
    nonsense = "_".join(
        [
            "".join(choice(string.digits) for i in range(2)),
            "".join(
                [choice(string.ascii_uppercase),
                "".join([choice(string.digits) for i in range(2)])
                ]
            ),
            "".join(choice(string.digits) for i in range(2)),
            choice(string.digits),
            "NA"
        ]
    )
    return nonsense

my_nonsense_df = pd.DataFrame(
    {"Nonsense" : [make_nonsense_codes() for i in range(5)]}
)

print(my_nonsense_df)

#         Nonsense
# 0  25_S91_13_1_NA
# 1  80_O54_58_4_NA
# 2  01_N98_68_3_NA
# 3  88_B37_14_9_NA
# 4  62_N65_73_7_NA

Now we can select our "Nonsense" column, and use str.split().

# Wrangling the nonsense column with series.str.split()
wrangled_nonsense_df = my_nonsense_df["Nonsense"].str.split("_", expand = True)
print(wrangled_nonsense_df)

#     0    1   2  3   4
# 0  25  S91  13  1  NA
# 1  80  O54  58  4  NA
# 2  01  N98  68  3  NA
# 3  88  B37  14  9  NA
# 4  62  N65  73  7  NA