I have column in a pandas df that has this format "1_A01_1_1_NA I want to extract the text that is between the underscores e.g. "A01" "1" "1" and "NA" , I tried to use left right and mid but the problem is that at some point the column value changes into something like this 11_B40_11_8_NA.
Pd the df has 7510 rows.
CodePudding user response:
Use str.split
:
df = pd.DataFrame({'Col1': ['1_A01_1_1_NA', '11_B40_11_8_NA']})
out = df['Col1'].str.split('_', expand=True)
Output:
>>> out
0 1 2 3 4
0 1 A01 1 1 NA
1 11 B40 11 8 NA
CodePudding user response:
The function you are looking for is Pandas.series.str.split().
You should be able to take your nasty column as a series and use the str.split("_", expand = True)
method. You can see the "expand" keyword is exactly what you need to make new columns out of the results (splitting on the "_" character, not any specific index).
So, something like this:
First we need to create a little bit of nonsense like yours. (Please forgive my messy and meandering code, I'm still new)
import pandas as pd
from random import choice
import string
# Creating Nonsense Data Frame
def make_nonsense_codes():
"""
Returns a string of nonsense like '11_B40_11_8_NA'
"""
nonsense = "_".join(
[
"".join(choice(string.digits) for i in range(2)),
"".join(
[choice(string.ascii_uppercase),
"".join([choice(string.digits) for i in range(2)])
]
),
"".join(choice(string.digits) for i in range(2)),
choice(string.digits),
"NA"
]
)
return nonsense
my_nonsense_df = pd.DataFrame(
{"Nonsense" : [make_nonsense_codes() for i in range(5)]}
)
print(my_nonsense_df)
# Nonsense
# 0 25_S91_13_1_NA
# 1 80_O54_58_4_NA
# 2 01_N98_68_3_NA
# 3 88_B37_14_9_NA
# 4 62_N65_73_7_NA
Now we can select our "Nonsense"
column, and use str.split()
.
# Wrangling the nonsense column with series.str.split()
wrangled_nonsense_df = my_nonsense_df["Nonsense"].str.split("_", expand = True)
print(wrangled_nonsense_df)
# 0 1 2 3 4
# 0 25 S91 13 1 NA
# 1 80 O54 58 4 NA
# 2 01 N98 68 3 NA
# 3 88 B37 14 9 NA
# 4 62 N65 73 7 NA