I"m trying to remove characters in a df column left and right of key words that keep repeating in rows of a large DF. There are leading numbers too that change so it's not so easy. My data looks like this:
128544 20210831 2200 882.2 342259.UNITED STATES.LAKERS.NBA.csv
128545 20210831 2300 918.9 342259.UNITED STATES.LAKERS.NBA.csv
128546 20210901 0 NaN 544257.UNITED STATES.CELTICS.NBA.csv
128547 20210901 100 NaN 544257.UNITED STATES.CELTICS.NBA.csv
I need the stripped character DF "df" to look like this:
col1 col2 col3 col4
128544 20210831 2200 882.2 LAKERS
128545 20210831 2300 918.9 LAKERS
128546 20210901 0 NaN CELTICS
128547 20210901 100 NaN CELTICS
I've tried - df['col4'] = df.col4.str.replace(".NBA).*","")
and it gives me the right half but I cannot see how to strip the left side or do both at the same time.
Thank you for your help,
CodePudding user response:
Try with split
df['col4'] = df.col4.str.split('.').str[2]
CodePudding user response:
Or for a more precise match try with .str.extract
since you’re looking into regular expressions:
>>> df['col4'] = df['col4'].str.extract('\d \.UNITED STATES\.(.*)\.NBA\.csv')[0]
>>> df
col0 col1 col2 col3 col4
0 128544 20210831 2200 882.2 LAKERS
1 128545 20210831 2300 918.9 LAKERS
2 128546 20210901 0 NaN CELTICS
3 128547 20210901 100 NaN CELTICS