Home > Back-end >  Extract string between two substrings in a pandas df column
Extract string between two substrings in a pandas df column

Time:10-31

I have the following dataframe:

                                           contract
0  Future(conId=482048803, symbol='ESTX50', lastT...
1  Future(conId=497000453, symbol='XT', lastTrade...
2  Stock(conId=321100413, symbol='SXRS', exchange...
3  Stock(conId=473087271, symbol='ETHEEUR', excha...
4  Stock(conId=80268543, symbol='IJPA', exchange=...
5  Stock(conId=153454120, symbol='EMIM', exchange...
6  Stock(conId=75776072, symbol='SXR8', exchange=...
7  Stock(conId=257200855, symbol='EGLN', exchange...
8  Stock(conId=464974581, symbol='VBTC', exchange...
9  Future(conId=478135706, symbol='ZN', lastTrade...

I want to create a new "symbol" column with all symbols (ESTX50, XT, SXRS...).

In order to extract the substring between "symbol='" and the following single quote, I tried the following:

df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")

but I get a column of NaN. What am I doing wrong? Thanks

CodePudding user response:

It looks like that is a column of objects, not strings:

import pandas as pd


class Future:
    def __init__(self, symbol):
        self.symbol = symbol

    def __repr__(self):
        return f'Future(symbol=\'{self.symbol}\')'


df = pd.DataFrame({'contract': [Future(symbol='ESTX50'), Future(symbol='XT')]})

df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
print(df)

df:

                  contract symbol
0  Future(symbol='ESTX50')    NaN
1      Future(symbol='XT')    NaN

Notice pandas considers strings to be object type so the string accessor is still allowed to attempt to perform operations. However, it cannot extract because these are not strings.

We can either convert to string first with astype:

df['symbol'] = df.contract.astype(str).str.extract(r"symbol='(.*?)'")

df:

                  contract  symbol
0  Future(symbol='ESTX50')  ESTX50
1      Future(symbol='XT')      XT

However, the faster approach is to try to extract the object property:

df['symbol'] = [getattr(x, 'symbol', None) for x in df.contract]

Or with apply (which can be slower than the comprehension)

df['symbol'] = df.contract.apply(lambda x: getattr(x, 'symbol', None))

Both produce:

                  contract  symbol
0  Future(symbol='ESTX50')  ESTX50
1      Future(symbol='XT')      XT
  • Related