Home > front end >  Convert a string column to period in pandas preserving the string
Convert a string column to period in pandas preserving the string

Time:09-17

I would like to understand if I can convert a string column to a PeriodIndex (for instance year), preserving the string (suffix).

I have the following DataFrame:

company            date                          ...       revenue         taxes
Facebook           2017-01-01 00:00:00 Total     ...       1796.00          0.00
Facebook           2018-07-01 00:00:00 Total     ...       7423.20        -11.54
Facebook Total     -                             ...       1704.00          0.00
Google             2017-12-01 00:00:00 Total     ...       1938.60      -1938.60
Google             2018-12-01 00:00:00 Total     ...       1403.47       -102.01
Google             2018-01-01 00:00:00 Total     ...       2028.00        -76.38
Google Total       -                             ...        800.00       -256.98

I'm trying to apply the PeriodIndex to date:

df['date'] = pd.PeriodIndex(df['date'].values, freq='Y')

However, nothing happens because Pandas can't convert it to a string. I can't remove the word Total from my DataFrame.

This is what I expect to achieve:

company            date                          ...       revenue         taxes
Facebook           2017 Total                    ...       1796.00          0.00
Facebook           2018 Total                    ...       7423.20        -11.54
Facebook Total     -                             ...       1704.00          0.00
Google             2017 Total                    ...       1938.60      -1938.60
Google             2018 Total                    ...       1403.47       -102.01
Google             2018 Total                    ...       2028.00        -76.38
Google Total       -                             ...        800.00       -256.98

Any way I can get around with this?

Thanks!

CodePudding user response:

Let's say there is a dummy dataframe, similiar with yours:

dictionary = {'company' : ['Facebook', 'Facebook', 'Facebook_Total','Google','Google_Total'],
              'date' : ['2019-09-14 09:00:08.279000 09:00 Total',
                       '2020-09-14 09:00:08.279000 09:00 Total',
                       '-',
                       '2021-09-14 09:00:08.279000 09:00 Total',
                       '-'],
             'revenue' : [10,20,30,40,50]}
df = pd.DataFrame(dictionary)

I used regex module to delete Total behind the year column as following:

substring = ' Total'
for i in range(len(df)):
    if re.search(substring, df['date'][i] , flags=re.IGNORECASE):
        df['date'][i] = df['date'][i].replace(' Total','')
    else: pass 

Then, I used pd.PeriodIndex as following:

for i in range(len(df)) :
    if df['date'][i] == '-':
        pass
    else:
        df['date'][i] = pd.PeriodIndex(pd.Series(df['date'][i]), freq='Y')[0]
        
for i in range(len(df)):
    if df['date'][i] == '-':
        pass
    else:
        df['date'][i] = str(df['date'][i])   ' Total'

The above code returns :

Out[1]: 
          company        date  revenue
0        Facebook  2019 Total       10
1        Facebook  2020 Total       20
2  Facebook_Total           -       30
3          Google  2021 Total       40
4    Google_Total           -       50
  • Related