Home > Net >  Pandas Dataframe/Series - Adding character to string values
Pandas Dataframe/Series - Adding character to string values

Time:12-25

I'm looking for help about the following problem.

I've got a dataframe :

df = pd.DataFrame({
                   'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
                   'ID2': ['blabla-012345-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla'],
                   'ID3': ['1234512345', '12345123456', '12345123456789', '123451234567891', '123451234567891', '123456123456789', '123456123456789', '123456789123456789', '1234567891234567891'],
                   'case': ['10', '11', '14', '15a_1', '15a_2', '15b_1', '15b_2', '18', '19']
                 })
df


    ID1                         ID2                         ID3                 case
0   blabla-012345-blabla        blabla-012345-blabla        1234512345          10
1   blabla-012345-blabla        blabla-123456-blabla        12345123456         11
2   blabla-012345-blabla        blabla-0123456789-blabla    12345123456789      14
3   blabla-012345-blabla        blabla-1234567891-blabla    123451234567891     15a_1
4   blabla-1234567891-blabla    blabla-012345-blabla        123451234567891     15a_2
5   blabla-123456-blabla        blabla-0123456789-blabla    123456123456789     15b_1
6   blabla-0123456789-blabla    blabla-123456-blabla        123456123456789     15b_2
7   blabla-0123456789-blabla    blabla-0123456789-blabla    123456789123456789  18
8   blabla-0123456789-blabla    blabla-1234567891-blabla    1234567891234567891 19

All values are strings, but my data was considered integer before imported to pandas.

'blabla' could be consider as anything (not random but useless to know to solve this problem), they complicate the way to rebuild ID3 from ID1 and ID2.

That's why I want to rebuild ID3 from itself.

Originally ID3 should be : 'ID3' = 'ID1number' 'ID2number' if ID1number<ID2number or 'ID3' = 'ID2number' 'ID1number' if ID2number<ID1number.

But as I said the IDs were considered integer before imported, so the '0' that started numbers in ID1 and ID2 just went off.

I want to reconstruct ID3 to fill the '0' that are missing. So I did those masks to do it, and try to change values corresponding to them :

mask_ok = df['ID3'].str.contains('^(\d{12}|\d{16}|\d{20})$')

mask_10_18 = df['ID3'].str.contains('^(\d{10}|\d{18})$')

mask_11_19 = df['ID3'].str.contains('^(\d{11}|\d{19})$')

mask_14 = df['ID3'].str.contains('^\d{14}$')

mask_15a = ((df['ID3'].str.contains('^\d{15}$'))
            &
            (df['ID1'].str.startswith('^blabla-0\d{5}-')
             |
             df['ID2'].str.startswith('^blabla-0\d{5}-')))

mask_15b = ((df['ID3'].str.contains('^\d{15}$'))
            &
            ~(df['ID1'].str.startswith('^blabla-0\d{5}-')
              |
              df['ID2'].str.startswith('^blabla-0\d{5}-')))

df.loc[mask_10_18] = '0' df.loc[mask_10_18][:len(df.loc[mask_10_18])/2] '0' df.loc[mask_10_18][len(df.loc[mask_10_18])/2:]

df.loc[mask_14] = '0' df.loc[mask_14][:5] '0' df.loc[mask_14][5:]

df.loc[mask_11_19|mask_15a] = '0' df.loc[mask_11_19|mask_15a]

df.loc[mask_15b] = df.loc[mask_15b][:6] '0' df.loc[mask_15b][6:]

df[~mask_ok]

What I want is that all zeros are well placed so that the dataframe look like this :

df = pd.DataFrame({
                   'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
                   'ID2': ['blabla-012345-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla'],
                   'ID3': ['012345012345', '012345123456', '0123450123456789', '0123451234567891', '0123451234567891', '1234560123456789', '1234560123456789', '01234567890123456789', '01234567891234567891'],
                   'case': ['12', '12', '16', '16', '16', '16', '16', '20', '20']
                 })
df

    ID1                         ID2                         ID3          case
0   blabla-012345-blabla        blabla-012345-blabla        012345012345            12
1   blabla-012345-blabla        blabla-123456-blabla        012345123456            12
2   blabla-012345-blabla        blabla-0123456789-blabla    0123450123456789        16
3   blabla-012345-blabla        blabla-1234567891-blabla    0123451234567891        16
4   blabla-1234567891-blabla    blabla-012345-blabla        0123451234567891        16
5   blabla-123456-blabla        blabla-0123456789-blabla    1234560123456789        16
6   blabla-0123456789-blabla    blabla-123456-blabla        1234560123456789        16
7   blabla-0123456789-blabla    blabla-0123456789-blabla    01234567890123456789    20
8   blabla-0123456789-blabla    blabla-1234567891-blabla    01234567891234567891    20

The columns case is just here to show the number of digit in ID3 but it's not in my original dataframe and I just put it there to improve visualisation. You can get rid of it for calculation.

I've got this error when I run the code :

TypeError                                 Traceback (most recent call last)
<ipython-input-30-91bff8470cf6> in <module>
     19               df['ID2'].str.startswith('^blabla-0')))
     20 
---> 21 df.loc[mask_10_18] = '0' df.loc[mask_10_18][:len(df.loc[mask_10_18])/2] '0' df.loc[mask_10_18][len(df.loc[mask_10_18])/2:]
     22 
     23 df.loc[mask_14] = '0' df.loc[mask_14][:5] '0' df.loc[mask_14][5:]

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2997 
   2998         # Do we have a slicer (on rows)?
-> 2999         indexer = convert_to_index_sliceable(self, key)
   3000         if indexer is not None:
   3001             if isinstance(indexer, np.ndarray):

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in convert_to_index_sliceable(obj, key)
   2208     idx = obj.index
   2209     if isinstance(key, slice):
-> 2210         return idx._convert_slice_indexer(key, kind="getitem")
   2211 
   2212     elif isinstance(key, str):

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in _convert_slice_indexer(self, key, kind)
   3355             if self.is_integer() or is_index_slice:
   3356                 self._validate_indexer("slice", key.start, "getitem")
-> 3357                 self._validate_indexer("slice", key.stop, "getitem")
   3358                 self._validate_indexer("slice", key.step, "getitem")
   3359                 return key

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in _validate_indexer(self, form, key, kind)
   5307             pass
   5308         else:
-> 5309             raise self._invalid_indexer(form, key)
   5310 
   5311     def _maybe_cast_slice_bound(self, label, side: str_t, kind):

TypeError: cannot do slice indexing on Int64Index with these indexers [1.0] of type float

I think it's because pandas is looking at my string slicing as chained slices on my dataframe.

I also tried to put .str in front of all calls to string (like .str.contains in the masks), but it doesn't work either.

How can I do this instead ?

CodePudding user response:

In my question, I was saying that the reconstruction of ID3 from ID1 and ID2 would be complicated by blabla- and -blabla. It was wrong because I can extract the numerics (and even a specific number of digit) from ID1 and ID2.

df = pd.DataFrame({
                   'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
                   'ID2': ['blabla-012346-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456799-blabla', 'blabla-1234567891-blabla'],
                   'ID3': ['1234512345', '12345123456', '12345123456789', '123451234567891', '123451234567891', '123456123456789', '123456123456789', '123456789123456789', '1234567891234567891'],
                   'case': ['10', '11', '14', '15a_1', '15a_2', '15b_1', '15b_2', '18', '19']
                 })
df['num_ID1'] = df['ID1'].str.extractall(r'(\d )').droplevel('match')
df['num_ID2'] = df['ID2'].str.extractall(r'(\d )').droplevel('match')
df.loc[df['num_ID1'].astype(int)<df['num_ID2'].astype(int), 'new_col'] = df['num_ID1']   df['num_ID2']
df.loc[df['num_ID2'].astype(int)<df['num_ID1'].astype(int), 'new_col'] = df['num_ID2']   df['num_ID1']
df

df1 = pd.DataFrame({
                   'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
                   'ID2': ['blabla-012346-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456799-blabla', 'blabla-1234567891-blabla'],
                   'ID3': ['012345012346', '012345123456', '0123450123456789', '0123451234567891', '0123451234567891', '1234560123456789', '1234560123456789', '01234567890123456799', '01234567891234567891'],
                   'case': ['10', '11', '14', '15a_1', '15a_2', '15b_1', '15b_2', '18', '19']
                 })
print(df1['ID3']==df['new_col'])

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
dtype: bool

Actually it's less complicated than trying to mask all cases and slice every ID3 on condition. I should now take care about the fact that 'blabla' could be random, so I need to extract only groups of a certain lenght to do it. And it will be done. Using string extraction is simpler than I though in this example.

  • Related