I'm looking for help about the following problem.
I've got a dataframe :
df = pd.DataFrame({
'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
'ID2': ['blabla-012345-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla'],
'ID3': ['1234512345', '12345123456', '12345123456789', '123451234567891', '123451234567891', '123456123456789', '123456123456789', '123456789123456789', '1234567891234567891'],
'case': ['10', '11', '14', '15a_1', '15a_2', '15b_1', '15b_2', '18', '19']
})
df
ID1 ID2 ID3 case
0 blabla-012345-blabla blabla-012345-blabla 1234512345 10
1 blabla-012345-blabla blabla-123456-blabla 12345123456 11
2 blabla-012345-blabla blabla-0123456789-blabla 12345123456789 14
3 blabla-012345-blabla blabla-1234567891-blabla 123451234567891 15a_1
4 blabla-1234567891-blabla blabla-012345-blabla 123451234567891 15a_2
5 blabla-123456-blabla blabla-0123456789-blabla 123456123456789 15b_1
6 blabla-0123456789-blabla blabla-123456-blabla 123456123456789 15b_2
7 blabla-0123456789-blabla blabla-0123456789-blabla 123456789123456789 18
8 blabla-0123456789-blabla blabla-1234567891-blabla 1234567891234567891 19
All values are strings, but my data was considered integer before imported to pandas.
'blabla' could be consider as anything (not random but useless to know to solve this problem), they complicate the way to rebuild ID3 from ID1 and ID2.
That's why I want to rebuild ID3 from itself.
Originally ID3 should be : 'ID3' = 'ID1number' 'ID2number' if ID1number<ID2number or 'ID3' = 'ID2number' 'ID1number' if ID2number<ID1number.
But as I said the IDs were considered integer before imported, so the '0' that started numbers in ID1 and ID2 just went off.
I want to reconstruct ID3 to fill the '0' that are missing. So I did those masks to do it, and try to change values corresponding to them :
mask_ok = df['ID3'].str.contains('^(\d{12}|\d{16}|\d{20})$')
mask_10_18 = df['ID3'].str.contains('^(\d{10}|\d{18})$')
mask_11_19 = df['ID3'].str.contains('^(\d{11}|\d{19})$')
mask_14 = df['ID3'].str.contains('^\d{14}$')
mask_15a = ((df['ID3'].str.contains('^\d{15}$'))
&
(df['ID1'].str.startswith('^blabla-0\d{5}-')
|
df['ID2'].str.startswith('^blabla-0\d{5}-')))
mask_15b = ((df['ID3'].str.contains('^\d{15}$'))
&
~(df['ID1'].str.startswith('^blabla-0\d{5}-')
|
df['ID2'].str.startswith('^blabla-0\d{5}-')))
df.loc[mask_10_18] = '0' df.loc[mask_10_18][:len(df.loc[mask_10_18])/2] '0' df.loc[mask_10_18][len(df.loc[mask_10_18])/2:]
df.loc[mask_14] = '0' df.loc[mask_14][:5] '0' df.loc[mask_14][5:]
df.loc[mask_11_19|mask_15a] = '0' df.loc[mask_11_19|mask_15a]
df.loc[mask_15b] = df.loc[mask_15b][:6] '0' df.loc[mask_15b][6:]
df[~mask_ok]
What I want is that all zeros are well placed so that the dataframe look like this :
df = pd.DataFrame({
'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
'ID2': ['blabla-012345-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla'],
'ID3': ['012345012345', '012345123456', '0123450123456789', '0123451234567891', '0123451234567891', '1234560123456789', '1234560123456789', '01234567890123456789', '01234567891234567891'],
'case': ['12', '12', '16', '16', '16', '16', '16', '20', '20']
})
df
ID1 ID2 ID3 case
0 blabla-012345-blabla blabla-012345-blabla 012345012345 12
1 blabla-012345-blabla blabla-123456-blabla 012345123456 12
2 blabla-012345-blabla blabla-0123456789-blabla 0123450123456789 16
3 blabla-012345-blabla blabla-1234567891-blabla 0123451234567891 16
4 blabla-1234567891-blabla blabla-012345-blabla 0123451234567891 16
5 blabla-123456-blabla blabla-0123456789-blabla 1234560123456789 16
6 blabla-0123456789-blabla blabla-123456-blabla 1234560123456789 16
7 blabla-0123456789-blabla blabla-0123456789-blabla 01234567890123456789 20
8 blabla-0123456789-blabla blabla-1234567891-blabla 01234567891234567891 20
The columns case is just here to show the number of digit in ID3 but it's not in my original dataframe and I just put it there to improve visualisation. You can get rid of it for calculation.
I've got this error when I run the code :
TypeError Traceback (most recent call last)
<ipython-input-30-91bff8470cf6> in <module>
19 df['ID2'].str.startswith('^blabla-0')))
20
---> 21 df.loc[mask_10_18] = '0' df.loc[mask_10_18][:len(df.loc[mask_10_18])/2] '0' df.loc[mask_10_18][len(df.loc[mask_10_18])/2:]
22
23 df.loc[mask_14] = '0' df.loc[mask_14][:5] '0' df.loc[mask_14][5:]
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2997
2998 # Do we have a slicer (on rows)?
-> 2999 indexer = convert_to_index_sliceable(self, key)
3000 if indexer is not None:
3001 if isinstance(indexer, np.ndarray):
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in convert_to_index_sliceable(obj, key)
2208 idx = obj.index
2209 if isinstance(key, slice):
-> 2210 return idx._convert_slice_indexer(key, kind="getitem")
2211
2212 elif isinstance(key, str):
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in _convert_slice_indexer(self, key, kind)
3355 if self.is_integer() or is_index_slice:
3356 self._validate_indexer("slice", key.start, "getitem")
-> 3357 self._validate_indexer("slice", key.stop, "getitem")
3358 self._validate_indexer("slice", key.step, "getitem")
3359 return key
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in _validate_indexer(self, form, key, kind)
5307 pass
5308 else:
-> 5309 raise self._invalid_indexer(form, key)
5310
5311 def _maybe_cast_slice_bound(self, label, side: str_t, kind):
TypeError: cannot do slice indexing on Int64Index with these indexers [1.0] of type float
I think it's because pandas is looking at my string slicing as chained slices on my dataframe.
I also tried to put .str in front of all calls to string (like .str.contains in the masks), but it doesn't work either.
How can I do this instead ?
CodePudding user response:
In my question, I was saying that the reconstruction of ID3 from ID1 and ID2 would be complicated by blabla- and -blabla. It was wrong because I can extract the numerics (and even a specific number of digit) from ID1 and ID2.
df = pd.DataFrame({
'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
'ID2': ['blabla-012346-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456799-blabla', 'blabla-1234567891-blabla'],
'ID3': ['1234512345', '12345123456', '12345123456789', '123451234567891', '123451234567891', '123456123456789', '123456123456789', '123456789123456789', '1234567891234567891'],
'case': ['10', '11', '14', '15a_1', '15a_2', '15b_1', '15b_2', '18', '19']
})
df['num_ID1'] = df['ID1'].str.extractall(r'(\d )').droplevel('match')
df['num_ID2'] = df['ID2'].str.extractall(r'(\d )').droplevel('match')
df.loc[df['num_ID1'].astype(int)<df['num_ID2'].astype(int), 'new_col'] = df['num_ID1'] df['num_ID2']
df.loc[df['num_ID2'].astype(int)<df['num_ID1'].astype(int), 'new_col'] = df['num_ID2'] df['num_ID1']
df
df1 = pd.DataFrame({
'ID1': ['blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-012345-blabla', 'blabla-1234567891-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla', 'blabla-0123456789-blabla'],
'ID2': ['blabla-012346-blabla', 'blabla-123456-blabla', 'blabla-0123456789-blabla', 'blabla-1234567891-blabla', 'blabla-012345-blabla', 'blabla-0123456789-blabla', 'blabla-123456-blabla', 'blabla-0123456799-blabla', 'blabla-1234567891-blabla'],
'ID3': ['012345012346', '012345123456', '0123450123456789', '0123451234567891', '0123451234567891', '1234560123456789', '1234560123456789', '01234567890123456799', '01234567891234567891'],
'case': ['10', '11', '14', '15a_1', '15a_2', '15b_1', '15b_2', '18', '19']
})
print(df1['ID3']==df['new_col'])
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
dtype: bool
Actually it's less complicated than trying to mask all cases and slice every ID3 on condition. I should now take care about the fact that 'blabla' could be random, so I need to extract only groups of a certain lenght to do it. And it will be done. Using string extraction is simpler than I though in this example.