Home > OS >  Python - Create variables from other variable separated by '_'
Python - Create variables from other variable separated by '_'

Time:08-17

I would like to do something in Python that I think is pretty simple but I have had a difficult time doing.

I am trying to create three variables by taking the three separate sections of the file_name variable, as separated by the '_' symbol. The first variable should be the first 10 digits, the second variable should be the next 6 digits, and the third variable should be the remaining digits. The last section could be 1, 2, 3, or 4 digits long. I appreciate any help!

Here is an example of what my dataframe currently consists of:

file_path File_name
/Users/user/Dropbox/SEC investigat... _0000886982_18795_2687.txt
/Users/user/Dropbox/SEC investigat... _0000072333_16855_805.txt

CodePudding user response:

Assuming you want to extract the 3 numbers as new columns:

df[['A', 'B', 'C']] = df['File_name'].str.extractall(r'(\d )')[0].unstack('match')

If you want to be strict on the format (10 digits, 5 digits, 1-4 digits, separated by underscores):

df[['A', 'B', 'C']] = df ['File_name'].str.extract(r'(\d{10})_(\d{5})_(\d{1,4})')

Output (without first column):

                    File_name           A      B     C
0  _0000886982_18795_2687.txt  0000886982  18795  2687
1   _0000072333_16855_805.txt  0000072333  16855   805

CodePudding user response:

The split function will split a string into a list, putting the splits where a specific character shows up.

>>> fn = "_0000886982_18795_2687.txt"
>>> fn.split('_')
['', '0000886982', '18795', '2687.txt']

One thing you could do is use split to first get rid of the ".txt", then use split again on the remainder to break it up.

>>> fn = "_0000886982_18795_2687.txt"
>>> vars = fn.split('.')[0].split('_')
>>> vars
['', '0000886982', '18795', '2687']

This gives you a list though, not three separate variables. vars[1:] will give you the list from the second element onwards, so you could do this:

>>> var1, var2, var3 = vars[1:]

If you know your "filename" will always begin with an underscore and end in ".txt", you could just do everything in a single step, this way:

>>> var1, var2, var3 = fn[:-3].split('_')[1:]
  • Related