I would like to do something in Python that I think is pretty simple but I have had a difficult time doing.
I am trying to create three variables by taking the three separate sections of the file_name variable, as separated by the '_' symbol. The first variable should be the first 10 digits, the second variable should be the next 6 digits, and the third variable should be the remaining digits. The last section could be 1, 2, 3, or 4 digits long. I appreciate any help!
Here is an example of what my dataframe currently consists of:
file_path | File_name |
---|---|
/Users/user/Dropbox/SEC investigat... | _0000886982_18795_2687.txt |
/Users/user/Dropbox/SEC investigat... | _0000072333_16855_805.txt |
CodePudding user response:
Assuming you want to extract
the 3 numbers as new columns:
df[['A', 'B', 'C']] = df['File_name'].str.extractall(r'(\d )')[0].unstack('match')
If you want to be strict on the format (10 digits, 5 digits, 1-4 digits, separated by underscores):
df[['A', 'B', 'C']] = df ['File_name'].str.extract(r'(\d{10})_(\d{5})_(\d{1,4})')
Output (without first column):
File_name A B C
0 _0000886982_18795_2687.txt 0000886982 18795 2687
1 _0000072333_16855_805.txt 0000072333 16855 805
CodePudding user response:
The split
function will split a string into a list, putting the splits where a specific character shows up.
>>> fn = "_0000886982_18795_2687.txt"
>>> fn.split('_')
['', '0000886982', '18795', '2687.txt']
One thing you could do is use split
to first get rid of the ".txt", then use split
again on the remainder to break it up.
>>> fn = "_0000886982_18795_2687.txt"
>>> vars = fn.split('.')[0].split('_')
>>> vars
['', '0000886982', '18795', '2687']
This gives you a list though, not three separate variables. vars[1:]
will give you the list from the second element onwards, so you could do this:
>>> var1, var2, var3 = vars[1:]
If you know your "filename" will always begin with an underscore and end in ".txt", you could just do everything in a single step, this way:
>>> var1, var2, var3 = fn[:-3].split('_')[1:]