I am working with a large dataset, but the problem can be summarized using the smaller dataset below:
import pandas as pd
df = pd.DataFrame({"Filename":["fileName1_uniqueTag1", "fileName2_uniqueTag1", "fileName3_uniqueTag1", "fileName1_uniqueTag2", "fileName2_uniqueTag2", "fileName3_uniqueTag2"],
"measurement":[1336.564888, 1090.852579, 990.320323, 1202.522612, 1098.045258, 923.600277],})
print(df)
>>>
Filename measurement
0 fileName1_uniqueTag1 1336.564888
1 fileName2_uniqueTag1 1090.852579
2 fileName3_uniqueTag1 990.320323
3 fileName1_uniqueTag2 1202.522612
4 fileName2_uniqueTag2 1098.045258
5 fileName3_uniqueTag2 923.600277
There are three different file names with two unique tags for each file name in the "Filename" column. The goal is to calculate the ratio of the measurements for uniqueTag1/uniqueTag2 for each file. The result should look something like this:
Filename uniqueTag2/uniqueTag1
0 fileName1 0.899711
1 fileName2 1.006593
2 fileName3 0.932627
I can make lists of the three different file names and two different tags using:
nameList = df["Filename"].tolist()
fileNames = [] #empty list to fill with different base file names
uniqueTags = [] #empty list to fill with unique tags
for name in nameList: #iterate through list of full file names
subStrings = name.split("_") #splits each base file name at the underscore
if subStrings[0] not in fileNames: #if the base file name isn't already in the file names list...
fileNames.append(subStrings[0]) #append it
if subStrings[1] not in uniqueTags: #if the unique tag isn't already in the unique tags list...
uniqueTags.append(subStrings[1]) #append it
I think that I could access individual measurement values by making the Filenames into indexes and using df.at(), but this seems horribly messy and I'm certain there must be a much better way to do this using functionality within Pandas. Any suggestions?
CodePudding user response:
You can do something like this, which is pretty straightforward, using str.split()
:
df[['Filename','uniquetag']] = df['Filename'].str.split('_', expand=True)
tag1 = df.loc[df['uniquetag'] == 'uniqueTag1'].set_index('Filename')['measurement']
tag2 = df.loc[df['uniquetag'] == 'uniqueTag2'].set_index('Filename')['measurement']
tag2 / tag1
CodePudding user response:
Try
df[['one','two']] = df.filename.str.split("_",expand=True)
Then groupby
using those 2 columns