I have a table like this
import pandas as pd
data = [
['ACOT', '00001', '', '', 1.5, 20, 30],
['ACOT', '00002', '', '', 1.7, 20, 33],
['ACOT', '00003', '','NA_0001' ,1.4, 20, 40],
['PAN', '000090', 'canonical', '', 0.5, 10, 30],
['PAN', '000091', '', '', 0.4, 10, 30],
['TOM', '000080', 'canonical', '', 0.4, 10, 15],
['TOM', '000040', '', '', 1.7, 10, 300]
]
df = pd.DataFrame(data, columns=[
'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end'])
The output
Gene_name Transcript_ID canonical mane metrics start end
0 ACOT 00001 1.5 20 30
1 ACOT 00002 NA_0001 1.7 20 33
2 ACOT 00003 1.4 20 40
3 PAN 000090 canonical NA_00090 0.5 10 30
4 PAN 000091 0.4 10 30
5 TOM 000080 canonical 0.4 10 15
6 TOM 000040 1.7 10 300
I want this output
Gene_name canonical mane metrics
0 ACOT No Yes 1.4-1.5
4 PAN Yes Yes 0.5-0.4
5 TOM Yes No 1.7-0.4
If mane/canonical have a value put yes otherwise put no. Create the range between the highest and the lowest value found in the metrics column.
CodePudding user response:
EDIT: Beside my solution works I would suggest to accept the answer of @965311532 because it is a very good combination of the "pandas-way" and "pythonic" code. Much less lines and much better to read.
Now my solution:
I have to say your example code doesn't provide the exact output data. Look into mane
column for example. And the metrics
column in your desired output doesn't make sense because min-max and max-min are mixed.
My example code here produce this output as sample data:
Gene_name canonical mane metrics
0 ACOT 1.5
1 ACOT 1.7
2 ACOT NA_0001 1.4
3 PAN canonical 0.5
4 PAN 0.4
5 TOM canonical 0.4
6 TOM 1.7
The result after transformation is
canonical mane metrics
Gene_name
ACOT No No 1.4-1.7
PAN Yes No 0.4-0.5
TOM Yes No 0.4-1.7
That is the solution code.
#!/usr/bin/env python3
import pandas as pd
data = [
['ACOT', '00001', '', '', 1.5, 20, 30],
['ACOT', '00002', '', '', 1.7, 20, 33],
['ACOT', '00003', '','NA_0001' ,1.4, 20, 40],
['PAN', '000090', 'canonical', '', 0.5, 10, 30],
['PAN', '000091', '', '', 0.4, 10, 30],
['TOM', '000080', 'canonical', '', 0.4, 10, 15],
['TOM', '000040', '', '', 1.7, 10, 300]
]
df = pd.DataFrame(data, columns=[
'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end'])
# select only the columns we need
df = df.loc[:, ['Gene_name', 'canonical', 'mane', 'metrics']]
# helper function to used in apply()
def my_helper(group):
# temporary store the result as a dict
result = {}
# check for existance of the values
# it is easy because columnname, value and result column
# have the same "name"
for col in ['canonical', 'mane']:
result[col] = \
'Yes' if group[col].isin([col]).any() else 'No'
# I use string format here. Could also be a tuple,
# list, Series or what every you want.
result['metrics'] = '{}-{}'.format(group.metrics.min(),
group.metrics.max())
return pd.Series(result)
desired = df.groupby('Gene_name').apply(my_helper)
The important part here is to use groupby()
and then do the calculations and checks in a separate helper function.
Some details
About that line
def my_helper(group):
result = {'Gene_name': group.Gene_name.iloc[0]}
The my_helper()
function is called three times for each of your Gene_name
. The group
argument is a DataFrame
. For example in the first call its content looks like this:
Gene_name canonical mane metrics
0 ACOT 1.5
1 ACOT 1.7
2 ACOT NA_0001 1.4
To collect the results I use a dict()
here because the keys it are later used as column names for the resulting data frame.
About this line:
for col in ['canonical', 'mane']:
result[col] = \
'Yes' if group[col].isin([col]).any() else 'No'
This maybe looks a bit wired but it is IMHO more pythonic. It could be also written with some more lines like this.
if group['canonical'].isin(['canonical']).any():
result['canonical'] = 'Yes'
else:
result['canonical'] = 'No'
# and the same for 'mane'
The part .isin(['canonical'])
gives you a list of boolean values. The .any()
return True
if there is minimal one True
in that list and False
if not.
About that line
return pd.Series(result)
Here a Series
object is created out of the dict
. From the first call of my_helper()
the series looks like this:
Gene_name ACOT
canonical No
mane No
metrics 1.4-1.7
It is a list where each item has its own name. The complete code will produce three of such series objects (for each Gene_name
). The apply()
does use each series as a row and glue them together to a new data frame.
CodePudding user response:
You can do this:
f = lambda x: "Yes" if x.any() else "No" # For canonical and mane
df = df.groupby('Gene_name').agg({'canonical': f, 'mane': f, 'metrics': ['min', 'max']})
# Rename columns
df.columns = ["canonical", "mane", "metrics_min", "metrics_max"]
Output:
canonical mane metrics_min metrics_max
Gene_name
ACOT No Yes 1.4 1.7
PAN Yes Yes 0.4 0.5
TOM Yes No 0.4 1.7