I am new to python and have a pandas data frame which is the result of appending a multiple entities. Following is an example of the data frame (It can be bigger than this):
main_tag | tag | value |
---|---|---|
geneid | geneid | 121175336 |
status | status | live |
create-date | Date-std_year | 2017 |
create-date | Date-std_month | 4 |
create-date | Date-std_day | 0 |
update-date | Date-std_year | 2021 |
update-date | Date-std_month | 3 |
update-date | Date-std_day | 21 |
Org-ref_taxname | Org-ref_taxname | Gmax |
Org-ref_db | Dbtag_db | taxon1 |
Org-ref_db | Object-id_id | 3845 |
OrgName_div | OrgName_div | PLN |
geneid | geneid | 121175335 |
status | status | live |
create-date | Date-std_year | 2011 |
create-date | Date-std_month | 4 |
create-date | Date-std_day | 20 |
update-date | Date-std_year | 2021 |
update-date | Date-std_month | 4 |
update-date | Date-std_day | 21 |
Org-ref_taxname | Org-ref_taxname | Glycine |
Org-ref_db | Dbtag_db | taxon2 |
Org-ref_db | Object-id_id | 3847 |
OrgName_div | OrgName_div | LSA |
geneid | geneid | 121175334 |
status | status | live |
create-date | Date-std_year | 2011 |
create-date | Date-std_month | 6 |
create-date | Date-std_day | 20 |
update-date | Date-std_year | 2018 |
update-date | Date-std_month | 8 |
update-date | Date-std_day | 21 |
Org-ref_taxname | Org-ref_taxname | GlycineMin |
Org-ref_db | Dbtag_db | taxon3 |
Org-ref_db | Object-id_id | 38467 |
OrgName_div | OrgName_div | PLN |
As tables shows, the sub-dataframe from "geneid" to "OrgName_div" is been repeated with different values and I need the fastest way to transpose and change it to something like this:
geneid | status | create-date | update-date | Org-ref_taxname | Org-ref_db | Org-ref_db | OrgName_div |
---|---|---|---|---|---|---|---|
121175336 | live | 2017-4-2 | 2021-3-21 | Gmax | taxon1 | 3845 | PLN |
121175335 | live | 2011-4-20 | 2021-4-21 | Glycine | taxon2 | 3847 | LSA |
121175334 | live | 2011-6-20 | 2018-8-21 | GlycineMin | taxon3 | 38467 | PLN |
Here is the code that generate this data frame which is basically parsing a huge xml file:
import numpy as np
import pandas as pd
from lxml import etree
class EntrezGene:
def __init__(self, fh,tg):
self.context = etree.iterparse(fh, events=("end",),tag=tg,huge_tree=True)
def _parse(self):
for event, elem in self.context:
yield elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def sequence(self, elements):
_elements = {}#[]
for elem in elements:
if len(elem):
_elements[elem.tag]= dict(self.sequence(elem))
else:
_elements[elem.tag] =elem.text
return _elements
def __iter__(self):
for xml_EntrezGene in self._parse():
if len(xml_EntrezGene.attrib):
EntrezGene = {xml_EntrezGene.tag: xml_EntrezGene.attrib[next(iter(xml_EntrezGene.attrib))]} #xml_EntrezGene.attrib
mainTag = xml_EntrezGene.tag
elif len(xml_EntrezGene.text.strip()):#len(xml_EntrezGene.text) :
EntrezGene = {xml_EntrezGene.tag: xml_EntrezGene.text}
mainTag = xml_EntrezGene.tag
else:
EntrezGene = {}
mainTag = xml_EntrezGene.tag
for elem in xml_EntrezGene:
# if elem is Sequence
if len(elem):
# Append tuple(tag, value)
EntrezGene[elem.tag] = self.sequence(elem)
mainTag = elem.getparent().tag
else:
EntrezGene[elem.tag] = elem.text
mainTag = elem.getparent().tag
yield EntrezGene, mainTag
def recursive_items(dictionary):
for key, value in dictionary.items():
if type(value) is dict:
#print(key)
yield from recursive_items(value)
else:
yield (key, value)
poTags =['geneid','status','OrgName_div','Org-ref_db','Org-ref_taxname','create-date', 'update-date']
def extract_xml_inf(xmlFileName, tagNames):
df = pd.DataFrame(columns=['main_tag','tag', 'value'])
with open(xmlFileName, 'rb') as in_xml:
for record in EntrezGene(in_xml,tg=tagNames):
for key, value in record[0].items():
if isinstance(value, (dict)):
for subKey, subValue in recursive_items(value):
df = df.append({'main_tag': record[1],'tag': subKey,'value': subValue}, ignore_index=True)
else:
df = df.append({'main_tag': record[1],'tag': key,'value': value}, ignore_index=True)
#print(df)
return df
I appreciate any help.
CodePudding user response:
You could do:
df1 = df.assign(id = (df.main_tag == 'geneid').cumsum()).\
groupby(['id', 'main_tag']).aggregate({'value':'-'.join}).reset_index().\
pivot('id', 'main_tag', 'value')[df.main_tag.unique()]
print(df1)
main_tag geneid status create-date update-date Org-ref_taxname Org-ref_db OrgName_div
id
1 121175336 live 2017-4-0 2021-3-21 Gmax taxon1-3845 PLN
2 121175335 live 2011-4-20 2021-4-21 Glycine taxon2-3847 LSA
3 121175334 live 2011-6-20 2018-8-21 GlycineMin taxon3-38467 PLN
CodePudding user response:
Perhaps you can go about this differently from the start. If you need to append everything, then reshape, someone else might have a solution for you.
Before concatenating, transpose just the value columns, then concatenate them. Then you can make the modifications to the dates. I don't have your code used to join your data, so you'll have to modify this statement as needed.
# hold individual df in a list
df_hold =[]
for i in range(3):
# .T for transpose
df_hold.append(df[['value']].T)
df = pd.concat(df_hold)
# note the _c in some of the dates to remove the chance of duplicate column names
tags = ['geneid', 'status', 'Date-std_year_c', 'Date-std_month_c', 'Date-std_day_c', 'Date-std_year', 'Date-std_month',
'Date-std_day', 'Org-ref_taxname', 'Dbtag_db', 'Object-id_id', 'OrgName_div']
df.columns = tags
df['create-date'] = df.apply(lambda x: '-'.join([x['Date-std_year_c'], x['Date-std_month_c'], x['Date-std_day_c']]), axis=1)
df['update-date'] = df.apply(lambda x: '-'.join([x['Date-std_year'], x['Date-std_month'], x['Date-std_day']]), axis=1)
# drop unwanted columns
df = df.drop(['Date-std_year_c', 'Date-std_month_c', 'Date-std_day_c', 'Date-std_year', 'Date-std_month',
'Date-std_day'], axis=1).set_index('geneid', drop=True)
this is just running the first geneid three times, that is why the data is the same in each line
status Org-ref_taxname Dbtag_db Object-id_id OrgName_div create-date update-date
geneid
121175336 live Gmax taxon1 3845 PLN 2017-4-0 2021-3-21
121175336 live Gmax taxon1 3845 PLN 2017-4-0 2021-3-21
121175336 live Gmax taxon1 3845 PLN 2017-4-0 2021-3-21