Home > Back-end >  Transform n columns into rows
Transform n columns into rows

Time:12-29

i am looking for a way to disaggregate data from a single row in a pandas df.

My data looks like this

edit: n stands for an unspecified number, e.g. in my working dataset I have 8 plots giving me 8 x 2 = 16 columns I would like to transform.

data = {
        'key':['k1', 'k2'],
        'plot_name_1':['name', 'name'],
        'plot_area_1':[1,2],
        'plot_name_2':['name', 'name'],
        'plot_area_2':[1,2],
        'plot_name_n':['name', 'name'],
        'plot_area_n':[1,2]
       }

df = pd.DataFrame(data)

enter image description here

And I would like to end up here, adding an extra column to identify the plot number:

data = {
        'key':['k1','k1','k1', 'k2', 'k2', 'k2'],
        'plot_number':['1', '2', 'n','1', '2', 'n'],
        'plot_name':['name', 'name','name', 'name','name', 'name'],
        'plot_area':[1,2,1,2,1,2],
       }

df = pd.DataFrame(data)

enter image description here

CodePudding user response:

pd.wide_to_long can do this:

In [160]: pd.wide_to_long(df, stubnames=["plot_name", "plot_area"],
                          i="key", j="plot_number",
                          sep="_", suffix=r"(?:\d |n)").reset_index()
Out[160]:
  key plot_number plot_name  plot_area
0  k1           1      name          1
1  k2           1      name          2
2  k1           2      name          1
3  k2           2      name          2
4  k1           n      name          1
5  k2           n      name          2

where

  • "stubnames" are the common column prefix names to capture
  • "sep" is the separator after those prefixes
    • "_" in your case
  • "suffix" is what's expected after the separator
    • one or more digits, or literal "n" in your case; if "n" was symbolic, you can have suffix=r"\d " there
  • "i" argument is the "index" (i.e., the identifier variables)
  • "j" signifies the name under which the suffixes are gathered.

As an aside, we need to paranthesise the regex when multiple suffixes are possible because of the way pandas uses the suffix under the hood when constructing the regex:

regex = rf"^{re.escape(stub)}{re.escape(sep)}{suffix}$"

We see that suffix is interpolated directly, and an alternator (i.e., |) in it will see the left side as, e.g., not only \d but also what comes from stub & sep, too.

CodePudding user response:

One option is to reshape with pivot_longer from pyjanitor, using a regular expression to capture the groups:

# pip install pyjanitor
import pandas as pd
import janitor

(df
.pivot_longer(
    index='key', 
    names_to = ('.value', 'plot_number'), 
    names_pattern = r"(. )_(. )")
)
  key plot_number plot_name  plot_area
0  k1           1      name          1
1  k2           1      name          2
2  k1           2      name          1
3  k2           2      name          2
4  k1           n      name          1
5  k2           n      name          2

The .value determines which parts of the columns remains as headers

  • Related