Home > Mobile >  Pandas create new columns based on column headers
Pandas create new columns based on column headers

Time:12-16

I'm trying to create several new columns in a pandas dataframe that merge portions of multiple other columns based on year (yet another column) and column headers. The layout of the dataframe currently is:

2000_A1 2000_A2 2001_A1 2001_A2 Year Latitude Longitude
2 8 0 3 2000 43.65 -76.26
3 8 9 4 2000 43.66 -76.26
3 2 5 3 2000 43.67 -76.26
4 6 5 1 2001 43.68 -76.26
7 8 2 0 2001 43.69 -76.26
1 3 1 1 2001 43.70 -76.26

I would like to create a dataframe that has two new columns (A1 and A2) that combine values based on year, as long as the year matches the column header. Originally, these were two separate dataframes that I merged (pd.merge()) based on coordinates (lat/long), which is why they're organized in different manners. Ideally, the new dataframe would look like this:

2000_A1 2000_A2 2001_A1 2001_A2 Year A1 A2 Latitude Longitude
2 8 0 3 2000 2 8 43.65 -76.26
3 8 9 4 2000 3 8 43.66 -76.26
3 2 5 3 2000 3 2 43.67 -76.26
4 6 5 1 2001 5 1 43.68 -76.26
7 8 2 0 2001 2 0 43.69 -76.26
1 3 1 1 2001 1 1 43.70 -76.26

If possible, I would like to use a for loop to create these new columns - in my actual dataset, I have about 20 years and either 7 or 9 'A's for each year, so some sort of iterative loop would be great. I'm new to python and struggling to figure out how to approach this, so any help would be much appreciated.

CodePudding user response:

You can use a combination of .melt and .pivot, optionally with merge to preserve your original {year}_A{n} columns. This will be likely much faster than an iterative solution (including things along the lines of .apply)

ff = df.copy()  # we will use this later
# split your variable column(s) into an identifier and value column
df = df.melt(id_vars=["Year", "Latitude", "Longitude"])
# parse the parts of {year}_A{n}
parts = df.variable.str.split("_")
df["var_year"] = parts.str[0].astype(int)
df["var_kind"] = parts.str[1]
df = df[df.var_year == df.Year] # use your filtering criteria
ident = ["Year", "Latitude", "Longitude"]
# pivot the frame to get A{n} columns
df = df.pivot(ident, columns="var_kind", values="value")
df = df.reset_index()
df = pd.merge(ff, df, on=ident)  # retain original {year}_A{n}
2000_A1 2000_A2 2001_A1 2001_A2 Year Latitude Longitude A1 A2
0 2 8 0 3 2000 43.65 -76.26 2 8
1 3 8 9 4 2000 43.66 -76.26 3 8
2 3 2 5 3 2000 43.67 -76.26 3 2
3 4 6 5 1 2001 43.68 -76.26 5 1
4 7 8 2 0 2001 43.69 -76.26 2 0
5 1 3 1 1 2001 43.7 -76.26 1 1

CodePudding user response:

You could use an pd.apply on the whole line, catching the desired year to build the column name.

def get_entries(line, new_entry="A1"):
    year = line.year
    used_feature = f"{year}_{new_entry}"
    return line[used_feature]

df["A1"] = df.apply(lambda line : get_entries(line, "A1"), axis=1)
df["A2"] = df.apply(lambda line : get_entries(line, "A2"), axis=1)

  • Related