Pandas create new columns based on column headers-CodePudding

I'm trying to create several new columns in a pandas dataframe that merge portions of multiple other columns based on year (yet another column) and column headers. The layout of the dataframe currently is:

2000_A1	2000_A2	2001_A1	2001_A2	Year	Latitude	Longitude
2	8	0	3	2000	43.65	-76.26
3	8	9	4	2000	43.66	-76.26
3	2	5	3	2000	43.67	-76.26
4	6	5	1	2001	43.68	-76.26
7	8	2	0	2001	43.69	-76.26
1	3	1	1	2001	43.70	-76.26

I would like to create a dataframe that has two new columns (A1 and A2) that combine values based on year, as long as the year matches the column header. Originally, these were two separate dataframes that I merged (pd.merge()) based on coordinates (lat/long), which is why they're organized in different manners. Ideally, the new dataframe would look like this:

2000_A1	2000_A2	2001_A1	2001_A2	Year	A1	A2	Latitude	Longitude
2	8	0	3	2000	2	8	43.65	-76.26
3	8	9	4	2000	3	8	43.66	-76.26
3	2	5	3	2000	3	2	43.67	-76.26
4	6	5	1	2001	5	1	43.68	-76.26
7	8	2	0	2001	2	0	43.69	-76.26
1	3	1	1	2001	1	1	43.70	-76.26

If possible, I would like to use a for loop to create these new columns - in my actual dataset, I have about 20 years and either 7 or 9 'A's for each year, so some sort of iterative loop would be great. I'm new to python and struggling to figure out how to approach this, so any help would be much appreciated.

CodePudding user response：

You can use a combination of .melt and .pivot, optionally with merge to preserve your original {year}_A{n} columns. This will be likely much faster than an iterative solution (including things along the lines of .apply)

ff = df.copy()  # we will use this later
# split your variable column(s) into an identifier and value column
df = df.melt(id_vars=["Year", "Latitude", "Longitude"])
# parse the parts of {year}_A{n}
parts = df.variable.str.split("_")
df["var_year"] = parts.str[0].astype(int)
df["var_kind"] = parts.str[1]
df = df[df.var_year == df.Year] # use your filtering criteria
ident = ["Year", "Latitude", "Longitude"]
# pivot the frame to get A{n} columns
df = df.pivot(ident, columns="var_kind", values="value")
df = df.reset_index()
df = pd.merge(ff, df, on=ident)  # retain original {year}_A{n}

	2000_A1	2000_A2	2001_A1	2001_A2	Year	Latitude	Longitude	A1	A2
0	2	8	0	3	2000	43.65	-76.26	2	8
1	3	8	9	4	2000	43.66	-76.26	3	8
2	3	2	5	3	2000	43.67	-76.26	3	2
3	4	6	5	1	2001	43.68	-76.26	5	1
4	7	8	2	0	2001	43.69	-76.26	2	0
5	1	3	1	1	2001	43.7	-76.26	1	1

CodePudding user response：

You could use an pd.apply on the whole line, catching the desired year to build the column name.

def get_entries(line, new_entry="A1"):
    year = line.year
    used_feature = f"{year}_{new_entry}"
    return line[used_feature]

df["A1"] = df.apply(lambda line : get_entries(line, "A1"), axis=1)
df["A2"] = df.apply(lambda line : get_entries(line, "A2"), axis=1)