I have a DataFrame that I am adding data to, but I do not gather the data for each row at the same time. Some of the columns are assembled in pieces.
The first time I have some data for a column, I can simply assign it:
import pandas as pd
df = pd.DataFrame()
df["name"] = pd.Series(["A", "B", "C", "D", "E", "F", "G", "H", "I"])
df["data"] = pd.Series([1, 2, 3])
new_data = pd.Series([4, 5, 6])
How can I append new_data
to the "data"
column?
Expected output:
name data
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 F 6.0
6 G NaN
7 H NaN
8 I NaN
Things I've tried:
df["data"] = new_data
# Adds instead of appending
df["data"] = df["data"].append(new_data)
# ValueError: cannot reindex on an axis with duplicate labels
df["data"] = pd.concat([df["data"], new_data])
# ValueError: cannot reindex on an axis with duplicate labels
column = df["data"]
df.drop(columns="data")
df["data"] = pd.concat([column, new_data])
# ValueError: cannot reindex on an axis with duplicate labels
df["data"] = pd.concat([df["data"], new_data], ignore_index=True)
# This doesn't appear to modify the column (but no errors)
df["data"] = pd.concat([df["data"], new_data], axis=1)
# ValueError: Columns must be same length as key
How can I accomplish this deceptively simple task? I believe I am not understanding how indexing works in Pandas.
CodePudding user response:
Before concat to original value, you need drop the NaN first first
df['data'] = pd.concat([df['data'].dropna(), new_data], ignore_index=True)
print(df)
name data
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 F 6.0
6 G NaN
7 H NaN
8 I NaN
If the new concated Series is longer than original, you can join/concat the result back to original dataframe
out = (df.drop(columns='data')
.join(pd.concat([df['data'].dropna(), new_data], ignore_index=True).to_frame('data'), how='outer'))
# or
out = (pd.concat([df.drop(columns='data'),
pd.concat([df['data'].dropna(), new_data], ignore_index=True).to_frame('data')], axis=1))
print(out)
name data
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 F 6.0
6 G 9.0
7 H 10.0
8 I 11.0
9 NaN 11.0
10 NaN 12.0
CodePudding user response:
Each cell in a dataframe has a row label (aka index) and a column label. This assign cell whose at row labels 3, 4, 5 and column label data
the values 4, 5, 6:
df.loc[3:5, "data"] = [4,5,6]
However, if your right hand side is a Series, the rows 3 to 5 will be NaN:
df.loc[3:5, "data"] = [4,5,6] # NaN
This is because pandas try to match the label on the left hand side and the right hand side. The labels on the left are 3, 4, 5. The labels on the right are 0, 1, 2 (default when you do not specify an index). So nothing matches and hence NaN.