I want to group by text, if at least one name exists, then loop through the names, and each name that does not appear - add a line to original df. example -
names_set = {'A','B','C'}
initial df:
columns = ['id','text','name','start','end']
data = [
[1,"this is text 1", 'A',0,4],
[2,"this is text 1", 'B',4,5],
[3,"this is text 1", 'C',4,5],
[3,"this is text 2", 'A',6,8],
[4,'this is text 3',None, None, None],
[5,"this is text 4", 'B',10,13],
[6,"this is text 4", 'B',1,5]
]
df1 = pd.DataFrame(data= data,columns=columns)
df1
id text name start end
0 1 this is text 1 A 0.0 4.0
1 2 this is text 1 B 4.0 5.0
2 3 this is text 1 C 4.0 5.0
3 3 this is text 2 A 6.0 8.0
4 4 this is text 3 None NaN NaN
5 5 this is text 4 B 10.0 13.0
6 6 this is text 4 B 1.0 5.0
output:
columns2 = ['id','text','name','start','end']
data2 = [
[1,"this is text 1", 'A',0,4],
[2,"this is text 1", 'B',4,5],
[3,"this is text 1", 'C',4,5],
[3,"this is text 2", 'A',6,8],
[None,"this is text 2", 'B',None,None],
[None,"this is text 2", 'C',None,None],
[4,'this is text 3',None, None, None],
[None,"this is text 4", 'A',None,None],
[5,"this is text 4", 'B',10,13],
[6,"this is text 4", 'B',1,5],
[None,"this is text 4", 'C',None,None]
]
df2 = pd.DataFrame(data= data2,columns=columns2)
df2
id text name start end
0 1.0 this is text 1 A 0.0 4.0
1 2.0 this is text 1 B 4.0 5.0
2 3.0 this is text 1 C 4.0 5.0
3 3.0 this is text 2 A 6.0 8.0
4 NaN this is text 2 B NaN NaN
5 NaN this is text 2 C NaN NaN
6 4.0 this is text 3 None NaN NaN
7 NaN this is text 4 A NaN NaN
8 5.0 this is text 4 B 10.0 13.0
9 6.0 this is text 4 B 1.0 5.0
10 NaN this is text 4 C NaN NaN
the code I have up until now -
g = df1.groupby('text')
text_names_group = df1.groupby("text")["name"].agg(list)
text_names_group
for text in text_names_group:
if len(text) == 1 and text[0] is None:
continue
cur_names = set(text)
missing_names_per_text = names_set - cur_names
so missing_names_per_text is what the names missing for each text, but I want to add it to the original df, per text
Thanks!
edit : there's an option for two lines with same text and name, but different start and end
example - added line 6 in input
CodePudding user response:
First filter only rows match by names_set
and add missing combinations, last append not matching rows by invert mask by ~
in boolean indexing
and join together by concat
:
names_set = {'A','B','C'}
m = df1['name'].isin(names_set)
df2 = (df1.set_index(['text', 'name'])
.reindex(pd.MultiIndex.from_product([df1.loc[m, 'text'].unique(),
sorted(names_set)],
names=['text', 'name']))).reset_index()
df = pd.concat([df1[~m], df2]).sort_values(['text'], ignore_index=True)
print (df)
id text name start end
0 1.0 this is text 1 A 0.0 4.0
1 2.0 this is text 1 B 4.0 5.0
2 3.0 this is text 1 C 4.0 5.0
3 3.0 this is text 2 A 6.0 8.0
4 NaN this is text 2 B NaN NaN
5 NaN this is text 2 C NaN NaN
6 4.0 this is text 3 None NaN NaN
7 NaN this is text 4 A NaN NaN
8 5.0 this is text 4 B 10.0 13.0
9 NaN this is text 4 C NaN NaN
In ral data should be problem sorting by text
column, so here is solution with mapping by enumerate
dictionary:
columns = ['id','text','name','start','end']
data = [
[1,"this is text 10", 'A',0,4],
[2,"this is text 10", 'B',4,5],
[3,"this is text 10", 'C',4,5],
[3,"this is text 20", 'A',6,8],
[4,'this is text 13',None, None, None],
[5,"this is text 14", 'B',10,13]
]
df1 = pd.DataFrame(data= data,columns=columns)
names_set = {'A','B','C'}
m = df1['name'].isin(names_set)
sorting = {v: k for k, v in enumerate(df1['text'].drop_duplicates())}
print (sorting)
{'this is text 10': 0, 'this is text 20': 1, 'this is text 13': 2, 'this is text 14': 3}
mux = pd.MultiIndex.from_product([df1.loc[m, 'text'].unique(),
sorted(names_set)],
names=['text', 'name'])
df2 = df1.set_index(['text', 'name']).reindex(mux).reset_index()
df = pd.concat([df1[~m], df2]).sort_values(['text'],
ignore_index=True,
key=lambda x: x.map(sorting))
print (df2)
text name id start end
0 this is text 10 A 1.0 0.0 4.0
1 this is text 10 B 2.0 4.0 5.0
2 this is text 10 C 3.0 4.0 5.0
3 this is text 20 A 3.0 6.0 8.0
4 this is text 20 B NaN NaN NaN
5 this is text 20 C NaN NaN NaN
6 this is text 14 A NaN NaN NaN
7 this is text 14 B 5.0 10.0 13.0
8 this is text 14 C NaN NaN NaN
Solution with duplicated name
s per groups is similar:
names_set = {'A','B','C'}
m = df1['name'].isin(names_set)
sorting = {v: k for k, v in enumerate(df1['text'].drop_duplicates())}
#print (sorting)
Create helper df3
DataFrame by MultiIndex.to_frame
, add missing text
and name
rows and last use left join with original DataFrame:
df3 = pd.MultiIndex.from_product([df1.loc[m, 'text'].unique(),
sorted(names_set)],
names=['text', 'name']).to_frame(index=False)
df3 = (pd.concat([df1.loc[~m, ['text', 'name']], df3])
.sort_values(['text'], ignore_index=True, key=lambda x: x.map(sorting)))
print (df3)
text name
0 this is text 1 A
1 this is text 1 B
2 this is text 1 C
3 this is text 2 A
4 this is text 2 B
5 this is text 2 C
6 this is text 3 None
7 this is text 4 A
8 this is text 4 B
9 this is text 4 C
df2 = df3.merge(df1, how='left')
print (df2)
text name id start end
0 this is text 1 A 1.0 0.0 4.0
1 this is text 1 B 2.0 4.0 5.0
2 this is text 1 C 3.0 4.0 5.0
3 this is text 2 A 3.0 6.0 8.0
4 this is text 2 B NaN NaN NaN
5 this is text 2 C NaN NaN NaN
6 this is text 3 None 4.0 NaN NaN
7 this is text 4 A NaN NaN NaN
8 this is text 4 B 5.0 10.0 13.0
9 this is text 4 B 6.0 1.0 5.0
10 this is text 4 C NaN NaN NaN