Home > Blockchain >  How do I stack multiple columns of a dataframe so data is long and remove any blanks rows before the
How do I stack multiple columns of a dataframe so data is long and remove any blanks rows before the

Time:02-28

My dataframe is currently created to be wide with many columns, after the for statement below is executed. I want to stack multiple columns of data so that the dataframe is long and remove any blank rows from col4 before the output dataframe is generated. The reason for the latter part (remove blanks before output is generated) is because the dataframe will be too large for any output to be created with the blank values included.

code:

# dataframe
df0 = pd.DataFrame(data ={'col1':[123,123,456,456],'col2':['one two three',
 'green yellow','four five six','green yellow']})

# words to search for
search_words1 = ['one','three','four','six','green yellow']

# create columns for each search word and indicate if search word is found for each row
for n in search_words1:
        df0[n] = np.where(df0['col2'].str.contains(n),n,'')

# stack all search word columns created and remove blank rows in col4 before output is generated
df0 = pd.concat([
    df0[['col1']].melt(value_name='col3'), 
    df0[['one','three','four','six','green yellow']].melt(value_name='col4')], 
    axis=1)

df0.loc[:,['col3','col4']]

current output:

    col3    col4
0   123.0   one
1   123.0   
2   456.0   
3   456.0   
4   NaN     three
5   NaN 
6   NaN 
7   NaN 
8   NaN 
9   NaN 
10  NaN     four
11  NaN 
12  NaN 
13  NaN 
14  NaN     six
15  NaN 
16  NaN 
17  NaN     green yellow
18  NaN 
19  NaN     green yellow

desired output:

      col3  col4
0   123.0   one
1   123.0   three
2   123.0   green yellow
3   456.0   four
4   456.0   six
5   456.0   green yellow     

CodePudding user response:

try this:

search_words1 = ['one','three','four','six','green yellow']
search_words1 = '|'.join(search_words1)
df0['col2'] = df0.col2.str.findall(search_words1)
df0.explode('col2')
>>>
    col1    col2
0   123     one
0   123     three
1   123     green yellow
2   456     four
2   456     six
3   456     green yellow

CodePudding user response:

you can remove all nan and blank spaces like such.

col3 = df0['col3']
col4 = df0['col4']
three = col3[col3.notna()]
four = col4[col4 != ""]
print(three, '\n', four)

out:

0    123.0
1    123.0
2    456.0
3    456.0
Name: col3, dtype: float64 
 0              one
4            three
10            four
14             six
17    green yellow
19    green yellow
Name: col4, dtype: object
  • Related