When I use a do loop to iterate through an array of words in SAS and see if they exist in a string, it works. When I incorporate a second for daily words, FINDW does not find words already in the final string.
So this works as expected:
word1=""
word2="pancake"
word3=""
word4="donut"
word5=""
array word {5} $ 250 word:;
final_str="pancake";
do i = 1 to 5;
final_str_w_removed_hyphens = translate(final_str, " ", "-");
if findw(final_str_w_removed_hyphens, word[i], " ") = 0
then final_str = catx("-", str, word[i]);
It gives me the expected final string of "pancake-donut".
However, when I incorporate days into the matter (there can be multiple breakfast names everyday), findw begins to do this weird double counting behavior. The data looks like this, it describes the food we ate for breakfast on a given day:
breakfast_foods_jan1 | breakfast_foods_jan2 | breakfast_foods_jan3 | breakfast_foods_jan4 |
---------------------|----------------------------------------|--------------------------------------|------------------------------------|
"breakfast-pancake" | "breakfast-donut-breakfast-pancake" | "breakfast-donut-breakfast-pancake" | "breakfast-donut-breakfast-pancake"|
I want to find all of the unique breakfast items a person ate in a year, here is my solution:
do j=1 to 4 /*january 1st - january 4th*/;
do i=1 to i=5 /*there can't be more than 5 breakfast items on any day*/;
if scan(breakfast_foods[j], i, "-", "d") ne "breakfast"
then daily_breakfast_foods[i] = scan(breakfast_foods[j], i, "-", "d");
word_find = findw(translate(all_breakfast_foods, " ", "-"), daily_breakfast_foods[i], " ");
if word_find=0 then all_breakfast_foods =
catx("-", all_breakfast_foods, daily_breakfast_foods[i];
end;
end;
This returns the final all_breakfast_foods of "pancake-donut-pancake" it double counts pancake!!! I have no clue why word_find is not finding pancake when it is clearly contained in the all_breakfast_foods string.
Here is what is happening in the loop:
daily_breakfast_foods1 | daily_breakfast_foods2 |daily_breakfast_foods3 | daily_breakfast_foods4 | daily_breakfast_foods5 |
-----------------------|------------------------|-----------------------|-----------------------|------------------------|
| donut | |pancake | |```
all_breakfast_foods_debug1 | all_breakfast_foods_debug2 | all_breakfast_foods_debug3 | all_breakfast_foods_debug4 | all_breakfast_foods_debug5 |
---|---|---|---|---|
pancake | pancake | pancake donut | pancake donut | pancake donut pancake |
CodePudding user response:
So your list has some leading spaces (spaces after the - delimiter) that is causing trouble. You can use the LEFT() function to remove the leading spaces. And the FINDW() has the T modifier to trim trailing spaces when hunting for words.
So let's make your test data listing into an actual dataset.
data have ;
input (breakfast_foods_jan1-breakfast_foods_jan4) ($40./);
cards;
breakfast-pancake
breakfast-donut-pancake
breakfast-donut-pancake
breakfast-donut- pancake
;
Now we can loop over the food variables and then loop over each list of foods and build up the list of unique foods.
data want;
set have;
array foods breakfast_foods_jan1-breakfast_foods_jan4 ;
length next_food $30 food_list $200 ;
do day=1 to dim(foods);
do item=1 to countw(foods[day],'-');
next_food = left(scan(foods[day],item,'-'));
if next_food ne 'breakfast' then do;
if not findw(food_list,next_food,'-','t') then
food_list=catx('-',food_list,next_food)
;
end;
end;
end;
drop day item next_food;
run;
Result:
CodePudding user response:
The fix was not to strip on line 5 like I was trying. I had to add the strip() to the findw function in word_find on line 8 as such:
word_find = findw(translate(all_breakfast_foods, " ", "-"), STRIP(daily_breakfast_foods[i]), " ");
Soooooo odd.