Why does an array of text lines appear to have an extra level of container?-CodePudding

I'm reading a file using the "array of lines" mode of Dyalog's ⎕nget:

lines _ _ ← ⎕nget '/usr/share/dict/words' 1

And it appears to work:

          lines[1]
 10th

But the individual elements don't appear to be character arrays:

          line ← lines[1]
          line
 10th
          ≢ line
1
          ⍴ line

Here we see that the first line has a tally of 1 and a shape of the empty array. I can't index into it any further; lines[1][1] or line[1] is a RANK ERROR. If I use ⊂ on the RHS I can assign the value to multiple variables at once and get the same behavior for each variable. But if I do a multiple assignment without the left shoe, I get this:

          word rest ← line
          word
10th
          ≢ word
4
          ⍴ word
4

At last we have the character array I expected! Yet it was not evidently separated from anything else hidden in line; the other variable is identical:

          rest
10th
          ≢ rest
4
          ⍴ rest
4
          word ≡ rest
1

Significantly, when I look at word it has no leading space, unlike line. So it seems that the individual array elements in the content matrix returned by ⎕nget are further wrapped in something that doesn't show up in shape or tally, and can't be indexed into, but when I use a destructuring assignment it unwraps them. It feels rather like the multiple-values stuff in Common Lisp.

If someone could explain what's going on here, I'd appreciate it. I feel like I'm missing something incredibly basic.

CodePudding user response：

The result of reading a file with "array of lines" mode is a nested array. It is specifically a nested vector of character vectors where each character vector is a line from your text file.

For example, take \tmp\test.txt here:

my text file
has 3
lines

If we read this in, we can inspect the contents

      (content newline encoding) ← ⎕nget'\tmp\test.txt' 1
      ≢ content     ⍝ How many lines?
3
      ≢¨content     ⍝ How long is each line?
12 5 5
      content[2]    ⍝ Indexing returns a scalar (non-simple)
┌─────┐
│has 3│
└─────┘
      2⊃content     ⍝ Use pick to get the contents of the 2nd scalar
has 3
      ⊃content[2]   ⍝ Disclose the non-simple scalar
has 3

As you probably read from the online documentation, the default behaviour of ⎕NGET is to bring in a simple (non-nested) character vector with embedded new line characters. These are typically operating-system dependent.

      (content encoding newline) ← ⎕nget'\tmp\test.txt' 
      newline   ⍝ Unicode code points for line endings in this file  (Microsoft Windows)
13 10
      content
my text file
has 3       
lines       
            
      content ∊ ⎕ucs 10 13
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1

But with "array of lines" mode, you get a nested result.

For a quick introduction to nested arrays and the array model, see Stefan Kruger's LearnAPL book.

CodePudding user response：

If you turn boxing on it's easier to see what's happening. Each element is an enclosed character vector. Use pick ⊃ instead of bracket index [] to get the actual item.

  words ← ⊃⎕nget'/usr/share/dict/words'1
  ]box on -s=max
  ⍴words
┌→─────┐
│235886│
└~─────┘
  
  words[10]
┌─────────┐
│ ┌→────┐ │
│ │Aaron│ │
│ └─────┘ │
└∊────────┘
  
  10⊃words ⍝ use pick
┌→────┐
│Aaron│
└─────┘