I would like to ask a question regarding to Intel DSA (Data Streaming Accelerator).
According to the documentation, Intel DSA exposes to programmers various portals, which are MMIO registers that users could write to using ENQCMD
and MOVDIR64B
, i.e. the data transfer granularity is 64 bytes.
Each portal is of size equal to the page size, which is 4096 bytes, i.e. the MMIO area has size 4096 bytes. Since each work descriptor is 64 bytes, the portal supports up to 64 worker threads submitting work descriptors at the same time, but, they must submit to different addresses within the portal (4096-byte page) so that they do not overwrite each other.
Such design allows work submission to scale, but the portal is just an entry point, in other words, it is not the shared work queue. Hence, when there are multiple work descriptors submitted to different addresses, the job will then be transferred to the DSA to "move" the work descriptors to the shared work queue.
The questions below are what makes me confused:
-
- First of all, from the DSA's perspective, the portal is a 4096-byte register. If any worker thread writes to such register, the value of the register will change; hence, in order to realize the mechanism that DSA recognizes the change, there must be "something" in the DSA that checks whether the value of the register has changed. If my understanding is correct, then what is such "something"? How is it implemented in DSA?
-
- Recall that one 4096-byte portal (MMIO register) supports up to 64 submissions at a time. If a work descriptor has just been submitted to address 0, which is detected by DSA, and the DSA starts validating the work descriptor submitted at address 0 (before DSA confirms that such work descriptor can be enqueued to the work queue), then after a short period of time, another work descriptor is submitted (by another CPU), to address 0 (same address), it will overwrite the previous work descriptor since DSA is still validating such work descriptor. Hence, may I know if DSA will support a locking mechanism, which prevents the MMIO writes from happening when it is validating the work descriptors? Or DSA will copy that work descriptor (64 bytes) to device memory first (while lock acquired as well), then starts validating those newly submitted work descriptors? The latter one allows only acquiring the lock while copying to device memory, i.e. whenever the register value changed (or maybe a timer ticks, and DSA notices that the register value changed), the lock is acquired, then the one or more "64 bytes" that changed will be copied to device memory. After copy finished, the lock is released, and the timer starts. Before the next timer tick, it is allowed that multiple work descriptors being submitted to the portal.
-
- If the locking mechanism is supported, then while the lock is acquired by DSA, if user submits work descriptors, those submissions will not be written to the portal. In that case, will those submissions fail? Or those submission are temporarily not written to the portal, once the lock is released, those submissions are written to the portal?
-
- May I know how DSA works so that it supports 64 worker threads submitting work descriptors all the time while maintaining consistence without overwriting previously submitted but not yet validated descriptors?
-
- The reason I come up with the above questions is because I am not clear of what the software (device driver) should do, and what the DSA itself should do to submit work descriptors.
-
- If I did not remember wrong, there are four portals supported by the DSA. Since 1 portal supports up to 64 work submissions at a time, then does it mean DSA supports at most 4*64=256 work submissions at the same time?
CodePudding user response:
It makes no difference to the device which address within the portal is used. Section 9.3 of the Intel DSA Architecture Specification says that bits 11:6 of the portal address are ignored.
There is no register associated with a portal. There's no possibility of a descriptor submitted to a portal overwriting a previous descriptor. When a descriptor arrives at a portal, it is either written into the WQ storage or rejected.
For a shared WQ, there is no reason for software to use any offset into the page other than 0. There can potentially be as many simultaneous work submissions to an SWQ as the number of hardware threads in the system. For example, for a 4-socket server, that would be a lot. (Although in practice, each thread would hopefully use a DSA instance on its own socket instead of having all of the threads use the same DSA instance.)
For a dedicated WQ, only one thread should be submitting work to a single WQ. But it doesn't matter to the device. As far as the device is concerned, any number of threads could be submitting work to a DWQ, all using the exact same portal address. The device would accept descriptors until the WQ is full and drop the rest. There would be no way for software to tell which descriptors were accepted and which were dropped, which is why a SWQ should be used instead.
The reason for software to use multiple addresses for work submission from a single CPU thread to a DWQ is that it allows multiple streaming writes to the same portal to be in flight at the same time.
There are four portals per WQ, but each portal has different behavior. Software would typically only use one portal for a WQ, depending on the behavior it needs. (The difference between them is outside the scope of this answer, but is described in section 3.3 of the Intel DSA spec.)