I have very fundamental question on how physically (in RTL) caches (e.g. L1,L2) are connected to cores (e.g. Arm Cortex A53)? How many read/writes ports/bus are there and what is width of it? Is it 32-bit bus? How to calculate theoretical max bandwidth/throughput on L1 cache connected to Arm Cortex A53 running at 1400MHz?
On web lots of information is available on how caches work but couldn't find how it is connected.
CodePudding user response:
You can get the information in the ARM documentation (which is pretty complete compared to others):
L1 data cache:
(configurable) sizes of 8KB, 16KB, 32KB, or 64KB.
Data side cache line length of 64 bytes.
256-bit write interface to the L2 memory system.
128-bit read interface to the L2 memory system.
64-bit read path from the data L1 memory system to the datapath.
128-bit write path from the datapath to the L1 memory system.
Note there is one datapath since it is mentioned when there are multiple of them, hence there is certainly 1 port unless 2 ports share the same datapath which would be surprising.
L2 cache:
All bus interfaces are 128-bits wide.
Configurable L2 cache size of 128KB, 256KB, 512KB, 1MB and 2MB.
Fixed line length of 64 bytes.
General information:
One to four cores, each with an L1 memory system and a single shared L2 cache.
In-order pipeline with symmetric dual-issue of most instructions.
Harvard Level 1 (L1) memory system with a Memory Management Unit (MMU).
Level 2 (L2) memory system providing cluster memory coherency, optionally including an L2 cache.
The Level 1 (L1) data cache controller, that generates the control signals for the associated embedded tag, data, and dirty RAMs, and arbitrates between the different sources requesting access to the memory resources. The data cache is 4-way set associative and uses a Physically Indexed, Physically Tagged (PIPT) scheme for lookup that enables unambiguous address management in the system.
The Store Buffer (STB) holds store operations when they have left the load/store pipeline and have been committed by the DPU. The STB can request access to the cache RAMs in the DCU, request the BIU to initiate linefills, or request the BIU to write out the data on the external write channel. External data writes are through the SCU.
The STB can merge several store transactions into a single transaction if they are to the same 128-bit aligned address.
An upper-bound for the L1 bandwidth is frequency * interface_width * number_of_paths
so 1400MHz * 64bit * 1 = 10.43 GiB/s
from the L1 (reads) and 20.86 GiB/s
to the L1 (writes). In practice, the concurrency can be a problem but it is hard to know which part of the chip will be a limiting factor.
Note that there are many other documents available but this one is the most interesting. I am not sure you can get the physical information about cache in RTL since I expect this information to be confidential, hence not publicly available (because I guess competitors could take benefit of this).