[TVM&VTA] VTA Concept and Design

VTA Concept

Versatile Tensor Accelerator(VTA)는 Washington 대학에서 시작된 RISC 형태의 오픈소스 deep learning accelerator 이다.
TVM 프로젝트의 하위 프로젝트로 TVM에 기반한 hardware design, drivers, a JIT runtime, and an optimizing compiler stack을 제공한다.
https://tvm.apache.org/docs/topic/vta/index.html

VTA Parameter

VTA는 HLS기반으로 합성 시 사용할 tensor intrinsic, clock frequency, pipelining, data type width, on-chip buffer sizes 등의 파라메터를 제공한다.
3rdparty/vta-hw/config/vta_config.json 파일에 사용할 파라메터가 있으며 https://tvm.apache.org/docs/topic/vta/dev/config.html 에 파라메터에 대한 설명이 기술되어 있다
다음의 그림은 파라메터가 VTA core에 어떻게 적용되는지 나태낸 그림이며 필자가 ZCU104에서 사용 중인 파일 예시이다.
- 합성 시 vta_config.json을 바탕으로 vta_config.py이 vta_config.tcl 파일을 생성한다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


{
  "TARGET" : "zcu104",
  "HW_FREQ" : 300,
  "HW_CLK_TARGET" : 3,
  "HW_VER" : "0.0.1",
  "LOG_INP_WIDTH" : 3,
  "LOG_WGT_WIDTH" : 3,
  "LOG_ACC_WIDTH" : 5,
  "LOG_BATCH" : 0,
  "LOG_BLOCK" : 4,
  "LOG_UOP_BUFF_SIZE" : 15,
  "LOG_INP_BUFF_SIZE" : 15,
  "LOG_WGT_BUFF_SIZE" : 18,
  "LOG_ACC_BUFF_SIZE" : 17
}

VTA Overview

VTA는 dense liniar algebra 연산을 위한 RISC-like 프로세서로 memory access latency를 줄이기 위해 access-execute가 decouple된 디자인을 가진다.
VTA는 4개의 모듈로 구성되며 각각의 모듈은 local memory block과 FIFO로 통신한다.
- fetch module : DRAM으로 부터 Instruction stream을 loading하고 Decoding 한다
- load module: input과 weight를 DRAM으로 부터 on-chip 메모리에 로딩한다.
- compute module : GEMM core로 선형대수 연산, ALU로 일반 연산 수행. DRAM으로 부터 데이터(?)를 로딩하며 micro-op kernel을 micro-op cache에 로딩한다.
- store module : compute 모듈의 결과를 DRAM에 저장한다.

VTA Archictectural Overview

Instruction Set Architecture(ISA)

VTA의 ISA는 4개의 CISC Instruction으로 구성됨 (각각 다른 실행시간을 가지며, GEMM/ALU는 micro-code instruction임) (micro-code : 하나의 기계어의 동작을 더 작은 동작들의 조합으로 구현한 것)
- LOAD : DRAM의 2D tensor를 input/weight buffer 및 register file에 로드. micro-kernel을 micro-op cache에 로딩하기도 함, input/weight 로딩 시 dynamic padding을 지원
- GEMM : input과 weight의 매트릭스 곱연산 micro-op 수행, register-file에 결과를 더함
- ALU : register-file 텐서 데이터에 대해 매트릭스 ALU(덧셈) micro-op 수행
- STORE : output buffer로 부터 DRAM에 2D tensor를 저장
LOAD는 load와 compute 모듈에 의해 실행, GEMM/ALU은 compute모듈에 의해 실행, STORE는 load 모듈에 의해 실행 (상세 내용은 다음 섹션에)

Dataflow Excution

VTA는 concurrent task 실행 동기화를 위해 각 producer/consumer 모듈이 read-after-write (RAW) and write-after-read (WAR) dependence queues로 연결되어 있음
하기의 그림과 pseudo-code는 모듈이 주어진 instruction을 실행하는 방법을 기술하고 있다(chatGPT 해석)
1. 명령어(insn)가 명령어 큐에서 꺼내집니다.
2. MODULE이 생산자 또는 소비자의 처리를 기다려야 할지를 결정하는 플래그가 설정됩니다.
3. MODULE이 현재 작업을 마친 후 생산자 또는 소비자에게 알릴지를 결정하는 플래그가 설정됩니다.
4. MODULE이 생산자를 기다려야 한다면, 생산자의 원시 큐에 항목이 있을 때까지 기다립니다. 큐가 비어 있으면 건너뛰고, 아니면 큐에서 항목을 꺼냅니다.
5. 마찬가지로, MODULE이 소비자를 기다려야 한다면 소비자의 대기 후 읽기(war) 큐를 확인합니다.
6. 대기 조건이 충족되면 MODULE이 생산자에게 알려야 한다면 생산자의 war 큐에 항목을 넣습니다.
7. 소비자에게 알려야 한다면 소비자의 원시 큐에 항목을 넣습니다.

Pipeline Expandability

default VTA design은 3-stage load-compute-store task pipeline을 가진 4개의 모듈로 구성되어 있다.
VTA pipeline을 확장가능하지만 이는 cost(logic overhead)를 가져오므로 설계자는 3-stage pipeline을 채용했다
- 예를들어 TPU처럼 텐서 GEMM/ALU를 분리하여 load-gemm-activate-store로 구성할 수 도 있다.

Microarchitectural Overview

각각의 모듈에 대하여 설명한다.(3rdparty/vta-hw/hardware/xilinx/sources/vta.cc 참조)

Fetch Module

Fetch 모듈은 DRAM의 Instruction Stream을 읽어서 Instruction을 decode하고 다른 모듈의 command queue에 명령을 전송한다.
Fetch모듈의 제어를 위해 insn_count, insns, control 레지스터를 가진다.(HLS 코드로 만들어 져 있다.)
- insns : DRAM 내부 instruction stream의 주소
- insns_count : fetch할 insturction의 갯수(DMA로 이동할 instruction 의 갯수)
- control : fetch 모듈의 시작
Decode 된 instruction은 각 내용에 따라 load, compute, store 모듈의 command queue에 제공됨
- STORE insn : Store CMD_QUE에 입력
- GEMM/ALU insn : Compute CMD_QUE에 입력
- LOAD insn : micro-op kernel이나 레지스터 파일 데이터를 load하는 명령은 Compute CMD_QUE에 입력
- LOAD insn : Input이나 weight를 load하는 명령은 Compute CMD_QUE에 입력
CMD_QUE가 full이면 full이 해제될 때 까지 fetch모듈은 stall 됨 (queue를 충분히 크게하라, 병렬 처리 보장 하라)

코드분석

Fetch 모듈의 코드는 간단하게 DRAM으로 부터 Instruction을 읽어 opcode를 확인한 후 instruction을 적절한 모듈의 CMD_QUEUE에 push한다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


/*! log2 of instruction data type width */
#define VTA_LOG_INS_WIDTH 7
/*! Instruction data type width */
#define VTA_INS_WIDTH (1 << VTA_LOG_INS_WIDTH)

typedef ap_uint<VTA_INS_WIDTH> insn_T;

typedef struct {
  /*! \brief The instruction opcode */
  uint64_t opcode         : VTA_OPCODE_BIT_WIDTH;
  /*! \brief Unused in this instruction */
  uint64_t pop_prev_dep   : 1;
  /*! \brief Pop dependence token from GEMM stage */
  uint64_t pop_next_dep   : 1;
  /*! \brief Unused in this instruction */
  uint64_t push_prev_dep  : 1;
  /*! \brief Push dependence token to GEMM stage */
  uint64_t push_next_dep  : 1;
  /*! \brief Padding */
  uint64_t pad_0          : 64 - VTA_OPCODE_BIT_WIDTH - 4;
  /*! \brief Padding */
  uint64_t pad_1          : 64;
} VTAGenericInsn;

union VTAInsn {
  /*! \brief VTA generic instruction */
  VTAGenericInsn generic;
  /*! \brief VTA load/store instruction */
  VTAMemInsn mem;
  /*! \brief VTA GEMM instruction */
  VTAGemInsn gemm;
  /*! \brief VTA ALU instruction */
  VTAAluInsn alu;
};


void fetch(
  uint32_t insn_count,
  volatile insn_T *insns,
  hls::stream<insn_T> &load_queue,
  hls::stream<insn_T> &gemm_queue,
  hls::stream<insn_T> &store_queue) {
PRAGMA_HLS(HLS INTERFACE s_axilite port = insn_count bundle = CONTROL_BUS offset = VTA_FETCH_INSN_COUNT_OFFSET)
#pragma HLS INTERFACE m_axi port = insns offset = slave bundle = ins_port
#pragma HLS INTERFACE axis port = load_queue
#pragma HLS INTERFACE axis port = gemm_queue
#pragma HLS INTERFACE axis port = store_queue
#pragma HLS INTERFACE s_axilite port = return bundle = CONTROL_BUS

  INSN_DECODE: for (int pc = 0; pc < insn_count; pc++) {
#pragma HLS PIPELINE
    // Read instruction fields
    insn_T raw_insn = insns[pc];
    VTAInsn insn;
    insn.generic = *((VTAGenericInsn *) &raw_insn);
    // Do some partial decoding
    opcode_T opcode = insn.generic.opcode;
    memop_id_T memory_type = insn.mem.memory_type;
    // Push to appropriate instruction queue
    if (opcode == VTA_OPCODE_STORE) {
      store_queue.write(raw_insn);
    } else if (opcode == VTA_OPCODE_LOAD) {
      if (memory_type == VTA_MEM_ID_INP || memory_type == VTA_MEM_ID_WGT) {
        load_queue.write(raw_insn);
      } else {
        gemm_queue.write(raw_insn);
      }
    } else {
      gemm_queue.write(raw_insn);
    }
  }
}

Compute Module

compute 모듈은 tensor 연산을 위한 RISC형태의 모듈로 ALU/GEMM으로 구성
micro-op cache로 부터 micro-op를 읽어서 수행하며 micro-op는 ALU, GEMM operation이 있음
compute 모듈은 footprint를 줄이기 위해 2단 nested loop에서 micro-op secqunce를 실행(조건 분기등을 피하기 위해)

코드분석

instruction을 읽어 Load/Store 모듈에 dependancy가 있으면 DEP_QUEUE를 pop을 대기한다
opcode가 FINISH, LOAD 또는 uOP/ACC 메모리 로드, GEMM/ALU 연산 수행인지 확인하여 해당 동작을 수행한다
Load/Store 모듈이 dependancy를 기다린다면 DEP_QUEUE를 push을 대기한다

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101


void compute(
  volatile uint32_t &done,
  volatile uop_T *uops,
  volatile bus_T *biases,
  hls::stream<insn_T> &gemm_queue,
  hls::stream<bool> &l2g_dep_queue,
  hls::stream<bool> &s2g_dep_queue,
  hls::stream<bool> &g2l_dep_queue,
  hls::stream<bool> &g2s_dep_queue,
  bus_T inp_mem[VTA_INP_BUFF_DEPTH][INP_MAT_AXI_RATIO],
  bus_T wgt_mem[VTA_WGT_BUFF_DEPTH][WGT_MAT_AXI_RATIO],
  bus_T out_mem[VTA_ACC_BUFF_DEPTH][OUT_MAT_AXI_RATIO]) {
PRAGMA_HLS(HLS INTERFACE s_axilite port = done bundle = CONTROL_BUS offset = VTA_COMPUTE_DONE_WR_OFFSET)
#pragma HLS INTERFACE m_axi port = uops offset = slave bundle = uop_port
#pragma HLS INTERFACE m_axi port = biases offset = slave bundle = data_port
#pragma HLS INTERFACE axis port = gemm_queue
#pragma HLS INTERFACE axis port = l2g_dep_queue
#pragma HLS INTERFACE axis port = s2g_dep_queue
#pragma HLS INTERFACE axis port = g2l_dep_queue
#pragma HLS INTERFACE axis port = g2s_dep_queue
#pragma HLS INTERFACE bram port = inp_mem
#pragma HLS INTERFACE bram port = wgt_mem
#pragma HLS INTERFACE bram port = out_mem
#pragma HLS INTERFACE s_axilite port = return bundle = CONTROL_BUS
#pragma HLS RESOURCE variable = inp_mem core = RAM_1P
#pragma HLS RESOURCE variable = wgt_mem core = RAM_1P
#pragma HLS RESOURCE variable = out_mem core = RAM_1P

  // Micro-op storage
  static uop_T uop_mem[VTA_UOP_BUFF_DEPTH];

  // Accumulator storage
  static bus_T acc_mem[VTA_ACC_BUFF_DEPTH][ACC_MAT_AXI_RATIO];
#pragma HLS ARRAY_RESHAPE variable = acc_mem complete dim=2
// This is necessary to obtain II=1
#pragma HLS DEPENDENCE variable = acc_mem inter false

  // Pop GEMM instruction
  insn_T raw_insn = gemm_queue.read();
  // Cast to GenericInsn
  VTAInsn insn;
  insn_T raw_copy = raw_insn;
  insn.generic = *((VTAGenericInsn *) &raw_copy);

  // Pop dependence token if instructed
  if (insn.generic.pop_prev_dep) {
    l2g_dep_queue.read();
  }
  if (insn.generic.pop_next_dep) {
    s2g_dep_queue.read();
  }

  // Set done value
  done = 0;
  // Perform action based on opcode
  if (insn.generic.opcode == VTA_OPCODE_FINISH) {
    // Set done flag if we reach a FINISH instruction
    done = 1;
  } else if (insn.generic.opcode == VTA_OPCODE_LOAD) {
    // Initialize indices
    memop_sram_T sram_idx = insn.mem.sram_base;
    memop_dram_T dram_idx = insn.mem.dram_base;
    memop_sram_T x_width =
        (insn.mem.x_pad_0 + insn.mem.x_size + insn.mem.x_pad_1);
    memop_sram_T y_offset_0 = x_width * insn.mem.y_pad_0;
    memop_sram_T y_offset_1 = x_width * insn.mem.y_pad_1;

    if (insn.mem.memory_type == VTA_MEM_ID_UOP) {
      // Perform data transfer
      memcpy(&uop_mem[sram_idx],
             (const uop_T*) &uops[dram_idx],
             insn.mem.x_size * sizeof(uop_T));
    } else if (insn.mem.memory_type == VTA_MEM_ID_ACC) {
      // Perform data transfer from DRAM
      load_pad_2d<bus_T, ACC_MAT_AXI_RATIO, VTA_ACC_ELEM_BYTES>(
          biases,
          acc_mem,
          sram_idx,
          dram_idx,
          insn.mem.y_size,
          insn.mem.x_size,
          insn.mem.x_stride,
          insn.mem.x_pad_0,
          insn.mem.x_pad_1,
          y_offset_0,
          y_offset_1);
    }
  } else if (insn.generic.opcode == VTA_OPCODE_GEMM) {
    gemm(raw_copy, uop_mem, acc_mem, inp_mem, wgt_mem, out_mem);
  } else if (insn.generic.opcode == VTA_OPCODE_ALU) {
    alu(raw_copy, uop_mem, acc_mem, inp_mem, wgt_mem, out_mem);
  }

  // Push dependence token if instructed
  if (insn.generic.push_prev_dep) {
    g2l_dep_queue.write(1);
  }
  if (insn.generic.push_next_dep) {
    g2s_dep_queue.write(1);
  }
}

GEMM core

GEMM core는 2-level nested loop 상에서 GEMM instruction을 수행을 위한 micro-code sequence를 실행
cycle 당 1개 input-weight matrix multiplication를 수행하며 행렬 연산 dimension은 hardware tensorization intrinsic으로 결정됨
tensorization intrinsic은 input, weight, accumulate tensor의 dimension으로 결정되며 overflow를 막기 위해 accumulator tensor가 더 큰 타입을 가짐
- 일반적으로 input/weight가 low-precision (8-bits or less), accumulator tenson가 32 bit
core의 utilization을 높이기 위해 input buffer, weight buffer, register file 충분한 Read/write bandwidth를 가져야 함

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90


void gemm(
  insn_T insn_raw,
  uop_T uop_mem[VTA_UOP_BUFF_DEPTH],
  bus_T acc_mem[VTA_ACC_BUFF_DEPTH][ACC_MAT_AXI_RATIO],
  bus_T inp_mem[VTA_INP_BUFF_DEPTH][INP_MAT_AXI_RATIO],
  bus_T wgt_mem[VTA_WGT_BUFF_DEPTH][WGT_MAT_AXI_RATIO],
  bus_T out_mem[VTA_ACC_BUFF_DEPTH][OUT_MAT_AXI_RATIO]) {
#pragma HLS INLINE

  VTAGemInsn insn = *((VTAGemInsn *) &insn_raw);

  // Loop offset
  acc_idx_T dst_offset_out = 0;
  inp_idx_T src_offset_out = 0;
  wgt_idx_T wgt_offset_out = 0;

  // Outer Loop
  EXE_OUT_LOOP: for (int it_out = 0; it_out < insn.iter_out; it_out++) {
    acc_idx_T dst_offset_in = dst_offset_out;
    inp_idx_T src_offset_in = src_offset_out;
    wgt_idx_T wgt_offset_in = wgt_offset_out;

    // Inner Loop
    EXE_IN_LOOP: for (int it_in = 0; it_in < insn.iter_in; it_in++) {

      // Iterate over micro op
      READ_GEMM_UOP: for (int upc = insn.uop_bgn; upc < insn.uop_end; upc++) {
#pragma HLS PIPELINE II = 1
        // Read micro-op fields
        uop_T uop = uop_mem[upc];

        // Decode indices
        acc_idx_T dst_idx =
            uop.range(VTA_UOP_GEM_0_1, VTA_UOP_GEM_0_0) + dst_offset_in;
        inp_idx_T src_idx =
            uop.range(VTA_UOP_GEM_1_1, VTA_UOP_GEM_1_0) + src_offset_in;
        wgt_idx_T wgt_idx =
            uop.range(VTA_UOP_GEM_2_1, VTA_UOP_GEM_2_0) + wgt_offset_in;

        // Read in weight tensor
        wgt_T w_tensor[VTA_BLOCK_OUT][VTA_BLOCK_IN];
        read_tensor<bus_T, wgt_T, wgt_idx_T, VTA_BUS_WIDTH, VTA_WGT_WIDTH, VTA_BLOCK_OUT, VTA_BLOCK_IN>(wgt_idx, wgt_mem, w_tensor);
        // Read in input tensor
        inp_T i_tensor[VTA_BATCH][VTA_BLOCK_IN];
        read_tensor<bus_T, inp_T, inp_idx_T, VTA_BUS_WIDTH, VTA_INP_WIDTH, VTA_BATCH, VTA_BLOCK_IN>(src_idx, inp_mem, i_tensor);
        // Read in accum tensor
        acc_T a_tensor[VTA_BATCH][VTA_BLOCK_OUT];
        read_tensor<bus_T, acc_T, acc_idx_T, VTA_BUS_WIDTH, VTA_ACC_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, acc_mem, a_tensor);
        // Output tensor
        out_T o_tensor[VTA_BATCH][VTA_BLOCK_OUT];

        // Inner GEMM loop
        for (int b = 0; b < VTA_BATCH; b++) {
          for (int oc = 0; oc < VTA_BLOCK_OUT; oc++) {
            // Initialize the accumulator values
            acc_T accum = a_tensor[b][oc];
            // Dot product sum
            sum_T tmp = 0;
            // Inner matrix multiplication loop (input channel/feature)
            for (int ic = 0; ic < VTA_BLOCK_IN; ic++) {
              wgt_T w_elem = w_tensor[oc][ic];
              inp_T i_elem = i_tensor[b][ic];
              mul_T prod_dsp = i_elem * w_elem;
              tmp += (sum_T) prod_dsp;
            }
            // Update summation
            accum += (acc_T) tmp;
            // Write back result acc_mem
            a_tensor[b][oc] = insn.reset_reg ? (acc_T) 0 : accum;
            // And output vector
            o_tensor[b][oc] = (out_T) accum.range(VTA_OUT_WIDTH - 1, 0);
          }
        }

        // Write the results back into accumulator
        write_tensor<bus_T, acc_T, acc_idx_T, VTA_BUS_WIDTH, VTA_ACC_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, a_tensor, acc_mem);
        // Write the results back in the output buffer
        write_tensor<bus_T, out_T, acc_idx_T, VTA_BUS_WIDTH, VTA_OUT_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, o_tensor, out_mem);
      }
      // Update offsets
      dst_offset_in += insn.dst_factor_in;
      src_offset_in += insn.src_factor_in;
      wgt_offset_in += insn.wgt_factor_in;
    }
    // Update offsets
    dst_offset_out += insn.dst_factor_out;
    src_offset_out += insn.src_factor_out;
    wgt_offset_out += insn.wgt_factor_out;
  }
}

ALU core

Tensor ALU는 ctivation, normalization, pooling을 위한 표준 연산자를 지원
- VTA는 모듈식 설계 이므로 Operator coverage를 높이기 위해 지원 연산을 확장할 수 있다고 기술하고 있음
tensor-tensor operations과 tensor-scalar operation(immediate value가 있는)을 지원
micro-code는 오직 지정된 data access pattern만 다룸
Tensor ALU는 읽기 포트가 부족해서 II = 2이고 레지스터 파일이 32bit이므로 tensor-tensor 연산을 한번에 수행하는 것은 비용이 많이듬, 여러 사이클의 vector-vector 연산으로 tensor-tensor 연산을 수행

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87


void alu(
  insn_T insn_raw,
  uop_T uop_mem[VTA_UOP_BUFF_DEPTH],
  bus_T acc_mem[VTA_ACC_BUFF_DEPTH][ACC_MAT_AXI_RATIO],
  bus_T inp_mem[VTA_INP_BUFF_DEPTH][INP_MAT_AXI_RATIO],
  bus_T wgt_mem[VTA_WGT_BUFF_DEPTH][WGT_MAT_AXI_RATIO],
  bus_T out_mem[VTA_ACC_BUFF_DEPTH][OUT_MAT_AXI_RATIO]) {
#pragma HLS INLINE

  VTAAluInsn insn = *((VTAAluInsn *) &insn_raw);

  // Loop offset
  acc_idx_T dst_offset_out = 0;
  inp_idx_T src_offset_out = 0;

  // Outer Loop
  EXE_OUT_LOOP: for (int it_out = 0; it_out < insn.iter_out; it_out++) {
    acc_idx_T dst_offset_in = dst_offset_out;
    inp_idx_T src_offset_in = src_offset_out;

    // Inner Loop
    EXE_IN_LOOP: for (int it_in = 0; it_in < insn.iter_in; it_in++) {
      // Iterate over micro op
      READ_ALU_UOP: for (int upc = insn.uop_bgn; upc < insn.uop_end; upc++) {
#pragma HLS PIPELINE II = 2
        // Read micro-op fields
        uop_T uop = uop_mem[upc];

        // Decode
        acc_idx_T dst_idx =
            uop.range(VTA_UOP_ALU_0_1, VTA_UOP_ALU_0_0) + dst_offset_in;
        acc_idx_T src_idx =
            uop.range(VTA_UOP_ALU_1_1, VTA_UOP_ALU_1_0) + src_offset_in;

        // Read in src tensor
        acc_T src_tensor[VTA_BATCH][VTA_BLOCK_OUT];
        read_tensor<bus_T, acc_T, acc_idx_T, VTA_BUS_WIDTH, VTA_ACC_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(src_idx, acc_mem, src_tensor);
        // Read in dst tensor
        acc_T dst_tensor[VTA_BATCH][VTA_BLOCK_OUT];
        read_tensor<bus_T, acc_T, acc_idx_T, VTA_BUS_WIDTH, VTA_ACC_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, acc_mem, dst_tensor);
        // Output tensor
        out_T o_tensor[VTA_BATCH][VTA_BLOCK_OUT];

        // Perform ALU op over matrix elements
        for (int i = 0; i < VTA_BATCH; i++) {
          for (int b = 0; b < VTA_BLOCK_OUT; b++) {
            // Read in operands
            acc_T src_0 = dst_tensor[i][b];
            acc_T src_1 = insn.use_imm ? (acc_T) insn.imm : src_tensor[i][b];
            aluop_shr_arg_T shft_by = src_1.range(VTA_SHR_ARG_BIT_WIDTH - 1, 0);
            aluop_mul_arg_T mul_by = src_1.range(VTA_MUL_ARG_BIT_WIDTH - 1, 0);
            if (insn.alu_opcode == VTA_ALU_OPCODE_MIN || insn.alu_opcode == VTA_ALU_OPCODE_MAX) {
              // Compute Min/Max
              acc_T mix_val = src_0 < src_1 ?
                  (insn.alu_opcode == VTA_ALU_OPCODE_MIN ? src_0 : src_1) :
                  (insn.alu_opcode == VTA_ALU_OPCODE_MIN ? src_1 : src_0);
              dst_tensor[i][b] = mix_val;
              o_tensor[i][b] = (out_T) mix_val.range(VTA_OUT_WIDTH - 1, 0);
            } else if (insn.alu_opcode == VTA_ALU_OPCODE_ADD) {
              // Compute Sum
              acc_T add_val =
                  src_0.range(VTA_ACC_WIDTH - 1, 0) + src_1.range(VTA_ACC_WIDTH - 1, 0);
              dst_tensor[i][b] = add_val;
              o_tensor[i][b] = (out_T) add_val.range(VTA_OUT_WIDTH - 1, 0);
            } else if (insn.alu_opcode == VTA_ALU_OPCODE_SHR) {
              // Compute Shift Right
              acc_T shr_val = src_0 >> shft_by;
              dst_tensor[i][b] = shr_val;
              o_tensor[i][b] = (out_T) shr_val.range(VTA_OUT_WIDTH - 1, 0);
            }
          }
        }

        // Write the results back into accumulator
        write_tensor<bus_T, acc_T, acc_idx_T, VTA_BUS_WIDTH, VTA_ACC_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, dst_tensor, acc_mem);
        // Write the results back in the output buffer
        write_tensor<bus_T, out_T, acc_idx_T, VTA_BUS_WIDTH, VTA_OUT_WIDTH, VTA_BATCH, VTA_BLOCK_OUT>(dst_idx, o_tensor, out_mem);
      }
      // Update offsets
      dst_offset_in += insn.dst_factor_in;
      src_offset_in += insn.src_factor_in;
    }
    // Update offsets
    dst_offset_out += insn.dst_factor_out;
    src_offset_out += insn.src_factor_out;
  }
}

Load and Store Modules

Load&Store 모듈은 DRAM에서 SRAM으로 데이터를 전송하기 위한 2D DMA 수행
stride와 2D Pading을 지원하므로 CPU에서 data를 re-lay하는 오버헤드를 줄일 수 있음

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114


void load(
  volatile bus_T *inputs,
  volatile bus_T *weights,
  hls::stream<insn_T> &load_queue,
  hls::stream<bool> &g2l_dep_queue,
  hls::stream<bool> &l2g_dep_queue,
  bus_T inp_mem[VTA_INP_BUFF_DEPTH][INP_MAT_AXI_RATIO],
  bus_T wgt_mem[VTA_WGT_BUFF_DEPTH][WGT_MAT_AXI_RATIO]) {
#pragma HLS INTERFACE m_axi port = inputs offset = slave bundle = data_port
#pragma HLS INTERFACE m_axi port = weights offset = slave bundle = data_port
#pragma HLS INTERFACE axis port = load_queue
#pragma HLS INTERFACE axis port = g2l_dep_queue
#pragma HLS INTERFACE axis port = l2g_dep_queue
#pragma HLS INTERFACE bram port = wgt_mem
#pragma HLS INTERFACE bram port = inp_mem
#pragma HLS INTERFACE s_axilite port = return bundle = CONTROL_BUS
#pragma HLS RESOURCE variable = inp_mem core = RAM_1P
#pragma HLS RESOURCE variable = wgt_mem core = RAM_1P

  // Pop load instruction
  insn_T raw_insn = load_queue.read();
  // Cast to MemInsn
  insn_T raw_copy = raw_insn;
  VTAMemInsn insn = *((VTAMemInsn *) &raw_copy);

  // Pop dependence token if instructed
  if (insn.pop_next_dep) {
    g2l_dep_queue.read();
  }

  // Pre-processing
  memop_sram_T x_width = (insn.x_pad_0 + insn.x_size + insn.x_pad_1);
  memop_sram_T y_offset_0 = x_width * insn.y_pad_0;
#pragma HLS RESOURCE variable = y_offset_0 core = Mul_LUT latency = 4
  memop_sram_T y_offset_1 = x_width * insn.y_pad_1;
#pragma HLS RESOURCE variable = y_offset_1 core = Mul_LUT latency = 4

  if (insn.memory_type == VTA_MEM_ID_INP) {
    load_pad_2d<bus_T, INP_MAT_AXI_RATIO, VTA_INP_ELEM_BYTES>(
        inputs,
        inp_mem,
        insn.sram_base,
        insn.dram_base,
        insn.y_size,
        insn.x_size,
        insn.x_stride,
        insn.x_pad_0,
        insn.x_pad_1,
        y_offset_0,
        y_offset_1);
  } else if (insn.memory_type == VTA_MEM_ID_WGT) {
    load_2d<bus_T, WGT_MAT_AXI_RATIO, VTA_WGT_ELEM_BYTES>(
        weights,
        wgt_mem,
        insn.sram_base,
        insn.dram_base,
        insn.y_size,
        insn.x_size,
        insn.x_stride);
  }

  // Push dependence token if instructed
  if (insn.push_next_dep) {
    l2g_dep_queue.write(1);
  }
}

void store(
  volatile bus_T *outputs,
  hls::stream<insn_T> &store_queue,
  hls::stream<bool> &g2s_dep_queue,
  hls::stream<bool> &s2g_dep_queue,
  bus_T out_mem[VTA_ACC_BUFF_DEPTH][OUT_MAT_AXI_RATIO]) {
#pragma HLS INTERFACE m_axi port = outputs offset = slave bundle = data_port
#pragma HLS INTERFACE axis port = store_queue
#pragma HLS INTERFACE axis port = g2s_dep_queue
#pragma HLS INTERFACE axis port = s2g_dep_queue
#pragma HLS INTERFACE bram port = out_mem
#pragma HLS INTERFACE s_axilite port = return bundle = CONTROL_BUS
#pragma HLS RESOURCE variable = out_mem core = RAM_1P

  // Pop store instruction
  insn_T raw_insn = store_queue.read();
  // Cast to MemInsn
  insn_T raw_copy = raw_insn;
  VTAMemInsn insn = *((VTAMemInsn *) &raw_copy);

  // Pop dependence token if instructed
  if (insn.pop_prev_dep) {
    g2s_dep_queue.read();
  }

  // Initialize indices
  memop_sram_T sram_idx = insn.sram_base;
  memop_dram_T dram_idx = insn.dram_base;

  // Copy along y dimension
  for (int y = 0; y < insn.y_size; y++) {
#pragma HLS PIPELINE
    // Perform data transfer
    memcpy(
      const_cast<bus_T*>(&outputs[dram_idx * OUT_MAT_AXI_RATIO]),
      (const bus_T*) &out_mem[sram_idx][0],
      insn.x_size * VTA_OUT_ELEM_BYTES);
#pragma HLS RESOURCE variable = sram_idx core = Mul_LUT
    sram_idx += insn.x_size;
    dram_idx += insn.x_stride;
  }

  // Push dependence token if instructed
  if (insn.push_prev_dep) {
    s2g_dep_queue.write(1);
  }
}

Vivado 구조 분석

실제 HLS 파일에는 vta top 모듈이 정의 되어 있지만 실제로 vivado에서 합성한 모듈은 top모듈을 사용하지 않고 Fetch, Compute, Load, Store 모듈과 FIFO IP를 이용하여 연결하였다. 이유는 글쎄?