PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-Core MCUs through Performance-Driven Autotuning

Reading it from here.

Abstract

Problem:

Making Internet-of-Things (IoT) sensor nodes "smart" and "self-adaptive" requires enabling on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) Microcontroller Units (MCUs). This is challenging. So ideally if we have a chip that can do this, it would be nice.

Solution:

To optimize PULP-TrainLib’s kernels, we propose a strategy to automatically select and configure (autotune) the fastest among a set of tiling options and optimized floating-point matrix multiplication kernels, according to the tensor shapes of every DNN layer.

Results:

Intro

Current Design:

Newer Methods:

Challenges of Deploying Training Algos on MCUs:

  1. Needs forward and backward propagation (unlike inference which is only forward).
  2. Floating-point operations are more expensive on MCUs than quantized (e.g., 8-bit integer) operations common in edge inference.
  3. SGD needs many passes over the data (epochs/iterations) to converge, whereas inference is often a single pass.
  4. Single element processing: MCUs often process data one element at a time, limiting parallelism from batching.

Paper's Approach on Solving the issues:

Related Work

On Device Training on PULP

PULP Platform:

The MCU features a single RISC-V core, a set of IO peripherals, and a 2 MB SRAM memory (L2) accessible by the cluster side through a DMA engine. The cluster includes 8 RISC-V cores sharing a 64 kB L1 memory, accessible in a single cycle. Each core implements a basic set of standard RISC-V extensions (RV32IMFC); additionally, a custom extension (Xpulp) provides DSP-like features to reduce overhead in highly uniform workloads, including post-increment load/store operations and 2-level nested hardware loops. Finally, each CPU is granted access to a private Floating Point Unit (FPU), capable of performing complex, single-cycle DSP instructions, like Fused Multiply-Add (FMA).

PULP-TrainLib Library:

So those are interesting words: WG-BW and IG-BW which are the weight gradient backward and input gradient backward respectively, which are used to compute the gradients for the weights and inputs during backpropagation.

Pasted image 20250530160606.png

Pasted image 20250530160803.png

PULP-TrainLib Primitives and AutoTuner

Pasted image 20250530161310.png

Loop Unrolling:

< i am yet to understand this >

The total number of instructions can be reduced by means of loop unrolling. This technique leads to a faster implementation by computing multiple outputs within the inner loop, hence exploiting data reuse. As a convention, we define the unrolling factor as U × V, where U and V are, respectively, the number of rows and columns of the output matrix concurrently computed within the inner loop of the MM. In particular, Figure 3 (b) represents a 2×1 loop unrolling, which features a 33% higher FMA/Load ratio with respect to the naı̈ve baseline, therefore reducing the overall number of instructions.

< will come back here again >
TODO: finish learning Loop Unrolling:

AutoTuner:

Algogorithm 1 (AutoTuner for PULP-TrainLib):

Pasted image 20250530162410.png

Function Tiler(L1MemSize, T)

AutoTuner Logic:

  1. If the original tensor shape fits in L1 memory, use it directly (single entry for TensorList).
  2. Else, call Tiler to get a list of feasible tensor shapes (tiles).
  3. For each tensor shape in TensorList:
    • For each MM kernel in the table:
      • Simulate the DNN training layer with the current MM kernel, tensor shape (tile), and number of cores using GVSOC (a behavioral event-based MCU simulator for PULP, claimed to have <5% error).
      • Store {Latency, TensorShape, MM} in PerfLog.
  4. Return the MM kernel and tile shape from PerfLog that has the lowest latency.

Pasted image 20250530162908.png