Exploit SW optimization first
Add hard matrix multiplier blocks to FPGA → how many? trade off: small area 3-5% (“Tensor Slices”) doubles the throughput for ML applications, w\out increasing the overall area, no extra general purpose connection required, no slowdown in non-ML benchmarks
Add compute in RAMs: improvement with +3% extra hw → reduction in data movement
Weightless NN have potential
ML for performance modeling: (1) calibration of power models using ML → McPAT, simulator to model machines that do not exist yet
Interest in ML-guided HW design → Computer Architecture 2.0, e.g. ML-based branch predictors?