• Exploit SW optimization first
  • Add hard matrix multiplier blocks to FPGA → how many? trade off: small area 3-5% (“Tensor Slices”) doubles the throughput for ML applications, w\out increasing the overall area, no extra general purpose connection required, no slowdown in non-ML benchmarks
  • Add compute in RAMs: improvement with +3% extra hw → reduction in data movement
  • Weightless NN have potential
  • ML for performance modeling: (1) calibration of power models using ML → McPAT, simulator to model machines that do not exist yet
  • Interest in ML-guided HW design → Computer Architecture 2.0, e.g. ML-based branch predictors?