Luxonis is way more powerful and has an Intel VPU (Movidius). It's not really meant to be a standalone platform, so it needs a host board (Linux SBC like the Pi). It takes a lot more power, but it can do 60fps on small Yolo models. Its resolution is a lot higher as well.
ESP32-S3 has a pretty small memory capacity, and doesn't have H264-H265 hardware encoding, so you'll be on low-res, low-fps. It only does MJPEG (AFAIK) streaming, so you'll also have to deal with high latency if you want to send that data somewhere. The big bonus is that it's low-cost and low-power, and you're running it directly on the core without an OS.
This means you can do stuff like sleep the cores until something wakes it up (like a PIR sensor that detects people), and it will start streaming in a second or two.
TL;DR: Luxonis stronk, but needs big batteries or plugged in. ESP32-S3 can run on small batteries or solar.