I think the only reason for mapping is to be able to block off 'no go' areas (no escaping out the front door!) and to be able to go home to the charger.
my previous robot vacuum did not do any mapping, but did always manage to find its way back to the charger. It'd just follow the walls until it saw the chargers IR beacon.
Clever design if you ask me. Doing a lot with a little.
If mass produced, no part of a robot vacuum is expensive. Blower fans are ~$1. Camera is $1. Cheap wifi MCU with a little ML accelerator + 8 Mbytes of ram is $1. Gyro is $1. Drive motors+gearboxes together are $1. AC charger $2. Plastic case $2. Batteries are the most expensive bit (~$3), but you can afford to have a battery life of just 10 mins if you can return to base frequently.
The hard part is the engineering hours to make it all work well. But you can get repaid those as long as you can sell 100 Million units to every nation in the world.
Yeah agreed 100%, might also need to factor in the cost of the charging dock but the overall thesis is still sound.
Do you know any cheap wifi MCU with a little ML accelerator that we can buy off the shelf? The only one we could think of was the Jetson Orin Nano and thats not cheap
I am not an expert but this seems like model distillation could work to get the behavior you need to run on a cheap end-user processor (Raspberry Pi 4/5 class). I chatted with claude opus about your project and had the following advice:
For the compute problem, you don't need a Jetson. The approach you want is knowledge distillation: train a large, expensive teacher model offline on a beefy GPU (cloud instance, your laptop's GPU, whatever), then distill it down into a tiny student network like a MobileNetV3-Small or EfficientNet-Lite. Quantize that student to int8 and export it to TFLite. The resulting model is 2-3 MB and runs at 10-20 FPS on a Raspberry Pi 4/5 with just the CPU - no ML accelerator needed. For even cheaper, an ESP32-S3 with a camera module can run sub-500KB models for simpler tasks. The preprocessing is trivial: resize the camera frame to 224x224, normalize pixel values, feed the tensor to the TFLite interpreter. The CNN learns its own feature extraction internally, so you don't need any classical CV preprocessing.
Looking at your observations, I think the deeper issue is what you identified: there's not enough signal in single frames. Your validation loss not converging even after augmentation and ImageNet pretraining confirms this. The fix is exactly what you listed in your future work - feed stacked temporal frames instead of single images. A simple approach is to concatenate 3-4 consecutive grayscale frames into a multi-channel input (e.g., 224x224x4). This gives the network implicit motion, velocity, and approach-rate information without needing to compute optical flow explicitly. It's the same trick DeepMind used in the original Atari DQN paper - a single frame of Pong doesn't tell you which direction the ball is moving either.
On the action space: your intuition about STOP being problematic is right. It creates a degenerate attractor - once the model predicts STOP, there's no recovery mechanism. The paper you referenced that only uses STOP at goal-reached is the better design. Also consider that TURN_CW and TURN_CCW have no obvious visual signal in a single frame (which way to turn is a function of where you've been and where you're going, not just what you see right now), which is another reason temporal stacking or adding a small recurrent/memory component would help. Even a simple LSTM or state tuple fed alongside the image could encode "I've been turning left for 3 steps, maybe try something else."
For the longer term, consider a hybrid architecture: use the distilled neural net for obstacle detection and free-space classification, but pair it with classical SLAM or even simple odometry-based mapping for path planning and coverage. Pure end-to-end behavior cloning for the full navigation stack is a hard problem - even the commercial robots use learned perception with algorithmic planning. And your data collection would get easier too, because you'd only need to label "what's in front of me" rather than "what should I do," which decouples perception from decision-making and makes each piece easier to train and debug independently.
The mel spectrum is the first part of a speech recognition pipeline...
But perhaps you'd get better results if more of a ML speech/audio recognition pipeline were included?
Eg. the pipeline could separate out drum beats from piano notes, and present them differently in the visualization?
An autoencoder network trained to minimize perceptual reconstruction loss would probably have the most 'interesting' information at the bottleneck, so that's the layer I'd feed into my LED strip.
I've done this in my own solution in this space (https://thundergroove.com). I use a realtime beat detection neural network combined with similar frequency spectrum analyses to provide a set of signals that effects can use.
Effects themselves are written in embedded Javascript and can be layered a bit like photoshop. Currently it only supports driving nanoleaf and wled fixtures, though wled gives you a huge range of options. The effect language is fully exposed so you can easily write your own effects against the real-time audio signals.
It isn't open source though, and still needs better onboarding and tutorials. Currently it's completely free, haven't really decided on if I want to bother trying to monetize any of it. If I were to it would probably just be for DMX and maybe midi support. Or maybe just for an ecosystem of portable hardware.
I was playing around with this recently, but the problem I encountered is that most AI analysis techniques like stem separation aren't built to work in real-time.
Btrfs allows migration from ext4 with a rather good rollback strategy...
Post-migration, a complete disk image of the original ext4 disk will exist within the new filesystem, using no additional disk space due to the magic of copy-on-write.
Why isn't the repair process the same? Fix the filesystem to get everything online asap, and leave a complete disk image of the old damaged filesystem so other recovery processes can be tried if necessary.
Doing something 10x as big is 100x as difficult. And the last 10% takes 50% of the work. With that in mind, Starship is right on schedule. Something will be operational by 2030.
The talent has mostly gone because the US is fiercely politically divided, and musk changing teams from democrats to republican pretty much meant his whole staff were forced to jump ship because he no longer aligned with their values.
It's what happens when Elon jumps into the k-hole and convinces himself that because he owns a company that successfully did a thing, his genius will make those companies do an even better thing. He's wrong. And he can stay wrong for years and decades even.
Starship is too big for orbital payloads, and too heavy to go beyond orbit. Yes and only if it actually achieve target payload capacity, it takes 15 refueling missions to refuel to do anything other than an orbital mission. If it doesn't achieve target payload capacity, it's cooked.
RDP in the windows XP days supported all kinds of tricks to work with low bandwidths like doing rendering on the client not the server.
I think most of those tricks have been disabled in modern windows for better security (you don't want some guest user able to feed your not-so-robust awfully complex rendering code some malicious inputs...)
This needs to be augmented with a new bit of contract law which enables a new type of 'subscription' where the terms are set by law.
Those terms would include things like "payments are monthly, service automatically ends when payments end, etc."
As things stand today, plenty of consumers end subscriptions by blocking payment, which practically works, but opens the doors to a scumbag company bulk chasing all those unpaid subscriptions through the courts and getting leins on millions of homes for $150 each and templated court cases.
For the actual cleaning, random works great.
reply