The Rhasspy [0] author recently got hired by mycroft to work on satelites and fully local. Rhasspy requires a lot of manual work, but replacing Alexa is already possible. I’m somewhat stuck with the current hardware availability issues, but I have a Pi 3 satellite that does wakeword detection (this is supposed to be handled by Pi Zero 2 W in the future) and sends the voice to the MQTT server running on a PI 4, the data gets picked up by the Rhasspy instance also running there, it does STT, intent recognition, sends the intent to home assistant and then does TTS back to the satellite.
My main software issue is currently how to replicate the music functionality. Playing music at the satellite that requested it, lowering the volume when it recognizes the wakeword. Preselection of "commands" for band and genre names should be easily scriptable afterwards.
In a quiet room, I have no issues with wakeword detection using a playstation eye camera (I wanted the seed USB microhphone array, but between discovering it and starting with buying hardware the supply chain bit once again)
Playing music from a Plex server is a major use case for me, and I have given up on Rhasspy because I couldn’t get all the pieces to work together (I have the mic array HAT and a Synology I can run recognition on). Do you have a write-up of your setup?
> Playing music from a Plex server is a major use case for me, and I have given up on Rhasspy because I couldn’t get all the pieces to work together (I have the mic array HAT and a Synology I can run recognition on). Do you have a write-up of your setup?
I have not yet managed / worked enough on it (the lack of HW making everything theoretical, which kills my motivation). The way I understand it, is that there’ll either be a casting server on the satelite, or a pulse audio/pipewire server reachable via network. But I have next to no experience with consumer linux, so the configuration of those parts is… hard.
But there are many tutorials for playing multi-room audio (with icecast or something), I just assumed it would be easier without multi-room as I don’t need it, but it turns out it’s not ;)
Yeah we aren't using the seeed array in the final Mark II. But we have used the same XMOS XVF-3510 to perform acoustic echo cancellation. That means, even with music blasting out of the speakers, you can still wake the device from across the room.
In a simple fashion you can think of it as subtracting the audio being output from the audio coming in from the microphone.
My main software issue is currently how to replicate the music functionality. Playing music at the satellite that requested it, lowering the volume when it recognizes the wakeword. Preselection of "commands" for band and genre names should be easily scriptable afterwards.
In a quiet room, I have no issues with wakeword detection using a playstation eye camera (I wanted the seed USB microhphone array, but between discovering it and starting with buying hardware the supply chain bit once again)
[0]: https://rhasspy.readthedocs.io/en/latest/