What this guide covers
A WiFi / Bluetooth AI smart speaker is one of the most complex consumer electronics products to design and manufacture. It requires 4 parallel design tracks that must converge at a single SoC:
- SoC + WiFi/BT radio — runs the entire system, including AI inference
- Audio chain — microphones, DSP, amplifier, speaker driver, acoustic chamber
- AI pipeline — wake word, STT, LLM, TTS, with strict latency budget
- Enclosure — mechanical, thermal, and acoustic design in one box
Get any of these wrong, and the speaker either doesn't work, sounds terrible, or fails certification. Get them all right, and you have a product that competes with Sonos, Apple HomePod, and Amazon Echo at a fraction of the cost.
This guide walks through the 7-stage process we use on every smart-speaker project at SkyTech, with real BOMs and cost breakdowns for entry ($22-30), mid ($50-65), and premium ($130-180) tiers.
1.Define product requirements
Before you pick a single component, lock down the requirements. The single biggest mistake we see is founders building a "smart speaker" without specifying which tier they're targeting. Here's the decision tree:
1a. Product tier
| Tier | Use case | Compute | Audio | Power | Target retail |
|---|---|---|---|---|---|
| Entry | Alarm clock, kitchen timer, basic voice | Single-chip MCU + cloud AI | 5W mono, 1 driver | 5V/2A USB-C or 4x AA | $40-60 |
| Mid | Kitchen assistant, bedside clock, portable | Quad-core SoC + cloud or on-device small model | 10-20W stereo, 2-4 drivers | 12-19V DC or battery | $100-200 |
| Premium | Living room speaker, conference room, smart display | Hex/octa-core SoC + on-device LLM (3-8B) | 30-100W with subwoofer | Mains, 100-240V AC | $300-600 |
1b. Latency budget
User-perceived latency is the most important quality metric for AI speakers. Here's what a good user experience requires:
| Stage | Cloud AI target | On-device LLM target | Why it matters |
|---|---|---|---|
| Wake word → on-device | < 200ms | < 200ms | User expects immediate response |
| STT (speech → text) | 1-2s | 1-2s | User starts listening for response |
| LLM first token | 300-800ms | 1-2s | User starts hearing response |
| TTS first audio | 200-500ms | 200-500ms | Audio starts playing |
| Total wake → first audio | 2-3.5s | 2.5-4.5s | Acceptable for natural conversation |
1c. Wake word approach
Two main paths, with serious privacy + power tradeoffs:
| Approach | Power | Latency | Privacy | Accuracy |
|---|---|---|---|---|
| On-device keyword spotting (Picovoice, Syntiant, custom TFLM) | 1-10mW | <200ms | Audio never leaves device | 95-98% in quiet, 85-90% in noise |
| Cloud-triggered activation (always-listening audio stream) | 200-500mW continuous | 500-1500ms (network) | Audio uploaded on every wake | 98%+ (cloud DSP) |
Recommendation: on-device keyword spotting for battery-powered products. Privacy-first. Apple, Google, and Amazon all do this on their flagship devices for good reason.
2.Select SoC and wireless chipset
The SoC is the most consequential decision. It determines cost, audio quality, AI capability, and battery life. Here's how the major options stack up for 2026:
| SoC | CPU | WiFi | BT | Audio I/O | AI | Cost (1k) | Best for |
|---|---|---|---|---|---|---|---|
| ESP32-S3 | Dual-core 240MHz Xtensa LX7 | 802.11 b/g/n | BLE 5 + Mesh | I²S, 2-ch DAC, PDM mic | TensorFlow Lite Micro | $2.50 | Entry / battery |
| ESP32-P4 | Dual-core 400MHz RISC-V + LP core | 802.11 b/g/n + BT 5.4 | BLE 5.4 + Mesh | I²S, PDM, MIPI-CSI | TFLM + vector ops | $4.20 | Entry+ with camera |
| Amlogic A113X | Quad-core 1.2GHz Cortex-A53 | 802.11 ac + BT 5.0 | BLE 5.0 + Audio | I²S, TDM, PDM, SPDIF | Linux + ALSA + DSP | $8.50 | Mid tier Linux-based |
| Allwinner V821 | Dual-core RISC-V + DSP | 802.11 b/g/n + BT 5.0 | BLE 5.0 | Hardware audio codec | Low-power AI (~1 TOPS) | $6.80 | Low-power portable |
| Rockchip RK2118 | Dual-core ARM + 1 TOPS NPU | 802.11 ax + BT 5.4 | BLE 5.4 + Audio | 8-ch PDM, hardware codec | 1 TOPS NPU (KWS + small models) | $11.00 | Mid+ with on-device KWS |
| Amlogic A311D | Quad-core A73 + A53 | 802.11 ax + BT 5.0 | BLE 5.0 + WiFi 6 | Multi-channel I²S, HDMI | 5 TOPS NPU | $22.00 | Premium on-device LLM |
| Rockchip RK3588 | Octa-core 2.4GHz + 6 TOPS NPU | 802.11 ax + BT 5.2 | BLE 5.2 | 8K video + multi-channel audio | 6 TOPS (Llama 3 8B at 5 tok/s) | $45.00 | Flagship smart display |
2a. WiFi / Bluetooth coexistence
Both radios share the 2.4 GHz band. Without proper coexistence, audio over Bluetooth drops out when WiFi is active. Two implementation paths:
Single antenna shared (most common)
- Time-Division Multiplexing (TDM): SoC alternates between WiFi TX/RX and BT TX/RX in microsecond time slots. ESP32 uses a hardware arbiter; Amlogic uses a software layer.
- Antenna isolation: Use a single 2.4 GHz antenna with a diplexer or shared front-end. 20-30 dB isolation is sufficient.
- Bluetooth audio codec: LC3 (LE Audio) is robust to WiFi interference; SBC is more vulnerable. Use LC3 if your SoC supports it.
Dual antenna (premium tier)
- Separate WiFi and BT antennas, placed at least 50mm apart on the PCB
- Higher cost ($1-2 per antenna) but eliminates coexistence issues
- Use for: premium speakers, home theater, conference room devices
3.Design the audio chain
The audio chain determines sound quality, voice pickup, and whether the speaker is delightful or awful. Don't skimp here.
3a. Microphone array
For voice pickup and beamforming:
- 2-mic array (entry): enough for basic far-field pickup. Mics spaced 40-60mm apart.
- 3-mic triangle (mid): better beamforming, supports 360° pickup. Mics on 3 corners of equilateral triangle, 50-80mm sides.
- 4-mic linear array (premium): best for directional pickup (e.g., conference room). Mics in a row, 25-40mm apart.
Recommended MEMS mic (2026):
- Knowles SPH0645LM4H-B — best SNR (65 dB), $1.20/unit, used in Apple HomePod mini
- Infineon IM73A135 — best for noise rejection, $1.50/unit, used in Amazon Echo
- TDK ICS-43434 — best for far-field, $0.85/unit, used in Google Nest
3b. Audio DSP and CODEC
Most modern SoCs include audio DSP. For pure software beamforming and AEC (acoustic echo cancellation), use the SoC's DSP core or an external DSP like Knowles AISonic.
Audio CODEC requirements:
- 24-bit / 48kHz minimum (16-bit is too low for Hi-Res audio)
- Integrated DAC + ADC (saves PCB space and BOM cost)
- Low latency (<20ms for real-time processing)
For entry tier: SoC integrated DAC/ADC is sufficient.
For mid/premium: external DAC like ESS Sabre 9018 or AKM AK4493 for audiophile-grade output.
3c. Amplifier
| Output | Class | Efficiency | Use case | Example chip |
|---|---|---|---|---|
| 5W mono | Class D | 85-90% | Entry | TI TPA2005 |
| 10W stereo | Class D | 85-90% | Mid | TI TPA3116 |
| 30W+ per channel | Class D | 88-93% | Premium | Infineon MERUS MA12070 |
| Subwoofer (50W+) | Class D bridge-tied load | 85% | Premium | TI TPA3255 |
Class D is the obvious choice for battery-powered speakers. Class AB only for audiophile HiFi products where THD matters more than efficiency.
3d. Speaker driver
Driver choice has more impact on perceived audio quality than any other component. Don't buy the cheapest driver you find on Alibaba.
- Full-range driver (1.5-3 inches) for entry: covers 200Hz-15kHz, low cost, low power
- Full-range + tweeter (2-way) for mid: better high-frequency extension, more natural sound
- Full-range + tweeter + subwoofer (3-way) for premium: full-range frequency response 50Hz-20kHz
- Passive radiator for bass extension in small enclosures (avoids port noise)
4.Integrate the AI pipeline
The AI pipeline is where smart speakers differentiate. There are 4 stages: wake word → STT → LLM → TTS. Each has its own latency target and engineering tradeoffs.
4a. Wake word detection
On-device keyword spotting is the privacy-first default. Options:
| Solution | Power | Latency | Custom wake | License |
|---|---|---|---|---|
| Picovoice Cheetah | 5-10mW | 100-150ms | Yes (free) | Commercial |
| Syntiant NDP120 | 0.5-1mW (always-on) | 50-100ms | Yes (custom model) | Commercial |
| Google Edge TPU | 10-30mW | 100-200ms | Yes (TensorFlow Lite) | Open source |
| Custom TFLM model | 5-20mW | 100-200ms | Yes (own training) | Self-hosted |
For most products, Picovoice Cheetah is the right answer. It runs on ESP32-S3, has the best accuracy, and supports unlimited custom wake words. Cost: $0.10/unit royalty at 10k+ units.
4b. STT (speech-to-text)
Cloud STT is the only practical option in 2026 for product-grade accuracy. On-device STT is improving but not yet at Whisper quality.
| Service | Latency | Accuracy (WER) | Cost | Streaming |
|---|---|---|---|---|
| OpenAI Whisper API | 1-2s | 3-5% | $0.006/min | No |
| Deepgram Nova-2 | 0.3-0.8s | 2-4% | $0.0043/min | Yes |
| Google Cloud STT | 0.5-1.5s | 4-6% | $0.006/15s | Yes |
| AssemblyAI Universal | 0.5-1.2s | 3-5% | $0.0065/min | Yes |
Recommendation: Deepgram for streaming + low latency + good price. If you need multilingual, Whisper is still the gold standard.
4c. LLM (the actual smart)
This is where the magic happens. Two paths:
Cloud LLM (most products)
- OpenAI GPT-4o-mini: $0.15/1M input tokens, fast, capable
- Anthropic Claude Sonnet 4.5: $3/1M input, smarter, better instruction following
- OpenAI GPT-5 (when available): new flagship
Latency: 300-800ms for first token, 5-15 tokens/sec streaming. For conversational AI, GPT-4o-mini is the sweet spot for cost/quality.
On-device LLM (premium tier, premium price)
- Phi-3 3.8B Mini: 5-8 tok/s on RK3588, decent for short responses
- Llama 3.1 8B: 3-5 tok/s, smarter but slower
- Qwen 2.5 7B: best for multilingual
On-device LLM is a "privacy premium" feature. Worth it for medical, legal, or B2B use cases. Not for consumer mass-market yet.
4d. TTS (text-to-speech)
Cloud TTS is the standard for natural-sounding voices. On-device TTS exists but quality is much lower.
| Service | Quality | Latency (first audio) | Cost | Streaming |
|---|---|---|---|---|
| ElevenLabs Turbo v2.5 | ★★★★★ | 200-400ms | $0.15/1k chars | Yes |
| OpenAI TTS-1-HD | ★★★★☆ | 300-500ms | $0.030/1k chars | Yes |
| Google Cloud TTS | ★★★☆☆ | 300-500ms | $0.016/1k chars | Yes |
| ElevenLabs (custom voice clone) | ★★★★★ | 500-800ms | $0.30/1k chars | Yes |
Recommendation: ElevenLabs for premium tier (voice is a differentiator). OpenAI TTS-HD for mid tier (good quality, lower cost). Google TTS for entry (cheap, decent quality).
5.Design enclosure and acoustic chamber
This is where most smart-speaker projects fail. A great electronics design with bad acoustics sounds like a $5 Bluetooth speaker. Here's the process:
5a. Acoustic simulation
Before cutting tooling, simulate the enclosure in COMSOL Multiphysics or Actran. You'll find:
- Standing waves: at frequencies where the wavelength matches 2x the longest enclosure dimension. For a 150mm speaker, that's around 1.1kHz. Add internal bracing to break up resonances.
- Port tuning (if ported): aim for 80Hz tuning, ±5Hz tolerance. Use a flared port to reduce chuffing.
- Driver placement: front-facing for music speakers, top-facing for voice assistants (better mic pickup from above the driver)
5b. Enclosure materials
| Material | Cost | Weight | Acoustic | Use |
|---|---|---|---|---|
| ABS plastic | Low | Light | Acceptable | Entry, mid |
| Polycarbonate | Medium | Light | Good | Mid, premium (drop-resistant) |
| Aluminum | High | Heavy | Excellent | Premium (also doubles as heatsink) |
| Wood (MDF) | Medium | Heavy | Good | Audiophile |
| Recycled fabric + polymer | Medium | Light | Variable | Sustainability-focused brands |
5c. Mesh fabric
The front grille mesh must be acoustically transparent. The common mistake is using a tight weave that absorbs 1-3kHz frequencies (where voice clarity lives). Use:
- Knit polyester with >60% open area
- Acoustically transparent metal mesh (3M™ acoustic fabric or similar)
- Perforated aluminum (1mm holes, 30% open area)
Always measure frequency response with the mesh in place. Without mesh, your product will sound "bright" in the lab and "muddy" in customers' homes.
6.Firmware architecture
The firmware is what makes or breaks a smart speaker. Architecture matters more than any specific SoC.
6a. RTOS or Linux?
| Approach | SoCs | Pros | Cons | Best for |
|---|---|---|---|---|
| RTOS (FreeRTOS, Zephyr) | ESP32, Allwinner V821, low-power chips | Low power, deterministic, simple | Limited compute, harder AI | Entry, battery, voice-only |
| Embedded Linux (Buildroot, Yocto) | Amlogic, Rockchip, multi-core chips | Full ecosystem, easier AI, familiar tooling | Higher power, more complex | Mid, premium, on-device LLM |
6b. Task partitioning (RTOS approach)
Even on a single-core SoC, partition your firmware into 4 tasks with priority:
- Audio task (priority: real-time, highest): mic array capture, AEC, beamforming. Must complete within 10ms or you get audio glitches.
- Wake word task (priority: real-time, second-highest): runs on dedicated DSP core if available, or on a low-power co-processor (Syntiant NDP). Must respond within 200ms.
- AI task (priority: medium): orchestrates STT → LLM → TTS, handles network calls. Latency acceptable up to 5 seconds.
- Network task (priority: low): handles WiFi provisioning, OTA updates, telemetry. Best-effort, not real-time.
6c. OTA update strategy
Every smart speaker needs secure OTA updates. We use A/B partitions with signed firmware images. See our Custom PCBA guide for the OTA implementation details. The non-obvious part: AI speaker OTA updates are 100-500MB (model + firmware), so make sure your flash is at least 8MB and your update strategy handles partial failures.
7.Certifications and production
7a. Required certifications
| Cert | Region | Cost | Time | Notes |
|---|---|---|---|---|
| FCC Part 15 (Subpart B + C) | US | $3-8k | 4-8 weeks | EMI/EMC + RF |
| CE RED | EU | $3-6k | 4-8 weeks | Radio Equipment Directive |
| BQB (Bluetooth) | Global | $8k/year + per-product | 2-4 weeks | Required for any product with Bluetooth |
| WiFi Alliance (WFA) | Global | $5-15k | 2-4 weeks | WPA3 + WiFi 6 cert |
| UN38.3 | Global | $2-4k | 4-8 weeks | Required if product has lithium battery |
| MFi (Apple Find My) | Apple | Free (license fee) | 4-8 weeks | Only if you integrate Apple Find My |
| Dolby Audio | Global | $0.50-1.50/unit royalty | 2-4 weeks | Only for premium tier |
| IP rating (IPX4, IP67) | Global | $1-3k | 2-4 weeks | Outdoor or bath products |
Plan 8-12 weeks for full cert cycle, run in parallel with product tooling. We typically start cert at EVT (50 units) and finish at DVT (500 units).
7b. Manufacturing plan
| Stage | Volume | Timeline | Cost | What changes |
|---|---|---|---|---|
| EVT (Engineering Validation) | 50 units | Week 1-6 | $8-15k | 3D-printed enclosure, off-the-shelf drivers, manual assembly |
| DVT (Design Validation) | 500 units | Week 6-14 | $25-40k | SLA enclosures, final drivers, semi-automated assembly |
| PVT (Production Validation) | 5,000 units | Week 14-22 | $80-150k | Injection-molded enclosure, full automation, cert complete |
| Mass production | 50,000+ units | Ongoing | $1.5-3M | Steel tooling, multi-line production, retail distribution |
Cost breakdown (3 tiers, 1k unit production run)
Smart speaker BOM comparison
All costs in USD, FOB Thailand, 1,000 unit production run. Engineering, NRE, and certification are separate.
Entry Tier
ESP32-S3 SoC
1× MEMS mic
5W mono amp
1× 2" full-range driver
Single mic, no beamforming
Cloud AI only
Retail target: $40-60
Mid Tier
Amlogic A113X SoC
2-mic array + beamforming
10W stereo amp
2× 2.5" full-range + tweeter
WiFi ac + BT 5.0
Cloud or on-device small model
Retail target: $100-200
Premium Tier
Rockchip RK3588 + 6 TOPS NPU
4-mic array + beamforming
30W+ per channel, subwoofer
3-way driver setup
WiFi 6 + BT 5.4
On-device LLM (Phi-3 or Llama 3)
Retail target: $300-500
What we see go wrong (4 common mistakes)
1. Picking the wrong SoC for the use case
Choosing ESP32-S3 for a premium speaker that needs 30W+ per channel, multi-channel beamforming, and on-device LLM. It can't do any of those. Or choosing RK3588 for a battery-powered alarm clock — overkill, expensive, 7W continuous draw.
Fix: Use the SoC selection table above. Match requirements to silicon tier.
2. Bad acoustic engineering
Buying expensive drivers and putting them in a plastic box without simulation. The result: muddy, boomy, tinny sound. We see this in 80% of first-time speaker projects.
Fix: COMSOL or Actran simulation before tooling. Always measure with the mesh fabric in place.
3. Wake word on the wrong chip
Trying to run always-on wake word on a general-purpose CPU. Drains battery in 6 hours. Or running wake word on a low-power DSP that's too slow for the chosen keyword.
Fix: Use dedicated low-power DSP for wake word (Syntiant NDP120, $0.85/unit, runs 1mW always-on). Reserve main SoC for higher-level processing.
4. Latency budget ignored
Designing each AI stage independently without thinking about end-to-end latency. Wake word 200ms, STT 3s, LLM 5s, TTS 3s = 11s total. Users abandon the product.
Fix: Set end-to-end latency target (we recommend 3s from wake word to first audio) and budget each stage.
What we built (case studies)
We've shipped 4 smart-speaker projects at SkyTech, including:
1. Battery-powered portable AI speaker (entry tier)
- ESP32-S3 + Syntiant NDP120 wake word + Picovoice
- 2-mic array, beamforming via ESP-ADF
- 5W mono, 4-hour battery life
- Whisper + GPT-4o-mini + ElevenLabs Turbo
- BOM at 1k units: $24
- Volume: 8,000 units, shipped Q4 2025
2. Kitchen assistant with display (mid tier)
- Amlogic A113X + 4" touchscreen
- 4-mic array + beamforming, far-field pickup to 5m
- 10W stereo + passive radiator
- Linux + Qt + custom AI app
- BOM at 1k units: $58
- Volume: 5,000 units, shipping Q2 2026
3. Premium smart speaker (in development)
- Rockchip RK3588 + 6 TOPS NPU
- On-device LLM (Phi-3 3.8B) + cloud fallback
- 30W per channel + 50W subwoofer
- 3-way driver setup, premium audio
- Target retail: $499
- Target launch: Q4 2026