What is the best SoC for an AI smart speaker?

It depends on tier. Entry (under $50 BOM): ESP32-S3 with on-device wake word + cloud AI. Mid ($50-100): Amlogic A113X quad-core with integrated DSP. Premium ($100+): Amlogic A311D or Rockchip RK3588 for on-device LLM (Phi-3 3.8B at 5 tokens/sec). For battery-powered portable speakers, ESP32-S3 is the winner at 50mW active.

How does WiFi / Bluetooth coexistence work?

Both WiFi and Bluetooth operate in the 2.4 GHz ISM band. The standard approach is Time-Division Multiplexing (TDM): the SoC alternates between WiFi TX/RX and Bluetooth TX/RX in microsecond time slots. ESP32 uses a hardware arbiter; Amlogic uses a software arbitration layer. Without proper coexistence, audio streaming over Bluetooth causes audible dropouts when WiFi is active.

What's the typical latency from wake word to AI response?

On a well-optimized cloud-AI speaker: wake word detection (150ms on-device) + STT (1-2s Whisper) + LLM first-token (300-800ms for short responses) + TTS first-audio (200-500ms streaming). Total: 2-3.5 seconds from wake word to first audio out. On-device LLM (Phi-3 3.8B): same wake word + STT, but LLM first-token is 1-2s and TTS 200-500ms — total 2.5-4.5s. The bottleneck is always STT + LLM, not wake word.

How much does it cost to make an AI smart speaker?

Entry tier (ESP32-S3, 5W mono, basic voice): $22-30 BOM at 1k units, retail $40-60. Mid tier (Amlogic A113X, stereo 2x10W, premium audio DSP): $50-65 BOM, retail $100-150. Premium tier (A311D, 2x30W with subwoofer, on-device LLM, Apple Find My): $130-180 BOM, retail $300-500. Per-unit NRE (tooling, cert, software): $15-40k depending on tier.

What is the MOQ for manufacturing AI speakers?

At SkyTech: 500 units for production, 50-200 for evaluation runs. The acoustic enclosure tooling is the limiting factor (typically $5-15k for soft tooling, $25-50k for injection mold). Speaker drivers and microphones have separate MOQs from their vendors (typically 1,000-5,000 units), so we help source pre-buy to bridge the gap for small initial runs.

WiFi / Bluetooth AI Speaker Development Guide: From SoC to Production

What this guide covers

A WiFi / Bluetooth AI smart speaker is one of the most complex consumer electronics products to design and manufacture. It requires 4 parallel design tracks that must converge at a single SoC:

SoC + WiFi/BT radio — runs the entire system, including AI inference
Audio chain — microphones, DSP, amplifier, speaker driver, acoustic chamber
AI pipeline — wake word, STT, LLM, TTS, with strict latency budget
Enclosure — mechanical, thermal, and acoustic design in one box

Get any of these wrong, and the speaker either doesn't work, sounds terrible, or fails certification. Get them all right, and you have a product that competes with Sonos, Apple HomePod, and Amazon Echo at a fraction of the cost.

This guide walks through the 7-stage process we use on every smart-speaker project at SkyTech, with real BOMs and cost breakdowns for entry ($22-30), mid ($50-65), and premium ($130-180) tiers.

Who this guide is for: Hardware founders building a smart speaker, voice assistant, conference room device, or AI-enhanced audio product. If you're an audio engineer, jump to the Audio Chain section. If you're a firmware engineer, jump to the AI Pipeline section. If you're a product manager, read the Cost Breakdown section.

1.Define product requirements

Before you pick a single component, lock down the requirements. The single biggest mistake we see is founders building a "smart speaker" without specifying which tier they're targeting. Here's the decision tree:

1a. Product tier

Tier	Use case	Compute	Audio	Power	Target retail
Entry	Alarm clock, kitchen timer, basic voice	Single-chip MCU + cloud AI	5W mono, 1 driver	5V/2A USB-C or 4x AA	$40-60
Mid	Kitchen assistant, bedside clock, portable	Quad-core SoC + cloud or on-device small model	10-20W stereo, 2-4 drivers	12-19V DC or battery	$100-200
Premium	Living room speaker, conference room, smart display	Hex/octa-core SoC + on-device LLM (3-8B)	30-100W with subwoofer	Mains, 100-240V AC	$300-600

1b. Latency budget

User-perceived latency is the most important quality metric for AI speakers. Here's what a good user experience requires:

Stage	Cloud AI target	On-device LLM target	Why it matters
Wake word → on-device	< 200ms	< 200ms	User expects immediate response
STT (speech → text)	1-2s	1-2s	User starts listening for response
LLM first token	300-800ms	1-2s	User starts hearing response
TTS first audio	200-500ms	200-500ms	Audio starts playing
Total wake → first audio	2-3.5s	2.5-4.5s	Acceptable for natural conversation

Common mistake: Spending 2 years optimizing wake word from 200ms to 50ms. Users don't notice. Spend that time on STT + LLM latency, which is where the real perception gap lives.

1c. Wake word approach

Two main paths, with serious privacy + power tradeoffs:

Approach	Power	Latency	Privacy	Accuracy
On-device keyword spotting (Picovoice, Syntiant, custom TFLM)	1-10mW	<200ms	Audio never leaves device	95-98% in quiet, 85-90% in noise
Cloud-triggered activation (always-listening audio stream)	200-500mW continuous	500-1500ms (network)	Audio uploaded on every wake	98%+ (cloud DSP)

Recommendation: on-device keyword spotting for battery-powered products. Privacy-first. Apple, Google, and Amazon all do this on their flagship devices for good reason.

2.Select SoC and wireless chipset

The SoC is the most consequential decision. It determines cost, audio quality, AI capability, and battery life. Here's how the major options stack up for 2026:

SoC	CPU	WiFi	BT	Audio I/O	AI	Cost (1k)	Best for
ESP32-S3	Dual-core 240MHz Xtensa LX7	802.11 b/g/n	BLE 5 + Mesh	I²S, 2-ch DAC, PDM mic	TensorFlow Lite Micro	$2.50	Entry / battery
ESP32-P4	Dual-core 400MHz RISC-V + LP core	802.11 b/g/n + BT 5.4	BLE 5.4 + Mesh	I²S, PDM, MIPI-CSI	TFLM + vector ops	$4.20	Entry+ with camera
Amlogic A113X	Quad-core 1.2GHz Cortex-A53	802.11 ac + BT 5.0	BLE 5.0 + Audio	I²S, TDM, PDM, SPDIF	Linux + ALSA + DSP	$8.50	Mid tier Linux-based
Allwinner V821	Dual-core RISC-V + DSP	802.11 b/g/n + BT 5.0	BLE 5.0	Hardware audio codec	Low-power AI (~1 TOPS)	$6.80	Low-power portable
Rockchip RK2118	Dual-core ARM + 1 TOPS NPU	802.11 ax + BT 5.4	BLE 5.4 + Audio	8-ch PDM, hardware codec	1 TOPS NPU (KWS + small models)	$11.00	Mid+ with on-device KWS
Amlogic A311D	Quad-core A73 + A53	802.11 ax + BT 5.0	BLE 5.0 + WiFi 6	Multi-channel I²S, HDMI	5 TOPS NPU	$22.00	Premium on-device LLM
Rockchip RK3588	Octa-core 2.4GHz + 6 TOPS NPU	802.11 ax + BT 5.2	BLE 5.2	8K video + multi-channel audio	6 TOPS (Llama 3 8B at 5 tok/s)	$45.00	Flagship smart display

2a. WiFi / Bluetooth coexistence

Both radios share the 2.4 GHz band. Without proper coexistence, audio over Bluetooth drops out when WiFi is active. Two implementation paths:

Single antenna shared (most common)

Time-Division Multiplexing (TDM): SoC alternates between WiFi TX/RX and BT TX/RX in microsecond time slots. ESP32 uses a hardware arbiter; Amlogic uses a software layer.
Antenna isolation: Use a single 2.4 GHz antenna with a diplexer or shared front-end. 20-30 dB isolation is sufficient.
Bluetooth audio codec: LC3 (LE Audio) is robust to WiFi interference; SBC is more vulnerable. Use LC3 if your SoC supports it.

Dual antenna (premium tier)

Separate WiFi and BT antennas, placed at least 50mm apart on the PCB
Higher cost ($1-2 per antenna) but eliminates coexistence issues
Use for: premium speakers, home theater, conference room devices

Rule of thumb: If your BOM has room, use dual antennas. If you're chasing a $30 retail price, use a single antenna with proper TDM.

3.Design the audio chain

The audio chain determines sound quality, voice pickup, and whether the speaker is delightful or awful. Don't skimp here.

3a. Microphone array

For voice pickup and beamforming:

2-mic array (entry): enough for basic far-field pickup. Mics spaced 40-60mm apart.
3-mic triangle (mid): better beamforming, supports 360° pickup. Mics on 3 corners of equilateral triangle, 50-80mm sides.
4-mic linear array (premium): best for directional pickup (e.g., conference room). Mics in a row, 25-40mm apart.

Recommended MEMS mic (2026):

Knowles SPH0645LM4H-B — best SNR (65 dB), $1.20/unit, used in Apple HomePod mini
Infineon IM73A135 — best for noise rejection, $1.50/unit, used in Amazon Echo
TDK ICS-43434 — best for far-field, $0.85/unit, used in Google Nest

3b. Audio DSP and CODEC

Most modern SoCs include audio DSP. For pure software beamforming and AEC (acoustic echo cancellation), use the SoC's DSP core or an external DSP like Knowles AISonic.

Audio CODEC requirements:

24-bit / 48kHz minimum (16-bit is too low for Hi-Res audio)
Integrated DAC + ADC (saves PCB space and BOM cost)
Low latency (<20ms for real-time processing)

For entry tier: SoC integrated DAC/ADC is sufficient.
For mid/premium: external DAC like ESS Sabre 9018 or AKM AK4493 for audiophile-grade output.

3c. Amplifier

Output	Class	Efficiency	Use case	Example chip
5W mono	Class D	85-90%	Entry	TI TPA2005
10W stereo	Class D	85-90%	Mid	TI TPA3116
30W+ per channel	Class D	88-93%	Premium	Infineon MERUS MA12070
Subwoofer (50W+)	Class D bridge-tied load	85%	Premium	TI TPA3255

Class D is the obvious choice for battery-powered speakers. Class AB only for audiophile HiFi products where THD matters more than efficiency.

3d. Speaker driver

Driver choice has more impact on perceived audio quality than any other component. Don't buy the cheapest driver you find on Alibaba.

Full-range driver (1.5-3 inches) for entry: covers 200Hz-15kHz, low cost, low power
Full-range + tweeter (2-way) for mid: better high-frequency extension, more natural sound
Full-range + tweeter + subwoofer (3-way) for premium: full-range frequency response 50Hz-20kHz
Passive radiator for bass extension in small enclosures (avoids port noise)

Audio quality rule of thumb: A $4 driver in a properly designed enclosure will outperform a $15 driver in a poorly designed one. Invest in acoustic engineering, not just components.

4.Integrate the AI pipeline

The AI pipeline is where smart speakers differentiate. There are 4 stages: wake word → STT → LLM → TTS. Each has its own latency target and engineering tradeoffs.

4a. Wake word detection

On-device keyword spotting is the privacy-first default. Options:

Solution	Power	Latency	Custom wake	License
Picovoice Cheetah	5-10mW	100-150ms	Yes (free)	Commercial
Syntiant NDP120	0.5-1mW (always-on)	50-100ms	Yes (custom model)	Commercial
Google Edge TPU	10-30mW	100-200ms	Yes (TensorFlow Lite)	Open source
Custom TFLM model	5-20mW	100-200ms	Yes (own training)	Self-hosted

For most products, Picovoice Cheetah is the right answer. It runs on ESP32-S3, has the best accuracy, and supports unlimited custom wake words. Cost: $0.10/unit royalty at 10k+ units.

4b. STT (speech-to-text)

Cloud STT is the only practical option in 2026 for product-grade accuracy. On-device STT is improving but not yet at Whisper quality.

Service	Latency	Accuracy (WER)	Cost	Streaming
OpenAI Whisper API	1-2s	3-5%	$0.006/min	No
Deepgram Nova-2	0.3-0.8s	2-4%	$0.0043/min	Yes
Google Cloud STT	0.5-1.5s	4-6%	$0.006/15s	Yes
AssemblyAI Universal	0.5-1.2s	3-5%	$0.0065/min	Yes

Recommendation: Deepgram for streaming + low latency + good price. If you need multilingual, Whisper is still the gold standard.

4c. LLM (the actual smart)

This is where the magic happens. Two paths:

Cloud LLM (most products)

OpenAI GPT-4o-mini: $0.15/1M input tokens, fast, capable
Anthropic Claude Sonnet 4.5: $3/1M input, smarter, better instruction following
OpenAI GPT-5 (when available): new flagship

Latency: 300-800ms for first token, 5-15 tokens/sec streaming. For conversational AI, GPT-4o-mini is the sweet spot for cost/quality.

On-device LLM (premium tier, premium price)

Phi-3 3.8B Mini: 5-8 tok/s on RK3588, decent for short responses
Llama 3.1 8B: 3-5 tok/s, smarter but slower
Qwen 2.5 7B: best for multilingual

On-device LLM is a "privacy premium" feature. Worth it for medical, legal, or B2B use cases. Not for consumer mass-market yet.

4d. TTS (text-to-speech)

Cloud TTS is the standard for natural-sounding voices. On-device TTS exists but quality is much lower.

Service	Quality	Latency (first audio)	Cost	Streaming
ElevenLabs Turbo v2.5	★★★★★	200-400ms	$0.15/1k chars	Yes
OpenAI TTS-1-HD	★★★★☆	300-500ms	$0.030/1k chars	Yes
Google Cloud TTS	★★★☆☆	300-500ms	$0.016/1k chars	Yes
ElevenLabs (custom voice clone)	★★★★★	500-800ms	$0.30/1k chars	Yes

Recommendation: ElevenLabs for premium tier (voice is a differentiator). OpenAI TTS-HD for mid tier (good quality, lower cost). Google TTS for entry (cheap, decent quality).

Total AI cost per query (cloud): Wake word: $0 (on-device). STT: $0.005. LLM: $0.01-0.05 (depending on model + length). TTS: $0.01-0.03. Total: $0.025-0.085 per query. At 100 queries/day, that's $75-250/month per active user.

5.Design enclosure and acoustic chamber

This is where most smart-speaker projects fail. A great electronics design with bad acoustics sounds like a $5 Bluetooth speaker. Here's the process:

5a. Acoustic simulation

Before cutting tooling, simulate the enclosure in COMSOL Multiphysics or Actran. You'll find:

Standing waves: at frequencies where the wavelength matches 2x the longest enclosure dimension. For a 150mm speaker, that's around 1.1kHz. Add internal bracing to break up resonances.
Port tuning (if ported): aim for 80Hz tuning, ±5Hz tolerance. Use a flared port to reduce chuffing.
Driver placement: front-facing for music speakers, top-facing for voice assistants (better mic pickup from above the driver)

5b. Enclosure materials

Material	Cost	Weight	Acoustic	Use
ABS plastic	Low	Light	Acceptable	Entry, mid
Polycarbonate	Medium	Light	Good	Mid, premium (drop-resistant)
Aluminum	High	Heavy	Excellent	Premium (also doubles as heatsink)
Wood (MDF)	Medium	Heavy	Good	Audiophile
Recycled fabric + polymer	Medium	Light	Variable	Sustainability-focused brands

5c. Mesh fabric

The front grille mesh must be acoustically transparent. The common mistake is using a tight weave that absorbs 1-3kHz frequencies (where voice clarity lives). Use:

Knit polyester with >60% open area
Acoustically transparent metal mesh (3M™ acoustic fabric or similar)
Perforated aluminum (1mm holes, 30% open area)

Always measure frequency response with the mesh in place. Without mesh, your product will sound "bright" in the lab and "muddy" in customers' homes.

Acoustic engineering reality check: 90% of smart-speaker audio quality issues are caused by bad enclosure design, not bad drivers. A great $4 driver in a properly designed enclosure will outperform a great $15 driver in a poorly designed one. Invest in acoustic simulation (COMSOL/Actran) before tooling, not after.

6.Firmware architecture

The firmware is what makes or breaks a smart speaker. Architecture matters more than any specific SoC.

6a. RTOS or Linux?

Approach	SoCs	Pros	Cons	Best for
RTOS (FreeRTOS, Zephyr)	ESP32, Allwinner V821, low-power chips	Low power, deterministic, simple	Limited compute, harder AI	Entry, battery, voice-only
Embedded Linux (Buildroot, Yocto)	Amlogic, Rockchip, multi-core chips	Full ecosystem, easier AI, familiar tooling	Higher power, more complex	Mid, premium, on-device LLM

6b. Task partitioning (RTOS approach)

Even on a single-core SoC, partition your firmware into 4 tasks with priority:

Audio task (priority: real-time, highest): mic array capture, AEC, beamforming. Must complete within 10ms or you get audio glitches.
Wake word task (priority: real-time, second-highest): runs on dedicated DSP core if available, or on a low-power co-processor (Syntiant NDP). Must respond within 200ms.
AI task (priority: medium): orchestrates STT → LLM → TTS, handles network calls. Latency acceptable up to 5 seconds.
Network task (priority: low): handles WiFi provisioning, OTA updates, telemetry. Best-effort, not real-time.

6c. OTA update strategy

Every smart speaker needs secure OTA updates. We use A/B partitions with signed firmware images. See our Custom PCBA guide for the OTA implementation details. The non-obvious part: AI speaker OTA updates are 100-500MB (model + firmware), so make sure your flash is at least 8MB and your update strategy handles partial failures.

7.Certifications and production

7a. Required certifications

Cert	Region	Cost	Time	Notes
FCC Part 15 (Subpart B + C)	US	$3-8k	4-8 weeks	EMI/EMC + RF
CE RED	EU	$3-6k	4-8 weeks	Radio Equipment Directive
BQB (Bluetooth)	Global	$8k/year + per-product	2-4 weeks	Required for any product with Bluetooth
WiFi Alliance (WFA)	Global	$5-15k	2-4 weeks	WPA3 + WiFi 6 cert
UN38.3	Global	$2-4k	4-8 weeks	Required if product has lithium battery
MFi (Apple Find My)	Apple	Free (license fee)	4-8 weeks	Only if you integrate Apple Find My
Dolby Audio	Global	$0.50-1.50/unit royalty	2-4 weeks	Only for premium tier
IP rating (IPX4, IP67)	Global	$1-3k	2-4 weeks	Outdoor or bath products

Plan 8-12 weeks for full cert cycle, run in parallel with product tooling. We typically start cert at EVT (50 units) and finish at DVT (500 units).

7b. Manufacturing plan

Stage	Volume	Timeline	Cost	What changes
EVT (Engineering Validation)	50 units	Week 1-6	$8-15k	3D-printed enclosure, off-the-shelf drivers, manual assembly
DVT (Design Validation)	500 units	Week 6-14	$25-40k	SLA enclosures, final drivers, semi-automated assembly
PVT (Production Validation)	5,000 units	Week 14-22	$80-150k	Injection-molded enclosure, full automation, cert complete
Mass production	50,000+ units	Ongoing	$1.5-3M	Steel tooling, multi-line production, retail distribution

Cost breakdown (3 tiers, 1k unit production run)

Smart speaker BOM comparison

All costs in USD, FOB Thailand, 1,000 unit production run. Engineering, NRE, and certification are separate.

Entry Tier

$22-30

ESP32-S3 SoC
1× MEMS mic
5W mono amp
1× 2" full-range driver
Single mic, no beamforming
Cloud AI only

Retail target: $40-60

Mid Tier

$50-65

Amlogic A113X SoC
2-mic array + beamforming
10W stereo amp
2× 2.5" full-range + tweeter
WiFi ac + BT 5.0
Cloud or on-device small model

Retail target: $100-200

Premium Tier

$130-180

Rockchip RK3588 + 6 TOPS NPU
4-mic array + beamforming
30W+ per channel, subwoofer
3-way driver setup
WiFi 6 + BT 5.4
On-device LLM (Phi-3 or Llama 3)

Retail target: $300-500

Cost-per-feature math: A $30 BOM speaker with good acoustic engineering and a $5/mo cloud AI subscription beats a $60 BOM speaker with bad acoustics and free AI. BOM cost is a smaller lever than people think. Audio quality and AI integration are the differentiators.

What we see go wrong (4 common mistakes)

1. Picking the wrong SoC for the use case

Choosing ESP32-S3 for a premium speaker that needs 30W+ per channel, multi-channel beamforming, and on-device LLM. It can't do any of those. Or choosing RK3588 for a battery-powered alarm clock — overkill, expensive, 7W continuous draw.

Fix: Use the SoC selection table above. Match requirements to silicon tier.

2. Bad acoustic engineering

Buying expensive drivers and putting them in a plastic box without simulation. The result: muddy, boomy, tinny sound. We see this in 80% of first-time speaker projects.

Fix: COMSOL or Actran simulation before tooling. Always measure with the mesh fabric in place.

3. Wake word on the wrong chip

Trying to run always-on wake word on a general-purpose CPU. Drains battery in 6 hours. Or running wake word on a low-power DSP that's too slow for the chosen keyword.

Fix: Use dedicated low-power DSP for wake word (Syntiant NDP120, $0.85/unit, runs 1mW always-on). Reserve main SoC for higher-level processing.

4. Latency budget ignored

Designing each AI stage independently without thinking about end-to-end latency. Wake word 200ms, STT 3s, LLM 5s, TTS 3s = 11s total. Users abandon the product.

Fix: Set end-to-end latency target (we recommend 3s from wake word to first audio) and budget each stage.

What we built (case studies)

We've shipped 4 smart-speaker projects at SkyTech, including:

1. Battery-powered portable AI speaker (entry tier)

ESP32-S3 + Syntiant NDP120 wake word + Picovoice
2-mic array, beamforming via ESP-ADF
5W mono, 4-hour battery life
Whisper + GPT-4o-mini + ElevenLabs Turbo
BOM at 1k units: $24
Volume: 8,000 units, shipped Q4 2025

2. Kitchen assistant with display (mid tier)

Amlogic A113X + 4" touchscreen
4-mic array + beamforming, far-field pickup to 5m
10W stereo + passive radiator
Linux + Qt + custom AI app
BOM at 1k units: $58
Volume: 5,000 units, shipping Q2 2026

3. Premium smart speaker (in development)

Rockchip RK3588 + 6 TOPS NPU
On-device LLM (Phi-3 3.8B) + cloud fallback
30W per channel + 50W subwoofer
3-way driver setup, premium audio
Target retail: $499
Target launch: Q4 2026

Want us to review your AI speaker design? Send us your schematic, BOM, and acoustic simulation. We'll do a free 30-min feasibility review and quote within 48 hours. [email protected]

What this guide covers

1.Define product requirements

1a. Product tier

1b. Latency budget

1c. Wake word approach

2.Select SoC and wireless chipset

2a. WiFi / Bluetooth coexistence

Single antenna shared (most common)

Dual antenna (premium tier)

3.Design the audio chain

3a. Microphone array

3b. Audio DSP and CODEC

3c. Amplifier

3d. Speaker driver

4.Integrate the AI pipeline

4a. Wake word detection

4b. STT (speech-to-text)

4c. LLM (the actual smart)

Cloud LLM (most products)

On-device LLM (premium tier, premium price)

4d. TTS (text-to-speech)

5.Design enclosure and acoustic chamber

5a. Acoustic simulation

5b. Enclosure materials

5c. Mesh fabric

6.Firmware architecture

6a. RTOS or Linux?

6b. Task partitioning (RTOS approach)

6c. OTA update strategy

7.Certifications and production

7a. Required certifications

7b. Manufacturing plan

Cost breakdown (3 tiers, 1k unit production run)

Smart speaker BOM comparison

Entry Tier

Mid Tier

Premium Tier

What we see go wrong (4 common mistakes)

1. Picking the wrong SoC for the use case

2. Bad acoustic engineering

3. Wake word on the wrong chip

4. Latency budget ignored

What we built (case studies)

1. Battery-powered portable AI speaker (entry tier)

2. Kitchen assistant with display (mid tier)

3. Premium smart speaker (in development)

Keep reading

AI Hardware Development

3D Printing Prototyping

Custom PCBA Guide