What this guide covers

A WiFi / Bluetooth AI smart speaker is one of the most complex consumer electronics products to design and manufacture. It requires 4 parallel design tracks that must converge at a single SoC:

  1. SoC + WiFi/BT radio — runs the entire system, including AI inference
  2. Audio chain — microphones, DSP, amplifier, speaker driver, acoustic chamber
  3. AI pipeline — wake word, STT, LLM, TTS, with strict latency budget
  4. Enclosure — mechanical, thermal, and acoustic design in one box

Get any of these wrong, and the speaker either doesn't work, sounds terrible, or fails certification. Get them all right, and you have a product that competes with Sonos, Apple HomePod, and Amazon Echo at a fraction of the cost.

This guide walks through the 7-stage process we use on every smart-speaker project at SkyTech, with real BOMs and cost breakdowns for entry ($22-30), mid ($50-65), and premium ($130-180) tiers.

Who this guide is for: Hardware founders building a smart speaker, voice assistant, conference room device, or AI-enhanced audio product. If you're an audio engineer, jump to the Audio Chain section. If you're a firmware engineer, jump to the AI Pipeline section. If you're a product manager, read the Cost Breakdown section.

1.Define product requirements

Before you pick a single component, lock down the requirements. The single biggest mistake we see is founders building a "smart speaker" without specifying which tier they're targeting. Here's the decision tree:

1a. Product tier

TierUse caseComputeAudioPowerTarget retail
EntryAlarm clock, kitchen timer, basic voiceSingle-chip MCU + cloud AI5W mono, 1 driver5V/2A USB-C or 4x AA$40-60
MidKitchen assistant, bedside clock, portableQuad-core SoC + cloud or on-device small model10-20W stereo, 2-4 drivers12-19V DC or battery$100-200
PremiumLiving room speaker, conference room, smart displayHex/octa-core SoC + on-device LLM (3-8B)30-100W with subwooferMains, 100-240V AC$300-600

1b. Latency budget

User-perceived latency is the most important quality metric for AI speakers. Here's what a good user experience requires:

StageCloud AI targetOn-device LLM targetWhy it matters
Wake word → on-device< 200ms< 200msUser expects immediate response
STT (speech → text)1-2s1-2sUser starts listening for response
LLM first token300-800ms1-2sUser starts hearing response
TTS first audio200-500ms200-500msAudio starts playing
Total wake → first audio2-3.5s2.5-4.5sAcceptable for natural conversation
Common mistake: Spending 2 years optimizing wake word from 200ms to 50ms. Users don't notice. Spend that time on STT + LLM latency, which is where the real perception gap lives.

1c. Wake word approach

Two main paths, with serious privacy + power tradeoffs:

ApproachPowerLatencyPrivacyAccuracy
On-device keyword spotting (Picovoice, Syntiant, custom TFLM)1-10mW<200msAudio never leaves device95-98% in quiet, 85-90% in noise
Cloud-triggered activation (always-listening audio stream)200-500mW continuous500-1500ms (network)Audio uploaded on every wake98%+ (cloud DSP)

Recommendation: on-device keyword spotting for battery-powered products. Privacy-first. Apple, Google, and Amazon all do this on their flagship devices for good reason.

2.Select SoC and wireless chipset

The SoC is the most consequential decision. It determines cost, audio quality, AI capability, and battery life. Here's how the major options stack up for 2026:

SoCCPUWiFiBTAudio I/OAICost (1k)Best for
ESP32-S3Dual-core 240MHz Xtensa LX7802.11 b/g/nBLE 5 + MeshI²S, 2-ch DAC, PDM micTensorFlow Lite Micro$2.50Entry / battery
ESP32-P4Dual-core 400MHz RISC-V + LP core802.11 b/g/n + BT 5.4BLE 5.4 + MeshI²S, PDM, MIPI-CSITFLM + vector ops$4.20Entry+ with camera
Amlogic A113XQuad-core 1.2GHz Cortex-A53802.11 ac + BT 5.0BLE 5.0 + AudioI²S, TDM, PDM, SPDIFLinux + ALSA + DSP$8.50Mid tier Linux-based
Allwinner V821Dual-core RISC-V + DSP802.11 b/g/n + BT 5.0BLE 5.0Hardware audio codecLow-power AI (~1 TOPS)$6.80Low-power portable
Rockchip RK2118Dual-core ARM + 1 TOPS NPU802.11 ax + BT 5.4BLE 5.4 + Audio8-ch PDM, hardware codec1 TOPS NPU (KWS + small models)$11.00Mid+ with on-device KWS
Amlogic A311DQuad-core A73 + A53802.11 ax + BT 5.0BLE 5.0 + WiFi 6Multi-channel I²S, HDMI5 TOPS NPU$22.00Premium on-device LLM
Rockchip RK3588Octa-core 2.4GHz + 6 TOPS NPU802.11 ax + BT 5.2BLE 5.28K video + multi-channel audio6 TOPS (Llama 3 8B at 5 tok/s)$45.00Flagship smart display

2a. WiFi / Bluetooth coexistence

Both radios share the 2.4 GHz band. Without proper coexistence, audio over Bluetooth drops out when WiFi is active. Two implementation paths:

Single antenna shared (most common)

Dual antenna (premium tier)

Rule of thumb: If your BOM has room, use dual antennas. If you're chasing a $30 retail price, use a single antenna with proper TDM.

3.Design the audio chain

The audio chain determines sound quality, voice pickup, and whether the speaker is delightful or awful. Don't skimp here.

3a. Microphone array

For voice pickup and beamforming:

Recommended MEMS mic (2026):

3b. Audio DSP and CODEC

Most modern SoCs include audio DSP. For pure software beamforming and AEC (acoustic echo cancellation), use the SoC's DSP core or an external DSP like Knowles AISonic.

Audio CODEC requirements:

For entry tier: SoC integrated DAC/ADC is sufficient.
For mid/premium: external DAC like ESS Sabre 9018 or AKM AK4493 for audiophile-grade output.

3c. Amplifier

OutputClassEfficiencyUse caseExample chip
5W monoClass D85-90%EntryTI TPA2005
10W stereoClass D85-90%MidTI TPA3116
30W+ per channelClass D88-93%PremiumInfineon MERUS MA12070
Subwoofer (50W+)Class D bridge-tied load85%PremiumTI TPA3255

Class D is the obvious choice for battery-powered speakers. Class AB only for audiophile HiFi products where THD matters more than efficiency.

3d. Speaker driver

Driver choice has more impact on perceived audio quality than any other component. Don't buy the cheapest driver you find on Alibaba.

Audio quality rule of thumb: A $4 driver in a properly designed enclosure will outperform a $15 driver in a poorly designed one. Invest in acoustic engineering, not just components.

4.Integrate the AI pipeline

The AI pipeline is where smart speakers differentiate. There are 4 stages: wake word → STT → LLM → TTS. Each has its own latency target and engineering tradeoffs.

4a. Wake word detection

On-device keyword spotting is the privacy-first default. Options:

SolutionPowerLatencyCustom wakeLicense
Picovoice Cheetah5-10mW100-150msYes (free)Commercial
Syntiant NDP1200.5-1mW (always-on)50-100msYes (custom model)Commercial
Google Edge TPU10-30mW100-200msYes (TensorFlow Lite)Open source
Custom TFLM model5-20mW100-200msYes (own training)Self-hosted

For most products, Picovoice Cheetah is the right answer. It runs on ESP32-S3, has the best accuracy, and supports unlimited custom wake words. Cost: $0.10/unit royalty at 10k+ units.

4b. STT (speech-to-text)

Cloud STT is the only practical option in 2026 for product-grade accuracy. On-device STT is improving but not yet at Whisper quality.

ServiceLatencyAccuracy (WER)CostStreaming
OpenAI Whisper API1-2s3-5%$0.006/minNo
Deepgram Nova-20.3-0.8s2-4%$0.0043/minYes
Google Cloud STT0.5-1.5s4-6%$0.006/15sYes
AssemblyAI Universal0.5-1.2s3-5%$0.0065/minYes

Recommendation: Deepgram for streaming + low latency + good price. If you need multilingual, Whisper is still the gold standard.

4c. LLM (the actual smart)

This is where the magic happens. Two paths:

Cloud LLM (most products)

Latency: 300-800ms for first token, 5-15 tokens/sec streaming. For conversational AI, GPT-4o-mini is the sweet spot for cost/quality.

On-device LLM (premium tier, premium price)

On-device LLM is a "privacy premium" feature. Worth it for medical, legal, or B2B use cases. Not for consumer mass-market yet.

4d. TTS (text-to-speech)

Cloud TTS is the standard for natural-sounding voices. On-device TTS exists but quality is much lower.

ServiceQualityLatency (first audio)CostStreaming
ElevenLabs Turbo v2.5★★★★★200-400ms$0.15/1k charsYes
OpenAI TTS-1-HD★★★★☆300-500ms$0.030/1k charsYes
Google Cloud TTS★★★☆☆300-500ms$0.016/1k charsYes
ElevenLabs (custom voice clone)★★★★★500-800ms$0.30/1k charsYes

Recommendation: ElevenLabs for premium tier (voice is a differentiator). OpenAI TTS-HD for mid tier (good quality, lower cost). Google TTS for entry (cheap, decent quality).

Total AI cost per query (cloud): Wake word: $0 (on-device). STT: $0.005. LLM: $0.01-0.05 (depending on model + length). TTS: $0.01-0.03. Total: $0.025-0.085 per query. At 100 queries/day, that's $75-250/month per active user.

5.Design enclosure and acoustic chamber

This is where most smart-speaker projects fail. A great electronics design with bad acoustics sounds like a $5 Bluetooth speaker. Here's the process:

5a. Acoustic simulation

Before cutting tooling, simulate the enclosure in COMSOL Multiphysics or Actran. You'll find:

5b. Enclosure materials

MaterialCostWeightAcousticUse
ABS plasticLowLightAcceptableEntry, mid
PolycarbonateMediumLightGoodMid, premium (drop-resistant)
AluminumHighHeavyExcellentPremium (also doubles as heatsink)
Wood (MDF)MediumHeavyGoodAudiophile
Recycled fabric + polymerMediumLightVariableSustainability-focused brands

5c. Mesh fabric

The front grille mesh must be acoustically transparent. The common mistake is using a tight weave that absorbs 1-3kHz frequencies (where voice clarity lives). Use:

Always measure frequency response with the mesh in place. Without mesh, your product will sound "bright" in the lab and "muddy" in customers' homes.

Acoustic engineering reality check: 90% of smart-speaker audio quality issues are caused by bad enclosure design, not bad drivers. A great $4 driver in a properly designed enclosure will outperform a great $15 driver in a poorly designed one. Invest in acoustic simulation (COMSOL/Actran) before tooling, not after.

6.Firmware architecture

The firmware is what makes or breaks a smart speaker. Architecture matters more than any specific SoC.

6a. RTOS or Linux?

ApproachSoCsProsConsBest for
RTOS (FreeRTOS, Zephyr)ESP32, Allwinner V821, low-power chipsLow power, deterministic, simpleLimited compute, harder AIEntry, battery, voice-only
Embedded Linux (Buildroot, Yocto)Amlogic, Rockchip, multi-core chipsFull ecosystem, easier AI, familiar toolingHigher power, more complexMid, premium, on-device LLM

6b. Task partitioning (RTOS approach)

Even on a single-core SoC, partition your firmware into 4 tasks with priority:

  1. Audio task (priority: real-time, highest): mic array capture, AEC, beamforming. Must complete within 10ms or you get audio glitches.
  2. Wake word task (priority: real-time, second-highest): runs on dedicated DSP core if available, or on a low-power co-processor (Syntiant NDP). Must respond within 200ms.
  3. AI task (priority: medium): orchestrates STT → LLM → TTS, handles network calls. Latency acceptable up to 5 seconds.
  4. Network task (priority: low): handles WiFi provisioning, OTA updates, telemetry. Best-effort, not real-time.

6c. OTA update strategy

Every smart speaker needs secure OTA updates. We use A/B partitions with signed firmware images. See our Custom PCBA guide for the OTA implementation details. The non-obvious part: AI speaker OTA updates are 100-500MB (model + firmware), so make sure your flash is at least 8MB and your update strategy handles partial failures.

7.Certifications and production

7a. Required certifications

CertRegionCostTimeNotes
FCC Part 15 (Subpart B + C)US$3-8k4-8 weeksEMI/EMC + RF
CE REDEU$3-6k4-8 weeksRadio Equipment Directive
BQB (Bluetooth)Global$8k/year + per-product2-4 weeksRequired for any product with Bluetooth
WiFi Alliance (WFA)Global$5-15k2-4 weeksWPA3 + WiFi 6 cert
UN38.3Global$2-4k4-8 weeksRequired if product has lithium battery
MFi (Apple Find My)AppleFree (license fee)4-8 weeksOnly if you integrate Apple Find My
Dolby AudioGlobal$0.50-1.50/unit royalty2-4 weeksOnly for premium tier
IP rating (IPX4, IP67)Global$1-3k2-4 weeksOutdoor or bath products

Plan 8-12 weeks for full cert cycle, run in parallel with product tooling. We typically start cert at EVT (50 units) and finish at DVT (500 units).

7b. Manufacturing plan

StageVolumeTimelineCostWhat changes
EVT (Engineering Validation)50 unitsWeek 1-6$8-15k3D-printed enclosure, off-the-shelf drivers, manual assembly
DVT (Design Validation)500 unitsWeek 6-14$25-40kSLA enclosures, final drivers, semi-automated assembly
PVT (Production Validation)5,000 unitsWeek 14-22$80-150kInjection-molded enclosure, full automation, cert complete
Mass production50,000+ unitsOngoing$1.5-3MSteel tooling, multi-line production, retail distribution

Cost breakdown (3 tiers, 1k unit production run)

Smart speaker BOM comparison

All costs in USD, FOB Thailand, 1,000 unit production run. Engineering, NRE, and certification are separate.

Entry Tier

$22-30

ESP32-S3 SoC
1× MEMS mic
5W mono amp
1× 2" full-range driver
Single mic, no beamforming
Cloud AI only

Retail target: $40-60

Mid Tier

$50-65

Amlogic A113X SoC
2-mic array + beamforming
10W stereo amp
2× 2.5" full-range + tweeter
WiFi ac + BT 5.0
Cloud or on-device small model

Retail target: $100-200

Premium Tier

$130-180

Rockchip RK3588 + 6 TOPS NPU
4-mic array + beamforming
30W+ per channel, subwoofer
3-way driver setup
WiFi 6 + BT 5.4
On-device LLM (Phi-3 or Llama 3)

Retail target: $300-500

Cost-per-feature math: A $30 BOM speaker with good acoustic engineering and a $5/mo cloud AI subscription beats a $60 BOM speaker with bad acoustics and free AI. BOM cost is a smaller lever than people think. Audio quality and AI integration are the differentiators.

What we see go wrong (4 common mistakes)

1. Picking the wrong SoC for the use case

Choosing ESP32-S3 for a premium speaker that needs 30W+ per channel, multi-channel beamforming, and on-device LLM. It can't do any of those. Or choosing RK3588 for a battery-powered alarm clock — overkill, expensive, 7W continuous draw.

Fix: Use the SoC selection table above. Match requirements to silicon tier.

2. Bad acoustic engineering

Buying expensive drivers and putting them in a plastic box without simulation. The result: muddy, boomy, tinny sound. We see this in 80% of first-time speaker projects.

Fix: COMSOL or Actran simulation before tooling. Always measure with the mesh fabric in place.

3. Wake word on the wrong chip

Trying to run always-on wake word on a general-purpose CPU. Drains battery in 6 hours. Or running wake word on a low-power DSP that's too slow for the chosen keyword.

Fix: Use dedicated low-power DSP for wake word (Syntiant NDP120, $0.85/unit, runs 1mW always-on). Reserve main SoC for higher-level processing.

4. Latency budget ignored

Designing each AI stage independently without thinking about end-to-end latency. Wake word 200ms, STT 3s, LLM 5s, TTS 3s = 11s total. Users abandon the product.

Fix: Set end-to-end latency target (we recommend 3s from wake word to first audio) and budget each stage.

What we built (case studies)

We've shipped 4 smart-speaker projects at SkyTech, including:

1. Battery-powered portable AI speaker (entry tier)

2. Kitchen assistant with display (mid tier)

3. Premium smart speaker (in development)

Want us to review your AI speaker design? Send us your schematic, BOM, and acoustic simulation. We'll do a free 30-min feasibility review and quote within 48 hours. [email protected]