Path: EDN Asia >> Design Centre >> Consumer Electronics >> Implementing voice processing for smart home apps
Consumer Electronics Share print

Implementing voice processing for smart home apps

26 Feb 2015  | Vineet Ganju, Trausti Thormundsson

Share this page with your friends

The supervised feature based VAD produces a probability measure for the presence of desired speech. This information is used in the unsupervised filtering modules to decide whether it is training the filters for the noise, interference, or for the desired speech source. Any suitable VAD can be used within this framework.

The core of the system is the unsupervised spatial filtering (USF) – a BSS algorithm based on independent component analyses (ICA). The ICA algorithm seeks to model the mixing system of the desired source and the interfering sources, allowing them to be separated with linear filtering. In a two microphone system, the USF will produce four signal outputs, two for each microphone. For each microphone, one signal contains the desired source and some residual noise, and another signal contains an estimation of all the interfering sources, where the desired source has been removed.

The only information USF needs to accomplish this is to know when the target speech is active and when the noise is active, which comes from the VAD. It then finds the filters to do the de-mixing of the desired source and the interference sources in a fully unsupervised manner. The USF does not use explicitly the direction of the source, although this information can be used to improve the VAD decision. Also, the locations of microphones on the device and mismatch between microphones have minimal effect on the algorithm. It is typically the case that in ICA systems, if N sources are present, at least N microphones are needed to recover the original signals. However, by treating the signal as either containing 1) a target speech signal and a noise signal, or 2) a noise signal only, ICA can be used with only two microphones and an unknown number of noise sources.

The output of the USF is not used directly in the output of the system, because it assumes that the mixtures are a linear combination of signals generated by a finite number of spatially localized sources. This coherence assumption is a condition that is only partially true for the main speech source signal but not for real-world noise. As a result, linear filtering is suboptimal for real-world applications and requires the signal to be compensated by non-linear, time-varying statistical based post-filtering. Post-filtering approaches generally involve estimation of spectral/temporal masks (or gains) derived by the outputs of the linear filters. While masks generally improve the noise reduction ability, the masking effect could lead to severe degradation of signal quality if the de-mixing model uncertainty is not taken into account.

A method for spectral filtering can be based on an unsupervised learning of spectral gain distributions, which is derived from the USF output signals. A probability of speech presence/absence can then be generated; these probabilities are used to control a spectral enhancement for each channel separately. The enhancement removes undesired interference and can at the same time remove late reverb components, i.e. effectively do de-reverberation.

Figures 6 and 7 show an example of performance from such a system. In this test, a user was 3m away from a two microphone system. The level of the desired speech at the microphones was 60dB, and the level of the interfering speech at the microphones was 50dB. The upper channel of figure 6 shows the received signal without any processing. The lower channel shows the processed output. Figure 7 shows the spectral content of the interference before and after processing. Under this condition, around a 30dB reduction in the interfering signal was achieved. When the unprocessed signal was sent through a speech recognition engine, a 95% Word Error Rate (WER) was obtained. After processing, the WER dropped to 15%.

Figure 6: The upper channel shows the received signal without any processing. The lower channel shows the processed output.

Figure 7: The spectral content of the interference before and after processing is shown.

Acoustic echo cancellation
Acoustic echo cancellation (AEC) has existed for many years and is a necessary part of any hands free communication system. The acoustic echo canceller removes from the microphone recordings the audio that the device itself is playing back. In its simplest form the AEC is half duplex, i.e. when the far end is talking it literally mutes the microphone on the near end, and vice versa when the near end is talking. In these types of systems, only one side can speak at a time.

 First Page Previous Page 1 • 2 • 3 • 4 • 5 Next Page Last Page

Want to more of this to be delivered to you for FREE?

Subscribe to EDN Asia alerts and receive the latest design ideas and product news in your inbox.

Got to make sure you're not a robot. Please enter the code displayed on the right.

Time to activate your subscription - it's easy!

We have sent an activate request to your registerd e-email. Simply click on the link to activate your subscription.

We're doing this to protect your privacy and ensure you successfully receive your e-mail alerts.

Add New Comment
Visitor (To avoid code verification, simply login or register with us. It is fast and free!)
*Verify code:
Tech Impact

Regional Roundup
Control this smart glass with the blink of an eye
K-Glass 2 detects users' eye movements to point the cursor to recognise computer icons or objects in the Internet, and uses winks for commands. The researchers call this interface the "i-Mouse."

GlobalFoundries extends grants to Singapore students
ARM, Tencent Games team up to improve mobile gaming

News | Products | Design Features | Regional Roundup | Tech Impact