Post on 22-Aug-2019
Friedrich-Alexander-Universitat Erlangen-Nurnberg
Chair of Multimedia Communications and Signal
Processing
Prof. Dr.-Ing. Walter Kellermann
Master Thesis
Blind Source Separation Methods forAcoustic Sources
Chengyu Huang
November 2017
Supervisor: M.Sc. Andreas Brendel
Erklarung
Ich versichere, dass ich die vorliegende Arbeit ohne fremde Hilfe und
ohne Benutzung anderer als der angegebenen Quellen angefertigt habe,
und dass die Arbeit in gleicher oder ahnlicher Form noch keiner an-
deren Prufungsbehorde vorgelegen hat und von dieser als Teil einer
Prufungsleistung angenommen wurde. Alle Ausfuhrungen, die wortlich
oder sinngemaß ubernommen wurden, sind als solche gekennzeichnet.
————————————
Ort, Datum
————————————
Unterschrift
CONTENTS I
Contents
Abstract V
List of Abbreviations VII
List of Formula Symbols IX
1 Introduction 1
1.1 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Fundamentals 1
2.1 Short Time Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Signal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Observation Model for Speech . . . . . . . . . . . . . . . . . . . 3
2.2.2 DoAs Of Source Signals . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 GMM And Wrapped Phase Model . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Wrapped Phase Model . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 K-means and EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 K-means Clustering[5] . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 EM Algorithm[5] . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 GMM Clustering With Dirichlet Prior 16
II
3.1 Dirichlet Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Problem of GMM Fitting . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Dirichlet Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 MAP Estimator for GMM Parameters . . . . . . . . . . . . . . . . . . 19
3.3 EM Algorithm For Counting Acoustic Sources . . . . . . . . . . . . . . 21
3.3.1 Method Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 EM Formula Derivation . . . . . . . . . . . . . . . . . . . . . . 22
4 Split and Merge Method 27
4.1 Tolerance Region of a Gaussian Distribution . . . . . . . . . . . . . . . 27
4.2 The Proposed Split and Merge Method . . . . . . . . . . . . . . . . . . 29
4.2.1 Removal of Gaussian Components . . . . . . . . . . . . . . . . . 29
4.2.2 Merging Gaussian Components . . . . . . . . . . . . . . . . . . 30
4.2.3 Pruning Training Data . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.4 Final Merging Step After Convergence . . . . . . . . . . . . . . 32
5 Infinite Gaussian Mixture Model 34
5.1 GMM and IGMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Component Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Adaptive Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Post processing for the IGMM Method . . . . . . . . . . . . . . . . . . 43
6 Experimental Results 46
6.1 Results with Artificial Data . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1.1 Test Scenarios and Artificial Training Data . . . . . . . . . . . . 47
6.1.2 Convergence Condition for EM algorithm in the GMM Clustering
Method with DP . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.3 Parameters for Clustering Methods . . . . . . . . . . . . . . . . 51
6.1.4 The Effect of Varying Angular Difference . . . . . . . . . . . . . 51
CONTENTS III
6.1.5 Source Counting Accuracy with Varying SNR Values . . . . . . 54
6.1.6 Source Counting Accuracy with Varying Data Variance . . . . . 54
6.1.7 Comparison and Conclusions . . . . . . . . . . . . . . . . . . . . 57
6.2 Experimental Results with Observed DoAs . . . . . . . . . . . . . . . . 57
6.2.1 Test with Varying Source Numbers and Sensor Numbers . . . . 59
6.2.2 Test with Varying Reverberation Time . . . . . . . . . . . . . . 60
6.2.3 Test with Varying Sensor Distance to Circle Center . . . . . . . 61
6.2.4 Test with Varying Source Distance to Circle Center . . . . . . . 62
6.2.5 Comparison and Conclusions . . . . . . . . . . . . . . . . . . . . 63
7 Summary and Outlook 65
List of Figures 66
List of Tables 70
References 73
ABSTRACT V
Abstract
This thesis deals with the BSS-based source counting algorithms to estimate the num-
ber of speech sources from an observed mixture of DoAs. Three methods based on the
assumption of source sparsity are implemented and evaluated under different acoustic
reverberant scenarios. The implemented methods include the GMM clustering method
with Dirichlet prior, which models the mixture of DoAs with a GMM and estimates
the model parameters using the MAP-EM algorithm; the Split and Merge method,
which removes the redundant clusters in the GMM based on ML-EM algorithm; a
non-parametric IGMM clustering method, which determines the number of clusters
in the model automatically by using a CRP representation of Dirichlet process. The
accuracy rate of sourcing estimation in different scenarios, which represents the perfor-
mance of the clustering methods, will be calculated and compared in the experiments.
Keywords : BSS; GMM; Split and Merge; IGMM; speech source counting; EM algo-
rithm; MAP; ARS; CRP.
LIST OF ABBREVIATIONS VII
List of Abbreviations
ARS Adaptive Rejection Sampling
BSS Blind Source Separation
CRP Chinese Restaurant Process
DoA Direction of Arrival
DTFT Discrete Time Fourier Transform
DP Dirichlet Prior
EM Expectation−Maximization
MAP Maximum A Posteriori
ML Maximum Likelihood
PDF Probability Density Function
TDoA Time Difference of Arrival
TR Tolerance Region
GMM Gaussian Mixture Model
IGMM Infinite Gaussian Mixture Model
ICA Independent Component Analysis
STFT Short Time Fourier Transform
SNR Signal-to-Noise Ratio
w.r.t with respect to
VIII
obs. observed
art. artificial
FORMULA SYMBOLS IX
List of Formula Symbols
Notations
non-bold letter a scalar variable
bold letter in lowercase a vector
bold letter in uppercase A matrix
Operators
(·)T Transpose of (·)
| · | Absolute value of (·)
(·)+ Moore-Penrose pseudo-inverse of (·)
arg min {·}/ arg max {·} Argument of minimum/maximum of (·)
∝ Proportional to
arg(·) Argument of (·)
exp(·) Exponential of (·)∑(·) Sum of (·)∏(·) Production of (·)
ln(·) Natural logarithm of (·)
E(·) Expectation of (·)
X
∼ Sampling from
‖ (·) ‖2 Euclidean distance of (·)
Symbols
si source signal qi source location
xj sensor signals dj,dJ sensor location
dn observed DoA h the impulse response
fs sampling frequency ı imaginary unit
v sound velocity φi source elevation angle
θi source azimuth ∆ TDoA vector
D location matrix of sensor pairs dn observed DoA
N total number N Gaussian distribution
µ mean σ standard derivation
σ2 variance Σ covariance
s precision α weight of Gaussian components.
β shape of the Gamma distribution K number of clusters
G Gamma distribution ν parameter for wrapped phase model
rnk indicator variable in k-means r(znk) responsibility in EM algorithm
φ parameter of DP θ parameter set of GMM
a power value c parameter for the power value
λL parameter of Lagrange multiplier rm, rs radius of the sensor and source
σbig upper limit of σ in GMM zp standard score Gaussian distribution
C criterion for the Split and Merge cvar upper limit of σ
clm lower limit of the σ Ψp Mahalanobis distance
FORMULA SYMBOLS XI
z indicator for observed data αlm lower limit of the cluster weight
lk Number of data in component k P power
gl(z) squeezing function ge(z) envelope function
a evaluation points in ARS Ct convergence condition for EM
CHAPTER 1. INTRODUCTION 1
Chapter 1
Introduction
Recently, many source separation algorithms have been developed to separate the sig-
nals of multiple point sources from each other. However, many source separation
methods require the knowledge of the number of sources in advance [3]. This is dif-
ficult because in general only mixture signals are available for the further separation
process [11].
Blind Source Separation (BSS) is an approach to separate signals where no informa-
tion about the source nor the mixing process is available [11]. In the last two decades,
many BSS methods have been proposed. Some BSS methods use Independent Com-
ponent Analysis (ICA) but also many BSS algorithms are based on the assumption
of sparsity of the sources, i.e., that in a speech mixture only one source is active at a
time-frequency bin such that the corresponding short time Fourier spectra of the single
sources do not overlap [3]. Using this assumption, the Directions of Arrival (DoAs)
can be estimated in each time-frequency bin and clustered afterwards. In the end, the
number of clusters yields an estimate of the number of sources.
The aim of this thesis is the implementation and the evaluation of algorithms for the
estimation of the number of speech signals from an observed mixture. The source
counting results of these algorithms can be used in many audio processing studies such
as speech recognition and automatic meeting transcription. In this thesis, all the source
counting algorithms are based on the sparsity assumption of speech sources. The DoAs
2
of the observed speech signal mixture of the is estimated using the source separation
method in [3, 2]. Three clustering methods for speech source counting are discussed
and implemented in this work including:
• The Gaussian Mixture Model (GMM) clustering method with Dirichlet prior
proposed by [3]. The Expectation-Maximization (EM) algorithm is used to solve
the Maximum A Posteriori (MAP) estimation problem of the model parameters
of the GMM.
• The Split and Merge method for the EM algorithm described in [15]. This method
also models the observed DoAs using a GMM and estimates the model parameters
using the EM algorithm. Instead of using the Dirichlet prior this method includes
split and merge steps n the M-step of EM algorithm to reduce the number of
redundant Gaussian components. The underlying optimization problem is a ML
estimation problem.
• The Infinite Gaussian Mixture Model (IGMM)[12]. This method models the
observed DoAs by setting the number of Gaussian components in GMM to be
infinite, which successfully avoids the problem of defining the correct number of
Gaussian components in advance under the circumstances that the source number
is unknown. In the IGMM method the model parameters are estimated using
Gibbs sampling. To limit the number of Gaussian components Rasmussen used
the Chinese Restaurant Process (CRP) representation of the Dirichlet process.
After the implementation, different acoustic reverberate scenarios are designed to in-
vestigate the performances and the robustness of these three methods for counting
speech sources.
1.1 Structure of the Thesis
The structure of the thesis is summarized as:
Chapter 2 - Fundamentals introduces the fundamentals needed for the thesis, such
CHAPTER 1. STRUCTURE OF THE THESIS 3
as the principles of the STFT, the GMM and the EM algorithm. The BSS-based
method for estimating DoAs from speech sources is also described in this chapter.
Chapter 3 - GMM Clustering with Dirichlet Prior describes the GMM method
of [3] for estimating the number of speech sources using a Dirichlet Prior, the derivation
for the MAP-EM algorithm is presented at the end of this chapter.
Chapter 4 - Split and Merge Method introduces principles of the split and merge
method for reducing the redundant Gaussian components in GMM clustering based on
the ML-EM algorithm.
Chapter 5 - Infinite Gaussian Mixture Model introduces the IGMM method of
[12] for data clustering using Gibbs sampling and the CRP.
Chapter 6 - Experimental Results discusses and compares the source counting
results using the three clustering methods.
Chapter 7 - Summary and Outlook reviews the results and conclusions obtained
in this thesis and suggests the directions for the future work.
CHAPTER 2. FUNDAMENTALS 1
Chapter 2
Fundamentals
This chapter introduces the fundamental knowledge which helps readers to understand
the following chapters describing theoretical principles and experiment results. In this
chapter, Section 2.1 introduces the basic idea of the STFT, which is an essential tool
for speech signal processing. Section 2.2 presents a BSS-based DoA estimation method
for multiple sparse sources proposed in [2][3]. The Section 2.3 and 2.4 describe the
definitions of the Gaussian distribution, the GMM as well as the wrapped phase model
proposed by Smaragdis et al. in [14]. In the end, Section 2.5 introduces the basic idea
of k-means and EM algorithm for data clustering.
2.1 Short Time Fourier Transform
The Fourier transform is well known for revealing the frequency contents of a signal.
The Fourier transform of a continuous signal xc(t) is written as
Xc(ω) =
∫ ∞−∞
xc(t)e−ıωtdt. (2.1)
The DTFT of a sequence x[n] can be derived as
X(ω) =∞∑−∞
x[n]e−2πıωn. (2.2)
2
The DFT is obtained by sampling the DTFT at Nft discrete frequencies. The DFT of
time-domain sequence x[n] is written as
X[κ] = x(κ/Nft) =
Nft−1∑n=0
x[n]e−ı 2πκn
Nft . (2.3)
The Fourier transform yields frequency information that is averaged over the entire
time domain. However, the temporal information is hidden in the transform [9]. Fig.
2.1 illustrates an example for a signal with varying time sequences. It can be seen that
although the two signals are very different in time domain, their Fourier transforms are
similar. This is because these two signals have the same frequency content but with
different temporal evaluation.
0 2 4 6 8 10−1
−0.5
0
0.5
1
(a)
0 2 4 6 8
·10−3
0
1,000
2,000
3,000
0 1 2 3 4 5−2
−1
0
1
2
Time(seconds)
(b)
0 2 4 6 8
·10−3
0
1,000
2,000
3,000
Frequency(Hz)
Figure 2.1: Missing time information of the Fourier transform illustrated by two dif-ferent signals and their magnitude Fourier transforms. (a) Two subsequent sinusoidsof frequency 1 Hz and 5 Hz. (b) Superposition of the same sinusoids.[9]
For non-stationary signals, it is also important to obtain the temporal structure. In
CHAPTER 2. SIGNAL MODELS 3
this case, the DFT over short periods of time can be utilized because speeches can
be considered stationary for short time intervals. This method is named STFT and
commonly used for analysis, modification, and synthesis of speech and audio signals.
Let x : Z → R be a real-valued DT-signal obtained by equidistant sampling with
respect to a fixed sampling rate Fs. w : [0 : Nft−1]→ R is a sampled window function
of length Nft ∈ N. The discrete STFT X of the signal x is given by
X(l, κ) :=
Nft−1∑n=0
x(n+ lH)w(n)exp(−2πıκn/Nft) (2.4)
with l ∈ Z and κ is the frequency index. The complex number X(l, κ) denotes the κth
Fourier coefficient for the lth time frame. Fig. 2.2 shows how the STFT works with a
shifted rectangular window across the signal.
2.2 Signal Models
This Section introduces a signal model as well as an BSS method of DoA estimation
for speech sources proposed by Araki et al in [3, 2]. The observation model built for
speech sources will be presented in Section 2.2.1 and the proposed DoA estimation
method will be introduced in Section 2.2.2.
2.2.1 Observation Model for Speech
The observation model includes Ns ≥ 1 speech sources and Nm sensors. The signal
observed at sensors can be written as
xj(t) =Ns∑i=1
∑l
hji(l)si(t− l), j = 1, ..., Nm. (2.5)
Where hji(l) is the impulse response from source i to sensor j.
If the assumption that speech signals are sufficiently sparse holds, we can assume only
one speech source is dominant at each time-frequency bin [3]. With an F-point STFT,
4
0 2 4 6 8 10−1−0.5
00.5
1
(a)
0 2 4 6 8
·10−3
0
1,000
2,000
3,000
0 2 4 6 8 10−1
0
1
(b)
0 2 4 6 8
·10−3
0
1,000
2,000
3,000
0 2 4 6 8 10−1
0
1
(c)
0 2 4 6 8
·10−3
0
1,000
2,000
3,000
0 2 4 6 8 10−1
0
1
Time(seconds)
(d)
0 2 4 6 8
·10−3
0
1,000
2,000
3,000
Frequency(Hz)
Figure 2.2: Signal and Fourier transform consisting of two subsequent sinusoids offrequency 1 Hz and 5 Hz [9].
Eq. (1.5) is converted into
Xj(f, τ) ≈N∑k=1
Hji(f)Si(f, τ), (2.6)
whereHji is the frequency response from source i to sensor j. f ∈{
0,1
Ffs, ...,
F − 1
Ffs
}is a discrete frequency, fs is the sampling frequency and τ ∈ {0, ..., T − 1} is a time-
frame index. When the assumption of source spareness holds and si(f, τ) is a dominant
CHAPTER 2. SIGNAL MODELS 5
source at the time-frequency bin (f, τ), the observation model can be written as
Xj(f, τ) ≈Hji(f)Si(f, τ). (2.7)
2.2.2 DoAs Of Source Signals
Figure 2.3: Definition of the DoA. A speech source i locates in a spherical coordinate.The DoA of i is represented with a unit norm qi
‖qi‖ marked by the red arrow [2].
In spherical coordinates, Fig. 2.3 illustrates the definition of DoA of a speech source. qi
is a 3-dimensional vector representing the location of source si and the DoA is defined
by the unit-norm qi‖qi‖ as
qi‖qi‖
= [cosθi cosφi, sinθi cosφi, sinφi]T , (2.8)
where θi represents the source azimuth angle and φi is the source elevation angle.
Let dj represent the known location of sensor j. The frequency response Hji(f) is
approximated by a far-field model (see Fig 2.4).
Hji(f) ≈ exp
[ı2πfv−1dT
j
qi‖qi‖
], (2.9)
6
Figure 2.4: Far-field model of a speech source i and two sensors j and J . The TDoAbetween sensors is (dj − dJ)T qi
‖qi‖ marked by the red arrow [2].
where v is the sound velocity. The expression ofHji(f) is derived under the assumption
that the frequency response is only determined by the distance dTjqi‖qi‖ between the
source i to origin O and the source i to a sensor j in direction of sound propagation.
For two sensors j and J , the following expression is obtained
Hji(f)
HJi(f)≈ exp
[ı2πfv−1 (dj − dJ)T
qi‖qi‖
], (2.10)
Because Xj(f, τ)/XJ(f, τ) = Hji(f, τ)/HJi(f, τ) and
arg[Xj(f, τ)X∗J(f, τ)] = arg[Xj(f, τ)/XJ(f, τ)], (2.11)
the phase difference between sensor signals of sensor j and sensor J can be written as
arg[Xj(f, τ)X∗J(f, τ)] = 2πfv−1 (dj − dJ)Tqi‖qi‖
, (2.12)
CHAPTER 2. GAUSSIAN DISTRIBUTION 7
and the DoAs can be calculated with
qi‖qi‖
= D+∆jJc, (2.13)
where ∆jJ =1
2πfarg[Xj(f, τ)X∗J(f, τ)] is the TDoA between sensor j and J , D+ is
the Moore-Penrose pseudo-inverse of (dj − dJ)T. If more than two sensors exist, just
define a location matrix D = [d1 − dJ , ...,dM − dJ ]T and the TDoA vector ∆J =
(∆1J , ...,∆MJ) of all sensor pairs. The DoAs can be calculated by
qi‖qi‖
= D+∆v. (2.14)
2.3 Gaussian Distribution
The Gaussian distribution is one of the most important probability distributions for
continuous variables. For the case of a single real-valued variable x, the probability
density function (PDF) of the Gaussian distribution is
N (x|µ, σ2) =1
(2πσ2)1/2exp
{− 1
2σ2(x− µ2)
}, (2.15)
where the parameter µ is the mean, σ2 the variance. The square root of the variance
σ is called the standard deviation. The precision of a Gaussian distribution is the
reciprocal of the variance 1σ2 .
As a probability distribution, the normal distribution satisfies the requirement
N (x|µ, σ2) > 0. (2.16)
If x is a D-dimensional vector of continuous variables, the Gaussian distribution of x
is written as
N (x|µ,Σ) =1
(2π)D/21
(|Σ|)1/2exp
{− 1
2(x− µ)TΣ−1(x− µ)
}, (2.17)
in this PDF the mean µ is a D-dimensional vector, the DxD matrix Σ is called the
covariance, and |Σ| is the determinant of Σ.
8
2σ
µx
N (x|µ,σ)
Figure 2.5: Plot of the univariate Gaussian showing the mean µ and the standarddeviation σ [5].
2.4 GMM And Wrapped Phase Model
After the DoA estimation, we can plot the PDF of the observed DoAs in a histogram.
Under the assumption that speech sources are sufficiently sparse, the histogram of
DoAs is supposed to have clusters which correspond to the directions of speech sources.
Fig. 2.6 shows the PDF histogram of observed DoAS of three speech sources. Since
the observed mixture consists DoAs of usually more than one sources from different
directions, e.g. there are 3 sources in Fig 2.6. The observed DoAs dn is can be modeled
as a Gaussian mixture model (GMM).
This Section introduces the definition of GMM and the Wrapped Phase Model proposed
by Smaragdis in [14].
2.4.1 GMM
A GMM is a density model formed by the linear superposition of a sufficient number
K of Gaussian components. The GMM for the observed DoAs dn can be expressed as
p(dn) =K∑k=1
αkN (dn|µk, σ2k), (2.18)
where each Gaussian density N (dn|µk, σ2k) is called a component of the mixture and
models the DoAs of one speech source. The parameter αk is the mixing coefficient,
CHAPTER 2. GMM AND WRAPPED PHASE MODEL 9
−180 −120 −60 0 60 120 1800
0.5
1
1.5
2
2.5·10−2
DoA [deg.]
P(D
oA)
Histogram of obs. DoAsTrue DoA
Figure 2.6: Histogram of observed DoAs of three speech sources. The white columnsrepresent the observed DoAs, the red stems indicate the accurate direction of speechsources.
µk is the mean and σ2k is called the variance of the k-th Gaussian component. If both
p(dn) and the individual Gaussian components are normalized, the mixing coefficient
is restricted by the condition
K∑k=1
αk = 1, (2.19)
this is obtained by integrating both sides of Eq. (2.18) with respect to dn. Fig. shows
an example of GMM.
2.4.2 Wrapped Phase Model
When modeling the data, the problem of phase wrapping should be taken into consid-
eration. According to the signal model of Araki et al. described in Section 2.2, the
observed mixture dn is distributed in the interval [−π, π). Each source is supposed to
be modeled by a Gaussian distribution. If the mean of the observed data is close to
10
x
N (x|µ,σ)
Figure 2.7: Plot of mixtures of three Gaussian components, the linear superposition ofthese Gaussian components builds a GMM.
−π or π, the distribution wraps and becomes bimodal. In this case a Gaussian model
might misrepresent the data [14]. Fig. 2.8 illustrates this phenomenon.
−180−120 −60 0 60 120 1800
0.2
0.4
0.6
0.8
1·10−2
DoA [deg.]
P(D
oA)
−180−120 −60 0 60 120 1800
0.2
0.4
0.6
0.8
1·10−2
DoA [deg.]
P(D
oA)
Figure 2.8: Histogram of observed DoAs. On the left, little phase wrapping happenswhich results into data that can be well modeled by a Gaussian distribution. On theright severe phase wrapping occurs which results in a poor fit of the data by a singleGaussian distribution [14].
To solve this problem the observed DoAs can be modeled by replicating and adding
Gaussian distributions at intervals of 2π
p(dn) =
+∞∑
ν=−∞
1√2πσ2
e−
(dn + ν2π − µ)2
2σ2 if dn ∈ [−π, π)
0 otherwise.
(2.20)
CHAPTER 2. GMM AND WRAPPED PHASE MODEL 11
For example see Fig. 2.9, Gaussian distributed with mean µ = 170◦. The dotted
lines represent the replication of this distribution while the solid line is the sum of
the replications in the interval [−π, π) and results in the wrapped distribution. The
figure shows that the negative left part of the central Gaussian distribution is added
to right Gaussian distribution, and the right part of the right Gaussian which beyond
π is represented by the central Gaussian. So the wrapped phase GMM of
-2π -π 0 π 2π 3π 4π0
5 · 10−2
0.1
0.15
0.2
Wrapped phase model
Figure 2.9: Illustration of the wrapped phase Gaussian distribution model. Replicatinga Gaussian distribution with of 2π (dashed lines) and summing the result in the intervalbetween [−π, π) (solid blue line) results into accurate modeling of a wrapped Gaussiandistribution [14].
p(dn) =K∑k=1
+∞∑ν=−∞
αkN (dn + 2πν|µk, σk) (2.21)
will be used instead of Eq. (2.18), where ν is a integer parameter and is determined
from the observed DoAs.
12
2.5 K-means and EM Algorithm
After building the structure GMM for the observed DoAs. The clustering algorithms
are applied to classify data and estimate the parameters of GMM. This section intro-
duces the basic idea of the k-means algorithm and EM algorithm, which are well known
data clustering algorithms.
2.5.1 K-means Clustering[5]
As described in the previous sections, the observed DoAs are represented by a vector
d = [d1, ..., dN ], the goal of k-means clustering is to partition the data set d into a
given number of K clusters. If µk represents the center of the k-th cluster, we define
a binary indicator variable rnk = [0, 1] to describe which cluster the data point dn is
assigned to,
rnk =
1 if dn is assigned to the k−th cluster.
0 otherwise.
(2.22)
We also define the sum of the squares of the distance of each data point to its assigned
vector µk as an objective function J as
J =N∑n=1
K∑k=1
rnk ‖ dn − µk ‖2 (2.23)
The minimization of J means that each observed DoA is assigned to its nearest cluster.
In oder to minimize J , the process of k-means clustering includes three steps,
1) Initialize values for µk.
2) J is minimized with respect to rnk, keeping the µk fixed. To achieve this goal, the
n-th data point is assign to the closest cluster center, this can be expressed as
rnk =
1 if k = arg min j ‖ dn − µk ‖2
0 otherwise.
(2.24)
CHAPTER 2. K-MEANS AND EM ALGORITHM 13
3) J is minimized with respect to µk, keeping the rnk fixed. The function J is a
quadratic function of µk, it can be minimized by setting the derivative with respect
to µk to zero giving
2N∑n=1
rnk(xn − µk) = 0 (2.25)
which the µk is easily given by
µk =
∑n rnkxn∑n rnk
. (2.26)
The last two steps will be repeated until convergence.
2.5.2 EM Algorithm[5]
This section introduces the general and simplest process of EM algorithm, which aims
to find ML solutions for a GMM without latent variables. The wrapped phase GMM
for observed DoAs is more complicated because it has latent variables. The detailed
formula derivation of the EM algorithm for finding the maximum a posteriori (MAP)
solution for the wrapped is presented in Chapter 3.
The log of the likelihood function of GMM with data set d is written as
ln p(d|α,µ,σ2) =N∑n=1
ln
{ K∑k=1
αkN (dn|µk, σ2k)
}, (2.27)
where α, µ and σ are the model parameters representing the mixing proportions,
means and the variance of the Gaussian components.
To calculate the maximum likelihood solution for µk, we set the derivation of the
likelihood function with respect to the µk and obtain
0 = −N∑n=1
αkN (dn|µk, σ2k)∑K
j=1 αjN (dn|µj, σ2j )σ2k(dn − µk), (2.28)
here we define a parameter γ(znk) called responsibility with
γ(znk) =αkN (dn|µk, σ2
k)∑Kj=1 αjN (dn|µj, σ2
j ). (2.29)
14
µk is calculated with
µk =1
Nk
N∑n=1
γ(znk)dn (2.30)
where Nk , which represents the number of data points assigned to cluster k, is defined
as
Nk =N∑n=1
γ(znk) (2.31)
To find the the maximum likelihood solution for σ2k, we set the derivate of ln p(d|α,µ,σ2)
with respect to σ2 to 0, we obtain
σ2k =
1
Nk
N∑n=1
γ(znk)(dn − µk)2 (2.32)
for a single Gaussian component k. According to the constraint that∑K
k=1 αk = 1, we
use a Lagrange multiplier to maximize the quantity
ln p(d|α,µ,σ2) + λ
( K∑k=1
αk − 1
), (2.33)
so that the maximum likelihood solution for αk can be calculated with
0 =N∑n=1
αkN (dn|µk, σ2k)∑K
j=1 αjN (dn|µj, σ2j )
+ λ (2.34)
If we multiply both sides of (2.34) with αk and we sum both sides with k, we find that
λ = −N and
αk =Nk
N, (2.35)
which means that the mixing coefficient for the k−th Gaussian component is the ratio
of the population of data points assigned to the k−th Gaussian component to the
population of all data points. The EM algorithm for GMM is summarized as
1. Choose an initial setting for the parameter µk, σk and the mixing proportion αk.
Evaluate the initial value of the log likelihood.
CHAPTER 2. K-MEANS AND EM ALGORITHM 15
2. E step Evaluate the responsibilities using the current parameters
γ(znk) =αkN (dn|µk, σ2
k)∑Kj=1 αjN (dn|µj, σ2
j ). (2.36)
3. M step Re-estimate the parameters using the current responsibilities
µnewk =
1
Nk
N∑n=1
γ(znk)dn
(σ2k)
new =1
Nk
N∑n=1
γ(znk)(dn − µnewk )2)
αnewk =
Nk
N
(2.37)
where
Nk =N∑n=1
γ(znk) (2.38)
4. Evaluate the log likelihood
ln p(d|α,µ,σ) =N∑n=1
ln
{ K∑k=1
αkN (dn|µk, σ2k)
}(2.39)
and check for convergence of either the log likelihood or the parameter values. If
the convergence criterion is not satisfied, return to step 2.
In summary, this Chapter introduces the fundamental knowledge of the thesis. Section
2.1 introduces the basic idea of STFT. Section 2.2 presents the DoA estimation method
for speech sources in [4]. To count the number of speech sources, a wrapped phase GMM
is utilized to model the observed DoAs ( introduced in Section 2.3 and 2.4) and the
clustering algorithms such as k-means and EM algorithm can be used to estimate the
parameters of GMM.
16 CHAPTER 3. GMM CLUSTERING WITH DIRICHLET PRIOR
Chapter 3
GMM Clustering With Dirichlet
Prior
In the previous chapter, the DoA estimation method proposed by Araki et al. in [3, 2]
computes multiple DoAs using the information of sensor signals and sensor locations.
The remaining task is to estimate the source number combining all single measure-
ments, this can be accomplished by a clustering algorithm. In their previous work [4],
they applied the k-means algorithm for the observed DoAs by assuming the source
number Ns to be given. The authors of [8] modeled the data with a GMM and derived
an EM algorithm for fitting the GMM to the data. In the context of BSS, however, the
source number is usually unknown. Furthermore, In the under-determined case, i.e.,
the source number exceeds the number of sensor, the estimation can be very difficult [3].
This chapter describes the GMM fitting method with Dirichlet prior[3]. The method
does not require the source number as prior knowledge and works well also in the
under-determined case. A MAP estimator with the EM algorithm is applied to esti-
mate the model parameters.
In this chapter, the derivation and implementation of the method will be presented
CHAPTER 3. DIRICHLET PRIOR 17
and discussed step by step in each section. Section 3.1 defines the Dirichlet prior and
the reason for using it. In Section 3.2, the MAP estimator based on the GMM is
given for estimation of the parameters. In the end, Section 3.3 presents the derivation
of the corresponding EM algorithm. The experiment results of this method will be
documented and compared in Chapter 6.
3.1 Dirichlet Prior
3.1.1 Problem of GMM Fitting
Due to the fact that the setting number of components of EM algorithm (e.g. 10) must
exceeds the actual maximum source number (e.g. 5). Araki et al. describe an often
occurring problem when applying a GMM fitting method to estimate the number of
sources within the DoA histogram. Fig. 3.1 illustrates this problem, the white columns
represent the observed DoAs from three speech sources, the red stems show the correct
position of each source and the GMM fitting result is plotted by black lines. From Fig.
3.1 it can be seen that the source locates at 140◦
is fitted by two Gaussian components,
the same problem also happens on the third source which locates at 110◦. However,
one cluster, which corresponds to one speech source, should be modeled by only one
Gaussian component. This Problem is called overfitting.
3.1.2 Dirichlet Prior
To avoid the overfitting for the data set. Araki et al. proposed to apply the Dirichlet
distribution to the GMM component weight parameter α
p(α) =1
B(φ)
K∏k=1
αφ−1k (3.1)
where α = [α1, ..., αk, ..., αK ] is the vector of component weights,∑K
k=1 αk = 1 because
the distribution is normalized, 0 ≤ αk ≤ 1. Here, φ is the parameter of the Dirichlet
distribution and B(φ) is the beta distribution [5]. Due to the summation constraint
18
−150 −100 −50 0 50 100 1500
0.5
1
1.5
2
2.5·10−2
DoA [deg.]
P(D
oA)
Histogram of obs. DoAsTrue DoAGMM fit
Figure 3.1: Example of GMM fitting result without a Dirichlet prior. The whitecolumns represent the observed DoAs, the red stems are the correct position of eachsource and the black lines plot the GMM fitting results.
∑Kk=1 αk = 1, the Dirichlet distribution over α is confined to a probability simplex,
Fig. 2.2 shows an example for α = [α1, α2, α3]. Fig. 3.3 illustrates the shape of
Dirichlet distribution looks like by varying the value of the hyper parameter φ. Different
colors indicate the trend of the density Dirichlet distribution p(α). When φ = 1, the
Dirichlet distribution over the simplex becomes a uniform distribution. When φ < 1
the distribution concentrates in the corners and along the boundaries of the simplex.
When φ > 1 the distribution tends to concentrate in the center of the simplex.In
this case the density p(α) reaches its maximum when α = [1, 0, 0], α = [0, 1, 0] or
α = [0, 0, 1]. This is desirable to represent the assumption of source sparsity. Araki et
al. proposed to set the Dirichlet prior φ < 1 to fix the problem of overfitting described
in Section 3.1.1, from Fig 3.4 we can see that each cluster is adequately fitted by one
Gaussian component by using Dirichlet prior φ = 0.01.
CHAPTER 3. MAP ESTIMATOR FOR GMM PARAMETERS 19
Figure 3.2: The Dirichlet distribution over three variables α1, α2, α3 is confined to aprobability simlex (the triangle area) [5].
(a) φ > 1 (b) φ = 1 (c) φ < 1
Figure 3.3: Shapes of the Dirichlet distribution over the simplex with different valuesof φ.
3.2 MAP Estimator for GMM Parameters
As described in Section 2.4.2, the observed DoAs are modeled by a wrapped phase
GMM, written as
p(dn) =K∑k=1
αk
+∞∑ν=−∞
N (dn + 2πν;µk, σk), (3.2)
where the mean µk, the standard deviation σk and the weight αk for each Gaussian
component has to be estimated. In equation (3.2), the number of the Gaussian compo-
nents m and the index ν for the wrapped phase model are handled as latent variables.
20
−150 −100 −50 0 50 100 1500
0.5
1
1.5
2
2.5·10−2
DoA [deg.]
P(D
oA)
Histogram of obs. DoAsTrue DoAGMM fit
Figure 3.4: GMM fitting example using the Dirichlet prior φ = 0.01.
Let θ = [αk, µk, σk]Kk=1 = [α,µ,σ] be the parameter set of the model. The observa-
tions are the DoA values d = [d1, d2, ..., dn, ..., dN ], where N = TF is the number of
observations, T is the time frames and F is the frequency bins of the STFT. In order to
emphasize the observations of the active speakers with high power by power weighting,
Araki et al. defined the power value a(f, t) = |x(f, t)|2 and the vector of the power
values over all frequency bins and time frames is a = [a1, a2, ..., an, ..., aN ].
The parameter set can be estimated via a MAP estimator by maximizing the posterior
probability
θ = arg maxθ
p(θ|d) (3.3)
According to Bayes Theorem,
arg maxθ
p(θ|d) = arg maxθ
p(d|θ)p(θ)
p(d)
= arg maxθ
p(d|θ)p(θ)
= arg maxθ
p(d,θ).
(3.4)
CHAPTER 3. EM ALGORITHM FOR COUNTING ACOUSTIC SOURCES 21
The log of the joint probability density function p(d,θ) is
L(θ) = log [ p(d|θ) p(α) p(µ) p(σ) ]
= log p(d|θ) + log p(α) + const.(3.5)
The power weight function in relation to a is defined by
f(an) =can∑Nn=1 an
(3.6)
We emphasize the log of probability of observations with this weight function, because
the observations are independent from each other
log p(d|θ) =N∑n=1
f(an)log p(dn|θ), (3.7)
so L(θ) can be written as
L(θ) =N∑n=1
f(an)log p(dn|θ) + log p(α) + const,
=N∑n=1
f(an) log p
( K∑k=1
∞∑ν=−∞
p(k, ν, dn|θ))
+ log p(α) + const,
(3.8)
where p(dn|θ) = p
(∑Kk=1
∑∞ν=−∞ p(k, ν, dn|θ)
)by marginalizing out the variable k
and ν, here the conditional probability p(k, ν, dn|θ) is written as
p(k, ν, dn|θ) =αk√2πσ2
k
exp
(−(dn + 2πν − µk)2
2σ2k
). (3.9)
3.3 EM Algorithm For Counting Acoustic Sources
3.3.1 Method Flow
To find the MAP solution and estimate parameters for GMM, Araki et al. use the EM
algorithm summarized in Tab. 3.1
22
EM Algorithm [3]
Initialization
1. Select a number of initial Gaussian components for the observed DoA mixture,
this number should be bigger than the maximum possible number of sources.
2. Initialize the means of the GMM by k-means clustering.
Repeat
1. E step: calculate the posterior probability p(k, ν|dn,θt) using equation (3.15),
here θt is the parameters calculated in the previous iteration.
2. M step: calculate the parameters θ with equation (3.22), (3.23) and (3.36).
Until the convergence of the GM parameters is arrived, i.e. the parameter difference
between two iterations is smaller than a given threshold.
Table 3.1: Summery of the EM Algorithm[3].
3.3.2 EM Formula Derivation
As described in Section 3.2, the log of the joint probability is given as
log p(d,θ) =N∑n=1
f(an) log p
( K∑k=1
∞∑ν=−∞
p(k, ν, dn|θ)
)+ log p(α) + const. (3.10)
where
p(k, ν, dn|θ) =αk√2πσ2
k
exp
(−(dn + 2πν − µk)2
2σ2k
), (3.11)
and
f(an) =can∑Nn=1 an
. (3.12)
The auxiliary function Q [5] is given as the conditional expectation of log p (d,θ) given
dn and θt as
Q(θ|θt) = E[log p (d,θ)|dn,θt], (3.13)
CHAPTER 3. EM ALGORITHM FOR COUNTING ACOUSTIC SOURCES 23
where θt is the estimate of the parameters after the t-th iteration. With Jensen’s
formula (log is concave):
log
( K∑k=1
ν=+∞∑ν=−∞
p(k, ν, dn|θ)
)≥
K∑k=1
ν=+∞∑ν=−∞
log p(k, ν, dn|θ), (3.14)
where we define
p(k, ν|dn,θt) =p(k, ν, dn|θt)∑K
k=1
∑ν=+∞ν=−∞ p(k, ν, dn|θt)
. (3.15)
With (3.10) to (3.15), the auxiliary function Q can be extended to
Q(θ|θt) =∑n
∑k
∑ν
[p(k, ν|dn,θ)f(an)log p(k, ν, dn|θ)] + log (α). (3.16)
The partial derivation of Q(θ|θt) by µk is written as
∂Q(θ|θt)∂µk
=∑n
∑ν
f(an)∂
∂µkp(k, ν|dn,θt)log p(k, ν, dn|θ)
=∑n
∑ν
f(an) p(k, ν|dn,θt)∂
∂µklog p(k, ν, dn|θ),
(3.17)
where the last term of (3.17) can be written based on (3.11) by
∂
∂µklog p(k, ν, dn|θ) =
∂
∂µk
(log
αk√2πσ2
k
+
(−(dn + 2πν − µk)2
2σ2k
))=dn + 2πν − µk
σ2k
(3.18)
Setting ∂Q(θ|θt)∂µk
= 0 and inserting (3.18) in (3.17) we have
∑n
∑ν
f(an) p(k, ν|dn,θt)(dn + 2πν − µk
σ2k
)!
= 0. (3.19)
Now we solve (3.19) for µk
µkσ2k
∑n
∑ν
f(an) p(k, ν|dn,θt) =∑n
∑ν
f(an) p(k, ν|dn,θt)dn + 2πν
σ2k
, (3.20)
using the definition (3.15),
24
µk∑n
∑ν
f(an)p(k, ν|dn,θt) =∑n
∑ν
f(an)p(k, ν|dn,θt)(dn + 2πν) (3.21)
Finally, we obtain
µt+1k =
∑n
∑ν f(an)p(k, ν|dn,θt)(dn + 2πν)∑n
∑ν f(an)p(k, ν|dn,θt)
. (3.22)
By setting ∂Q(θ|θt)∂σ2k
= 0, we obtain the parameter (σ2k)
(t+1) by
(σ2k)
(t+1) =
∑n
∑ν f(an)p(k, ν|dn,θt)(dn + 2πν)∑n
∑ν f(an)p(k, ν|dn,θt)
− (µt+1k )2. (3.23)
In addition, by using the Lagrange multiplier method with the constraint∑K
k=1 αk = 1,
the partial derivative of Q(θ|θt) w.r.t αk is calculated by
∂
∂αk
(Q(θ|θt)− λL(
∑k
αk − 1))
=∂
∂αkQ(θ|θt)− λL (3.24)
∂
∂αkQ(θ|θt) =
∑n
∑ν
f(an)p(k, ν|dn,θt)∂
∂αklog p(k, ν, dn|θ)︸ ︷︷ ︸
(I)
+∂
∂αklog p(α)︸ ︷︷ ︸(II)
(3.25)
where the derivative in (I) of (3.25) is computed by
∂
∂αklog p(k, ν, dn|θ) =
∂
∂αk
(log
αk√2πσ2
k
− (dn + 2πν − µk)2
2σ2k
)=
∂
∂αk
(logαk − log
√2πσ2
k
)=
1
αk,
(3.26)
CHAPTER 3. EM ALGORITHM FOR COUNTING ACOUSTIC SOURCES 25
and (II) is calculated by
∂
∂αklog p(α) =
∂
∂αklog
1
B(φ)
K∏k=1
αφ−1k
=∂
∂αk
(− logB(φ) +
K∑k=1
logαφ−1k
)
=∂
∂αk
K∑k=1
(φ− 1)logαk
= (φ− 1)1
αk.
(3.27)
Inserting the results for (I) and (II) in (3.25) yields
∂
∂αkQ(θ|θt) =
∑n
∑ν
f(an)p(k, ν|dn,θt)1
αk+ (φ− 1)
1
αk. (3.28)
Setting (3.28) to zero and include λL in the both sides
∂
∂αkQ(θ|θt)− λL =
1
αk
((φ− 1) +
∑n
∑ν
f(an)p(k, ν|dn,θt))− λL
!= 0. (3.29)
αk is dependent on λL
αk =1
λL
((φ− 1) +
∑n
∑ν
f(an) p(k, ν|dn,θt)). (3.30)
Inserting (3.30) in (3.29) we have
∂
∂λL
(Q(θ|θt)− λL
(∑k
αk − 1))
=∑k
αk − 1!
= 0, (3.31)
the derivative of λL is
∑k
1
λL
((φ− 1) +
∑n
∑ν
f(an) p(k, ν|dn,θt))− 1 = 0. (3.32)
Inserting (3.31) in (3.32) we have
λL =∑k
((φ− 1) +
∑n
∑k
f(an) p(k, ν|dn,θt))
(3.33)
26
the weight parameter α is finally calculated by
αk =(φ− 1) +
∑n
∑ν f(an) p(k, ν|dn,θt)
K(φ− 1) +∑
k
∑n
∑ν f(an) p(k, ν|dn,θt)
=(φ− 1) +
∑n
∑ν f(an) p(k, ν|dn,θt)
K(φ− 1) +∑
n f(an)∑
k
∑ν p(k, ν|dn,θt)
,
(3.34)
where∑
k
∑ν p(k, ν|dn,θt) = 1, so
αk =1
K(φ− 1) +∑
n f(an)
((φ− 1) +
∑n
∑ν
f(an) p(k, ν|dn,θt))
(3.35)
because of∑
n f(an) =
∑n can∑n an
= c,
αk =1
K(φ− 1) + c
((φ− 1) +
∑n
∑ν
f(an) p(k, ν|dn,θt))
(3.36)
CHAPTER 4. SPLIT AND MERGE METHOD 27
Chapter 4
Split and Merge Method
This chapter introduces the Split and Merge method[15]. Similar to the GMM clus-
tering method described in Chapter 3, the goal of the Split and Merge method is also
to estimate the number of sources and Gaussian mixture parameters from an observed
set of DoA mixture using the EM algorithm. The main difference of them is how the
overfitting problem is handled. Araki et al. applied a Dirichlet prior to the weight
parameter vector of the Gaussian components of the GMM, whereas the Split and
Merge method introduces conditions to remove and merge extra Gaussian components
as well as to prune data in the fitting process. The advantage of this Split and Merge
method is its fast convergence speed compared with the standard EM algorithm in the
GMM clustering method with a Dirichlet prior. The pseudo code of the method flow
is summarized in Table 4.1. We start with the Split and Merge method proposed in
[15], which fits the position estimates in the 2D. Since the observed data in thesis is
1D, the author of this thesis made some necessary modifications to the constraints and
parameters in the following.
4.1 Tolerance Region of a Gaussian Distribution
For a Gaussian distribution with mean µ and variance σ2, the probability mass pmass
is a cumulative distribution function, which gives the probability that value x is in a
28
Split and Merge Method
Initialization
1. Select a number of initial Gaussian components to represent the observed DoA
data by a GMM,this number corresponds to the maximum number of sources.
2. Initialize the GMM by k-means clustering.
Repeat
1. Perform the standard EM algorithm for 100 iterations.
2. Estimate number of components.
2 (a). Removing a Gaussian component.
2 (b). Merging Gaussian components.
3. Prune training data.
Until The difference of parameter values of the GMM between two iterations
is sufficiently small. Run a final Mahalanobis distance-based merger.
Table 4.1: Pseudo code of the Split and Merge method[15].
certain interval [µ − zpσ, µ − zpσ][1]. Such an interval with a certain value of pmass
is called a tolerance interval[15]. The blue area of Fig. 4.3 illustrates this tolerance
interval. TR is the abbreviation of Tolerance Region. The cumulative probability pmass
can be written as
pmass = p(− zp ≤
x− µσ≤ zp
)= p
((x− µ)2
σ2≤ z2p
),
(4.1)
where zp is called the standard score of the Gaussian distribution. Let Ψpmass = z2p ,
pmass can be represented as
p
((x− µ)2
σ2≤ z2p
)= p
((x− µ)2
σ2≤ Ψpmass
). (4.2)
CHAPTER 4. THE PROPOSED SPLIT AND MERGE METHOD 29
Given the value of pmass, zp as well as Ψpmass can be calculated with the z-table of
Gaussian distribution in [1].
pmassµ− zpσ µ µ+ zpσ
x
p
Figure 4.1: Figure of a Gaussian distribution, the blue area represents the toleranceregion [µ−zpσ, µ−zpσ], the cumulative probability pmass is calculated by the integrationof the distribution PDF within the tolerance region[1].
4.2 The Proposed Split and Merge Method
Given the vector of observed DoAs d = [d1, ...dN ], the GMM parameter θ can be
estimated with the EM algorithm[5]. Taseska et al. implemented the Split and Merge
method to update the number of sources by removing redundant components and
merging components that model the same source.
4.2.1 Removal of Gaussian Components
The first step in the Split and Merge approach is a component removing step, in which
Taseska et al. defined three empirical criteria C1, C2 and C3 to decide if the k-th
Gaussian component in the current iteration models a source.
The first criterion C1 is a variance criterion, which aims to remove the Gaussian compo-
nents with significantly larger variance among all components in the mixture. Because
30
a component with large variance is assumed to model noise [15]. This criterion C1 is
written as
C1 : σk < cvar · σk∗ ,
k∗ = arg mink′
(σk′), (4.3)
where cvar is a constant which represents the upper limit of the standard deviation of
the Gaussian components, and k∗ is the Gaussian component with minimal variance
value in the current iteration.
The second criterion C2 aims at removing the components with variance value which
is under the lower limit clm for the standard deviation of the component
C2 : σk < clm, (4.4)
where clm is a pre-defined constant. The third criterion C3 aims to remove the compo-
nents, whose tolerance region includes at least two means from other Gaussian com-
ponents. For example in Fig. 3.2, the blue area is the tolerance region of the middle
Gaussian component with mean µ2. It can be clearly seen that the means µ1 and µ3 of
other two components are within this blue area, so the component in the middle will
be removed. Formally this criterion is written as
C3 : (µk′ − µk)2/σ2k ≤ Ψpµ , (4.5)
where Ψpµ is the Mahalanobis distance for the tolerance region defined and calculated
by a probability pµ = 0.9. If the Mahalanobis distance between the µk and µ′k (here
two values of k′) is smaller than Ψpµ , component k will be removed.
4.2.2 Merging Gaussian Components
If the means of two Gaussian components are located close to each other, these two
components are likely to model the same source and should be merged. This component
CHAPTER 4. THE PROPOSED SPLIT AND MERGE METHOD 31
TR2µ1 µ2 µ3
x
p
Figure 4.2: Figure of three Gaussian components with their means µ1, µ2 and µ3, TR2represents the tolerance region of the Gaussian component in the middle. Accordingto the criterion C3, this middle component will be removed.
merging condition can be defined as
‖ µk′ − µk ‖2< cµ, (4.6)
where ‖ · ‖2 is the Euclidean norm and cµ is a user-defined constant. If the condition
(4.6) is satisfied, component k′ and k are merged into one component with
µmerge =µk′ + µk
2, σmerge =
σ(k′) + σ(k)
2and αmerge = αk′ + αk. (4.7)
4.2.3 Pruning Training Data
After the removing and merging step for components, the redundant data points, which
do not belong to any remaining components will be removed. More clearly, if a DoA
estimate dn dose not belong to the tolerance region of any remaining Gaussian compo-
nents, it will be removed for the next iteration. This is called data pruning. Fig. 3.3
illustrates this with an example for the case that two Gaussian components remained
after the removing and merging step, and data points included by an orange area,
which will be removed in this pruning step since they do not belong to the tolerance
region of any Gaussian component. the pruning is written by
(dn − µk)2
σ2k
≥ Ψpmass (4.8)
32
where Ψpmass is calculated with the given cumulative probability pmass = 0.95.
TR1 TR2µ1 µ2
x
p
Figure 4.3: Figure of two Gaussian components, where TR1 and TR2 represent thetolerance regions of the Gaussian components. The data in the blue areas ate includedin the tolerance regions, while the data in orange areas ate outside the tolerance regions.
4.2.4 Final Merging Step After Convergence
After repeating the following steps: Em algorithm, removing of extra components,
merging multiple components and pruning data until convergence, a final merging step
need to be done to make sure that each cluster is modeled bu only one Gaussian
component. Two Gaussian components k and k′ are merged to one component if one
of these following conditions holds:
(µk − µk′)2
σ2k
≥ Ψpmerge , or (4.9)
(µk − µk′)2
σ2k′
≥ Ψpmerge . (4.10)
Similar to Ψpmass mentioned before, Ψpmerge is calculated in the same way with a given
probability pmerge = 0.95
In summary, this chapter introduces the Split and Merge method[15] for data clustering.
It reduces the iteration steps of the EM algorithm by removing redundant components
and data points in the M-step. The performance of the Split and Merge method will
CHAPTER 4. THE PROPOSED SPLIT AND MERGE METHOD 33
be evaluated in the experimental chapter.
34 CHAPTER 5. INFINITE GAUSSIAN MIXTURE MODEL
Chapter 5
Infinite Gaussian Mixture Model
This chapter introduces the Infinite Gaussian Mixture Model (IGMM) clustering method
proposed in [12]. As described in the Chapter 3 and Chapter 4, Araki et al. and Taseska
et al. fitted the observed DoAs with a finite number of Gaussian components. Ras-
mussen considered that it is not necessary a priori to limit the number of components
to be finite [12] and pointed out the main disadvantage of the finite GMM clustering
method is that the correct number of mixture components needs to be defined before
clustering. Unfortunately, this is not possible in many situations, because the number
of clusters is unknown.
From the experience on implementing the GMM clustering method proposed by Araki
et al., we know that the in advance specified number of Gaussian components K must
exceeds the maximum possible number of speech sources, which is 5 in our experi-
ment. However, if this component number is too large, it brings the side effects of
”overfitting”. The hierarchical nonparametric IGMM method by Rasmussen success-
fully avoids this problem. In the proposed method, Rasmussen derived a hierarchical
IGMM and sampled model parameters with Gibbs sampling. For parameters whose
density function very difficult to sample from, he used the adaptive rejection sampling
in [6]. The Chinese restaurant process representation of the Dirichlet process is used
in the proposed method to limit the number of components and assign observations to
them.
CHAPTER 5. GMM AND IGMM 35
5.1 GMM and IGMM
The finite GMM with K components is written as
p(d|µ1, ..., µK , s1, ..., sK , α1, ..., αK) =K∑k=1
αkN (µk, s−1k ), (5.1)
where d = [d1, ..., dN ] is the training data and N is the data length, µk is the mean,
sk is the precision and αk is the mixing proportions of the k-th Gaussian component.
The IGMM where the component number k →∞ can be written as
p(d|µ1, ..., µK , s1, ..., sK , α1, ..., αK) =N∑i=1
p(zn = k)∞∑k=1
αkN (µk, s−1k ), (5.2)
Rasmussen defined here a indicator zn for each observation, which takes value 1...k.
The indicator zn = k indicates the k-th component has generated the observation di
and p(zn = k) is the conditional prior for components.
5.2 Component Parameters
This section specifies the priors on component parameters and hyper parameters of the
IGMM. The conditional distributions for the parameters will also be derived for the
Gibbs sampling. So the component means µk are given Gaussian priors,
p(µk|λ, r) ∼ N (λ, r−1), (5.3)
where the mean λ and precision r are hyper parameters common to all components.
The hyper parameters themselves are given vague Normal and Gamma Priors,
p(λ) ∼ N (µd, σ2d), p(r) ∼ G(1, σ2
d) ∝ r−1/2exp(−rσ2d/2), (5.4)
here the µd and σ2d are the mean and variance of the observed data. The shape of the
the Gamma prior is set to unity, corresponding to a very broad distribution.
The component precision sk are given Gamma priors with
p(sk|β, ω) ∼ G(β, ω−1), (5.5)
36
its shape β and mean ω−1 are given inverse Gamma form
p(β−1) ∼ G(1, 1)⇒ p(β) ∝ β−3/2 exp(−1/(2β)), p(ω) ∼ G(1, σ2d). (5.6)
By multiplying the (5.2) and the (5.3) we obtain the conditional posterior distribution
for the means
p(µk|z,d, sk, λ, r) ∼ N(dklksk + λr
lksk + r,
1
lksk + r
), dk =
1
lk
∑n:zn=k
dn, (5.7)
here the lk is the number of observations belonging to component k and dk is the
mean of these observations. Meanwhile, the conditional posterior distribution for the
precision are obtained by multiplying the Eq. (5.5) and Eq. (5.2) conditioned on the
indicators zn,
p(sk|z,d, µk, β, ω) ∼ G(β + lk, [
1
β + lk(ωβ +
∑n:zn=k
(dn − µk)2)]−1). (5.8)
Combining the (5.3) and (5.7) we obtain the conditional posteriors for the hyper pa-
rameters λ and r
p(λ|µ1, ..., µK , r) ∼ N(µd σ
−2d + r
∑Kk=1 µk
σ−2d +Kr,
1
σ−2d +Kr
), (5.9)
p(r|µ1, ..., µK , r) ∼ G(K + 1, [
1
K + 1(σ2
d +K∑k=1
(µk − λ)2)]−1). (5.10)
Similarly, we obtain the conditional posteriors for the hyper parameters ω and β by
combining the Eq. (5.5), Eq. (5.6) and Eq. (5.8).
p(ω|s1, ..., sK , β) ∼ G(Kβ + 1, [
1
Kβ + 1(σ−2d + β
K∑m=1
sk)]−1), (5.11)
p(β|s1, ..., sK , ω) ∝ Γ(β
2)−Kexp(
−1
2β)(β
2)(Kβ−3)/2
K∏k=1
(skω)β/2exp(− βskω
2
). (5.12)
It is very difficult to sample β directly from the density function of Eq. (5.12), but
p(log(β)|s1, ..., sK , ω) is log-concave, so we can generate the independent samples from
CHAPTER 5. ADAPTIVE REJECTION SAMPLING 37
the distribution for log(β) utilizing the adaptive rejection sampling technique, and
transform these to get values for β.
The mixing weight for the components αk are given symmetric Dirichlet prior with the
concentration parameter φ/K,
p(α1, ..., αK |φ) ∼ Dirichlet(φ/K, ..., φ/K) =Γ(φ)
Γ(φ/K)K
K∏k=1
αφ/K−1k , (5.13)
and the concentration parameter φ is given the inverse Gamma shape
p(φ−1) ∼ G(1, 1)⇒ p(φ) ∝ φ−3/2exp(− 1/(2φ)
). (5.14)
Given the mixing proportions the joint distribution of the indicators zn is written as
p(z1, ..., zK |α1, ..., αK) =K∏k=1
αlkk , lk =N∑i=1
δKronecker(zn, k). (5.15)
We can integrate out the mixing proportions using the standard Dirichlet integral
p(z1, ..., zK |φ) =
∫p(z1, ..., zK |α1, ..., αK)p(α1, ..., αK)dα1...dαK
=Γ(φ)
Γ(φ/K)K
∫ K∏k=1
αlk+φ/K−1k dαk
=Γ(φ)
Γ(N + φ)
K∏k=1
Γ(lk + φ/K)
Γ(φ/K).
(5.16)
Combing the (5.14) and (5.16), we obtain
p(l1, ..., lK |φ) =φKΓ(φ)
Γ(N + φ), p(φ|K,N) ∝ φK−3/2 exp(−1/(2φ))Γ(φ)
Γ(N + φ). (5.17)
Because the distribution p(log(φ)|K,N) is log-concave, the samples of this distribution
can be easily generated using the adaptive rejection sampling.
5.3 Adaptive Rejection Sampling
The Gibbs sampling is a famous Markov chain Monte Carlo (MCMC) technique to
generate samples from complicated multivariate probability distributions. It is com-
38
monly used to update the value of variables from its conditional distribution given all
other variables in the system[5].
For example, suppose we want to sample from the indicator distribution p(z) =
p(z1, ..., zN). In each step of the Gibbs sampling, we draw a value from the distri-
bution p(zi|zri) to replace the original value zi, where zi denotes the i−th component
of z, zri denotes [z1, ..., zi−1, zi−1, ..., zN ] and p(zi|zri) is the distribution of zi condi-
tioned on other variables. The procedure of Gibbs sampling to get T samples for p(z)
can be summarized as in the table 5.1.
Gibbs Sampling
1. Initialize zi : i = 1, ..., N
2. For τ = 1, ..., T :
Sample zτ+11 ∼ p(z1|zτ2 , zτ3 , ..., zτN).
Sample zτ+12 ∼ p(z2|zτ+1
1 , zτ3 , ..., zτN).
:
Sample zτ+1k ∼ p(zk|zτ+1
1 , ..., zτ+1j−1 , z
τj−1, ..., z
τN).
:
Sample zτ+1N ∼ p(zj|zτ+1
1 , ..., zτ+1N−1).
Table 5.1: Summary of steps in Gibbs sampling [5].
However for a hierarchical model, the Gibbs sampling for a conditional distribution can
be very computational expensive, because the conditional distribution might includes
the product of many terms [6]. To improve the computational efficiency of Gibbs sam-
pling, Gilks and Wild proposed the Adaptive rejection sampling (ARS) technique for
Gibbs sampling in [6].
To introduce ARS we first introduce the idea of rejection sampling. Suppose we want
to sample from the complex density distribution p(z) in its domain D, because the
CHAPTER 5. ADAPTIVE REJECTION SAMPLING 39
distribution p(z) is complicated, which means it is very difficult to sample directly
from. However, another density function g(z) is considered from which we can sample
directly under the restriction that g(z) = bp(z) for some possibly unknown value of b,
i.e., g(z) ∝ p(z). To sample T points from p(z), we define an envelope function ge(z)
that ge(z) ≥ g(z) and a squeezing function gl(x) that gl(z) ≤ g(z) for all z in D. The
steps of rejection sampling algorithm is described in the Table. 5.2.
Non-adaptive Rejection Sampling
1. Define an envelope function ge(z) that ge(z) ≥ g(z) for all x in D.
2. Define a squeezing function gl(z) that gl(z) ≤ g(z) for all x in D.
Repeat
1. Sample a value z∗ from ge(z) and gl(z).
2. Sample a value u independently from uniform distribution [0, 1].
3. Perform the rejection test
if u ≤ gl(z∗)/ge(z
∗)
z∗ accepted
else if u ≤ g(z∗)/ge(z∗)
z∗ accepted
else z∗ rejected
Until T values have been accepted.
Table 5.2: Summary of steps in non-adaptive rejection sampling.[6]
However, it is always very difficult to find a suitable ge(z) in D. Alternatively, the
envelope function can be constructed on the fly based on some measured values of
the distribution p(z) [6]. The basic idea of ARS is to form a piece-wise linear upper
envelope function g′(z) from a set of tangents to the log of the target density p(z)
40
and improve this envelope function g′(z) adaptively during the sampling process. The
piece-wise linear upper envelope function g′(z) is written as
g′(z) ≥ log p(z) (5.18)
We evaluate log p(z) and g′(z) at τ points a1 < a2... < aτ , the vector of this points is a.
The envelope function g′(z) is composed of the tangents of log p(z) at these points. Fig.
5.1 plots the construction of the envelope function. Each time a point z = a is rejected,
it is added to the evaluation points to improve the quality of g′(z). Because g′(z) is
piecewise-linear, we can easily get samples from exp g′(z), where exp g′(z) ≥ p(z). Like
in the rejection sampling, we also define a squeeze function g′l(z) to apply the squeeze
test. The steps of ARS is summarized in Tab. 4.2.
a1
a2a3
a4
z
log
p(z
)
Figure 5.1: Plot of ARS, the envelope function is built using the tangent lines of a logconcave function, the tangent lines are computed at a set of points a. [5]
Under the assumption that p(z) is log-concave, the ARS is a computational efficient
sampling method because it reduces the number of evaluations. Each sample is more
CHAPTER 5. CHINESE RESTAURANT PROCESS 41
Adaptive Rejection Sampling
1. Select the evaluation points a.
2. Form the piecewise-exponential upper bound exp g′(z)
and lower bound g′l(z) using points a.
Repeat
1. Sample a value z∗ from exp g′(z)/∑
exp g′(z).
2. sample a value u independently from uniform distribution (0, 1).
3. Perform the rejection test:
if u ≤ g′l(z∗)/exp g′(z∗)
z∗ accepted
else if u ≤ p(z∗)/exp g′(z∗)
z∗ accepted
else Add the corresponding z∗ to a and recompute upper
and lower bounds exp g′(z) and g′l(z).
Until number of points required have been accepted.
Table 5.3: Summary of steps in ARS.[13]
likely to be accepted with the improving quality of the envelope function g′(z).
5.4 Chinese Restaurant Process
To generate the indicator variable zn, Rasmussen uses the Chinese restaurant process.
The Chinese restaurant process can be described briefly according to [16]. Suppose
there is a Chinese Restaurant with an infinite number of tables, each of which can seat
an infinite number of customers, see Fig. 5.2.
42
Figure 5.2: Simple diagram of Chinese restaurant process, the tables are marked bythe big circles, the customers are represented by the small squares.
The tables are chosen according the following random process:
1. The first customer sits at the first table.
2. The second customer chooses to sit with the first customer in the first table or
sit by himself in a new empty table.
3. ...
4. The N -th customer chooses to sit by himself with the probabilityφ
N − 1 + φor
sit in the k-th table with other customers with the probabilitylk
N − 1 + φ, where
lk denotes the number of customers already sitting at the k-th table.
Note that it is more likely that the new customer chooses the table with more customers
(the rich-gets richer phenomenon), such that the growth speed of the number of tables
is much slower than the growth speed of the number of customers. In our case of
data clustering, the customers are the observed data, and the tables in the restaurant
correspond to the mixture components. The rich-gets-richer phenomenon means that
we expect large clusters, i.e., the number of components K is smaller than the number
of the observations N [16]. Rasmussen uses this characteristic of the Chinese restaurant
process to realize the infinite limit of the component number, so we can work with the
finite number of clusters. The limits of conditional prior for the indicators can be
written as
CHAPTER 5. POST PROCESSING FOR THE IGMM METHOD 43
• components where l−n,k > 0 :
p(zn = k|z−n, φ) =l−n,k
N − 1 + φ(5.19)
• new component:
p(zn = knew|z−n, φ) =φ
N − 1 + φ. (5.20)
By combining the density function Eq. (5.2) and the prior limit Eq. (5.4) we obtain
the conditional posteriors for the indicators
• components where l−n,k > 0 :
p(zn = k|z−n, µk, sk, φ) ∝
p(zn = k|z−n, φ)p(dn|µk, sk, z−n) ∝ l−n,kN − 1 + φ
s1/2k exp(−sk(dn − µk)2/2) (5.21)
• new component:
p(zn = knew|z−n, λ, r, β, ω, φ) ∝φ
N − 1 + φ
∫p(dn|µk, sk) p(µk, sk|λ, r, β, ω)dµkdsk. (5.22)
The integral part∫p(dn|µk, sk) p(µk, sk|λ, r, β, ω)dµkdsk is not analytically tractable,
Rasmussen used the method subscribed in [10], which suggests that this integral can
be sampled from a normal-gamma distribution. After we obtained the conditional
posteriors for the indicators, we re-assign all the observations to the components, the
components with no data will be removed after the re-assignment.
5.5 Post processing for the IGMM Method
According to the experimental results, the IGMM clustering method also has the prob-
lem of ”overfitting”, see Fig. 5.3. In order to obtain the final number of sources, some
44
−180 −120 −60 0 60 120 1800
1
2
3
4·10−2
DoA [deg.]
P(D
oA)
Histogram of obs. DoAsTrue DoAIGMM fit
Figure 5.3: The fitting result of the IGMM for 4 sources. The observed DoAs areplotted by white bars, the red stems show the accurate position of each source and thefitting result is represented by the blue lines.
further steps similar to the split and merge method from [17] are applied after the
Gibbs sampling. First of all, we define a threshold αlm = 0.1 as the lower limit of
component weight αk. The component k′ will be removed if this condition holds
αk′ < αlm. (5.23)
This criterion aims to remove the components with small weight values. Next the com-
−180 −120 −60 0 60 120 1800
2 · 10−2
4 · 10−2
6 · 10−2
8 · 10−2
0.1
DoA [deg.]
P(D
oA)
Histogram of obs. DoAsTrue DoAIGMM fit
Figure 5.4: The fitting result of IGMM for 4 sources with the extra split and mergestep. The observed DoAs are plotted by white bars, the red stems show the trueposition of each source, the fitting result is represented by the blue lines.
ponents with big variance (higher than 10 times the minimal variance of all mixture
CHAPTER 5. POST PROCESSING FOR THE IGMM METHOD 45
components) will be removed. Finally the components with closely located means will
be merged into one component. The results of applying this post processing step after
the iterations are illustrated in Fig. 5.4. We can see that the problem of overfitting is
solved.
In summary, this chapter introduces the IGMM method[12]. The observed DoA mix-
ture are modeled by infinite number of Gaussians and the model parameters are esti-
mated using Gibbs sampling. The final number of components is finite by using the
CRP. To avoid the ”overfitting” problem, a post processing step[17] is applied to the
IGMM after the iterations.
46 CHAPTER 6. EXPERIMENTAL RESULTS
Chapter 6
Experimental Results
This chapter presents the experimental results for the three clustering methods for
counting acoustic sources, including the GMM clustering method with Dirichlet prior,
the Split and Merge method and the IGMM method. The source counting accuracy
of these three methods are tested with both artificially generated data and observed
DoAs under different experimental environments. The test results and conclusions for
artificial data will be presented in Section 6.1. In Section 6.2, the observed DoAs
under different acoustic scenarios are used to evaluate the performance of these three
algorithms.
6.1 Results with Artificial Data
This section presents the accuracy test results of three clustering methods with artificial
data. The test results with artificial data help us to prove the viability of the clustering
methods due to the fact that the artificial data is actually a ideal presentation of the
observed DoAs in reality, i.e. the artificial data is more Gaussian like. Also, the
simulation test with artificial is much more time saving than the test with observed
DoAs, which allows us to run the test for hundreds of times in order to obtain better
statistics. Moreover, some features of the artificial data can be easily modified, e.g.
the data variance, which might influences the performance of the clustering algorithms.
CHAPTER 6. RESULTS WITH ARTIFICIAL DATA 47
In summary, the simulation test results of artificial data helps to explain the fitting
results of the clustering methods with the observed DoAs.
6.1.1 Test Scenarios and Artificial Training Data
First we introduce the test scenarios and the generating process of the artificial data.
Artificial data of one artificial speech source are generated according to a Gaussian
distribution N (µs, σ2s), where µs is the mean, σ2
s is the variance of the data. The
population of the data points in one artificial source is set to be 1000.
We test the clustering methods with up to five artificial sources. Similar to the observed
DoAs, the artificial data also locates within the interval [−180◦, 180◦). Each artificial
source location is randomly generated from a uniform distribution of angles within this
interval in the test. Fig. 6.1 shows an example of a histogram of one artificial source
with µs = 7◦.
The accuracy of source counting is calculated by running each test 100 times and
−180 −120 −60 0 60 120 1800
1
2
3
4·10−2
DoA [deg.]
P(D
oA)
Histogram of art. DoAsTrue DoA
Figure 6.1: Histogram of artificial data for one artificial speech source. The columnsrepresent the normalized probability of the artificial data and the red stem shows thetrue source location at µs = 7◦.
48
averaging the amount of correct estimates. We also define a criterion for an accurate
estimation: if the estimated source number K equals to the true source number Ns and
the estimated mean µ locates near the setting mean µs with ‖ µ − µs ‖< 15◦, we say
that the applied method has accurately estimated the source number from the data
mixture. Fig. 6.2 shows an example of an accurate source counting using the Split and
Merge method: there is only one Gaussian component and the mean µ = 6.56◦, which
satisfied the condition that ‖ µ− µs ‖< 15◦.
[tbh!]
−180 −120 −60 0 60 120 1800
1
2
3
4
5·10−2
DoA [deg.]
P(D
oA)
Histogram of art. DoAsTrue DoAS&M fit
Figure 6.2: Histogram of artificial data of one source and the data fitting result of theSplit and Merge method. The columns represent the PDF of the artificial data andthe red stem shows the true source location µs = 7◦. The fitting result is plotted bythe blue Gaussian component whose mean is µk = 6.56◦.
In the following sections, the source counting results for the three methods will be
presented. The variables of test scenario consist of different source numbers, SNRs,
data variances σ2s and angular differences between neighboring sources.
CHAPTER 6. RESULTS WITH ARTIFICIAL DATA 49
6.1.2 Convergence Condition for EM algorithm in the GMM
Clustering Method with DP
To start the evaluation of the GMM clustering with Dirichlet prior method, the initial-
ization and the convergence condition of the EM algorithm need to be specified. In this
test the EM algorithm is initialized with K = 8 Gaussian components, where the means
of the components µ are initialized by applying the k-means[5] clustering method to
the artificial data. The parameter αk is uniformly initialized as αk = 1/K = 0.125
and the overall variance of the data is set as the initial variance σ2k for each Gaussian
component.
Araki et al. [3] did not define a convergence condition of EM algorithm. The experi-
mental results for the case of one source Ns = 1 are also not given in [3]. During the
implementation we found out that the different convergence conditions do effects the
experiment results. We tested three convergence conditions at the end of the M-step
to see if it influences the experimental results. The conditions are
ct1 : µt+1 − µt < 10(−20)
ct2 : µt+1 − µt < 10(−30)
ct3 : µt+1 = µt,
(6.1)
where µt are the estimated means of Gaussian components from the previous EM
iteration and µt+1 are the means of Gaussian components calculated in the current
iteration. By implementing the evaluation for 100 trials, we calculate the accuracy of
source counting on artificial data for each variables of the scenario. Given the exper-
iment settings presented in Tab. 6.1, the source counting accuracy in relation to the
convergence conditions is presented in Fig. 6.3.
We can see that the convergence conditions do effect the source counting results, es-
pecially for cases when source number Ns = 1 and Ns = 2. This is because the EM
algorithm reaches the convergence too early (with only 500 iterations) with the conver-
gence condition c11 and ct2, which leads to the problem of ”overfitting”. A more strict
50
convergence condition ct3 increases the iteration steps and improves the accuracy of
source counting. We found out that with more than 2000 iterations, the results of the
GMM clustering method with Dirichlet prior are always good. So we perform in the
following tests for the GMM clustering method by using the convergence condition of
ct3 and the maximum number of iteration steps is 2500.
Source Number SNR in dB Standard Deviation Minimal Angular Distance
1,2,3,4,5 30 15◦ 70◦
Table 6.1: Experimental variables for the evaluation of the influence of the convergencecondition.
1 2 3 4 50
20
40
60
80
100
Source Number
Acc
ura
cyofNs
esti
mat
ion
in%
ct1ct2ct3
Figure 6.3: Source counting accuracy for artificial data in % using different convergenceconditions over different numbers of speech sources Ns = [1, 2, 3, 4, 5].
For the Split and Merge method we use the convergence condition ct1 with µt+1−µt <
10(−20) for the EM algorithm. This is because the fast convergence speed(in 300 steps)
is a main advantage of the Split and Merge method.
CHAPTER 6. RESULTS WITH ARTIFICIAL DATA 51
K Initial number of components 8
ν Parameter for wrapped phase model -1,0,1
φ Parameter of the Dirichlet prior 0.001
c Parameter for the power weight function Data length
σbig Parameter for removing components with big σ 20◦
αlm Parameter for removing components with small α 0.1
Table 6.2: Parameters of the GMM clustering method with Dirichlet prior
6.1.3 Parameters for Clustering Methods
This section gives the values of the parameter settings in each clustering method for the
evaluation of the source counting accuracy. Note that for artificial data we don’t have
the STFT values, so the parameter c for the power weight function, which is written
as
f(an) =can∑Nn=1 an
, (6.2)
equals to the artificial data length (the value of c is recalculated from the derivation
of EM algorithm in Chapter 3). As the post processing step of the GMM method and
the IGMM method, the components with big variance σbig or with very small weight
αlm are removed. The parameters of three methods are given in Tab. 6.2, Tab. 6.3
and Tab. 6.4.
6.1.4 The Effect of Varying Angular Difference
The angular difference between the neighboring artificial sources is also a factor of
interest. With the experiment settings in Tab. 6.5, the estimation accuracy of the
three clustering methods in relation to the minimal angular difference between the
neighboring sources are presented in Fig. 6.4.
The performance of the Split and Merge method and the IGMM method increases with
52
K Initial number of components 8
cvar Parameter for upper limit of standard derivation 2
clm Parameter for lower limit of standard derivation 2.5◦
pµ Probability mass for condition C3 in Chapter 4 0.9
cµ Parameter for merging components 23◦
pmass Probability mass for data pruning 0.95
pmerge Probability mass for the final merging step after convergence 0.95
αlm Parameter for removing components with very small weight 0.1
Table 6.3: Parameters of the Split and Merge method.
cvar Parameter for upper limit of standard deviation 10
cµ Parameter for merging components 30◦
αlm Parameter for removing components with very small weight 0.1
Table 6.4: Parameters of the IGMM method.
increasing minimal angular distance between the neighboring sources. For example,
we can see that the angular difference between two neighboring sources affects the
estimation accuracy for the case which Ns = 5 the most. The source counting accuracy
for Ns = 5 improves sharply with the increasing angular difference between sources
after a threshold (for the Split and Merge method and the IGMM method 50◦and
for the GMM clustering method 30◦. Since we estimate up to five speech sources in
the experiment, the minimum angular difference between the neighboring sources is
supposed to be more than 65◦.
CHAPTER 6. RESULTS WITH ARTIFICIAL DATA 53
Source Number SNR in dB Standard Deviation Minimal Angular Distance
1,2,3,4,5 30 15 20◦, 30◦, 40◦, 50◦, 60◦, 70◦
Table 6.5: Experimental variables for the evaluation of the influence of the minimalangular distance between neighboring sources.
20 25 30 35 40 45 50 55 60 65 700
20
40
60
80
100GMM with Dirichlet Prior
Ns=1Ns=2Ns=3Ns=4Ns=5
20 25 30 35 40 45 50 55 60 65 700
20
40
60
80
100
Acc
ura
cyof
sourc
eco
unti
ng
in%
Split and Merge
Ns=1Ns=2Ns=3Ns=4Ns=5
20 25 30 35 40 45 50 55 60 65 700
20
40
60
80
100
Minimal Angular difference (in degree)
IGMM
Ns=1Ns=2Ns=3Ns=4Ns=5
Figure 6.4: Accuracy of source number estimation in relation to the angular differencebetween the neighboring artificial sources.
54
6.1.5 Source Counting Accuracy with Varying SNR Values
Speech signals in reality always contain noise, therefore, we also test the clustering
methods with data of varying values of the signal-to-noise ratio (SNR). The SNR is a
measure that compares the level of a desired signal to the level of background noise
and is defined as
SNR = 10 log10
(Psignal
Pnoise
), (6.3)
wherePsignal
Pnoise
is the ratio of the signal power and the background noise power.
With the experiment settings in Tab. 6.6, the estimation accuracy in relation to the
SNR values of the artificial data are presented in Fig. 6.5. From the results we can
see that with increasing values of the SNR, the performance of the clustering methods
improves. For the cases where the number of artificial sources equals Ns = 4 and
Ns = 5, none of the methods are able to estimate the accurate number of sources for
SNR = 10dB.
Source Number SNR in dB Standard Deviation Minimal Angular Distance
1,2,3,4,5 10,20,30 15◦ 70◦
Table 6.6: Experimental variables for the evaluation of the influence of the SNR values.
6.1.6 Source Counting Accuracy with Varying Data Variance
The variance σ2s is a measure of spread of the data around the mean of the data set.
We tested the clustering methods with different data variances varying from 10 to 25,
Tab. 6.7 gives the experimental settings for the investigation of the influence of varying
variances. From the corresponding source counting results in Fig. 6.6, we can see that
the performance of the clustering methods decrease as the data variance increases.
For the GMM clustering method with Dirichlet prior, the source counting accuracy
for the cases Ns = 1 and Ns = 2 also decreases sharply with the increasing standard
CHAPTER 6. RESULTS WITH ARTIFICIAL DATA 55
10 20 300
20
40
60
80
100GMM with Dirichlet Prior
Ns=1Ns=2Ns=3Ns=4Ns=5
10 20 300
20
40
60
80
100
Acc
ura
cyof
sourc
eco
unti
ng
in%
Split and Merge
Ns=1Ns=2Ns=3Ns=4Ns=5
10 20 300
20
40
60
80
100
Input SNR in dB
IGMM
Ns=1Ns=2Ns=3Ns=4Ns=5
Figure 6.5: Source counting accuracy of clustering methods in relation to SNR valuesof artificial data.
deviation from σ = 15◦, while other two methods are less affected. This is because the
”overfitting” problem of the GMM clustering method happens more frequently. For the
case Ns = 5 with the standard deviation values from 15◦to 20◦, the Split and Merge
method has much better performance than the other two methods. After σ = 20◦,
none of the methods are able to estimate the accurate source number when Ns = 5.
Overall, the GMM method with Dirichlet prior is more sensitive to the data variance
and other two methods have better performance in this case.
56
Source Number SNR in dB Standard Deviation Minimal Angular Distance
1,2,3,4,5 30 10,15,20,25 70
Table 6.7: Experiment variables for the evaluation of the influence of the varying datavariance.
10 12 14 16 18 20 22 240
20
40
60
80
100GMM with Dirichlet Prior
Ns=1Ns=2Ns=3Ns=4Ns=5
10 12 14 16 18 20 22 240
20
40
60
80
100
Acc
ura
cyof
sourc
eco
unti
ng
in%
Split and Merge
Ns=1Ns=2Ns=3Ns=4Ns=5
10 12 14 16 18 20 22 240
20
40
60
80
100
The data variance value of each artificial source
IGMM
Ns=1Ns=2Ns=3Ns=4Ns=5
Figure 6.6: Source counting accuracy of clustering methods in relation to the varianceof artificial data.
CHAPTER 6. EXPERIMENTAL RESULTS WITH OBSERVED DOAS 57
6.1.7 Comparison and Conclusions
In conclusion, all three clustering methods show better performance of source counting
for data of smaller variance, large SNR values and larger minimal angular distance
between artificial sources. The number of iterations of the EM algorithm and the
convergence condition affects the performance of the GMM clustering method with
Dirichlet prior. This is because the problem of ”overfitting” happens for small number
of iterations in EM algorithm. Moreover, the GMM clustering method with Dirichlet
prior is more sensitive to varying data variance than the other two methods.
The performance of the methods become better with the increasing minimal angular
distance between neighboring sources. To estimate 5 sources, the minimal angular dis-
tance between neighboring sources should be more than 60 degree.
6.2 Experimental Results with Observed DoAs
This section presents the experimental results of the three clustering methods for source
counting based on observed DoAs. The performances of the three methods will be
tested and evaluated under different acoustic scenarios. In the simulation environment,
up to 5 speech sources are located in a room of dimension 8.8m×3.75m ×2.4m[4]. The
experimental setup is illustrated in Fig. 6.7. Both speech sources and sensors are
arranged on circles around the center point. We definite the center of the circle to be
located at [2.82m, 2m, 1.2m]. The radius of the sources to the center is denoted as rs
and the radius of the microphone array to the center is rm.
In the experiment, the source locations are generated randomly from a uniform distri-
bution of the angles within [-180,180). For the speech sources we choose five English
speech signals sampled at 16kHz and of duration of 5 seconds. The frame size for the
STFT is 1024, and the frame shift is 256.
The room impulse response h is generated using the room impulse response generator
[7]. The experiment is performed in a reverberant condition. Fig. 6.8 plots observed
58
Figure 6.7: Simulated experiment setup for locations of sensors and sources. The roomsize is 8.8m × 3.75m × 2m Speech sources and sensors locate in circles with radius rsand rm. All sources and sensors are in the same height of 1.2m.
DoAs and artificial DoAs. We can see that compared with the artificial DoAs, the
variance of the observed DoAs is larger.
In the experiment, the problem of ”overfitting” always occurs when we apply the GMM
clustering method to the observed DoAs, which results in the lower accuracy of source
counting. We attempted to increase the number of iteration steps to 20000, but the
severe ”overfitting” problem still exists even with the Dirichlet prior. In order to im-
prove the performance of the GMM clustering method with Dirichlet prior, we applied
the same post processing step of the IGMM method as described in Section 5.5 to the
GMM clustering method. The post processing steps include a component merging step
with the same cµ in Tab. 6.4. Other parameters are the same as the parameter settings
CHAPTER 6. EXPERIMENTAL RESULTS WITH OBSERVED DOAS 59
−180−120 −60 0 60 120 1800
0.5
1
1.5
2·10−2
DoA [deg.]
P(D
oA)
Histogram of obs. DoAsTrue DoA
−180−120 −60 0 60 120 1800
2
4
6
8·10−3
DoA [deg.]
P(D
oA)
Histogram of art. dataTrue DoA
Figure 6.8: Histogram of artificial data (left) and artificial data (right).
in Section 6.1.3.
The experiment variables include varying source numbers, sensor numbers, reverbera-
tion times, sensor distances and the source distances to the center. Since the number of
data points in the observed DoAs is very large, each single test is very time consuming,
especially for the EM algorithm, we decrease the trial times from 100 to 20 for each
test to calculate the source counting accuracy.
6.2.1 Test with Varying Source Numbers and Sensor Numbers
1 Source number 1,2,3,4,5
2 Sensor number 3,4,5,6
3 Reverberation time 0.12s
4 Sensor distance to circle center 0.017m
5 Source distance to circle center 0.8m
6 Minimal angular distance between neighboring sources 70◦
7 SNR 50dB
Table 6.8: Experiment variables for the evaluation of the influence of the differentsensor numbers
60
From the test results in we can not find a relation between the source counting accuracy
and the number of sensors. However, the accuracy of the three methods tends to
decrease with the increasing source number. For the cases where Ns = 1 and Ns = 2,
the IGMM method has better performance (with accuracy over 90%) in comparison
with other two methods. All three methods are not able to estimate the correct source
number when Ns = 5. This might result from the large data variance of the observed
DoAs.
GMM with DP Split & Merge IGMM
Nm\Ns 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
3 85 65 40 55 0 80 55 70 45 0 100 90 50 30 0
4 90 55 70 65 5 65 45 45 50 10 100 100 35 35 0
5 90 50 55 55 0 80 55 50 50 5 95 95 35 20 0
6 90 65 45 0 5 85 55 65 40 5 100 90 70 5 0
Table 6.9: Accuracy of source counting on observed DoAs mixture with varying sensornumbers Nm from 3 to 6 and source numbers Ns from 1 to 5.
6.2.2 Test with Varying Reverberation Time
In this test the mixture of DoAs calculated under varying reverberation time is used to
evaluate the three clustering methods. From Fig. 6.9 we can see that the performance
of the three clustering methods tends to degrade with increasing reverberation time.
It means that the three clustering methods work only for the DoAs that are observed
with short reverberation time in small rooms.
CHAPTER 6. EXPERIMENTAL RESULTS WITH OBSERVED DOAS 61
1 Source number 1,2,3,4,5
2 Sensor number 5
3 Reverberation time in s 0.12,0.15,0.20,0.25,0.3
4 Sensor distance to circle center 0.017m
5 Source Distance to Circle Center 0.8m
6 Minimal Angular Distance between Neighboring Sources 70◦
7 SNR 50dB
Table 6.10: Experiment variables for the evaluation of the influence of the varyingreverberation time.
6.2.3 Test with Varying Sensor Distance to Circle Center
The varying sensor distances also influences the performances of the clustering methods.
Based on the values of the experimental variables described in Tab.6.11, the source
counting accuracy of all three clustering methods tends to decrease with the increasing
sensor distance to the center from 0.05m.
1 Source Number 1,2,3,4,5
2 Sensor Number 5
3 Reverberation Time in s 0.12
4 Sensor Distance to Circle Center in m [0.017 0.05 0.1 0.15 0.2]
5 Source Distance to Circle Center in m 0.5
6 Minimal Angular Distance between Neighboring Sources 70◦
7 SNR 50dB
Table 6.11: Experiment variation settings for the evaluation of the influence of thevarying sensor distance to the circle center
62
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30
20
40
60
80
100GMM with Dirichlet Prior
Ns=1Ns=2Ns=3Ns=4Ns=5
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30
20
40
60
80
100
Acc
ura
cyof
sourc
eco
unti
ng
in%
Split and Merge
Ns=1Ns=2Ns=3Ns=4Ns=5
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30
20
40
60
80
100
The Reverberation Time(in Seconds)
IGMM
Ns=1Ns=2Ns=3Ns=4Ns=5
Figure 6.9: Source counting accuracy in relation to the reverberation time
6.2.4 Test with Varying Source Distance to Circle Center
To investigate the influence of varying source distance to the center on the performance
of source counting. We evaluate the three methods with different source distance from
0.5m to 1.5m. The experiment variables are given in the Tab 6.12. Here we denote
the source distance to the center as rs. The experiment results is presented in Tab
6.13. We don’t find a strong relation between the performance of the methods and the
source distance. However, we can see that the performance of the GMM method with
Dirichlet prior is not stable for Ns = 4 with varying source distance.
CHAPTER 6. EXPERIMENTAL RESULTS WITH OBSERVED DOAS 63
0 5 · 10−2 0.1 0.15 0.2 0.250
20
40
60
80
100GMM with Dirichlet Prior
Ns=1Ns=2Ns=3Ns=4Ns=5
0 5 · 10−2 0.1 0.15 0.2 0.250
20
40
60
80
100
Acc
ura
cyof
sourc
eco
unti
ng
in%
Split and Merge
Ns=1Ns=2Ns=3Ns=4Ns=5
0 5 · 10−2 0.1 0.15 0.2 0.250
20
40
60
80
100
Sensor distance to the circle center in m
IGMM
Ns=1Ns=2Ns=3Ns=4Ns=5
Figure 6.10: Source counting accuracy in relation to the sensor distance to the circlecenter in m
6.2.5 Comparison and Conclusions
The accuracy of source counting decreases with the the increasing source number. The
varying sensor number has no obvious influence on the performance of three methods.
We also don’t find a reasonable explanation for the results with varying sensor distance
to the center. For a better performance of the clustering methods on observed DoAs,
the sensors should locate near with the center.
64
1 Source number 1,2,3,4,5
2 Sensor number 5
3 Reverberation time 0.12s
4 Sensor distance to circle center 0.017m
5 Source distance to circle center 0.5,0.8,1.0,1.2,1.5
6 Minimal Angular Distance between Neighboring Sources 70◦
7 SNR 50dB
Table 6.12: Experiment variation settings for the evaluation of the influence of thevarying source distance to the circle center.
GMM with DP Split & Merge IGMM
rs\Ns 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0.5m 75 65 40 70 0 65 60 70 50 10 100 90 45 15 0
0.8m 95 80 45 0 0 70 60 60 80 10 100 90 60 20 0
1m 75 70 55 85 5 50 50 80 50 15 90 85 55 20 0
1.2 75 65 50 0 0 50 55 50 35 15 80 85 50 25 5
1.5m 60 60 45 100 0 50 50 70 55 5 90 75 35 20 15
Table 6.13: Accuracy of source counting on observed DoAs mixture with varying sensordistance
CHAPTER 7. SUMMARY AND OUTLOOK 65
Chapter 7
Summary and Outlook
In summary, in this thesis three clustering methods for estimating the number of acous-
tic sources are implemented and evaluated. The three BSS methods include the GMM
clustering with Dirichlet prior[3], the Split and Merge method[15] and the IGMM clus-
tering method [12].
Because of the large number of iteration steps of the EM algorithm, the GMM cluster-
ing method with Dirichlet prior is the most time consuming method among the three
methods. Compared with other two methods, the performance of the GMM method
is more affected by the varying data variance.
The Split and Merge method is the most easiest method to understand. It reduces
the iteration steps of traditional EM algorithm. Compared with the GMM method
with Dirichlet prior, it presents a better performance for handling the ”overfitting”
problem. However, this method contains 6 empirical parameters, which results in more
simulations to find the suitable parameter values.
The IGMM method is competitive compared with other two methods. It has the min-
imal number of empirical parameters in three methods. For the cases where source
number equals to 1 and 2, the IGMM method has better performance than the other
two methods.
Improvements of the IGMM method can be done in the future work. For example,
Walter et al. built a wrapped phase IGMM for the weighted observed DoAs in [17],
66
which could improve the source counting accuracy of the IGMM method on observed
DoAs. Processing methods for the observed DoAs can be investigated to improve the
accuracy by reducing the data variance.
LIST OF FIGURES 67
List of Figures
2.1 Missing time information of the Fourier transform illustrated by two
different signals and their magnitude Fourier transforms. (a) Two sub-
sequent sinusoids of frequency 1 Hz and 5 Hz. (b) Superposition of the
same sinusoids.[9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Signal and Fourier transform consisting of two subsequent sinusoids of
frequency 1 Hz and 5 Hz [9]. . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Definition of the DoA. A speech source i locates in a spherical coordinate.
The DoA of i is represented with a unit norm qi‖qi‖ marked by the red
arrow [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Far-field model of a speech source i and two sensors j and J . The TDoA
between sensors is (dj − dJ)T qi‖qi‖ marked by the red arrow [2]. . . . . . 6
2.5 Plot of the univariate Gaussian showing the mean µ and the standard
deviation σ [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Histogram of observed DoAs of three speech sources. The white columns
represent the observed DoAs, the red stems indicate the accurate direc-
tion of speech sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Plot of mixtures of three Gaussian components, the linear superposition
of these Gaussian components builds a GMM. . . . . . . . . . . . . . . 10
68
2.8 Histogram of observed DoAs. On the left, little phase wrapping hap-
pens which results into data that can be well modeled by a Gaussian
distribution. On the right severe phase wrapping occurs which results
in a poor fit of the data by a single Gaussian distribution [14]. . . . . . 10
2.9 Illustration of the wrapped phase Gaussian distribution model. Repli-
cating a Gaussian distribution with of 2π (dashed lines) and summing
the result in the interval between [−π, π) (solid blue line) results into
accurate modeling of a wrapped Gaussian distribution [14]. . . . . . . . 11
3.1 Example of GMM fitting result without a Dirichlet prior. The white
columns represent the observed DoAs, the red stems are the correct
position of each source and the black lines plot the GMM fitting results. 18
3.2 The Dirichlet distribution over three variables α1, α2, α3 is confined to
a probability simlex (the triangle area) [5]. . . . . . . . . . . . . . . . . 19
3.3 Shapes of the Dirichlet distribution over the simplex with different values
of φ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 GMM fitting example using the Dirichlet prior φ = 0.01. . . . . . . . . 20
4.1 Figure of a Gaussian distribution, the blue area represents the tolerance
region [µ − zpσ, µ − zpσ], the cumulative probability pmass is calculated
by the integration of the distribution PDF within the tolerance region[1]. 29
4.2 Figure of three Gaussian components with their means µ1, µ2 and µ3,
TR2 represents the tolerance region of the Gaussian component in the
middle. According to the criterion C3, this middle component will be
removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Figure of two Gaussian components, where TR1 and TR2 represent the
tolerance regions of the Gaussian components. The data in the blue
areas ate included in the tolerance regions, while the data in orange
areas ate outside the tolerance regions. . . . . . . . . . . . . . . . . . . 32
LIST OF FIGURES 69
5.1 Plot of ARS, the envelope function is built using the tangent lines of a
log concave function, the tangent lines are computed at a set of points
a. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 CRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 The fitting result of the IGMM for 4 sources. The observed DoAs are
plotted by white bars, the red stems show the accurate position of each
source and the fitting result is represented by the blue lines. . . . . . . 44
5.4 The fitting result of IGMM for 4 sources with the extra split and merge
step. The observed DoAs are plotted by white bars, the red stems show
the true position of each source, the fitting result is represented by the
blue lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.1 Histogram of artificial data for one artificial speech source. The columns
represent the normalized probability of the artificial data and the red
stem shows the true source location at µs = 7◦. . . . . . . . . . . . . . 47
6.2 Histogram of artificial data of one source and the data fitting result of
the Split and Merge method. The columns represent the PDF of the
artificial data and the red stem shows the true source location µs = 7◦.
The fitting result is plotted by the blue Gaussian component whose mean
is µk = 6.56◦. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Source counting accuracy for artificial data in % using different conver-
gence conditions over different numbers of speech sourcesNs = [1, 2, 3, 4, 5]. 50
6.4 Accuracy of source number estimation in relation to the angular differ-
ence between the neighboring artificial sources. . . . . . . . . . . . . . 53
6.5 Source counting accuracy of clustering methods in relation to SNR values
of artificial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.6 Source counting accuracy of clustering methods in relation to the vari-
ance of artificial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.7 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
70
6.8 Artificial data and observed DoAs . . . . . . . . . . . . . . . . . . . . . 59
6.9 Source counting accuracy in relation to the reverberation time . . . . . 62
6.10 Source counting accuracy in relation to the sensor distance to the circle
center in m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
LIST OF TABLES 71
List of Tables
3.1 Summery of the EM Algorithm[3]. . . . . . . . . . . . . . . . . . . . . . 22
4.1 Pseudo code of the Split and Merge method[15]. . . . . . . . . . . . . . 28
5.1 Summary of steps in Gibbs sampling [5]. . . . . . . . . . . . . . . . . . 38
5.2 Summary of steps in non-adaptive rejection sampling.[6] . . . . . . . . 39
5.3 Summary of steps in ARS.[13] . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Experimental variables for the evaluation of the influence of the conver-
gence condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Parameters of the GMM clustering method with Dirichlet prior . . . . 51
6.3 Parameters for Split and Merge . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Parameters for IGMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 Experimental variables for the evaluation of the influence of the minimal
angular distance between neighboring sources. . . . . . . . . . . . . . . 53
6.6 Experimental variables for the evaluation of the influence of the SNR
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.7 Experiment variables for the evaluation of the influence of the varying
data variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.8 Experiment variables for the evaluation of the influence of the different
sensor numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.9 Accuracy of source counting on observed DoAs mixture with varying
sensor numbers Nm from 3 to 6 and source numbers Ns from 1 to 5. . 60
72
6.10 Experiment variables for the evaluation of the influence of the varying
reverberation time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.11 Experiment variation settings for the evaluation of the influence of the
varying sensor distance to the circle center . . . . . . . . . . . . . . . . 61
6.12 Experiment variation settings for the evaluation of the influence of the
varying source distance to the circle center. . . . . . . . . . . . . . . . . 64
6.13 Accuracy of source counting on observed DoAs mixture with varying
sensor distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
BIBLIOGRAPHY 73
Bibliography
[1] Z-table. http://www.statisticshowto.com/tables/z-table/. Accessed: 20.10.2017.
[2] S. Araki, Sawada H., R. Mukai, and S. Makino. DoA estimation for multiple sparse
sources with normalized observation vector clustering. In 2006 IEEE International
Conference on Acoustics Speech and Signal Processing Proceedings, volume 5, May
2006.
[3] S. Araki, T. Nakatani, H. Sawada, and S. Makino. Blind sparse source separation
for unknown number of sources using Gaussian mixture model fitting with Dirich-
let prior. In 2009 IEEE International Conference on Acoustics, Speech and Signal
Processing, pages 33–36, April 2009.
[4] S. Araki, H. Sawada, R. Mukai, and S. Makino. Underdetermined blind sparse
source separation for arbitrarily arranged multiple sensors. Signal Processing,
87(8):1833 – 1847, 2007. Independent Component Analysis and Blind Source
Separation.
[5] C. M. Bishop. Pattern recognition and machine learning. Springer, New York,
2006.
[6] W. R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Journal
of the Royal Statistical Society. Series C (Applied Statistics), 41(2):337–348, 1992.
[7] E. A. P. Habets. RIR generator from audiolabs-erlangen. https://www.audiolabs-
erlangen.de/fau/professor/habets/software/rir-generator. Accessed: 20.05.2017.
74
[8] M. I. Mandel, D. P. Ellis, and J. Tony. An EM algorithm for localizing multiple
sound sources in reverberant environments. In B. Scholkopf, J. Platt, and T. Hoff-
man, editors, Advances in Neural Information Processing Systems, pages 953–960,
Cambridge, MA, 2007. MIT Press.
[9] M. Muller. Fundamentals of Music Processing - Audio, Analysis, Algorithms,
Applications. Springer, Berlin, Heidelberg, 2015.
[10] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
[11] M. Pal, R Roy, J Basu, and M. S. Bepari. Blind source separation: A review
and analysis. In 2013 International Conference Oriental COCOSDA held jointly
with 2013 Conference on Asian Spoken Language Research and Evaluation (O-
COCOSDA/CASLRE), pages 1–5, Nov 2013.
[12] C. E. Rasmussen. The infinite Gaussian mixture model. In Advances in Neural
Information Processing Systems, pages 554–560. MIT Press, 2000.
[13] D. Sheldon. Discrete adaptive rejection sampling. School of Computer Science,
University of Mssachusetts, 2013.
[14] P. Smaragdis and P. Boufounos. Learning source trajectories using wrapped-phase
hidden Markov models. In IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics (WASPAA), pages 114–117, oct 2005.
[15] M. Taseska, A. H. Khan, and E. A. P. Habets. Speech enhancement with a low-
complexity online source number estimator using distributed arrays. In 2014 22nd
European Signal Processing Conference (EUSIPCO), pages 929–933, Sept 2014.
[16] Y. W. Teh. Dirichlet processes. In Encyclopedia of Machine Learning. Springer,
2010.
BIBLIOGRAPHY 75
[17] O. Walter, L. Drude, and R. Haeb-Umbach. Source counting in speech mixtures
by nonparametric Bayesian estimation of an infinite Gaussian mixture model. In
2015 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 459–463, April 2015.