梅尔频率倒谱系数（MFCC）.html

<html>
<head>
  <title>Evernote Export</title>
  <basefont face="微软雅黑" size="2" />
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name="exporter-version" content="Evernote Windows/306387 (zh-CN, DDL); Windows/10.0.0 (Win64);"/>
  <style>
    body, td {
      font-family: 微软雅黑;
      font-size: 10pt;
    }
  </style>
</head>
<body>
<a name="1330"/>

<div>
<span><div><span style="color: rgb(227, 0, 0); font-size: 10pt; font-weight: bold;">声道的形状在语音短时功率谱的包络中显示出来。而MFCCs就是一种准确描述这个包络的一种特征。</span><br/></div><div><br/></div><div><span style="color: rgb(227, 0, 0); font-weight: bold; font-size: 10pt;">一、声谱图（Spectrogram）</span></div><div><br/></div><div><span style="font-size: 10pt;">我们处理的是语音信号，那么</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">如何去描述它很重要</span><span style="font-size: 10pt;">。因为不同的描述方式放映它不同的信息。</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image.jpg" type="image/jpeg" data-filename="Image.jpg" width="560"/></span></div><div><span style="font-size: 10pt;">这段语音被分为很多帧，每帧语音都对应于一个频谱（通过短时FFT计算），</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">频谱表示频率与能量的关系</span><span style="font-size: 10pt;">。</span></div><div><span style="font-size: 10pt;">在实际使用中，频谱图有三种，即</span><span style="font-size: 10pt; font-weight: bold;">线性振幅谱</span><span style="font-size: 10pt;">、</span><span style="font-size: 10pt; font-weight: bold;">对数振幅谱</span><span style="font-size: 10pt;">、</span><span style="font-size: 10pt; font-weight: bold;">自功率谱</span><span style="font-size: 10pt;">（对数振幅谱中</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">各谱线的振幅都作了对数计算</span><span style="font-size: 10pt;">，所以其</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">纵坐标的单位是dB</span><span style="font-size: 10pt;">（分贝）。</span></div><div><span style="font-size: 10pt;">这个变换的目的是使那些振幅较低的成分相对高振幅成分得以拉高，以便观察掩盖在低幅噪声中的周期信号）。</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [1].jpg" type="image/jpeg" data-filename="Image.jpg" width="547"/></span></div><div><span style="font-size: 10pt;">我们先将其中</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">一帧语音的频谱</span><span style="font-size: 10pt;">通过坐标表示出来，如上图左。现在我们将左边的频谱</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">旋转90度</span><span style="font-size: 10pt;">。得到中间的图。</span></div><div><span style="font-size: 10pt;">然后把这些幅度</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">映射到一个灰度级</span><span style="font-size: 10pt;">表示（也可以理解为将连续的幅度量化为256个量化值？）</span></div><div><span style="font-size: 10pt;">0表示黑，255表示白色。</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">幅度值越大，相应的区域越黑</span><span style="font-size: 10pt;">。这样就得到了最右边的图。</span></div><div><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">增加时间维度</span><span style="font-size: 10pt;">，就可以显示一段语音而不是一帧语音的频谱，而且可以直观的看到静态和动态的信息。优点稍后呈上。</span></div><div><br/></div><div><span style="font-size: 10pt;">这样我们会得到一个</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">随着时间变化的频谱图，</span><span style="font-size: 10pt;">这个就是描述语音信号的spectrogram声谱图。</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [2].jpg" type="image/jpeg" data-filename="Image.jpg"/></span></div><div><span style="font-size: 10pt;">下图是一段语音的声谱图，很黑的地方就是频谱图中的峰值（共振峰formants）</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [3].jpg" type="image/jpeg" data-filename="Image.jpg"/></span></div><div><br/></div><div><span style="font-size: 10pt;">首先，</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">音素（Phones）的属性</span><span style="font-size: 10pt;">可以更好的在这里面观察出来。另外，通过观察共振峰和它们的转变可以更好的识别声音。</span></div><div><span style="font-size: 10pt;">隐马尔科夫模型（Hidden Markov Models）就是隐含地对</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">声谱图</span><span style="font-size: 10pt;">进行建模以达到好的识别性能。</span></div><div><span style="font-size: 10pt;">还有一个作用就是它可以</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">直观的评估TTS系统</span><span style="font-size: 10pt;">（text to speech）的好坏，直接</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">对比合成的语音和自然的语音声谱图</span><span style="font-size: 10pt;">的匹配度即可。</span></div><div><br/></div><div><font style="font-size: 10pt;"><span style="font-size: 10pt; font-weight: bold;">通过对语音进行分帧进行时频变换，得到每一帧的FFT频谱再将各帧频谱按照时间顺序排列起来，得到</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">时间-频率-能量分布图</span><span style="font-size: 10pt; font-weight: bold;">。</span></font></div><div><br/></div><div><span style="color: rgb(227, 0, 0); font-weight: bold; font-size: 10pt;">二、倒谱分析（Cepstrum Analysis）</span></div><div><br/></div><div><span style="font-size: 10pt;">下面是一个语音的频谱图。峰值就表示语音的主要频率成分，我们把这些</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">峰值称为共振峰</span><span style="font-size: 10pt;">（formants）</span></div><div><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">共振峰就是携带了声音的辨识属性</span><span style="font-size: 10pt;">（就是个人身份证一样）。所以它特别重要。用它就可以识别不同的声音。</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [4].jpg" type="image/jpeg" data-filename="Image.jpg" width="622"/></span></div><div><br/></div><div><span style="font-size: 10pt;">要提取的</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">不仅仅是共振峰的位置</span><span style="font-size: 10pt;">，还得提取它们转变的过程。</span></div><div><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">频谱的包络</span><span style="font-size: 10pt;">（Spectral Envelope），这包络就是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">一条连接这些共振峰点的平滑曲线</span><span style="font-size: 10pt;">。</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [5].jpg" type="image/jpeg" data-filename="Image.jpg" width="636"/></span></div><div><br/></div><div><span style="font-size: 10pt;">原始的频谱由两部分组成：</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">包络</span><span style="font-size: 10pt;">和</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">频谱的细节</span><span style="font-size: 10pt;">。</span></div><div><span style="font-size: 10pt;">这里用到的是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">对数频谱</span><span style="font-size: 10pt;">，所以单位是dB。</span></div><div><span style="font-size: 10pt;">那现在我们需要把这两部分分离开，这样我们就可以得到包络了。</span></div><div><br/></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [6].jpg" type="image/jpeg" data-filename="Image.jpg"/></span></div><div><span style="font-weight: bold; font-size: 10pt;">怎么在给定log X[k]的基础上，</span></div><div><span style="font-weight: bold; font-size: 10pt;">    求得：</span></div><div><span style="font-weight: bold; font-size: 10pt;">                log H[k] 和 log E[k]</span></div><div><span style="font-weight: bold; font-size: 10pt;">    以满足：</span></div><div><span style="font-weight: bold; font-size: 10pt;">                log X[k] = log H[k] + log E[k]呢？</span></div><div><br/></div><div><span style="font-size: 10pt;">为了达到这个目标，我们需要Play a Mathematical Trick。这个</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">Trick</span><span style="font-size: 10pt;">是什么呢？</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">就是对频谱做FFT</span><span style="font-size: 10pt;">。</span></div><div><span style="font-size: 10pt;">在频谱上做傅里叶变换就相当于</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">逆傅里叶变换</span><span style="font-size: 10pt;">Inverse FFT (</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">IFFT</span><span style="font-size: 10pt;">)。</span></div><div><span style="font-size: 10pt;">在频谱的</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">对数域</span><span style="font-size: 10pt;">上面处理的，这也属于Trick的一部分。</span></div><div><span style="font-size: 10pt;">在对数频谱上面做IFFT就相当于</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">在一个伪频率</span><span style="font-size: 10pt;">（pseudo-frequency）</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">坐标轴</span><span style="font-size: 10pt;">上面</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">描述信号</span><span style="font-size: 10pt;">。</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [7].jpg" type="image/jpeg" data-filename="Image.jpg"/></span></div><div><br/></div><div><span style="font-size: 10pt;">由上面这个图我们可以看到，</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">包络主要是低频成分</span></div><div style="text-align: center;"><span style="color: rgb(50, 135, 18); font-weight: bold; font-size: 10pt;">（这时需要转变思维，横轴就不要看成频率，咱们可以看成时间）</span></div><div><span style="font-size: 10pt;">把它看成是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">每秒4个周期的正弦信号</span><span style="font-size: 10pt;">。这样我们在伪坐标轴上面的</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">4Hz的地方给它一个峰值</span><span style="font-size: 10pt;">。</span></div><div><span style="font-size: 10pt;">频谱的</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">细节部分主要是高频</span></div><div><span style="font-size: 10pt;">把它看成是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">每秒100个周期的正弦信号</span><span style="font-size: 10pt;">。这样我们在伪坐标轴</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">上面的100Hz的地方给它一个峰值</span><span style="font-size: 10pt;">。</span></div><div><br/></div><div><span style="font-size: 10pt;">把它俩叠加起来就是原来的频谱信号了</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [8].jpg" type="image/jpeg" data-filename="Image.jpg"/></span></div><div><span style="font-size: 10pt;">在实际中已经知道log X[k]，所以我们也可以得到</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">x[k]</span><span style="font-size: 10pt;">。</span></div><div><span style="font-size: 10pt;">那么</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">h[k]</span><span style="font-size: 10pt;">是x[k]的</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">低频部分</span><span style="font-size: 10pt;">，那么我们将x[k]通过一个</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">低通滤波器</span><span style="font-size: 10pt;">就可以得到h[k]，也就是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">频谱的包络</span><span style="font-size: 10pt;">。</span></div><div><br/></div><div><span style="font-size: 10pt;">x[k]实际上就是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">倒谱</span><span style="font-size: 10pt;">Cepstrum</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">（这个是一个新造出来的词，把频谱的单词spectrum的前面四个字母顺序倒过来就是倒谱的单词了）</span></div><div><span style="font-size: 10pt;">我们所关心的h[k]就是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">倒谱的低频部分</span><span style="font-size: 10pt;">。</span></div><div><span style="font-weight: bold; font-size: 10pt;">h[k]描述了频谱的包络，在语音识别中被广泛用于描述特征。</span></div><div><br/></div><div><span style="font-size: 10pt;">那现在</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">总结倒谱分析</span><span style="font-size: 10pt;">，它实际上是这样一个过程：</span></div><div><span style="font-size: 10pt;">1）将原语音信号经过</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">傅里叶变换</span><span style="font-size: 10pt;">得到频谱：</span></div><div><span style="font-size: 10pt;">                                X[k]=H[k]E[k]</span></div><div><span style="font-size: 10pt;">  </span> <span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;"> </span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">只考虑幅度</span><span style="font-size: 10pt;">就是：</span></div><div><span style="font-size: 10pt;">                                |X[k] |=|H[k]| |E[k]|</span></div><div><br/></div><div><span style="font-size: 10pt;">2）我们在</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">两边取对数</span><span style="font-size: 10pt;">：</span></div><div><span style="font-size: 10pt;">                                    log||X[k] ||= log ||H[k] ||+ log ||E[k] ||</span></div><div><br/></div><div><span style="font-size: 10pt;">3）再在两边取逆傅里叶变换得到：</span></div><div><span style="font-size: 10pt;">                                                    x[k]=h[k]+e[k]</span></div><div><span style="font-size: 10pt;">专业的名字叫做</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">同态信号处理</span><span style="font-size: 10pt;">。目的是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">将非线性问题转化为线性问题</span><span style="font-size: 10pt;">的处理方法。</span></div><div><br/></div><div><span style="font-size: 10pt;">对应上面，原来的语音信号实际上是一个</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">卷性信号</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">（声道相当于一个线性</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">时不变系统</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">，</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">声音的产生</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">可以</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">理解为一个激励</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">通过这个系统）</span></div><div><span style="font-size: 10pt;">第一步通过卷积将其变成了</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">乘性信号</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">（</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">时域的卷积</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">相当于</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">频域的乘积</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">）</span></div><div><span style="font-size: 10pt;">第二步通过取对数将</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">乘性信号</span><span style="font-size: 10pt;">转化为</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">加性信号</span></div><div><span style="font-size: 10pt;">第三步进行</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">逆变换</span><span style="font-size: 10pt;">，使其</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">恢复为卷性信号</span><span style="font-size: 10pt;">。这时候，虽然前后均是时域序列，但它们所处的</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">离散时域显然不同</span><span style="font-size: 10pt;">，所以后者称为</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">倒谱频域</span><span style="font-size: 10pt;">。</span></div><div><span style="font-size: 10pt;">总结下，倒谱（cepstrum）就是</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">一种信号的</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">傅里叶变换</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">经</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">对数运算</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">后再进行</span><span style="font-size: 10pt; color: rgb(50, 135, 18); font-weight: bold;">傅里叶反变换</span><span style="font-size: 10pt; color: rgb(227, 0, 0); font-weight: bold;">得到的谱</span><span style="font-size: 10pt;">。</span></div><div><br/></div><div><span style="font-size: 10pt;">它的计算过程如下：</span></div><div style="text-align: center;"><span style="font-size: 10pt;"><img src="MFCC-_files/Image [9].jpg" type="image/jpeg" data-filename="Image.jpg" width="619"/></span></div><div><br/></div><div><br/></div></span>
</div></body></html> 