Titanet-Large En: A speaker representation model for speaker verification

TitaNet-Large model is a speaker representation model for speaker verification. It is mainly composed of 1D convolutions, residual connections and Squeeze-and-Excitation layers. It has 25.3M parameters in total (~100Mb)

API Usage

To use the model, simply provide two 16000 KHz mono-channel sound files as input.The model will return the similarity score between the two speakers together with the verification result based on a cosine similarity threshold. One can get embeddings of both files if they wish.

Input parameters are as follows: - sound_file1: 16000 KHz mono-channel sound file. - sound_file2: 16000 KHz mono-channel sound file. - threshold: cosine similarity threshold. If not provided default value of 0.7 is used. - return_embedding: If set to True, the model will return the embeddings of the sound files. If not provided it will use the default value of False.

Return values are as follows: - embedding1: embedding of sound_file1 if return_embedding is set to True - embedding2: embedding of sound_file2 if return_embedding is set to True - similarity: cosine similarity between the two embeddings - verification_result: verification result based on the threshold threshold

References

@INPROCEEDINGS{9746806,
  author={Koluguri, Nithin Rao and Park, Taejin and Ginsburg, Boris},
  booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context}, 
  year={2022},
  pages={8102-8106},
  doi={10.1109/ICASSP43922.2022.9746806}}

Model created over 1 year ago