Titanet-Large En: A speaker representation model for speaker verification
TitaNet-Large model is a speaker representation model for speaker verification. It is mainly composed of 1D convolutions, residual connections and Squeeze-and-Excitation layers. It has 25.3M parameters in total (~100Mb)
API Usage
To use the model, simply provide two 16000 KHz mono-channel sound files as input.The model will return the similarity score between the two speakers together with the verification result based on a cosine similarity threshold. One can get embeddings of both files if they wish.
Input parameters are as follows: - sound_file1: 16000 KHz mono-channel sound file. - sound_file2: 16000 KHz mono-channel sound file. - threshold: cosine similarity threshold. If not provided default value of 0.7 is used. - return_embedding: If set to True, the model will return the embeddings of the sound files. If not provided it will use the default value of False.
Return values are as follows: - embedding1: embedding of sound_file1 if return_embedding is set to True - embedding2: embedding of sound_file2 if return_embedding is set to True - similarity: cosine similarity between the two embeddings - verification_result: verification result based on the threshold threshold
References
@INPROCEEDINGS{9746806,
author={Koluguri, Nithin Rao and Park, Taejin and Ginsburg, Boris},
booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context},
year={2022},
pages={8102-8106},
doi={10.1109/ICASSP43922.2022.9746806}}