Run time and cost

This model costs approximately $0.021 to run on Replicate, or 47 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on CPU hardware. Predictions typically complete within 4 minutes.

Readme

Titanet-Large En: A speaker representation model for speaker verification

TitaNet-Large model is a speaker representation model for speaker verification. It is mainly composed of 1D convolutions, residual connections and Squeeze-and-Excitation layers. It has 25.3M parameters in total (~100Mb)

API Usage

To use the model, simply provide two 16000 KHz mono-channel sound files as input.The model will return the similarity score between the two speakers together with the verification result based on a cosine similarity threshold. One can get embeddings of both files if they wish.

Input parameters are as follows: - sound_file1: 16000 KHz mono-channel sound file. - sound_file2: 16000 KHz mono-channel sound file. - threshold: cosine similarity threshold. If not provided default value of 0.7 is used. - return_embedding: If set to True, the model will return the embeddings of the sound files. If not provided it will use the default value of False.

Return values are as follows: - embedding1: embedding of sound_file1 if return_embedding is set to True - embedding2: embedding of sound_file2 if return_embedding is set to True - similarity: cosine similarity between the two embeddings - verification_result: verification result based on the threshold threshold

References

@INPROCEEDINGS{9746806,
  author={Koluguri, Nithin Rao and Park, Taejin and Ginsburg, Boris},
  booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context}, 
  year={2022},
  pages={8102-8106},
  doi={10.1109/ICASSP43922.2022.9746806}}