human voice activity
This is a lightweight neural network model that detects human speech segments in audio. This model works as a fundamental task by distinguishing speech segments from background noise which is important in mostly voice related applications. Our model is designed to work with a very long audio file such as a movie and television show. In the result our model achieves faster than real-time performance by processing a 30 minutes long audio stream in 8 seconds.