Computerized speech recognition (ASR) is a generally used machine studying (ML) know-how in our each day lives and enterprise situations. Purposes akin to voice-controlled assistants like Alexa and Siri, and voice-to-text functions like computerized subtitling for movies and transcribing conferences, are all powered by this know-how. These functions take audio clips as enter and convert speech alerts to textual content, additionally referred as speech-to-text functions.
This know-how has matured lately, and most of the newest fashions can obtain an excellent efficiency, akin to transformer-based fashions Wav2Vec2 and Speech2Text. Transformer is a sequence-to-sequence deep studying structure initially proposed for machine translation. Now it’s prolonged to unravel every kind of pure language processing (NLP) duties, akin to textual content classification, textual content summarization, and ASR. The transformer structure yields excellent mannequin efficiency and ends in numerous NLP duties; nevertheless, the fashions’ sizes (the variety of parameters) in addition to the quantity of knowledge they’re pre-trained on enhance exponentially when pursuing higher efficiency. It turns into very time-consuming and dear to coach a transformer from scratch, for instance coaching a BERT mannequin from scratch may take 4 days and value $6,912 (for extra data, see The Staggering Cost of Training SOTA AI Models). Hugging Face, an AI firm, offers an open-source platform the place builders can share and reuse hundreds of pre-trained transformer fashions. With the transfer learning method, you possibly can fine-tune your mannequin with a small set of labeled information for a goal use case. This reduces the general compute price, quickens the event lifecycle, and lessens the carbon footprint of the group.
AWS introduced collaboration with Hugging Face in 2021. Builders can simply work with Hugging Face fashions on Amazon SageMaker and profit from each worlds. You’ll be able to fine-tune and optimize all fashions from Hugging Face, and SageMaker offers managed coaching and inference providers that supply excessive efficiency sources and excessive scalability through Amazon SageMaker distributed coaching libraries. This collaboration may help you speed up your NLP duties’ productization journey and notice enterprise advantages.
This put up exhibits the right way to use SageMaker to simply fine-tune the newest Wav2Vec2 mannequin from Hugging Face, after which deploy the mannequin with a custom-defined inference course of to a SageMaker managed inference endpoint. Lastly, you possibly can take a look at the mannequin efficiency with pattern audio clips, and overview the corresponding transcription as output.
Wav2Vec2 background
Wav2Vec2 is a transformer-based structure for ASR duties and was launched in September 2020. The next diagram exhibits its simplified structure. For extra particulars, see the original paper. Because the diagram exhibits, the mannequin consists of a multi-layer convolutional community (CNN) as a function extractor, which takes an enter audio sign and outputs audio representations, additionally thought-about as options. They’re fed right into a transformer community to generate contextualized representations. This a part of coaching will be self-supervised; the transformer will be educated with unlabeled speech and study from it. Then the mannequin is fine-tuned on labeled information with the Connectionist Temporal Classification (CTC) algorithm for particular ASR duties. The bottom mannequin we use on this put up is Wav2Vec2-Base-960h, fine-tuned on 960 hours of Librispeech on 16 kHz sampled speech audio.
CTC is a character-based algorithm. Throughout coaching, it’s in a position to demarcate every character of the transcription within the speech mechanically, so the timeframe alignment isn’t required between audio sign and transcription. For instance, if the audio clip says “Whats up World,” we don’t must know during which second the phrase “howdy” is positioned. It saves plenty of labeling effort for ASR use circumstances. For extra details about how the algorithm works, check with Sequence Modeling With CTC.
Answer overview
On this put up, we use the SUPERB (Speech processing Universal PERformance Benchmark) dataset out there from the Hugging Face Datasets library, and fine-tune the Wav2Vec2 mannequin and deploy it as a SageMaker endpoint for real-time inference for an ASR job. SUPERB is a leaderboard to benchmark the efficiency of a shared mannequin throughout a variety of speech processing duties.
The next diagram offers a high-level view of the answer workflow.
First, we present the right way to load and preprocess the SUPERB dataset in a SageMaker atmosphere so as to get hold of a tokenizer and have extractor, that are required for fine-tuning the Wav2Vec2 mannequin. Then we use SageMaker Script Mode for coaching and inference steps, which lets you outline and use {custom} coaching and inference scripts, and SageMaker offers supported Hugging Face framework Docker containers. For extra details about coaching and serving Hugging Face fashions on SageMaker, see Use Hugging Face with Amazon SageMaker. This performance is offered by the event of Hugging Face AWS Deep Studying Containers (DLCs).
The pocket book and code from this put up can be found on GitHub. The pocket book is examined in each Amazon SageMaker Studio and SageMaker pocket book environments.
Knowledge preprocessing
On this part, we stroll by the steps to preprocess the info.
Course of the dataset
On this put up we use SUPERB dataset, which you’ll be able to load from the Hugging Face Datasets library instantly utilizing the load_dataset
perform. The SUPERB dataset additionally contains speaker_id
and chapter_id
; we take away these columns and solely maintain audio information and transcriptions to fine-tune the Wav2Vec2 mannequin for an ASR job, which transcribes speech to textual content. To hurry up the fine-tuning course of for this instance, we solely take the take a look at dataset from the unique dataset, then break up it into practice and take a look at datasets. See the next code:
After we course of the info, the dataset construction is as follows:
Let’s print one information level from the practice dataset and study the knowledge in every function. ‘file’
is the audio file path the place it’s saved and cached within the native repository. ‘audio’
accommodates three elements: ‘path’
is identical as ‘file’
, ‘array’
is the numerical illustration of the uncooked waveform of the audio file in NumPy array format, and ‘sampling_rate’
exhibits the variety of samples of audio recorded each second. ‘textual content’
is the transcript of the audio file.
Construct a vocabulary file
The Wav2Vec2 mannequin makes use of the CTC algorithm to coach deep neural networks in sequence issues, and its output is a single letter or clean. It makes use of a character-based tokenizer. Due to this fact, we extract distinct letters from the dataset and construct the vocabulary file utilizing the next code:
Create a tokenizer and have extractor
The Wav2Vec2 mannequin accommodates a tokenizer and have extractor. On this step, we use the vocab.json
file that we created from the earlier step to create the Wav2Vec2CTCTokenizer
. We use Wav2Vec2FeatureExtractor
to be sure that the dataset utilized in fine-tuning has the identical audio sampling charge because the dataset used for pre-training. Lastly, we create a Wav2Vec2 processor that may wrap the function extractor and the tokenizer into one single processor. See the next code:
Put together the practice and take a look at datasets
Subsequent, we extract the array illustration of the audio information and its sampling_rate
from the dataset and course of them utilizing the processor, so as to have practice and take a look at information that may be consumed by the mannequin:
Then we add the practice and take a look at information to Amazon Easy Storage Service (Amazon S3) utilizing the next code:
High-quality-tune the Hugging Face mannequin (Wav2Vec2)
We use SageMaker Hugging Face DLC script mode to assemble the coaching and inference job, which lets you write {custom} coaching and serving code and utilizing Hugging Face framework containers which might be maintained and supported by AWS.
After we create a coaching job utilizing the script mode, the entry_point
script, hyperparameters, its dependencies (inside necessities.txt), and enter information (practice and take a look at datasets) are copied into the container. Then it invokes the entry_point
coaching script, the place the practice and take a look at datasets are loaded, coaching steps are carried out, and mannequin artifacts are saved in /choose/ml/mannequin
within the container. After coaching, artifacts on this listing are uploaded to Amazon S3 for later mannequin internet hosting.
You’ll be able to examine the coaching script within the GitHub repo, within the scripts/
listing.
Create an estimator and begin a coaching job
We use the Hugging Face estimator class to coach our mannequin. When creating the estimator, you must specify the next parameters:
- entry_point – The identify of the coaching script. It hundreds information from the enter channels, configures coaching with hyperparameters, trains a mannequin, and saves the mannequin.
- source_dir – The situation of the coaching scripts.
- transformers_version – The Hugging Face Transformers library model we wish to use.
- pytorch_version – The PyTorch model that’s appropriate with the Transformers library.
For this use case and dataset, we use one ml.p3.2xlarge occasion, and the coaching job is ready to end in round 2 hours. You’ll be able to choose a extra highly effective occasion with extra reminiscence and GPU to scale back the coaching time; nevertheless, it incurs extra price.
Whenever you create a Hugging Face estimator, you possibly can configure hyperparameters and supply a {custom} parameter into the coaching script, akin to vocab_url
on this instance. Additionally, you possibly can specify the metrics within the estimator, parse the logs of those metrics, and ship them to Amazon CloudWatch to observe and observe the coaching efficiency. For extra particulars, see Monitor and Analyze Coaching Jobs Utilizing Amazon CloudWatch Metrics.
Within the following determine of CloudWatch coaching job logs, you possibly can see that, after 10 epochs of coaching, the mannequin analysis metrics WER (phrase error charge) can obtain round 0.17 for the subset of the SUPERB dataset. WER is a generally used metric to judge speech recognition mannequin efficiency, and the target is to attenuate it. You’ll be able to enhance the variety of epochs or use the complete SUPERB dataset to enhance the mannequin additional.
Deploy the mannequin as an endpoint on SageMaker and run inference
On this part, we stroll by the steps to deploy the mannequin and carry out inference.
Inference script
We use the SageMaker Hugging Face Inference Toolkit to host our fine-tuned mannequin. It offers default features for preprocessing, predicting, and postprocessing for sure duties. Nevertheless, the default capabilities can’t inference our mannequin correctly. Due to this fact, we outlined the {custom} features model_fn()
, input_fn()
, predict_fn()
, and output_fn()
within the inference.py
script to override the default settings with {custom} necessities. For extra particulars, check with the GitHub repo.
As of January 2022, the Inference Toolkit can inference duties from architectures that finish with 'TapasForQuestionAnswering'
, 'ForQuestionAnswering'
, 'ForTokenClassification'
, 'ForSequenceClassification'
, 'ForMultipleChoice'
, 'ForMaskedLM'
, 'ForCausalLM'
, 'ForConditionalGeneration'
, 'MTModel'
, 'EncoderDecoderModel'
,'GPT2LMHeadModel'
, and 'T5WithLMHeadModel'
. The Wav2Vec2 mannequin just isn’t at the moment supported.
You’ll be able to examine the complete inference script within the GitHub repo, within the scripts/
listing.
Create a Hugging Face mannequin from the estimator
We use the Hugging Face Model class to create a mannequin object, which you’ll be able to deploy to a SageMaker endpoint. When creating the mannequin, specify the next parameters:
- entry_point – The identify of the inference script. The strategies outlined within the inference script are applied to the endpoint.
- source_dir – The situation of the inference scripts.
- transformers_version – The Hugging Face Transformers library model we wish to use. It needs to be in keeping with the coaching step.
- pytorch_version – The PyTorch model that’s appropriate with the Transformers library. It needs to be in keeping with the coaching step.
- model_data – The Amazon S3 location of a SageMaker mannequin information .tar.gz file.
Whenever you create a predictor by utilizing the mannequin.deploy
perform, you possibly can change the occasion depend and occasion kind based mostly in your efficiency necessities.
Inference audio information
After you deploy the endpoint, you possibly can run prediction checks to test the mannequin efficiency. You’ll be able to obtain an audio file from the S3 bucket by utilizing the next code:
Alternatively, you possibly can obtain a sample audio file to run the inference request:
The anticipated result’s as follows:
Clear up
Whenever you’re completed utilizing the answer, delete the SageMaker endpoint to keep away from ongoing costs:
Conclusion
On this put up, we confirmed the right way to fine-tune the pre-trained Wav2Vec2 mannequin on SageMaker utilizing a Hugging Face estimator, and in addition the right way to host the mannequin on SageMaker as a real-time inference endpoint utilizing the SageMaker Hugging Face Inference Toolkit. For each coaching and inference steps, we supplied {custom} outlined scripts for higher flexibility, that are enabled and supported by SageMaker Hugging Face DLCs. You should utilize the strategy from this put up to fine-tune a We2Vec2 mannequin with your personal datasets, or to fine-tune and deploy a special transformer mannequin from Hugging Face.
Try the pocket book and code of this challenge from GitHub, and tell us your feedback. For extra complete data, see Hugging Face on SageMaker and Use Hugging Face with Amazon SageMaker.
As well as, Hugging Face and AWS introduced a partnership in 2022 that makes it even simpler to coach Hugging Face fashions on SageMaker. This performance is offered by the event of Hugging Face AWS DLCs. These containers embrace the Hugging Face Transformers, Tokenizers, and Datasets libraries, which permit us to make use of these sources for coaching and inference jobs. For an inventory of the out there DLC photos, see Available Deep Learning Containers Images. They’re maintained and usually up to date with safety patches. You’ll find many examples of the right way to practice Hugging Face fashions with these DLCs and the Hugging Face Python SDK within the following GitHub repo.
In regards to the Creator
Ying Hou, PhD, is a Machine Studying Prototyping Architect at AWS. Her most important areas of pursuits are deep studying, pc imaginative and prescient, NLP, and time collection information prediction. In her spare time, she enjoys studying novels and mountain climbing in nationwide parks within the UK.