Feature Review: AWS Transcribe

So I’ve tested AWS Transcribe to create automatic subtitling of a video.

Setup complexity:

This is super easy, basically create a single S3 bucket and voilà.

For the permission, you will need :

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "allowStartTranscriptionJob",
            "Effect": "Allow",
            "Action": [
                "transcribe:StartTranscriptionJob",
                  ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "transcribe:Outputkey": "DOC-EXAMPLE-BUCKET/prefix"
                }
            }
        },
        {
            "Sid": "allowS3",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*transcribe*/"
        }
    ]
}

Start a transcription job:

This is really super easy, as show in this lambda:

# Get transcribe client
transcribe_client = boto3.client("transcribe")

# Start a job
transcribe_job = transcribe_client.start_transcription_job(
    # Important, the key must match "^[A-Za-z0-9_.-]+" regex
    # Must be unique across the account
    TranscriptionJobName=event.object_key.split("/")[-1],
    LanguageCode="en-AU",  # You can use IdentifyLanguage=True instead
    MediaFormat=media_format,  # Not required
    Media={
        "MediaFileUri": f"s3://{event.bucket_name}/{event.object_key}",
    },
    OutputBucketName=event.bucket_name,
    OutputKey=output_key,  # Prefix for output files
    Settings={
        "ShowSpeakerLabels": True,
        "MaxSpeakerLabels": 5,  # Mandatory if ShowSpeakerLabels is True
        "ShowAlternatives": False,
        # "MaxAlternatives": 3,  # Mandatory id ShowAlternatives is True
    },
    Subtitles={
        "Formats": [output_format],  # list of strings in "srt", "vtt"
    },
)

Full updated API details are in AWS documentation.

Get the subtitle files:

Once the job complete, the subtitles files are available in the output bucket.

Not that there are other options, for redact PIIs, and provide custom vocabularies and model to refine the transcriptions you create.

Performances:

I got to admit that is performs quite well, I would say like 90% hit when you speak load and clear. With some issues with very short words, and onomatopoeia.

Model Improvement:

I didn’t dive real deep into this, but for what I saw we have 2 features:

Custom Vocabulary

This is the easiest to setup, basically you provide the API with phrase for a specific language or provide tables with phrase/sounds matching.

Custom Model

This take more efforts to create, to make it simple you will have pro provide audio + transcript to either train a new model or tune an existing model.

Use cases identified

So I was more focused on how this can help on the video subtitle topic, and even if it’s not perfect, I really think I’ll use it for the video projects I have, it’s important to me to provide subtitles.

I don’t see me going to the hassle of training or tuning a custom model, but the custom vocabularies looks quite easy to setup and quite useful.

I’ve set a small pipeline with CDK that takes files dropped in a S3 bucket and create subtitles.

The big help it will provide is getting mostly accurate subtitles, with good timing, then I will only have to tune in a bit.

Cost

That’s cheap, here is a quick idea for no custom model, assuming 4h of video/week, and being really irresponsible with the storage and setting for 50GB of video stored.

Item name	Quantity	Currency	Cost (no VAT)
S3 Storage	50GB	USD	1.20
Transcribe Jobs	4h	USD	5.76
Lambdas	<1000000	–	–
Notifications	<1000000	–	–
	Total	USD	6.96