Video Transcription using Azure Speech and MoviePy

Video content is becoming more prevalent than ever before. However, extracting valuable information from videos can be time-consuming and challenging. That’s where Azure Speech, a powerful service provided by Azure AI Services, comes into play. Azure Speech offers state-of-the-art speech recognition capabilities, allowing us to transcribe spoken words in videos accurately and easily.

This blog post explores the seamless integration of Azure Speech and MoviePy, a popular Python library for video editing and manipulation, to effortlessly transcribe videos into text files. By combining these two technologies, you can automate the transcription process and extract valuable textual information from your video content in just a few simple steps.

Why use Azure Speech and MoviePy?

Azure Speech has many key features which include speech-to-text capabilities and its compatibility with various audio and video formats. It emphasizes the benefits of using Azure AI Services, such as its robustness, scalability, and ease of integration.

Next, the blog delves into MoviePy, providing an overview of its capabilities and its ability to handle various video file formats. It explains how MoviePy can be leveraged to extract audio from videos, an essential step for the subsequent transcription process.

Preview of Instructions for Setting Up Azure Speech with MoviePy

Instructions are provided on the necessary setup steps, including creating an Azure Speech resource in the Azure portal and installing the required Python packages, such as the Azure SDK and MoviePy library.

With the setup complete, the blog presents a step-by-step walkthrough of the transcription process. It covers loading the video using MoviePy, extracting the audio track, and sending the audio data to Azure Speech for transcription. It explains how Azure Speech leverages advanced machine learning models to accurately convert spoken words into text.

The combination of Azure Speech and MoviePy can simplify the process of transcribing videos into text files. By harnessing the power of Azure’s speech recognition capabilities and the versatility of MoviePy, users can efficiently extract and utilize textual information from their video assets.

With the detailed walkthrough and insights provided, readers will be empowered to leverage these technologies, enabling them to save time, increase productivity, and unlock the hidden potential of their video content.



Azure AI Services Previously known as Azure Cognitive Services

Azure Speech a quick intro:

Azure Speech is a powerful service offered by Azure AI Services that provides advanced speech recognition capabilities. It is designed to convert spoken language into written text, enabling developers to transcribe audio and video content with remarkable accuracy. With Azure Speech, you can leverage cutting-edge machine learning models to extract meaningful insights from your audio and video assets.

One of its key features is its robust speech-to-text capability, which accurately converts spoken words into text, making it ideal for tasks such as transcription, voice-controlled applications, and voice assistants. Additionally, Azure Speech is compatible with a wide range of audio and video formats, allowing you to seamlessly process and transcribe content from various sources. This compatibility ensures flexibility and ease of integration, enabling you to harness the power of speech recognition across diverse applications and industries.

MoviePy, a quick intro:

MoviePy is a versatile Python library that empowers developers with a wide range of video editing and manipulation capabilities. It simplifies the process of working with video files, enabling tasks such as video trimming, concatenation, effects application, and much more.

With MoviePy, you can effortlessly extract audio tracks from videos, a crucial step for transcription purposes. It supports a variety of video file formats, including popular ones like MP4, AVI, and MOV, ensuring compatibility with a broad range of video sources. MoviePy’s intuitive API and comprehensive documentation make it accessible for both beginners and advanced users, facilitating video processing tasks with ease. Its extensibility and flexibility make it a go-to choice for video-related operations in Python projects, allowing developers to efficiently handle video files and seamlessly integrate them into their workflows.

Step-by-Step Guide:

Use MoviePy to extra audio from video:

Step 1. Create a Vnet.

  • Go to
  • Login if you already have an account, otherwise, create one
  • Create a new subscription
  • Create a new Azure VNET by using an ARM template or manually creating one.  Azure ARM (Azure Resource Manager) templates allow you to create and deploy an entire Azure infrastructure declaratively.
  • Create a subnet where the VM with moviepy will reside.

Step 2. Azure Portal – Create a VM that will host MoviePy.

  • Create a new resource: click on the “create a resource” button on the left hand side of the Azure portal.
  • Create a virtual machine (VM) in Azure:
    • Select Virtual Machine: In the “New” pane, search for “Virtual Machine” and select “Virtual Machine” from the list of available options.
    • Choose a deployment option: Azure offers two deployment models for virtual machines: Resource Manager and Classic. Choose the Resource Manager deployment model for the latest features and capabilities.
    • Configure the basics: Provide the necessary information such as the subscription, resource group, and virtual machine name. Select the region where you want to deploy the VM.
    • Select an image: Choose an operating system image for your VM, such as Windows or Linux. You can select from a range of pre-configured images available in the Azure Marketplace.
    • Choose a size: Select the size of the VM based on your requirements, considering factors such as CPU, memory, and storage capacity.
    • Configure optional features: Customize additional settings like networking, storage, availability options, and management options according to your needs. You can also configure advanced settings if required.
    • Set up authentication: Specify the username and password or SSH key for the VM’s login credentials. This will be used to access the VM remotely.
    •  Review and create: Double-check all the configurations you have made and click on the “Create” button to start the deployment process.
    •  Monitor deployment: Azure will begin provisioning the VM based on your specifications. You can monitor the progress in the Azure Portal.
    • Access and manage the VM: Once the deployment is complete, you can access and manage the VM through the Azure Portal, Azure CLI, PowerShell, or any other preferred method.
  • Please connect to the VM using SSH to update and install software in Step 3

Step 3. VM setup and software installation:

  • Update the Virtual Machine:
    • sudo update
  • Install Python PIP:
    • sudo apt-get install python3-pip
  • Install moviepy:
    • sudo pip install moviepy
  • Install Azure AI Services:
    • sudo pip install azure-cognitiveservices-speech
  • Create a file using your favorite editor (I will use vi):
    • sudo vi

from moviepy.editor import *

video_file = “YOURMOVIE FILE”

output_file = “YOURMOVIE FILE.wav”

#load the video clip

video = VideoFileClip(video_file)

#extract the audio from the video

audio =

# Set the desired audio parameters

audio_params = {

“codec”: “pcm_s16le”,

“fps”: 16000,  # Set the desired sampling rate: 16000 Hz

# “fps”: 8000,  # Alternatively, set the sampling rate to 8000 Hz

“nchannels”: 1,  # Mono audio

“bitrate”: “16k”  # Set the desired bitrate


  • The audio parameters are very important, the fps (frames per second) and bitrate need to be set in a way that they are compatible with the Azure speech service.

Step 4. Run the file and extract audio:

  • sudo python3


Step 5. Setup Azure Speech Service:

  •  Sign into the Azure Portal: Go to the Azure Portal website ( and sign in using your Azure account credentials.
  • Create a new resource: Click on the “Create a resource” button (+) on the left-hand side of the Azure Portal.
  • Search and select Speech service: In the “New” pane, search for “Speech” and select “Speech” from the list of available options.
  • Configure the basics: Provide the necessary information such as the subscription, resource group, and speech service name. Select the region where you want to deploy the service.


  • Choose pricing tier: Select the pricing tier based on your requirements. Azure Speech Service offers different tiers with varying capabilities and pricing.


  • Configure additional settings: Customize additional settings such as the number of concurrent requests, location, and storage options. You can also configure advanced settings if required.
  • Review and create: Double-check all the configurations you have made and click on the “Create” button to start the deployment process.
  • Access and manage the Speech service: Once the deployment is complete, you can access and manage the Speech service through the Azure Portal.
  • Obtain the subscription key: To use the Speech service, you will need a subscription key. Go to the Azure Portal, navigate to the Speech service you created, and find the “Keys and Endpoint” section. Obtain the subscription key from there. (you will need this for the Python Script)




Step 6. Configure script and run the transcription job:

import azure.cognitiveservices.speech as speechsdk

import time

# Set up the Azure Speech configuration

speech_key = “YOURKEY”

service_region = “YOUR REGION”

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

# Set the audio file path

audio_file = “FreddyDubon.wav”

# Set up the audio configuration

audio_config =

# Create a speech recognizer object

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# Create an empty list to store the transcription results

transcriptions = []

# Define an event handler for continuous recognition

def continuous_recognition_handler(evt):

if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:


# Start continuous recognition



# Wait for the recognition to complete

timeout_seconds = 600  # Set a timeout value (in seconds) based on your audio file length

timeout_expiration = time.time() + timeout_seconds

while time.time() < timeout_expiration:

time.sleep(1)  # Adjust the sleep duration as needed

# Stop continuous recognition


# Combine transcriptions into a single string

transcription = ‘ ‘.join(transcriptions)

# Write the transcription to a file

output_file = “transcription.txt”

with open(output_file, “w”) as file:


print(“Transcription saved to: ” + output_file)

Step 7. Run the script and wait for the transcription text file to be created. Note:Depending on the size of the video the transcription time may take a while to be processed.

Original Post>