Foundational Concepts of Audio and Video Processing
#

Whether you are multimedia professional or deep learning Engineer, if you are dealing with audio and video processing, you will need to understand the core concepts of audio and video processing. My this guide is focussed on some of the key concepts of Audio Video processing.

A digital microphone is a device that captures sound (air pressure variations) and converts it into digital signals. Internally, it contains an analog-to-digital converter (ADC) that performs the conversion from analog audio to digital audio.

A digital camera captures light (optical signals) through an image sensor (like CMOS or CCD), which generates analog electrical signals. These signals are then passed through an ADC to be converted into digital image data.

Let’s understand Image and Colors
#

An image is a grid of pixels, where each pixel represents the color and intensity of a small part of the image.

A black-and-white (grayscale) image typically has one channel (layer), where each pixel value ranges from 0 (black) to 255 (white).
A color image usually uses three channels (RGB)
Red, Green, and Blue — each with a value between 0 and 255, requiring 3 bytes per pixel, visible on the screen. If the image supports transparency, it includes a fourth channel called Alpha, which represents opacity — it’s value varies from 0 (fully transparent) to 255 (fully opaque).
A pixel in an RGBA image takes 4 bytes (1 byte per channel).
A 1024×1024 RGB image has 2²⁰ pixels, and since each pixel uses 3 bytes, the total uncompressed size is 3 × 2²⁰ = 3,145,728 bytes (~3 MB). If this is a png file (contains alpha channel) then it will take 4 MB space, if it is grayscale image then it will take only 1 MB space.

Let’s understand Sound and Air Pressure Signals.
#

Sound is a continuous analog signal created by air pressure variations. But, in digital audio, this signal is captured and sampled at regular intervals.

A sample is one measurement of amplitude at a given point in time.
The number of samples per second is the sample rate (e.g., 44.1kHz means 44,100 samples per second).
Each sample can be stored using:
- 8-bit (1 byte) → 256 possible values
- 16-bit (2 bytes) → 65,536 values (more common for quality audio)
- 24-bit or 32-bit → used in high-fidelity applications
Generally, a sample can be represented by one byte, but most modern audio uses 16-bit samples, meaning 2 bytes per sample.
Audio file size
- If Sample rate: 44,100 samples/sec
- and Sample size: 16-bit (2 bytes) (in Audacity software they refer this as format)
- and Channels: 1 (mono) (for all voice signals)
- then file size per second = 44,100 samples * 2 bytes = 88,200 bytes ≈ 86.1 KB
- but for stereo (2 channels), it doubles:
- 44,100 * 2 bytes * 2 channels = 176,400 bytes ≈ 172.3 KB

🎧 Audio Processing Concepts
#

1. Sampling Rate
#

Number of samples taken per second (e.g., 44.1kHz, 16kHz).
Higher rate = more detail, bigger files.
Common: 44.1kHz (music), 16kHz (speech), 8kHz (telephony).

2. Bit Depth
#

Number of bits per sample (e.g., 16-bit, 24-bit).
Higher bit depth = more dynamic range.

3. Channels
#

Mono = 1 channel, Stereo = 2 channels.
Multichannel = 5.1, 7.1 surround sound.

4. Bitrate
#

Data rate (e.g., 128kbps, 64kbps).
Affects audio quality and file size.

5. Audio Codecs
#

Compression algorithms (e.g., MP3, AAC, Opus, FLAC).
Lossy (MP3) vs Lossless (FLAC).

6. Waveform & Spectrogram
#

Visual tools to represent audio signals in time (waveform) or frequency (spectrogram).

7. Noise Reduction, Equalization, Normalization
#

Signal processing techniques for improving or adjusting audio.

9. Decibal (dB)
#

What is movie?
#

A movie is a series of images (called frames) played in rapid succession to create the illusion of motion. When a tool displays 24 images per second, the human eye perceives continuous movement — this is known as the frame rate (24 fps). Higher frame rates like 60 fps are often used for action sequences or sports, where smoother and more fluid motion is needed.

🎥 Video Processing Concepts
#

1. Resolution
#

Frame size: 1920x1080 (1080p), 1280x720 (720p), etc.
Higher res = sharper image + larger file size.

2. Frame Rate (fps)
#

Frames per second: 24, 30, 60 fps.
Higher = smoother motion, but more data. For action movies to capture every transition we need high frame rates. For natural scenes or slow motions or slow documentaries movies we need lower frame rate.

3. Aspect Ratio
#

Width to height ratio (e.g., 16:9, 4:3, 9:16 for vertical videos).

4. Bitrate
#

Controls video quality & size (e.g., 1500kbps).
Can be constant (CBR) or variable (VBR).

Great — these are crucial video processing concepts, especially for your article! Here’s a more detailed and clarified explanation for each section so readers can truly understand how these parts work together in a video file:

🎞️ 5. Video Codecs
#

A codec (short for coder-decoder) is an algorithm that compresses and decompresses video to reduce file size without (too much) loss of quality.

👉 Summary:
#

Codec	Compression	Support	Use Case
H.264	Good	Universal	Standard streaming & recording
H.265	Excellent	Limited (older devices may lag)	4K, HEVC content
VP9	Very Good	Web (Chrome/YouTube)	High-res web video
AV1	Best	Growing (newer devices)	Future-proof streaming

📦 6. Containers / Formats
#

A container is like a box that holds different types of data streams: video, audio, subtitles, metadata — all bundled into a single file.

Popular containers:

MP4: Most common, supports H.264, H.265. Compatible with nearly everything (browsers, phones, TVs).
MKV: Open-source, supports almost any codec, very flexible. Common for high-quality video (e.g., Blu-ray rips).
MOV: Apple’s container format. High quality, used in professional editing.
AVI: Older Microsoft format, less efficient, but still seen in legacy apps.

👉 Analogy:
Think of codec = language, and container = suitcase that holds the movie + soundtrack + extras.

🎨 7. Color Space & Chroma Subsampling
#

Color Space
#

A method of representing color. Common ones:
- RGB: Red-Green-Blue — used for monitors, raw images.
- YUV (or YCbCr): Used in video — separates luma (Y) = brightness, and chroma (U/V) = color.
  - Humans perceive brightness better than color, so we can compress color info more heavily — which brings us to…

Chroma Subsampling
#

Technique to reduce file size by lowering color resolution while preserving brightness.
Formats:
- 4:4:4 — no subsampling (full color + brightness)
- 4:2:2 — some color reduction
- 4:2:0 — most common in compressed video (e.g., H.264)

✅ 4:2:0 means:

For every 4 luma (Y) samples, only 1 chroma sample for U and V.
Result: Good visual quality + great compression.

🔁 8. Keyframes & GOP (Group of Pictures)
#

Video compression doesn’t store every full frame — it stores changes between frames to save space.

I-frame (Intra-coded frame): A full image. Like a JPEG. Can be decoded on its own.
P-frame (Predicted frame): Stores changes from the previous frame.
B-frame (Bi-directional frame): Stores changes between previous and future frames.

GOP = Group of Pictures
#

A sequence like: I B B P B B P ... I
Starts with an I-frame, followed by P/B-frames.
Smaller GOP = easier seeking/editing, larger GOP = better compression.

🔁 Common to Both
#

1. Transcoding
#

Re-encoding from one format/codec to another.

2. Compression
#

Lossy (discard some data) vs Lossless (no data loss).

3. Latency / Sync
#

Keeping audio/video aligned.

4. Streaming vs Local Playback
#

Adaptive bitrate streaming (HLS, DASH), buffering, encoding-on-the-fly.

🛠 Tools to Learn
#

FFmpeg – Swiss Army knife for audio/video.
Audacity – Audio editing.
HandBrake – GUI video transcoder.
OBS Studio – Recording & streaming.
Adobe Premiere / DaVinci Resolve – Professional video editing.
Python Libs – pydub, moviepy, ffmpeg-python, OpenCV for automation.

Follow Me

Dr. Hari Thapliyaal

Dr. Hari Thapliyal is a seasoned professional and prolific blogger with a multifaceted background that spans the realms of Data Science, Project Management, and Advait-Vedanta Philosophy. Holding a Doctorate in AI/NLP from SSBM (Geneva, Switzerland), Hari has earned Master's degrees in Computers, Business Management, Data Science, and Economics, reflecting his dedication to continuous learning and a diverse skill set. With over three decades of experience in management and leadership, Hari has proven expertise in training, consulting, and coaching within the technology sector. His extensive 16+ years in all phases of software product development are complemented by a decade-long focus on course design, training, coaching, and consulting in Project Management. In the dynamic field of Data Science, Hari stands out with more than three years of hands-on experience in software development, training course development, training, and mentoring professionals. His areas of specialization include Data Science, AI, Computer Vision, NLP, complex machine learning algorithms, statistical modeling, pattern identification, and extraction of valuable insights. Hari's professional journey showcases his diverse experience in planning and executing multiple types of projects. He excels in driving stakeholders to identify and resolve business problems, consistently delivering excellent results. Beyond the professional sphere, Hari finds solace in long meditation, often seeking secluded places or immersing himself in the embrace of nature.

Comments:

Share with :

Audio Video Processing Concepts

On This Page

Foundational Concepts of Audio and Video Processing
#

Let’s understand Image and Colors
#

Let’s understand Sound and Air Pressure Signals.
#

🎧 Audio Processing Concepts
#

1. Sampling Rate
#

2. Bit Depth
#

3. Channels
#

4. Bitrate
#

5. Audio Codecs
#

6. Waveform & Spectrogram
#

7. Noise Reduction, Equalization, Normalization
#

9. Decibal (dB)
#

What is movie?
#

🎥 Video Processing Concepts
#

1. Resolution
#

2. Frame Rate (fps)
#

3. Aspect Ratio
#

4. Bitrate
#

🎞️ 5. Video Codecs
#

👉 Summary:
#

📦 6. Containers / Formats
#

🎨 7. Color Space & Chroma Subsampling
#

Color Space
#

Chroma Subsampling
#

🔁 8. Keyframes & GOP (Group of Pictures)
#

GOP = Group of Pictures
#

🔁 Common to Both
#

1. Transcoding
#

2. Compression
#

3. Latency / Sync
#

4. Streaming vs Local Playback
#

🛠 Tools to Learn
#

Dr. Hari Thapliyaal

Comments:

Related

On This Page

Foundational Concepts of Audio and Video Processing #

Let’s understand Image and Colors #

Let’s understand Sound and Air Pressure Signals. #

🎧 Audio Processing Concepts #

1. Sampling Rate #

2. Bit Depth #

3. Channels #

4. Bitrate #

5. Audio Codecs #

6. Waveform & Spectrogram #

7. Noise Reduction, Equalization, Normalization #

9. Decibal (dB) #

What is movie? #

🎥 Video Processing Concepts #

1. Resolution #

2. Frame Rate (fps) #

3. Aspect Ratio #

4. Bitrate #

🎞️ 5. Video Codecs #

👉 Summary: #

📦 6. Containers / Formats #

🎨 7. Color Space & Chroma Subsampling #

Color Space #

Chroma Subsampling #

🔁 8. Keyframes & GOP (Group of Pictures) #

GOP = Group of Pictures #

🔁 Common to Both #

1. Transcoding #

2. Compression #

3. Latency / Sync #

4. Streaming vs Local Playback #

🛠 Tools to Learn #

Dr. Hari Thapliyaal

Comments:

Related

Foundational Concepts of Audio and Video Processing
#

Let’s understand Image and Colors
#

Let’s understand Sound and Air Pressure Signals.
#

🎧 Audio Processing Concepts
#

1. Sampling Rate
#

2. Bit Depth
#

3. Channels
#

4. Bitrate
#

5. Audio Codecs
#

6. Waveform & Spectrogram
#

7. Noise Reduction, Equalization, Normalization
#

9. Decibal (dB)
#

What is movie?
#

🎥 Video Processing Concepts
#

1. Resolution
#

2. Frame Rate (fps)
#

3. Aspect Ratio
#

4. Bitrate
#

🎞️ 5. Video Codecs
#

👉 Summary:
#

📦 6. Containers / Formats
#

🎨 7. Color Space & Chroma Subsampling
#

Color Space
#

Chroma Subsampling
#

🔁 8. Keyframes & GOP (Group of Pictures)
#

GOP = Group of Pictures
#

🔁 Common to Both
#

1. Transcoding
#

2. Compression
#

3. Latency / Sync
#

4. Streaming vs Local Playback
#

🛠 Tools to Learn
#