Video Transcoding at Scale: Codecs, Containers, and the Pipeline
Transcoding converts source video into optimized renditions for adaptive streaming. At scale, it becomes a distributed systems challenge involving codec selection, resolution ladders, quality metrics, and pipeline architecture.
Video transcoding is the process of converting a source video file from one encoding format to another — changing codecs, resolutions, bitrates, or container formats to produce the renditions needed for delivery. Every modern streaming experience depends on transcoding: a single 4K source file must be converted into a ladder of renditions at multiple quality levels so that players can adapt to each viewer's bandwidth and device capabilities in real time. At small scale, transcoding is a manual step — drag a file into an encoder, pick a preset, wait. At scale — thousands of videos, multiple codecs per video, continuous ingestion from production teams, vendors, and user submissions — it becomes a distributed systems problem that determines your video quality, delivery speed, and infrastructure cost.
Containers vs. codecs: the first distinction
Before diving into codec selection and pipeline architecture, it is worth clarifying a distinction that trips up even experienced engineers. A codec (compressor-decompressor) is the algorithm that compresses raw video frames into a compact representation and decompresses them during playback. H.264, H.265, VP9, and AV1 are codecs. A container (also called a wrapper or mux format) is the file format that packages the compressed video stream together with audio streams, subtitle tracks, and metadata into a single file. MP4, WebM, MOV, and MKV are containers.
The relationship is many-to-many. An MP4 container can hold H.264, H.265, or AV1 video. A WebM container holds VP9 or AV1. The same codec can live in different containers. HLS (HTTP Live Streaming) and DASH (Dynamic Adaptive Streaming over HTTP) are neither codecs nor containers — they are streaming protocols that reference segmented media files. HLS traditionally used MPEG-TS (.ts) segments but now supports fragmented MP4 (fMP4). DASH uses fMP4 exclusively. CMAF (Common Media Application Format) standardizes the fMP4 segment format so that the same physical segments can be referenced by both HLS and DASH manifests — reducing storage requirements for platforms that serve both protocols.
Understanding this hierarchy — codec encodes the pixels, container packages the streams, protocol orchestrates the delivery — prevents common mistakes like assuming that “converting from MP4 to WebM” is a simple container swap (it usually requires re-encoding because WebM does not support H.264) or that “supporting HLS” means choosing a specific codec (HLS supports multiple codecs).
The codec landscape in 2026
The codec you choose determines the quality-to-size ratio, device compatibility, encoding compute cost, and licensing obligations for your video library. Three codecs dominate the landscape, each with genuine strengths and real tradeoffs.
H.264 (AVC)
H.264, standardized in 2003, remains the universal baseline. It is supported by virtually every browser, mobile device, smart TV, and set-top box in existence. Hardware decoding is ubiquitous, which means playback is power-efficient even on low-end mobile devices. Encoding is fast and well-optimized after two decades of software and hardware development. The tradeoff is compression efficiency: at high resolutions (4K and above), H.264 requires relatively high bitrates to maintain visual quality, which translates to larger file sizes and higher bandwidth costs. Licensing is managed through a single patent pool (MPEG-LA), and the terms are well-understood. For content targeting the broadest possible audience — especially when legacy devices are in the mix — H.264 is still the safe default.
H.265 (HEVC)
H.265, also known as HEVC (High Efficiency Video Coding), delivers roughly 50% better compression than H.264 at equivalent visual quality. This means the same perceived quality at half the bitrate, or significantly better quality at the same bitrate. The catch is adoption complexity. Patent licensing for H.265 is fragmented across multiple patent pools (MPEG-LA, HEVC Advance, and Access Advance), which has created uncertainty around total licensing costs and slowed browser adoption. Safari supports it natively. Chrome added hardware-accelerated support in recent versions. Firefox support remains limited and platform-dependent. Hardware decoder support is widespread on modern devices (manufactured after roughly 2016) but absent on older hardware. H.265 is most effective when you can target specific platforms (Apple ecosystem, modern Android, smart TVs) or use it as one tier in a multi-codec strategy where clients that support it get the bandwidth savings and everyone else falls back to H.264.
AV1
AV1 is the royalty-free codec developed by the Alliance for Open Media (AOMedia), backed by Google, Apple, Meta, Netflix, Amazon, and others. It offers compression efficiency comparable to or better than H.265 without patent licensing fees — a significant advantage for high-volume platforms where per-play royalty costs compound. The primary challenge is encoding speed: AV1 encoding is computationally expensive, roughly 10 to 20 times slower than H.264 at equivalent quality settings. Hardware encoder support has arrived in Apple's M3 and later chips, Intel Arc GPUs, AMD RDNA 3, and recent Nvidia GPUs, but is not yet universal. On the decode side, browser support is strong — Chrome, Firefox, Edge, and Safari all support AV1 playback, with hardware-accelerated decoding on newer devices. AV1 is increasingly the right choice for large libraries where the one-time encoding cost is amortized over millions of playbacks. The per-play bandwidth savings (30-50% smaller files than H.264) more than offset the encoding investment at scale.
Codec comparison
| Codec | Compression efficiency | Browser support | HW decode support | Licensing |
|---|---|---|---|---|
| H.264 (AVC) | Baseline (1×) | Universal | Universal | MPEG-LA pool (well-defined) |
| H.265 (HEVC) | ~50% better than H.264 | Safari, Chrome; Firefox limited | Post-2016 devices | Fragmented (3 patent pools) |
| AV1 | ~50-60% better than H.264 | Chrome, Firefox, Safari, Edge | Post-2022 devices (growing) | Royalty-free (AOMedia) |
In practice, most platforms operating at scale adopt a multi-codec strategy: encode every video in H.264 as the universal fallback, plus H.265 or AV1 (or both) for clients that support them. The player checks device capabilities and selects the most efficient codec available. This approach maximizes compatibility while capturing bandwidth savings for the growing majority of devices that support modern codecs.
Building an ABR ladder
Adaptive bitrate (ABR) streaming requires a ladder of renditions at different resolution and bitrate combinations. The player evaluates network conditions in real time and switches between rungs to prevent buffering while maximizing visual quality. A typical ladder for a 1080p source might include:
| Resolution | Bitrate (H.264) | Bitrate (H.265/AV1) | Target scenario |
|---|---|---|---|
| 1080p (1920×1080) | 4.5–6 Mbps | 2.5–3.5 Mbps | Desktop, smart TV, strong Wi-Fi |
| 720p (1280×720) | 2–3 Mbps | 1.2–1.8 Mbps | Tablet, moderate connection |
| 480p (854×480) | 1–1.5 Mbps | 600–900 Kbps | Mobile on 4G, constrained Wi-Fi |
| 360p (640×360) | 500–800 Kbps | 300–500 Kbps | Mobile on 3G, very slow connections |
Designing the ladder involves tradeoffs. More rungs mean smoother quality transitions during playback — the player has finer-grained options to match available bandwidth. But each additional rung costs compute time to encode and storage to maintain. A four-rung ladder for a single codec produces four renditions. A four-rung ladder across three codecs produces twelve renditions from one source file. Multiply that by thousands of videos and the storage and compute implications become significant.
Per-title encoding
Fixed-bitrate ladders are inefficient because they ignore content complexity. A talking-head video with a static background compresses dramatically better than a fast-motion sports highlight with rapid scene changes. Per-title encoding (also called content-aware encoding) analyzes each video's visual complexity — motion vectors, spatial detail, scene change frequency — and generates a custom bitrate ladder for that specific video. A low-complexity video might skip the highest bitrate rung entirely because it achieves reference quality at a lower bitrate. A high-complexity video might need additional rungs or higher bitrate targets to avoid visible artifacts.
The result is significant: per-title encoding typically reduces total storage by 20-50% compared to fixed ladders at equivalent perceived quality. For a 50 TB library, a 30% reduction saves 15 TB of storage — thousands of dollars per month in cloud storage costs alone, plus proportional bandwidth savings on every playback. The tradeoff is increased encoding complexity: the encoder must analyze the source first, run test encodes at multiple bitrate points, and select the optimal ladder. This adds minutes to the encoding process per video but pays for itself many times over in reduced storage and delivery costs.
Quality metrics: beyond bitrate
Traditional transcoding workflows specify a target bitrate — say, 5 Mbps for 1080p content — and the encoder allocates bits within that budget. The problem is that bitrate is a blunt instrument. A static presentation slide at 5 Mbps looks identical to one at 2 Mbps. A fast-motion action sequence at 5 Mbps might show visible compression artifacts. Constant bitrate (CBR) encoding wastes bandwidth on simple scenes and starves complex ones. Variable bitrate (VBR) encoding adjusts frame-by-frame but still optimizes for a bitrate target, not a quality target.
Perceptual quality metrics solve this fundamental problem. VMAF (Video Multi-Method Assessment Fusion), developed by Netflix, is a machine learning model trained on human visual perception data. It scores video quality on a 0-100 scale that correlates closely with how humans perceive quality — a VMAF score of 93 or above is generally considered indistinguishable from the uncompressed source for most viewers. SSIM (Structural Similarity Index) provides a complementary measure based on structural patterns in the image, scoring from 0 to 1 where values above 0.95 indicate excellent quality.
Quality-aware encoding uses these perceptual metrics as the target instead of bitrate: encode to VMAF 93 and let the encoder find the minimum bitrate that achieves that quality for each scene. The result is smaller files for visually simple content, appropriately sized files for complex content, and a consistently excellent viewing experience across your entire library. For the viewer, quality never drops below the perceptual threshold. For the operator, file sizes are 20-40% smaller on average compared to constant bitrate encoding at equivalent perceived quality — savings that compound across every stored rendition and every delivered segment.
The transcoding pipeline architecture
A transcoding pipeline at scale is more than a queue of encoding jobs. It is a distributed system that must handle variable input formats, manage compute resources efficiently, handle failures gracefully, and complete jobs within acceptable time budgets. The architecture breaks into distinct stages, each with its own operational concerns.
Ingest and analysis
The first stage validates the source file: is the container readable, is the codec decodable, are there audio streams, what is the duration, resolution, and frame rate? A probe step (using tools like FFprobe or MediaInfo) extracts this technical metadata without decoding the full video. Validation catches corrupted uploads, unsupported formats, and mismatched metadata before any encoding compute is spent. For per-title encoding, this stage also analyzes visual complexity to determine the optimal bitrate ladder.
Segmentation and parallel encoding
This is where parallelism transforms transcoding from a serial bottleneck into a scalable operation. The source video is split at keyframe boundaries (typically every 2-6 seconds, aligned with the target segment duration for ABR delivery). Each segment is dispatched to an encoding worker independently. A 60-minute video split into 2-second segments produces 1,800 segments. Distributed across 100 encoding workers, the entire video can be transcoded in the time it takes to encode roughly 18 segments sequentially — minutes instead of hours. Workers can be CPU-based (well-optimized for H.264 and H.265 with libraries like x264 and x265) or GPU-accelerated (increasingly common for AV1 with hardware encoders like NVENC and AMD AMF). The job scheduler must balance worker utilization, handle stragglers (segments that take longer due to scene complexity), and retry failed segments without re-encoding the entire video.
Packaging and manifest generation
After encoding, segments are packaged into the target streaming format. For HLS, this means writing .m3u8 playlist files that reference the segment URLs at each quality level, plus a master playlist that lists all available renditions. For DASH, the equivalent is a .mpd (Media Presentation Description) manifest. If you are targeting both protocols, CMAF packaging lets you write segments once in fMP4 format and generate both HLS and DASH manifests pointing to the same files. The packaging stage also handles DRM (Digital Rights Management) encryption if required, applying encryption to segments and embedding key server references in the manifests.
Storage and delivery handoff
The final stage pushes encoded renditions, segments, and manifests to persistent storage (typically object storage like S3 or GCS) and optionally warms the CDN cache for high-priority content. Webhook notifications or message queue events signal downstream systems that the video is ready for delivery. Monitoring and alerting at every stage — ingest success rate, encoding duration percentiles, segment failure rate, storage write latency — are essential for operating the pipeline reliably. A failed encode on one rendition should not block delivery of the others. A corrupted source file should be detected at ingest, not after hours of encoding compute.
Cost and performance tradeoffs
Transcoding at scale involves a three-way tradeoff between encoding speed, storage cost, and delivery latency. Understanding these tradeoffs is essential for designing a pipeline that fits your operational requirements and budget.
Eager transcoding
Eager (or pre-) transcoding generates all renditions at upload time. Every video in the library has a complete ABR ladder ready for instant delivery. The advantage is zero latency on first playback — the renditions already exist. The disadvantage is cost: you pay for compute and storage for every rendition of every video, including renditions that may never be requested. In a typical library, a significant portion of content receives minimal views. Generating a full multi-codec ABR ladder for content that is accessed once or twice is wasteful.
Just-in-time (lazy) transcoding
Lazy transcoding generates renditions only when they are first requested. The source file is stored, and renditions are created on demand. The advantage is dramatic cost savings: you only pay for compute and storage for renditions that are actually consumed. For libraries where the long tail of content receives few views, lazy transcoding can reduce total encoding compute by 60-80%. The disadvantage is latency on the first request — the viewer must wait while the rendition is generated. For short clips (under 30 seconds), this delay may be acceptable. For longer content, the delay can cause a poor first-view experience.
Hybrid approaches
The most cost-effective approach is typically a hybrid: eagerly transcode the most commonly requested renditions (the H.264 720p and 1080p rungs, which cover the majority of playback scenarios) and lazily generate the rest (4K, AV1, lower-resolution fallbacks) on first request. Popular content — identified by view count, editorial priority, or placement on high-traffic pages — gets the full eager treatment. Long-tail content gets the lazy treatment. Intelligent caching ensures that once a lazy rendition is generated, it is stored and served from the CDN for subsequent requests without re-encoding. This hybrid model reduces total encoding compute by 40-60% compared to full eager transcoding while maintaining sub-second delivery latency for the vast majority of playback scenarios.
Where Cloudinary fits
Cloudinary's transcoding pipeline handles multi-codec encoding (H.264, H.265, VP9, AV1), automatic ABR ladder generation, and quality-aware compression using perceptual metrics. Rather than specifying fixed bitrate targets, teams can configure quality-based encoding that targets a perceptual quality threshold and lets the platform find the optimal bitrate for each video's content complexity — implementing per-title encoding without building the analysis infrastructure in-house.
The platform supports both eager and lazy transcoding strategies. Encoding profiles are defined once and applied automatically to every upload. URL-based codec selection lets developers serve different codecs by changing a URL parameter — the same video URL with a different format extension returns an H.265 or AV1 rendition if it exists, or triggers lazy generation if it does not. For teams managing large libraries, this eliminates the need to pre-generate every codec and resolution combination while ensuring that any requested variant is available after a single generation.
Cloudinary's credit-based pricing ties transcoding, storage, and delivery costs into a single model, making it straightforward to forecast total cost of ownership as the library grows. The managed pipeline removes the operational burden of scaling encoding workers, managing job queues, and monitoring segment-level failures — letting engineering teams focus on their product rather than their video infrastructure.
Frequently asked questions
What is video transcoding and why does it matter at scale?
Video transcoding converts a source video from one encoding format to another, producing renditions optimized for different playback scenarios. At scale — thousands of videos with multiple codecs per video — transcoding becomes a distributed systems challenge requiring parallel processing pipelines, intelligent quality optimization, and cost-aware strategies to balance encoding speed, storage cost, and delivery quality.
Which video codec should I use: H.264, H.265, or AV1?
It depends on your audience. H.264 offers universal support and is the safe default. H.265 delivers 50% better compression but has fragmented licensing and inconsistent browser support. AV1 provides the best compression efficiency and is royalty-free, but encoding is computationally expensive. Most platforms at scale use a multi-codec strategy: H.264 as the fallback, with H.265 or AV1 served to capable devices.
What is per-title encoding and how does it improve video quality?
Per-title encoding analyzes each video's visual complexity and generates a custom bitrate ladder instead of using fixed targets. A low-motion video gets lower bitrates (because it compresses well), while a high-motion video gets higher bitrates (because it needs them). This typically reduces file sizes by 20-50% at equivalent perceived quality, saving storage and bandwidth costs at scale.
Ready to manage video assets at scale?
See how Cloudinary helps teams upload, transform, and deliver video — with a free tier to get started.