Videoconferencing - Technology Overview

Videoconferencing (VTC) was a relatively static field for numerous years—a handful of large manufacturers produced a limited number of videoconferencing codecs and infrastructure components that were built around the H.323 and SIP standards, while the relatively new field of software-based VTC systems was likewise limited to a small collection of companies.  The field has expanded drastically over the past several years, with numerous new manufacturers entering both the hardware and software sectors of the market.  This expansion has been built around improvements in technology, greater consumer acceptance of videoconferencing, and increased access to high-bandwidth internet.  While these developments helped drive down the cost of many VTC systems, they also created a more fractured and confusing marketplace than existed five years ago.

This section of the toolkit aims to describe the many components of the videoconferencing world, including software systems, hardware platforms, infrastructure elements, and peripheral devices.  As this is a potentially unwieldy topic, the technology overview has been broken down into the following sections:

Concepts and Components

Endpoints – Hardware and Software

Multi-Party Conferencing – Bridges and MCUs



Concepts and Components

Videoconferencing has long been considered a critical component of many telehealth systems, to the point that some people consider videoconferencing and telemedicine to be synonymous.  While this is undoubtedly an oversimplification of both telemedicine and videoconferencing, it is an understandable mental connection to make given that VTC has been used in an incredibly diverse range of clinical settings for decades. The fact that many states and insurance providers only reimburse for telemedicine if consultations take place over videoconferencing further reinforces the notion that VTC is a fundamental part of the telemedicine landscape.  But what is videoconferencing?

At its most simple, VTC is consists of two camera/microphone/display sets that communicate through some type of intermediary, networked device.  Examples include:

  • Two smart phones running software with the built-in camera, microphone and display
  • Two laptop or desktop PCs running software with external USB webcams and microphones
  • A wall-mounted VTC platform connecting with a VTC system on a hand-moved cart
  • Software on a desktop PC that communicates with a robotic video cart
  • Any permutation of these products (with some potential caveats regarding interoperability)

In most real-world implementations, however, things quickly become significantly more complex as videoconferencing systems begin to connect disparate organizations, home users, numerous simultaneous participants, and devices from multiple manufacturers.  The simple camera/microphone/display model quickly becomes inadequate as we consider proxies, gateways, gatekeepers, multi-point control units, multiple monitors, microphone arrays, pan-tilt-zoom cameras, and software platforms.

While the above terminology may seem overwhelming to those new to videoconferencing, each component has a fairly simple description, purpose, and place within a VTC system.  These various parts will be described in the coming pages.

Encoding / Decoding / Codecs

Video teleconferencing (VTC) is a simultaneous audio and video data stream that brings people at different sites together visually and audibly. This streaming of media is built upon the concept of coding and decoding audio and video data through codecs, which perform compression (coding) for faster transmission of data and decompressing (decoding) on the receiving side of the call. 

Codecs can be either hardware- or software-based, and will typically include the implementation of well-defined audio and video coding standards, such as G.711 and H.264 (please refer to the section on standards for more discussion on this topic).  Note that this dual meaning of codec-as-endpoint and codec-as-algorithm can cause some confusion when discussing this technology.

A large amount of analog data, including color, movement, and sound, is compressed and broken into small packets of data. VTC data packets travel over an Internet Protocol (IP) network, where they are reassembled and decoded at the other points in the conference.  Advances in compression and coding, significant increases in the number of consumer products that support videoconferencing, and the explosive growth of broadband connections in the home and on cellular networks have continued to drive the growth of this market.

Basic Components

Generally speaking, videoconferencing requires the following primary components:

  • A video input device, such as a video camera, desktop web cam, or a camera built in to a tablet, smart phone, or laptop.
  • An audio input device, such as a stand-alone microphone, headset, or a smart phone’s voice input.
  • A video output device, such as a computer monitor, HDTV, or a smart phone or tablet display.
  • An audio output device, such as a speaker
  • A data processing unit, such as a computer or codec the does the compressing and decompressing of data streams and ties all of the user sites together. This is often referred to as the endpoint.
  • Infrastructure for bridging the calls across the network, called a bridge or multipoint control unit (MCU).
  • A network or Internet connection.

What is an Endpoint?

Many of the people who use videoconferencing systems – physicians, nurses, patients, and administrators – may not be aware all of the components used in a typical VTC call, but have quite possibly heard the term “endpoint” used in discussions. Technically, the endpoint is a uniquely-addressable point on a network that can engage in a call with another videoconferencing endpoint. More broadly speaking, an endpoint is comprised of the codec, camera, and monitor used for videoconferencing.

The codec is the hardware or software portion of the endpoint that is used to code and decode audiovisual data sent in a videoconferencing session. It essentially serves as the brains of a videoconferencing endpoint, taking video data from the camera, transmitting a video signal to the monitor, taking in audio and video information from peripheral devices, and communicating with either core VTC infrastructure or other endpoints. The codec may come in a variety of shapes and sizes, ranging from large PC-like boxes to small, all-in-one units. More recently, software codecs have replaced the need for an external box, allowing computers, smart phones, and other mobile options to send and receive audiovisual data.

 The cameras used in videoconferencing connect to the codecs, providing the video image that is sent to other systems and to the local monitor or display (also referred to as the “far” and “near” ends, respectively). Camera designs and functionality vary, though they can broadly be placed in two categories – fixed and pan-tilt-zoom (PTZ). A fixed camera cannot be moved without physically changing the position of the device by hand, and are often built into monitors, phones, tablets, laptops, and all-in-one desktop units, or are USB-connected products that plug into a PC. These cameras tend to have a wide-angle lens, with some devices providing various levels of zoom and magnification. PTZ cameras can be moved through use of an electronic controller, and many of them allow a user at the remote site, or “far” end, to change where the camera is pointed.

Monitors vary widely in size, resolution, and design. Videoconferencing equipment manufacturers may include their own monitors in some product lines, while others will support the use of external, third-party monitors. The choice of monitor will often be driven by manufacturer recommendations, room and/or cart size, and budget.

The Types of Endpoints

Endpoints come in a variety of sizes and shapes, with each fitting a different intended audience and use. Manufacturers may use slightly different terms to describe their endpoints, or may use the same terms with different meanings and definitions. Below is a broad set of descriptions of the types of endpoints.

Mobile, PC, and Laptop Software

Various computing products, including mobile devices, laptops, and desktops, utilize software systems that communicate with other videoconferencing infrastructure to manage VTC calls.  This category includes standards-based products that communicate with hardware-based platforms as well as consumer-oriented products that do not interoperate with other systems. The rise in computing power on even the smallest of devices makes it possible for computationally complex encoding and decoding algorithms to be used, potentially improving performance while decreasing bandwidth needs.

Video Phone

A variety of products can fit into a broad “video phone” category. As opposed to smartphones and mobile PCs that use a software application, these videophones are small, purpose-built products that support phone calls and videoconferences. They are designed largely to fit on a desk in place of a traditional voice-only phone and are characterized by relatively small screens and cameras. The cameras are not capable of pan or tilt functionality, though they may support a basic or digital zoom feature. Additionally, these products typically lack additional inputs for other video or serial data devices.

Stand-Alone Desktop

The desktop units (not to be confused with desktop software) are all-in-one products that provide a monitor, camera, and built-in codec for use as a videoconferencing. Screen size varies with these products, though they are often the size and shape of a typical widescreen computer monitor. They often have the capacity to serve as a secondary monitor, allowing a desktop VTC unit to serve as both a computer monitor and as a videoconferencing platform. The cameras are not capable of pan or tilt functionality, though they may support a basic or digital zoom feature. Depending on the model and manufacturer, there may be options to connect either additional video products or serial data devices.

Room Unit – Cart or Mounted

The platform that often jumps to peoples’ minds when thinking about videoconferencing is the room-based platform. The codec and camera may be sold separately from the monitor, or may be sold as an all-in-one unit with a built-in pan-tilt-zoom camera. These models will typically have some variety of additional input offered, though the type and number of inputs varies by manufacturer.

Depending on the model and organizational preferences, these units can often be mounted in many different ways, with common options being to mount the devices to a wall, a stationary platform or pedestal, or a mobile cart. The carts can come in a variety of formats, with additional features such as built-in Uninterruptible Power Supplies, movable writing surfaces, articulating screen mounts, and attachment points for other peripherals.

Robotic and Remote-Controlled Platforms

Much interest has been generated by motorized, robotic platforms for videoconferencing. These systems provide the ability to remotely control a mobile videoconferencing platform, allowing a distant operator to navigate through hallways and rooms while maintaining a live video connection.
Interfaces for “driving” these platforms vary from web-based controllers to PC-based software applications with dedicated hardware controllers. Web-based controllers provide freedom from a dedicated controlling station, while PC-based systems with hardware controllers allow for a more tactile controlling experience.

It is important to note that these systems, which use either a software or web-based interface for controlling the robotic platform, do not work with standards-based systems. There is not yet a standardized method of sending control information from a room-based videoconferencing endpoint to a mobile, robotic platform. This means that robotic platforms will not work with existing videoconferencing infrastructure, nor will they work with other organizations’ VTC products unless they are also using the same robotic platform or controlling systems.


Manufacturers, vendors, and media outlets use telepresence in different ways. Some manufacturers have used the term telepresence to describe a high-definition codec attached to a large monitor, while others have used it to describe platforms with wheels. Some of the loosest applications of the term have included references to a modified Roomba with a video camera as providing “telepresence” based on the fact it allowed a videoconferencing platform to be “present” as the user moved around.

For the sake of clarity here, telepresence is being defined as the use of very high-definition systems that are configured to be viewed such that the person is felt to be “present” in the room, often with the subject largely filling the screen. These telepresence systems typically include multiple monitors and cameras in a single videoconferencing installation. These installations are often built in purpose-built rooms, and can connect with other VTC platforms to create impressive, bandwidth-intensive conference centers. These systems allow the sharing of content and provide various video inputs.

All videoconferencing could, technically, be defined as providing telepresence. However, some segments of the industry are moving in a direction that increasingly emphasizes larger, more elaborate installation. As the terminology is still being defined, this will likely be an area of some confusion in the short-term.

Multiparty Videoconferencing – Bridges and MCUs

Videoconferencing has increasing become associated with VTC calls that include multiple participants connecting simultaneously on multiple different endpoints in a single, virtual meeting room, all communicating seamlessly. The bridge, or Multipoint Control Unit (MCU), is one of the key pieces of technology that can make these multi-party video calls possible.

MCUs can support many different functions. Generally, they are devices that allow multiple videoconferencing endpoints to communicate in a conference that includes three or more people, as opposed to the point-to-point video calls that are limited to two participants. It is important to note that some endpoints provide this bridging functionality for a limited number of endpoints without requiring additional hardware or a dedicated bridging device.

The products are typically licensed with a set number of ports, or live video connections, that can be engaged at once. These connections may be used simultaneously in one large multiparty videoconference, or may be split across several smaller simultaneous VTC calls.


MCUs, as stated, provide the ability to help people connect in a videoconferencing system. Their exact functions will vary by manufacturer and model.

Video Switching

Video switching is the ability of a videoconferencing bridge to show the active speaker in a larger portion of the video image (or possibly the entire image), with the option of displaying participants who are not speaking in smaller squares, if at all. This is often voice-activated, meaning that the system dynamically switches the current speaker displayed in the main portion of the screen. On some systems, the active speaker may be set at each individual endpoint, or by a central chair that grants and revokes the main video image. Those acting as the video “chairperson” may also perform other functions, such as manually removing people from the call or canceling the entire conference.

Dynamic switching of the primary speaker is not always desirable.  Some systems also support simultaneously displaying all participants on the screen in small portions of the screen.  This is colloquially referred to as a “Hollywood Squares” or “Brady Bunch” conference. Essentially, everyone is viewed in a small square or tile in the larger screen. As more people join the conference, the number of squares increases as individual sizes decrease until a threshold is reached (between 9 and 25 participants, depending on the system).

Some bridges will output the individual squares in a 4:3 aspect ratio, which can cause some cropping issues when the camera being used is designed around the 16:9 “widescreen” ratio.  This may impact how participants should position themselves in front of the camera.

Identifiers – Rooms and People

MCUs can provide two potentially significant features related to what are best described as “identifiers” – rooms and people. Rooms can be incredibly useful for managing video calls. As opposed to knowing which user at which endpoint needs to be called, or which alias to reach, all parties can agree to call into a virtual “room.”  This can be especially useful for clinicians who offer VTC services in an on-call setting.  It would be difficult for all remote participants to know who is on-call at any given time; this variability can be simplified by directing all inbound calls to be placed to the central virtual room so that there is only one number to keep on record for a given specialty service.

For example, let us consider a rehabilitation clinic with 5 physicians who have all agreed to provide telehealth services. They decide that one of them will be available on a rotating basis, each taking one day of the week to be “on call” from their own offices. Remote clinics trying to reach a doctor would be frustrated if they had to call five doctors to find the one who was providing the services each day. By establishing this virtual room, the remote clinic could call into the room, the on-call doctor could call into the room, and they would be able to seamlessly connect.

The other identifier, which identifies specific people, allows participants in a multi-party call to see a textual label for each person, allowing you to see the name of the person speaking or the endpoint description for the site from which they are connecting.

Selected Communication Mode

In a traditional point-to-point videoconference, systems would agree on the connection speed. If one person’s system could not support the call, they would possibly have to drop the call and reconnect at a lower speed. MCUs provide different data rates to different endpoints, allowing people connecting at mismatched speeds to communicate without having to reconnect on lower settings. This means that people working on higher-bandwidth lines will not have to be throttled down if a single low-speed person joins the call.

Selected Communication Mode also allows an MCU to establish a maximum and minimum call speed. If an organization determines that it is only willing to perform video consults over high-speed connections, they can design a system that will not support low-performing endpoints to connect. Additionally, network resources can be saved from excessive bandwidth by limiting how much can be used at any given time either in a single or across multiple conferences.


Audio signals can be mixed in a videoconference, allowing multiple people to speak at the same time. While this can be a distraction, as with face-to-face conversation, it allows people to interject while a person is talking without having to wait for them to finish.

Additionally, people have the ability to call into a videoconference by phone, which allows them to engage in a conversation even if they do not have access to videoconferencing equipment.

Data Transmission

MCUs may have the ability to stream serialized data on a separate data channel, allowing devices such as serial electronic stethoscopes to send data to another endpoint. This process has not yet been standardized, and many MCUs will handle this differently. The TTAC has seen endpoints from one manufacturer unable to send serial data between their own endpoints via an MCU, yet the same MCU could send the data between two of their competitor’s endpoints.


Bridges can provide secure connections, to the degree that they may disallow anyone who does not support encryption from participating in a call.  They do not inherently provide more or less security, though the ability to require encryption for all call participants may help enforce organizational policies regarding the use of encryption for VTC communications.


When multiple bridges are connected, they are referred to as being “cascaded.” This means that a multiparty, bridged call from Organization A can join another multiparty, bridged call from Organization B, and all individual endpoints will be connected. There is a limit of three MCUs between endpoints, meaning that Organization A could be cascaded with Organization B, and Organization C could be cascaded with Organization B, effectively bridging the call between A and C. It would not be possible for Organization D to join to Organization C and communicate in the bridged call with Organization A. Note that this is not a common problem to face when first designing and deploying a videoconferencing system for an organization, but may be important to consider depending on how partner networks are connected.


In addition to the endpoints and MCUs used in videoconferencing, it is important to make note of the other hardware that is often associated with videoconferencing. This additional hardware is often mandatory for any serious videoconferencing deployment. There are options for third party hosting and support of the various infrastructure components if hosting it within an organization is not feasible.


The functions performed by a gatekeeper will vary by model, but can generally be broken down into a handful of categories – address translation, admission control, bandwidth control, user authentication, and zone management. Address translations allow an endpoint to be given a user-friendly alias, with internal addressing and routing handled by the gatekeeper whenever another endpoint tries to dial the alias. Admission control and bandwidth control limits how many simultaneous calls can take place and manages bandwidth allocation. Zones consist of devices associated with a single gatekeeper; zone management helps control devices registered within a single zone, as well as how a zone’s endpoints communicate with other zones.


Despite the fact that standardization has been widely adopted by videoconferencing manufacturers, getting devices from different manufacturers to communicate with one another can be a challenge. Standards have changed, new standards have been introduced, and optional features of some standards have been implemented in different ways between manufacturers. Gateways help manage these communication difficulties, connecting and translating (transcoding) between endpoints, MCUs, and other network devices.


Videoconferencing often takes place across the boundaries of at least one network, with connections occurring between different organizations or between internet-based endpoints and a single organization. There are certain challenges inherent in connecting these various networks. Proxies are designed to help mitigate these problems, sitting on the “edge” of a network and managing communication with endpoints, gateways, and gatekeepers.


Telehealth programs frequently utilize peripheral devices in conjunction with videoconferencing systems.  These peripheral devices are dependent on a range of supporting features, inputs, and outputs being available in the videoconferencing equipment.  Connectivity options for peripheral devices vary widely, ranging from standard- and high-definition video, 3.5 mm audio, serial, USB, and Bluetooth. Not all of these inputs are supported on videoconferencing endpoints, and it is important to note that some bridges introduce problems when attempting to transmit serial data between endpoints.


The videoconferencing market is undergoing a shift regarding the video inputs and outputs that are supported.  These changes have significant impact on both existing telehealth peripherals and future peripheral development for healthcare.  Videoconferencing systems previously supported the use of an external standard-definition video source through either an S-Video or Composite Video connection.  Many of the new products being released to the market forgo the use of standard-definition inputs in favor of high-definition HDMI inputs.  This shift in supported inputs means that devices lacking HD outputs require the use of additional devices that upscale or convert standard definition video signals to the HDMI format.

As high definition video outputs are introduced to telehealth peripherals this problem will be eliminated, though it may introduce the opposite problem—some of the new peripherals may only have HDMI outputs, and connecting them to existing VTC endpoints will be impossible without utilizing a downscaling or converting device.  It is now doubly important to verify that the devices to be used in a VTC deployment are capable of working together.

Software-based videoconferencing systems face multiple effective options for connecting auxiliary video inputs, with both standard- and high-definition USB converters allowing video devices to plug into a PC.  This means that various imaging devices with S-Video, Composite, HDMI, DVI, and Component outputs can be plugged into a computer and will appear as a webcam to the VTC software.  Mobile-based videoconferencing systems are also facing an increasing number of options as manufacturers produce attachments that allow standard scopes to utilize the built-in camera.

Regardless of the connection used, these inputs can be set as the primary video source while conferencing, allowing the peripheral device’s video stream to be sent to the far site. Common imaging devices include patient exam cameras, camcorders, digital cameras with video outputs, and otoscopes.


Audio connectivity is currently fairly limited when linking medical devices to videoconferencing endpoints. VTC endpoints may support XLR, RCA, Bluetooth, or 3.5 mm inputs, but the medical devices typically provide 3.5 mm outputs. Note that some stethoscopes use a 1.5-to-3.5mm cable for connecting to other devices. Connecting to videoconferencing endpoints requires either a 3.5 mm cable, if available, or a 3.5mm-to-RCA adapter.

Videoconferencing endpoints may support other audio devices through their additional inputs. There is a wide assortment of microphone options for videoconferencing. These microphones can have different performance characteristics, with each one suited for different environments and settings.

Sounds sent through an endpoint’s various audio inputs will be subject to the same encoding and decoding that the rest of the audio goes through (namely, the G.7xx standards). As such, these sounds are prone to some of the same issues of latency, jitter, and compression. Using other methods of transmission, such as serial data for some stethoscopes, may produce a more satisfactory result.

Serial Data – RS232

At this time there is only one category of devices commonly used in telemedicine that rely on serial data transfer over VTC – electronic stethoscopes. Even within the category of electronic stethoscopes there are only two manufacturers producing devices that perform this function (though some of these have been rebranded and sold through other outlets).

These particular electronic stethoscopes work by digitizing the audio and converting it to a serialized stream of data, which is then sent to the videoconferencing endpoint. The data is not compressed within the endpoint. Upon transmission to the receiving site, the data is streamed to another electronic stethoscope device that decodes the serialized information and plays it back through the headphones. This bypasses the compression that occurs through audio channels in the VTC endpoint.

Unfortunately, as mentioned in the standards section of this document, manufacturers have taken different approaches to transmitting serialized data. This means that it is not possible to send a serialized stream from one manufacturer’s endpoint to the endpoint of a different manufacturer. Further, serial inputs are not implemented by all VTC manufacturers.

Some stethoscope manufacturers offer Bluetooth stethoscopes.  It is important to note that these products do not communicate through videoconferencing software, and require a separate application to be running on a computer for transmitting the stethoscope sounds.


Videoconferencing technology encompasses an expansive range of devices, standards, and potential challenges. While standardization has done a lot to improve the ease with which networks and systems can be connected and managed, there still exists a sizeable difference between exactly how the pieces all fit together.

Videoconferencing endpoints often need to work within these larger networks and systems. While it may seem daunting to consider implementing a enterprise-wide videoconferencing system, these additional elements can help improve user experience, ease management issues, and ensure the ongoing success of a videoconferencing deployment.