H264 decoding is done on the CPU, using the FFmpeg library.

The final operation of interpolation from YUV bitmap into a RGB bitmap of correct (window) size is done on the GPU, using the openGL shading language.

This approach is a nice compromise as it takes some advantage of the GPU by offloading the (heavy) interpolation of (large) bitmaps.

One might think that it would be fancier to do the H264 decoding on the GPU as well, but this is a road to hell - forget about it.

Nv$dia for example, offers H264/5 decoding directly on their GPUs, but then you are dependent on their proprietary implementation, which means that:

You'll never know how many H264 streams the proprietary graphics drivers is allowed to decode simultaneously. Such restraints are completely artificial and they are implemented so that you would buy a more expensive "specialized" card.
You'll never know what H264 "profiles" the proprietary driver supports.
There is no way you can even find out these things - no document exists that would reveal which chipset/card supports how many simultaneous H264 streams and which H264 profiles.

So, if you're decoding only one H264 stream, then it might be ok to use proprietary H264 decoding on the GPU (but on the other hand, what's the point if it's only one stream..). If you want to do some serious parallel streaming (like here), invest on CPUs instead.

Other possibilities to transfer the H264 decoding completely to the GPU are (not implemented in Valkka at the moment):

Use Nvidia VDPAU (the API is open source) of the Mesa stack. In this case you must use X.org drivers for your GPU.
Create a H264 decoder based on the OpenGL shading language. This would be cool (and demanding) project.

In Valkka, concurrency in decoding and presenting various streams simultaneously is achieved using multithreading and mutex-protected fifos. This works roughly as follows:

Live555 thread (LiveThread)     FrameFifo       Decoding threads      OpenGLFrameFifo          OpenGLThread
                                                     
                                                                                              +-------------+ 
                                                                                              |             |
 +---------------------------+                                                                |interpolation|
 | rtsp negotiation          | -> [FIFO] ->      [AVThread] ->                                |timing       |
 | frame composition         | -> [FIFO] ->      [AVthread] ->          [Global FIFO] ->      |presentation |
 |                           | -> [FIFO] ->      [AVthread] ->                                |             |
 +---------------------------+                                                                |             |
                                                                                              +-------------+

A general purpose "mother" class Thread has been implemented (see multithreading) for multithreading schemes and is inherited by:

LiveThread, for connecting to media sources using the Live555 streaming library, see livethread
AVThread, for decoding streams using the FFMpeg library and uploading them to GPU, see decoding
OpenGLThread, that handles direct memory access to GPU and presents the Frames, based on their timestamps, see opengl

To get a rough idea how Live555 works, please see Live555 primer and live555 bridge. The livethread produces frames (class Frame), that are passed to mutex-protected fifos (see queues and fifos).

Between the threads, frames are passed through series of "filters" (see available framefilters). Filters can be used to modify the media packets (say, their timestamps for example) and to produce copying and redirection of the stream. Valkka library filters should not be confused with Live555 sink/source/filters nor with FFmpeg filters - which are completely different things.

For visualization of the media stream plumbings / graphs, we adopt the following notation which you should always use when commenting your python or cpp code:

() == Thread
{} == FrameFilter
[] == FrameFifo queue

To be more informative, we use:

(N. Thread class name: variable name)
{N. FrameFilter class name: variable name)
[N. FrameFifo class name: variable name]

A typical thread / framefilter graph would then look like this:

(1.LiveThread:livethread) --> {2.TimestampFrameFilter:myfilter} --> {3.FifoFrameFilter:fifofilter} --> [4.FrameFifo:framefifo] -->> (5.AVThread:avthread) --> ...

Which means that ..

(1) LiveThread reads the rtsp camera source, passes the frames to filter (2) that corrects the timestamp of the frame.
(2) passes the frames to a special filter (FifoFrameFilter) which feeds a fifo queue (4).
(4) FrameFifo is a class that handles a mutex-proteced fifo and a stack for frames

The whole filter chain from (1) to (4) is simply a callback cascade. Because of this, the execution of LiveThread (1) is blocked, until the callback chain has been completed. The callback chain ends to the "thread border", marked with "-->>". On the "other side" of the thread border, another thread is running independently.

Also, keep in mind the following rule:

Processes read from mutex-protected fifos (base class for fifos is FrameFifo)
Processes write into filters (base class FrameFilter)

In practice, Thread classes manage their own internal FrameFifo and FifoFrameFilter instances, and things become simpler:

* (1.LiveThread:livethread) --> {2.TimestampFrameFilter:myfilter} -->> (3.AVThread:avthread) --> ...

An input framefilter can be requested with AVThread::getFrameFifo()

LiveThread, AVThread and OpenGLThread constructors take a parameter that defines the stack/fifo combination (FrameFifoContext, OpenGLFrameFifoContext).

In the case of LiveThread, the API user passes a separate FrameFilter per each requested stream to LiveThread. That FrameFilter then serves as a starting point for the filter chain. The last filter in the chain is typically FifoFrameFilter, e.g. a filter that feeds the (modified/filtered) decoded frame to a fifo that is then being consumed by AVThread.

For more details, refer to examples, doxygen documentation and the source code itself.

Remember that example we sketched in the github readme page? Using our notation, it would look like this:

 (1.LiveThread:livethread) --> {2.TimestampFrameFilter:myfilter} 
                                  |
                                  +--> {3.ForkFrameFilter:forkfilter}  
                                         |    |
                                         |    |
       through filters, to filesystem <--+    +--->> (6.AVThread:avthread) ---------------------+
                                                                                                |
                                                                                                +--> {7.ForkFrameFilter:forkfilter2}  
                                                                                                                 |    |                                                                                    
                                                                                                                 |    |
                                  (10.OpenGLThread:openglthread) <<----------------------------------------------+    +--> .. finally, to analyzing process
                             feeds the video into various X-windoses

For the various FrameFilters, see available framefilters
For threads, see multithreading

Some more miscellaneous details about the architecture:

AVThread decoding threads both decode (with FFmpeg) and runs the uploading of the YUV bitmaps to the GPU. Uploading pixel buffer objects takes place in OpenGLFrameFifo.
The OpenGLThread thread performs the final interpolation from YUV into RGB bitmap using the openGL shading language
Fifos are a combination of a fifo queue and a stack: each frame inserted into the fifo is taken from an internal reservoir stack. If no frames are left in the stack, it means overflow and the fifo/stack is resetted to its initial state
Reserving items for the fifo beforehand and placing them into a reservoir stack avoids constant memory (de)allocations that can become a bottleneck in multithreading schemes.
This way we also get "graceful overflow" behaviour: in the case the decoding threads or the OpenGL thread being too slow (i.e. if you have too many/heavy streams), the pipeline overflows in a controlled way.
For a complete walk-through from stream source to x window, check out Code walkthrough: rendering