THIS PAGE IS UNDER CONSTRUCTION

Latency

Certain SDR applications require deterministic timing of transmitted bursts when responding to a received packet. This precise timing can be achieved using UHD's scheduled transmit functionality (via tx_metadata), however different devices (and host configurations) will have different lower bounds on round-trip latency.

This document will focus on characterising and minimising the time delay for a host computer to send data to a USRP and have the device actually transmit the data out onto the air.

The simplest scenario consists of a USRP that is listening on a receive channel for a packet from a remote node. At a random point in time, it will:

  1. 'hear' a packet from a remote node
  2. demodulate and decode the packet
  3. act upon the payload
  4. form a response packet
  5. modulate the packet
  6. transmit the baseband signal (as a burst)

Latency during the transmit (reply) phase is of interest for two reasons:

  1. Trying to minimise the turn-around time (i.e. having the USRP actually transmit the baseband burst in the shortest period of time after the data has been submitted to UHD)
  2. Being able to transmit the burst at a precise point in time after the time-stamped reception of the original incoming packet (e.g. in a TDMA scheme where one can only transmit in their assigned time slot).

Please note: there are a number of variables involved, and your own observed performance will depend on a variety of factors outside the control of your USRP.
If you require hard real-time deterministic processing at higher layers in your radio/protocol stack, please consider moving your code into the FPGA.

Receiving Samples

The receive snippet assumes you have created a rx_streamer object from your UHD usrp device instance, and started streaming. To do this:

uhd::stream_args_t stream_args("fc32");                                        // Output complex floats from UHD
// We are not setting any other stream arguments (yet)
uhd::rx_streamer::sptr rx_stream = usrp->get_rx_stream(stream_args);           // Get the RX streamer

uhd::stream_cmd_t stream_cmd(uhd::stream_cmd_t::STREAM_MODE_START_CONTINUOUS); // Set 'stream_mode' member from constructor
stream_cmd.stream_now = true;
usrp->issue_stream_cmd(stream_cmd);                                            // Start streaming now

Now that the USRP is streaming, you can read samples and the accompanying metadata:

std::vector<std::complex<float> > buff(samps_per_buff); // 'samps_per_buff' is up to you to set, and has an impact on latency in your application
uhd::rx_metadata_t rx_md;

size_t num_rx_samps = rx_stream->recv(&buff.front(), buff.size(), rx_md, timeout);

rx_md.time_spec provides the time-stamp of the first sample as a time_spec struct. This can then be used to schedule burst transmission in the future.

Samples per Buffer (SPB)

This is the size of buffer given to the UHD rx_streamer recv function. (Relates to the variable above samps_per_buff.)

Samples per Packet (SPP)

This is the size of packet over the wire (minus framing overhead).
One can determine the maximum number of samples in a packet by calling the streamer object's get_max_num_samps.

To manually set the number of samples in a packet, the "spp" property can be added to the stream_args_t arguments used to create the rx_streamer :

uhd::stream_args_t stream_args("fc32");                                // Output complex floats from UHD

stream_args.args["spp"] = str(boost::format("%d") % samps_per_packet); // Set the property

uhd::rx_streamer::sptr rx_stream = usrp->get_rx_stream(stream_args);   // Get the RX streamer

samps_per_packet = rx_stream->get_max_num_samps();                     // Read it back to see whether it was honoured

Please see relevant Device Type section below on how SPP affects latency.

Sending a Timed Burst

This assumes you have:
  1. rx_md.time_spec from a previous recv call.
  2. sample_rate is the current sample rate and n is the number of samples that have been processed since rx_md.time_spec was set (i.e. the delta between the first sample in the current group, and the current sample that has 'triggered' a timed burst response to be transmitted).
uhd::tx_metadata_t tx_md;

tx_md.start_of_burst = true; // This is the first packet in the chain.
tx_md.end_of_burst   = true; // This is the last packet in the chain.
tx_md.has_time_spec  = true; // Send at the supplied point in time (next line)
tx_md.time_spec      = rx_md.time_spec +                     // 'Base' time from when last buffer of samples was received
                       uhd::time_spec_t(0, n, sample_rate) + // Offset from first sample in that buffer to 'current' sample that triggered burst response
                       uhd::time_spec_t(delay);              // Pre-determined delta (adjusted depending on application)

size_t num_tx_samps = tx_stream->send(&response_buff.front(), response_buff.size(), tx_md, timeout);

If you wish to send the burst as quickly as possible (i.e. not at a pre-determined point in the future), you can make the following modification:

tx_md.has_time_spec = false;
// 'tx_md.time_spec' is now ignored

Note: this has the advantage of telling UHD that it should no longer expect samples. Therefore it will not report underruns, and the USRP will automatically switch back to receive mode if an antenna switch is in the signal path (e.g. half-duplex mode with a daughterboard).

For a simple code example on how to perform timed burst transmission, please see tx_timed_samples.cpp

General Latency Guidelines

  • Operating at a higher sample rate will mean data will get to/from the USRP quicker
  • Choosing a smaller Samples per Block will mean less time filling your buffer on a recv call
  • Choosing a smaller Samples per Packet will mean the device will spend less time filling a packet's payload before sending it down the wire.
    • On certain devices (i.e. USB) only changing SPP is not sufficient - more changes are necessary (see below).

Device Notes

Ethernet (N2xx)

Remember to take into account your interface's MTU when setting SPP.
For example, if you request 512 samples, this will not fit in a standard Ethernet frame: 512 * 4 bytes (I/Q sample) + overhead > 1500 byte Ethernet frame.
This will result in packets not being received by UHD and will cause recv to timeout.
The alternative is to enable Jumbo Frames on your NIC to allow for the larger size. If you are communicating with your USRP via network switch(es) make sure they support Jumbo Frames and they are enabled! (Switches tend to have this feature disabled by default.)

You can change your NIC's MTU by issuing:

sudo ifconfig ethX mtu <size, e.g. 4000 (9000 is usually the max)>

NIC Interrupt Moderation/Mitigation

NICs 'moderate' the assertion of interrupts when transmitting/receiving packets to decrease the load on the CPU.
Under most environments this is beneficial, but to improve latency we wish to have the NIC (via its driver settings) to operate immediately (at the expensive of potentially higher CPU usage).

See below for a comparison of N210 results with and without adjustment of interrupt moderation.

Intel NIC (82579LM)

This chipset was used during development of the latency characterisation code, and collection of the E2xx results on this page.

Commands for re-initialising Linux driver with arguments for maximum responsiveness:

sudo modprobe -r e1000e
sudo modprobe e1000e InterruptThrottleRate=0 TxAbsIntDelay=0 TxIntDelay=0 RxAbsIntDelay=0 RxIntDelay=0

If you wish to have more verbose driver output and/or have multiple network cards (making an array of values necessary), consider:

sudo modprobe e1000e debug=3 InterruptThrottleRate=0,0,0 TxAbsIntDelay=0,0,0 TxIntDelay=0,0,0 RxAbsIntDelay=0,0,0 RxIntDelay=0,0,0

USB

USB transfers are done through libusb by filling a LUT. The size of this transfer can be changed by supplying the recv_frame_size/send_frame_size arguments as device arguments to UHD when creating your device instance. Please see the UHD Transport Notes for more information.

Controller Notes

On the machine (Levono ThinkPad T430s laptop) used for testing below, two controllers were available:

  1. USB 2: USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller (rev 04)
    Kernel module: ehci_hcd
  2. USB 3: USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
    Kernel module: xhci_hcd

EHCI has a tunable parameter log2_irq_thresh to control interrupt moderation, but the default (0) should already give best results. See ehci.txt for more information.
XHCI does not have any such parameter.

B100

There is an extra stage in data processing on the B100 hardware that has important implication for latency:

  • Data is passed from the DDC to the VITA49 framer.
    The frame uses the SPP parameter to determine how many samples to pack into the current VITA49 frame.
  • Each VITA49 frame is submitted to a LUT buffer, whose size is determined by recv_frame_size.
    The complete buffer (and all the VITA49 frames it contains) will be transferred via USB when the buffer either: # is full and cannot accept another incoming VITA49 frame # contains at least one packet and a transfer timeout occurs (this is the flush timer and operates on a per-cycle basis)

By default, the buffer is large (16K), and the flush timer expires at 65536 cycles (~1ms at 64MHz).

Therefore, to achieve minimum latency, it would make sense to decrease the SPP as usual and also set the flush timer's period to something very small so it checks more often. It is also possible to set the timeout to 0, which will cause it to check the buffer for a frame every cycle and transfer it immediately.

Also, refer to notes on possible USB controller issues to mitigate against overruns.

USRP 1

Timing is emulated on the host (it is not done on the FPGA), therefore it will not perform in a deterministic fashion as the later devices do.

Please do not expect this device to perform in the manner that the other devices do!

Embedded (E1xx)

Due to the embedded nature of device, the FPGA-to-host interface is slower. Therefore latency will not be as low as an N2xx Ethernet device, for example.

  • SPP should not exceed 507 samples as this is too great to transfer to the processor (it exceeds the DMA transfer size, e.g. 512 will result in rx_metadata_t::ERROR_CODE_BAD_PACKET being returned by recv).

More Tweaks

Measuring Latency

Modes of operation:

  1. Stand-alone (no daughterboard required)
  2. Interactive
    1. Pure UHD
    2. GNU Radio

Getting Ready

  • Use CMake to generate Makefile for 'responder'
  • make

'responder'

responder accepts key commands while running in interactive mode:

  • d: toggle timed (scheduled) burst mode, or leave as best-effort
  • l: allow late bursts (as opposed to not transmitting at all if it will be late)
  • Left/right arrow: change transmit delay by current step size
  • Up/down arrow: change magnitude of step size
  • q: quit

'run_tests.py' requires 'responder' to be in the same directory

  • Please run apps with --help to see all possible options!
  • sudo is not strictly necessary, but code will try to change scheduler for better performance (and this requires elevated privileges).

Automatic:

sudo ./run_tests.py

Manual examples:

N210

sudo ./responder --spb=64 --rate=25e6 --iterations=1000 --delay-min=50e-6 --delay-max=200e-6 --delay-step=5e-6 --duration=100e-6

B100

sudo ./responder --spb=64 --rate=4e6 --iterations=1000 --delay-min=3000e-6 --delay-max=5000e-6 --delay-step=50e-6 --duration=10e-6 --simulate=199 --time-mul=1e6 --test-success=3

E100

sudo ./responder --spb=64 --rate=8e6 --iterations=1000 --delay-min=50e-6 --delay-max=5000e-6 --delay-step=50e-6 --duration=10e-6 --simulate=199 --time-mul=1e6 --test-success=3

Graphing results:

find . -name "latency*.txt" | ./graph.py

Results

These results were collected in-house for each device using run_tests above.

We are very interested to hear how your system performs!
If you wish to send us your results for comparison (especially if they are better than those below), please email them to us!

The specifications of the test computer were:

Brand Lenovo ThinkPad T430s
CPU Quad-core GHz
NIC Intel GbE 82579LM with e1000e 2.1.4-NAPI
O/S Linux
Kernel 3.5.3
Distribution Mint
UHD 003.005.000

A standard E100 was used for the embedded tests.

N210

TEST PIC:

B100

PIX

Controller Issues

Please read the USB controller notes to put the following in context (i.e. the hardware configuration of the test machine):

During initial testing, a B100 was connected to the single external USB 2 port. While running uhd_fft at 8 Msps, hardware overruns would be reported (there are two types of overruns: those reported by the hardware itself, and those detected by UHD as lost packets, or sequence errors). Increasing the recv_frame_size to twice the default (32768) would help, however moving the gain slider rapidly (to cause additional traffic on the wire) would trigger more hardware overruns.

When the B100 was connected to the USB 3 port, no overruns were reported.

E100

Best combination of parameters for each of the following sample rates:

Sample rate (Msps) SPB SPP Approximate minimum latency for greatest chance of success (us)
0.25 32/64 64 750
1 32 128 500
4 64 128 500
8 256 507 (default) 250

Latency at 250 ksps

Latency at 1 Msps

Latency at 4 Msps

Latency at 8 Msps

USRP 1

As the USRP 1 is not capable of reporting whether a burst was transmitted on time (the timed burst transmission feature is emulated on the host), the 'test' was instead conducted by observing the burst signals on a scope. responder was taken out of self-test mode and instead made to listen for a periodic incoming pulse. This simple received 'packet' then triggers the faux timed burst to be transmitted by the USRP. The 'latency' can be seen by inputting both received and transmitted signals into the scope and how long after the trigger the reply is transmitted (more importantly: how late the reply is after the future point in time when we requested it to be sent using tx_metadata).

5ms

PIX

Further Reading

One can refer to:

  • rx_timed_samples.cpp for the basics of doing timed reception (i.e. receiving samples at a precise point in the future)

General Performance Improvements

For N2xx, make sure you have enough kernel memory set aside for the network receive/transmit buffers.

sudo sysctl net.core.rmem_max will return the current receiver buffer (wmem_max will return the transmit buffer)

To set it to something large (e.g. 50M, 128M, or even higher):

sudo sysctl -w net.core.rmem_max=256000000

If using FUSE to capture a stream to a disk (e.g. NTFS-3G), then the FUSE mount helper will consume a lot of CPU and potentially cause packets to be dropped. To avoid this, try to optimise your FUSE configuration. For example:

  • NTFS-3G: use the big_writes mount option to improve performance for large writes (and optionally re-format using a larger NTFS Cluster Size)

To Do

  • CPU governor impact (e.g. Performance vs. On-demand)
  • sudo (for FIFO/RR)
  • Pre-emptive (better) vs. server (worse) kernel
  • Why doesn't SetMaxOutput in GNU Radio have an effect?
ethtool -G eth0 tx 128
ifconfig eth0 txqueuelen 100
To add:
  • Tek screencaps
  • GNU Plot output