THIS PAGE IS UNDER CONSTRUCTION
Certain SDR applications require deterministic timing of transmitted bursts when responding to a received packet. This precise timing can be achieved using UHD's scheduled transmit functionality (via tx_metadata), however different devices (and host configurations) will have different lower bounds on round-trip latency.
This document will focus on characterising and minimising the time delay for a host computer to send data to a USRP and have the device actually transmit the data out onto the air.
The simplest scenario consists of a USRP that is listening on a receive channel for a packet from a remote node. At a random point in time, it will:
- 'hear' a packet from a remote node
- demodulate and decode the packet
- act upon the payload
- form a response packet
- modulate the packet
- transmit the baseband signal (as a burst)
Latency during the transmit (reply) phase is of interest for two reasons:
- Trying to minimise the turn-around time (i.e. having the USRP actually transmit the baseband burst in the shortest period of time after the data has been submitted to UHD)
- Being able to transmit the burst at a precise point in time after the time-stamped reception of the original incoming packet (e.g. in a TDMA scheme where one can only transmit in their assigned time slot).
Please note: there are a number of variables involved, and your own observed performance will depend on a variety of factors outside the control of your USRP.
If you require hard real-time deterministic processing at higher layers in your radio/protocol stack, please consider moving your code into the FPGA.
- Receiving Samples
- Sending a Timed Burst
- General Latency Guidelines
- Device Notes
- More Tweaks
- Measuring Latency
- Further Reading
- To Do
uhd::stream_args_t stream_args("fc32"); // Output complex floats from UHD // We are not setting any other stream arguments (yet) uhd::rx_streamer::sptr rx_stream = usrp->get_rx_stream(stream_args); // Get the RX streamer uhd::stream_cmd_t stream_cmd(uhd::stream_cmd_t::STREAM_MODE_START_CONTINUOUS); // Set 'stream_mode' member from constructor stream_cmd.stream_now = true; usrp->issue_stream_cmd(stream_cmd); // Start streaming now
Now that the USRP is streaming, you can read samples and the accompanying metadata:
std::vector<std::complex<float> > buff(samps_per_buff); // 'samps_per_buff' is up to you to set, and has an impact on latency in your application uhd::rx_metadata_t rx_md; size_t num_rx_samps = rx_stream->recv(&buff.front(), buff.size(), rx_md, timeout);
rx_md.time_spec provides the time-stamp of the first sample as a time_spec struct. This can then be used to schedule burst transmission in the future.
Samples per Buffer (SPB)¶
This is the size of buffer given to the UHD rx_streamer
recv function. (Relates to the variable above
Samples per Packet (SPP)¶
This is the size of packet over the wire (minus framing overhead).
One can determine the maximum number of samples in a packet by calling the streamer object's
uhd::stream_args_t stream_args("fc32"); // Output complex floats from UHD stream_args.args["spp"] = str(boost::format("%d") % samps_per_packet); // Set the property uhd::rx_streamer::sptr rx_stream = usrp->get_rx_stream(stream_args); // Get the RX streamer samps_per_packet = rx_stream->get_max_num_samps(); // Read it back to see whether it was honoured
Please see relevant Device Type section below on how SPP affects latency.
Sending a Timed Burst¶This assumes you have:
rx_md.time_specfrom a previous
sample_rateis the current sample rate and
nis the number of samples that have been processed since
rx_md.time_specwas set (i.e. the delta between the first sample in the current group, and the current sample that has 'triggered' a timed burst response to be transmitted).
uhd::tx_metadata_t tx_md; tx_md.start_of_burst = true; // This is the first packet in the chain. tx_md.end_of_burst = true; // This is the last packet in the chain. tx_md.has_time_spec = true; // Send at the supplied point in time (next line) tx_md.time_spec = rx_md.time_spec + // 'Base' time from when last buffer of samples was received uhd::time_spec_t(0, n, sample_rate) + // Offset from first sample in that buffer to 'current' sample that triggered burst response uhd::time_spec_t(delay); // Pre-determined delta (adjusted depending on application) size_t num_tx_samps = tx_stream->send(&response_buff.front(), response_buff.size(), tx_md, timeout);
If you wish to send the burst as quickly as possible (i.e. not at a pre-determined point in the future), you can make the following modification:
tx_md.has_time_spec = false; // 'tx_md.time_spec' is now ignored
Note: this has the advantage of telling UHD that it should no longer expect samples. Therefore it will not report underruns, and the USRP will automatically switch back to receive mode if an antenna switch is in the signal path (e.g. half-duplex mode with a daughterboard).
For a simple code example on how to perform timed burst transmission, please see tx_timed_samples.cpp
General Latency Guidelines¶
- Operating at a higher sample rate will mean data will get to/from the USRP quicker
- Choosing a smaller Samples per Block will mean less time filling your buffer on a
- Choosing a smaller Samples per Packet will mean the device will spend less time filling a packet's payload before sending it down the wire.
- On certain devices (i.e. USB) only changing SPP is not sufficient - more changes are necessary (see below).
Remember to take into account your interface's MTU when setting SPP.
For example, if you request 512 samples, this will not fit in a standard Ethernet frame:
512 * 4 bytes (I/Q sample) + overhead > 1500 byte Ethernet frame.
This will result in packets not being received by UHD and will cause
recv to timeout.
The alternative is to enable Jumbo Frames on your NIC to allow for the larger size. If you are communicating with your USRP via network switch(es) make sure they support Jumbo Frames and they are enabled! (Switches tend to have this feature disabled by default.)
You can change your NIC's MTU by issuing:
sudo ifconfig ethX mtu <size, e.g. 4000 (9000 is usually the max)>
NIC Interrupt Moderation/Mitigation¶
NICs 'moderate' the assertion of interrupts when transmitting/receiving packets to decrease the load on the CPU.
Under most environments this is beneficial, but to improve latency we wish to have the NIC (via its driver settings) to operate immediately (at the expensive of potentially higher CPU usage).
- Description of Interrupt Coalescing
See below for a comparison of N210 results with and without adjustment of interrupt moderation.
Intel NIC (82579LM)¶
This chipset was used during development of the latency characterisation code, and collection of the E2xx results on this page.
- Under Linux, uses the
e1000edriver (official source available on Sourceforge).
- The Offical Driver README lists the driver arguments that can control performance.
- Intel's presentation on Interrupt Moderation in GBe controllers
Commands for re-initialising Linux driver with arguments for maximum responsiveness:
sudo modprobe -r e1000e sudo modprobe e1000e InterruptThrottleRate=0 TxAbsIntDelay=0 TxIntDelay=0 RxAbsIntDelay=0 RxIntDelay=0
If you wish to have more verbose driver output and/or have multiple network cards (making an array of values necessary), consider:
sudo modprobe e1000e debug=3 InterruptThrottleRate=0,0,0 TxAbsIntDelay=0,0,0 TxIntDelay=0,0,0 RxAbsIntDelay=0,0,0 RxIntDelay=0,0,0
USB transfers are done through libusb by filling a LUT. The size of this transfer can be changed by supplying the
send_frame_size arguments as device arguments to UHD when creating your device instance. Please see the UHD Transport Notes for more information.
On the machine (Levono ThinkPad T430s laptop) used for testing below, two controllers were available:
- USB 2:
USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller (rev 04)
- USB 3:
USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
EHCI has a tunable parameter
log2_irq_thresh to control interrupt moderation, but the default (
0) should already give best results. See ehci.txt for more information.
XHCI does not have any such parameter.
There is an extra stage in data processing on the B100 hardware that has important implication for latency:
- Data is passed from the DDC to the VITA49 framer.
The frame uses the SPP parameter to determine how many samples to pack into the current VITA49 frame.
- Each VITA49 frame is submitted to a LUT buffer, whose size is determined by
The complete buffer (and all the VITA49 frames it contains) will be transferred via USB when the buffer either: # is full and cannot accept another incoming VITA49 frame # contains at least one packet and a transfer timeout occurs (this is the flush timer and operates on a per-cycle basis)
By default, the buffer is large (16K), and the flush timer expires at 65536 cycles (~1ms at 64MHz).
Therefore, to achieve minimum latency, it would make sense to decrease the SPP as usual and also set the flush timer's period to something very small so it checks more often. It is also possible to set the timeout to 0, which will cause it to check the buffer for a frame every cycle and transfer it immediately.
Also, refer to notes on possible USB controller issues to mitigate against overruns.
Timing is emulated on the host (it is not done on the FPGA), therefore it will not perform in a deterministic fashion as the later devices do.
Please do not expect this device to perform in the manner that the other devices do!
Due to the embedded nature of device, the FPGA-to-host interface is slower. Therefore latency will not be as low as an N2xx Ethernet device, for example.
- SPP should not exceed 507 samples as this is too great to transfer to the processor (it exceeds the DMA transfer size, e.g. 512 will result in
rx_metadata_t::ERROR_CODE_BAD_PACKETbeing returned by
Modes of operation:
- Stand-alone (no daughterboard required)
- Pure UHD
- GNU Radio
- Use CMake to generate Makefile for 'responder'
responder accepts key commands while running in interactive mode:
- d: toggle timed (scheduled) burst mode, or leave as best-effort
- l: allow late bursts (as opposed to not transmitting at all if it will be late)
- Left/right arrow: change transmit delay by current step size
- Up/down arrow: change magnitude of step size
- q: quit
'run_tests.py' requires 'responder' to be in the same directory
- Please run apps with --help to see all possible options!
- sudo is not strictly necessary, but code will try to change scheduler for better performance (and this requires elevated privileges).
sudo ./responder --spb=64 --rate=25e6 --iterations=1000 --delay-min=50e-6 --delay-max=200e-6 --delay-step=5e-6 --duration=100e-6
sudo ./responder --spb=64 --rate=4e6 --iterations=1000 --delay-min=3000e-6 --delay-max=5000e-6 --delay-step=50e-6 --duration=10e-6 --simulate=199 --time-mul=1e6 --test-success=3
sudo ./responder --spb=64 --rate=8e6 --iterations=1000 --delay-min=50e-6 --delay-max=5000e-6 --delay-step=50e-6 --duration=10e-6 --simulate=199 --time-mul=1e6 --test-success=3
find . -name "latency*.txt" | ./graph.py
find results/ -maxdepth 1 -name "*B210*rate_10*.txt" | ./graph.py --sort="-rate -spp -spb" --output=B210_1Msps_results
These results were collected in-house for each device using
We are very interested to hear how your system performs!
If you wish to send us your results for comparison (especially if they are better than those below), please email them to us!
The specifications of the test computer were:
|Brand||Lenovo ThinkPad T430s|
|CPU||Quad-core 2.9 GHz|
|NIC||Intel GbE 82579LM with e1000e 2.1.4-NAPI|
A standard E100 was used for the embedded tests.
Please read the USB controller notes to put the following in context (i.e. the hardware configuration of the test machine):
During initial testing, a B100 was connected to the single external USB 2 port. While running
uhd_fft at 8 Msps, hardware overruns would be reported (there are two types of overruns: those reported by the hardware itself, and those detected by UHD as lost packets, or sequence errors). Increasing the
recv_frame_size to twice the default (32768) would help, however moving the gain slider rapidly (to cause additional traffic on the wire) would trigger more hardware overruns.
When the B100 was connected to the USB 3 port, no overruns were reported.
The specifications of the test computer for the B200 tests were:
|CPU||Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz|
|USB controller||Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)|
|Sample rate (Msps)||SPB||SPP||minimum latency for greatest chance of success (us)|
As the USRP 1 is not capable of reporting whether a burst was transmitted on time (the timed burst transmission feature is emulated on the host), the 'test' was instead conducted by observing the burst signals on a scope.
responder was taken out of self-test mode and instead made to listen for a periodic incoming pulse. This simple received 'packet' then triggers the faux timed burst to be transmitted by the USRP. The 'latency' can be seen by inputting both received and transmitted signals into the scope and how long after the trigger the reply is transmitted (more importantly: how late the reply is after the future point in time when we requested it to be sent using
sudo ./responder --args "type=usrp1" --rate=1e6 --no-delay --invert --flush=4000 --no-eob
One can refer to:
- rx_timed_samples.cpp for the basics of doing timed reception (i.e. receiving samples at a precise point in the future)
General Performance Improvements¶
For N2xx, make sure you have enough kernel memory set aside for the network receive/transmit buffers.
sudo sysctl net.core.rmem_max will return the current receiver buffer (
wmem_max will return the transmit buffer)
To set it to something large (e.g. 50M, 128M, or even higher):
sudo sysctl -w net.core.rmem_max=256000000
If using FUSE to capture a stream to a disk (e.g. NTFS-3G), then the FUSE mount helper will consume a lot of CPU and potentially cause packets to be dropped. To avoid this, try to optimise your FUSE configuration. For example:
- NTFS-3G: use the
big_writesmount option to improve performance for large writes (and optionally re-format using a larger NTFS Cluster Size)
- CPU governor impact (e.g. Performance vs. On-demand)
- sudo (for FIFO/RR)
- Pre-emptive (better) vs. server (worse) kernel
- Why doesn't SetMaxOutput in GNU Radio have an effect?
ethtool -G eth0 tx 128 ifconfig eth0 txqueuelen 100