OTP 22 Highlights

May 13, 2019 · by Lukas Larsson

OTP 22 has just been released. It has been a long process with three release candidates before the final release. We decided this year to try to get one month more testing of the major release and I think that the extra time has paid off. We’ve received many bug reports from the community about large and small bugs that our internal tests did not find.

This blog post will describe some highlights of what is released in OTP 22 and in OTP 21 maintenance patches.

You can download the readme describing the changes here: OTP 22 Readme. Or, as always, look at the release notes of the application you are interested in. For instance here: OTP 22 Erts Release Notes.

Compiler #

In OTP 22 we have completely re-implemented the lower levels of the Erlang compiler. Before this change the Erlang compiler consisted of a number of IRs (intermediate representations):

Erlang AST -> Core Erlang -> Kernel Erlang -> Beam Asm

When compiling an Erlang module, the code is optimized and transformed between these different IRs. In OTP 22 we have almost removed the Kernel Erlang IR and added a new IR called Beam SSA. There are a series of blog posts describing this change in greater details for those that are interested.

With this change the compile pipeline now looks like this:

Erlang AST -> Core Erlang -> Kernel Erlang -> Beam SSA -> Beam Asm

Together with the SSA rewrite a number of new optimizations have been introduced. One such is strengthening of the bit syntax. Before the change, you had to be very careful with how you wrote your binary matching in order for the binary match context optimization to work properly. There were also scenarios where it was impossible to get the optimization to trigger at all. One place in Erlang/OTP where this had a great effect was the internal string:bin_search_inv_1 function used by string:lexemes/1 and other string functions. We can see the change in the benchmark graph below (where higher is better and the turquoise line in the OTP 22 branch):

String Lexemes OTP 22 benchmark

You can read more about this optimization in PR1958 and Retiring old performance pitfalls.

Another great optimization is PR2100 which makes the compiler’s type optimization pass work across functions within the same module. For instance in the code below:

-record(myrecord, {value}).

h(#myrecord{value=Val}) ->
    #myrecord{value=Val+1}.

i(A) ->
    #myrecord{value=V} = h(#myrecord{value=A}),
    V.

The new compiler is able to detect the type of the term passed as an argument to h/1 and also the return value of h/1 so it can eliminate the record checks completely. Looking at the BEAM code (produced by erlc -S) of the h/1 function we get:

OTP 21:

    {test,is_tagged_tuple,{f,9},[{x,0},2,{atom,myrecord}]}.
    {get_tuple_element,{x,0},0,{x,1}}.
    {get_tuple_element,{x,0},1,{x,2}}.
    {gc_bif,'+',{f,0},3,[{x,2},{integer,1}],{x,0}}.
    {test_heap,3,1}.

OTP 22:

    {get_tuple_element,{x,0},1,{x,0}}.
    {gc_bif,'+',{f,0},1,[{x,0},{integer,1}],{x,0}}.
    {test_heap,3,1}.

The is_tagged_tuple instruction has been completely eliminated and as an added bonus one get_tuple_element was also removed.

However, this is only the start and we are already looking into making even better optimizations for OTP 23, building on top of the SSA rewrite.

Socket #

OTP 22 comes with a new experimental socket API. The idea behind this API is to have a stable intermediary API that users can use to create features that are not part of the higher-level gen APIs. We will also be using this API to re-implement the higher-level gen APIs in OTP 23.

Another aspect of the new socket API is that it can be used to greatly reduce the overhead that is inherent with using ports. I wrote this microbenchmark called gen_tcp2 to see what the difference could be.

Erlang/OTP 22 [erts-10.4] [source] [64-bit]

Eshell V10.4  (abort with ^G)
1> gen_tcp2:run().
              client             server
 gen_tcp:       12.4 ns/byte       12.4 ns/byte
gen_tcp2:        7.3 ns/byte        7.3 ns/byte
   ratio:       58.9 %             58.9%
ok

The results seem promising. The socket implementation of gen_tcp uses roughly 40% less CPU to send the same amount of packets. Of course, gen_tcp does a lot more than gen_tcp2 (dealing with lots of buffers, error cases and IPv6 to name a new), so it is not by any means a fair comparison. Though if an application can live without all the guarantees that come with gen_tcp, then using socket could be very good for performance.

Write concurrency in ordered_sets #

PR1952 contributed by Kjell Winblad from Uppsala University makes it possible to do updates in parallel on ets tables of the type ordered_set. This together with other improvements by Kjell Winblad and Sverker Eriksson (PR1997 and PR2190) has greatly increased the scalability of such ets tables that are the base for many applications, for instance, pg2 and the default ssl session cache.

Ordered Set Write Concurrency OTP 22 benchmark

In the benchmark above we can see that on an ordered_set table the operations per seconds possible on a 64 core machine has increased dramatically between OTP 21 and OTP 22. You can see a description of the benchmark and the results of many more benchmarks here.

The data structure used to enable write_concurrency in the ordered_set is called contention adaptive search tree. In a nutshell, the data structure keeps a shadow tree that represents the locks needed to read or write a term in the tree. When conflicts between multiple writers happen, the shadow tree is updated to have more fine-grained locks for specific branches of the tree. You can read more about the details of the algorithm in A Contention Adapting Approach to Concurrent Ordered Sets (PDF).

TLS Improvements #

In OTP 21.3 the culmination of many optimizations in the ssl application was released. For certain use-cases, the overhead of a using TLS has been significantly reduced. For instance in this TLS distribution benchmark:

TLS Dist OTP 22 benchmark

The bytes per second that the Erlang distribution over TLS is able to send has been increased from 17K to about 80K, so more than 4 times as much data as before. The throughput gain above is mostly due to better batching of distribution messages which makes it so that ssl does not have to add a lot of padding to each message sent. So it does not translate over to using ssl directly but is still a very nice performance improvement.

In OTP 22 the logging facility for ssl has been greatly improved and there is now basic server support for TLSv1.3. In order to work with TLSv1.3 you need to install an OpenSSL version that supports TLSv1.3 (for instance 1.1.1b), compile Erlang/OTP using that OpenSSL version and generate the correct certificates. Then we can start a TLSv1.3 server like this:

LOpts = [{certfile, "tls_server_cert.pem"},
	     {keyfile, "tls_server_key.pem"},
	     {versions, ['tlsv1.3']},
	     {log_level, debug}
	    ],
{ok, LSock} = ssl:listen(8443, LOpts),
{ok, CSock} = ssl:transport_accept(LSock),
{ok, S} = ssl:handshake(CSock).

And use the OpenSSL client to connect:

openssl s_client -debug -connect localhost:8443 \
  -CAfile tls_client_cacerts.pem \
  -tls1_3 -groups P-256:X25519

This will produce a huge amount of logs, but somewhere in there we can see this in Erlang:

<<< TLS 1.3 Handshake, ClientHello

and this in OpenSSL:

New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384

which means that we have successfully created a new TLSv1.3 connection. If you want to duplicate what I’ve done you can follow these instructions.

Not all features of TLSv1.3 have been implemented, you can see which parts of the RFCs that are missing in the ssl application’s Standard Complience documentation.

Fragmented distribution messages #

In order to deal with the head of line blocking caused by sending very large messages over Erlang Distribution, we have added fragmentation of distribution messages in OTP 22. This means that large messages will now be split up into smaller fragments allowing smaller messages to be sent without being blocked for a long time.

If we run the code below that does small rpc calls every 100ms millisecond and concurrently sends many 1/2 GB terms.

1> spawn(fun() ->
           (fun F(Max) ->
             {T, _} = timer:tc(fun() ->
                 rpc:call(RemoteNode, erlang, length, [[]])
               end),
             NewMax = lists:max([Max, T]),
             [io:format("Max: ~p~n",[NewMax]) || NewMax > Max],
             timer:sleep(100),
             F(NewMax)
           end)(0)
         end).
2> D = lists:duplicate(100000000,100000000),
   [{kjell, RemoteNode} ! D || _ <- lists:seq(1,100)],
   ok.

Using two of our test machines I get a max latency of about 0.4 seconds on OTP 22, whereas on OTP 21 the max latency is around 50 seconds. So with the network at our test site the max latency is decreased by roughly 99%, which is a nice improvement.

Counter/Atomics and persistent_terms #

Three new modules, counters, atomics, and persistent_term, were added in OTP 21.2. These modules make it possible for the user to access low-level primitives of the runtime to make some spectacular performance improvements.

For instance, the cover tool was recently re-written to use counters and persistent_term. Previously it used a bunch of ets tables to keep the counters for when the code was executed, but now it uses counters and the overhead of running cover has decreased by up to 80%.

persistent_term is adding run-time support for mochiglobal and similar tools. It makes it possible to very efficiently access data globally but at the cost of making updates very expensive. In Erlang/OTP we so far use it to optimize logger backends but the use cases are numerous.

A fun (and possibly useful) use case for atomics is to create a shared mutable bit-vector. So, now we can spawn 100 processes and play flip that bit with each other:

BV = bit_vector:new(80),
[spawn(fun F() ->
            bit_vector:flip(BV, rand:uniform(80)-1),
            F()
          end) || _ <- lists:seq(1,100)],
timer:sleep(1000),
bit_vector:print(BV).

Documentation Changes #

In OTP 21.3, the version when all functions and modules were introduced was added to the documentation.

Documentation Version OTP 21.3

Sverker used some git magic to figure out when functions and modules were added and automatically updated all the reference manuals. So now it should be a lot easier to see when some functionality was introduced. Knowing when an option to functions was added is still problematic, but we are trying to be better there as well.

In OTP 22 a new documentation top section called Internal Documentation has been added to the erts and compiler applications. The sections contain the internal documentation that previously only has been available on github so that it easier to access.

More Memory optimizations #

Each major OTP release wouldn’t be complete without a set of memory allocator improvements and OTP 22 is no exception. The ones with the most potential to impact your applications are PR2046 and PR1854. Both of these optimizations should allow systems to better utilize memory carriers in high memory situations allowing your systems to handle more load.