previous article, I said that concatenating strings was an O(n)
operation on the length of the first string.
I feel the need to explain that.
In erlang, strings are lists of integer. Since they are integers, they can store any unicode characters such as the snowman ☃ which has a value of 9731.
When one wants to concatenate 2 strings, one can use string:concat/2
implemented in lib/stdlib/src/string.erl
as follows:
(All code extracts in this article are subject to the Erlang Public License.)
%% concat(String1, String2)
%% Concatenate 2 strings.
Ok, strings are lists, just concatenate lists!
The ++
operator is a BIF, a Built-In Function implemented in C in the Beam VM. It is implemented in erts/emulator/beam/erl_bif_lists.c
.
1 /*
2 * erlang:'++'/2
3 */
4
5 Eterm
6 7 8 9
10
11 static BIF_RETTYPE 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
First, the length of A
is computed. This operation is O(n)
where n
is the length of the list. Trust me on this one, of just go look at erts/emulator/beam/utils.c
.
If either lists are empty, return the other one.
On lines 29 - 30, space is allocated on the process heap for a new list. Each list item has a size of 2 Eterm
: one for the list itself, one for the element.
The code on lines 31 to 42 just copies A
to our newly allocated list and adds B on its tail.
last = CONS(hp, CAR(list_val(list)), make_list(hp+2));
Ok, what is that?
From erts/emulator/beam/erl_term.h
, we need some more definitions:
make_list
is used to return an Eterm
tagged as a list from a pointer on the process heap.
list_val
does the opposite and returns the address on the heap of the list.
CONS
puts 2 elements in hp[0]
(the element in the list) and hp[1]
(the next list item) and returns the list at hp
as an Eterm
.
That complicated expression can now be read as:
hp[0]
the element of the list we want to copy,Eterm
following on the stack (hp+2
) since the new list is allocated as one block on the stack.When list A
has been fully copied, the last element is set to B
at the line 42 with CDR(list_val(last)) = B;
.
If you’re curious, erts/emulator/beam/sys.h
can tell you that an Eterm
has usually a size of 4 bytes.
Small integers can be stored directly in those 4 bytes. There are 26 bits usable to store a small integer so it can store up to 2²⁵ = 33554432 because of the sign bit. For reference, pile of poo which was added in october 2010 has a value of 128169.
We can consider that unicode characters can be stored directly in the list.
If a string has n
characters, it will use 2 * n
words.
A binary
on the opposite will use 3 to 6 words (depending on the data size) plus the size of the data itself.
A string stored as a list takes about 8 times the space it would take stored as a binary
.
The pointed values are not copied, only the list items themsleves.
Appending a list of records of n
tuples to a list of m
tuples will take the same time/memory as concatenating a string of length n
to one of length m
.
The only difference is that tuples are not boxed in an Eterm
. The list only contains an Eterm
as a pointer to the tuple: where it’s stored in the heap.
Of course there is one! You should use binary
and IoList
whenever possible. An IoList
is iolist = [char() | binary() | iolist()]
. They are “deep-list”.
Want to append an IoList a
to an IoList b
? Simple! Just create a new list with 2 elements: a
and b
! Concatenation has become an O(1) operation. Most IO apis accept IoLists.