Over the years, I've learned to be cautious with C++ pointers. In
particular, I'm always very careful about who owns a given pointer, and
who's in charge of calling delete
on it. But my caution often forces me
to write deliberately inefficient functions. For example:
vector<string>tokenize_string(conststring&text);
Here, we have a large string text
, and we want to split it into a vector
of tokens. This function is nice and safe, but it allocates one string
for every token in the input. Now, if we were feeling reckless, we could
avoid these allocations by returning a vector of pointers into text
:
vector<pair<constchar*,constchar*>>tokenize_string2(conststring&text);
In this version, each token is represented by two pointers into text
: One
pointing to the first character, and one pointing just beyond the last
character.1 But
this can go horribly wrong:
// Disaster strikes!autov=tokenize_string2(get_input_string());munge(v);
Why does this fail? The function get_input_string
returns a temporary
string, and tokenize_string2
builds an array of pointers into that
string. Unfortunately, the temporary string only lives until the end of
the current expression, and then the underlying memory is released. And so
all our pointers in v
now point into oblivion—and our program just wound
up getting featured in a CERT advisory. So personally, I'm going to prefer
the inefficient tokenize_string
function almost every time.
Rust lifetimes to the rescue!
Going back to our original design, let's declare a type Token
. Each
token is either a Word
or an Other
, and each token contains pointers
into a pre-existing string. In Rust, we can declare this as follows:
#[deriving(Show, PartialEq)]enumToken<'a>{Word(&'astr),Other(&'astr)}