Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?
I have a large, frozen String and I want to change its encoding. But I don't want to copy the whole String just to do that. For context, this is to pass values to a Google Protocol Buffer which has the bytes
type and only accepts Encoding::ASCII_8BIT.
big_string.freeze
MyProtobuf::SomeMessage.new(
# I would prefer not to have to copy the whole string just to
# change the encoding.
value: big_string.dup.force_encoding(Encoding::ASCII_8BIT)
)
CodePudding user response:
Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?
There is nothing in the Ruby Language Specification that prevents that. There is also nothing in the Ruby Language Specification that enforces that.
In general, the Ruby Language Specification tries to stay silent on all things related to memory management, space complexity, step complexity, or time complexity. This is not exclusive to the Ruby Language Specification, most Language Specifications try to leave the implementors as much leeway as possible. In other words, Language Specifications tend to specify Syntax and Semantics and leave the Pragmatics up to the implementor. (C is somewhat of an exception in that it specifies space and time complexity for the algorithms in the standard library.) Even C, which is typically thought of as a language which gives you full control over everything, doesn't actually specify things like memory layouts precisely – for example, due to the definition of the term width in the standard, a uint16_t
is actually allowed to occupy more than 16 bits!
Every implementor is free to implement strings however they want, as long as they comply with the semantics defined in the Ruby Language Specification.
If I remember correctly, both Rubinius and TruffleRuby did, at one point, experiment with a String
implementation based on Ropes. Chris Seaton, TruffleRuby's lead developer, wrote a paper about that implementation. However, I don't know if they are still using it. (I know TruffleRuby switched to Truffle Strings recently, and I am not sure what their underlying representation is … or whether they are even guaranteeing a specific underlying representation.)
There is problem with the answer "you have to look at the specification", though: unfortunately, unlike many other programming languages, the Ruby Language Specification does not exist as a single document in a single place. Ruby does not have a single formal specification that defines what certain language constructs mean.
There are several resources, the sum of which can be considered kind of a specification for the Ruby programming language.
Some of these resources are:
- The ISO/IEC 30170:2012 Information technology — Programming languages — Ruby specification – Note that the ISO Ruby Specification was written around 2009–2010 with the specific goal that all existing Ruby implementations at the time would easily be compliant. Since YARV and MacRuby only implement Ruby 1.9 and MRI only implements Ruby 1.8 and lower and JRuby, XRuby, Ruby.NET, and IronRuby (at the time) only implemented a subset of Ruby 1.8, this means that the ISO Ruby Specification only contains features that are common to both Ruby 1.8 and Ruby 1.9. Also, the ISO Ruby Specification was specifically intended to be minimal and only contain the features that are absolutely required for writing Ruby programs. Because of that, it does for example only specify
String
s very broadly (since they have changed significantly between Ruby 1.8 and Ruby 1.9). It obviously also does not specify features which were added after the ISO Ruby Specification was written, such as Ractors or Pattern Matching. - The Ruby Spec Suite aka
ruby/spec
– Note that theruby/spec
is unfortunately far from complete. However, I quite like it because it is written in Ruby instead of "ISO-standardese", which is much easier to read for a Rubyist, and it doubles as an executable conformance test suite. - The Ruby Programming Language by David Flanagan and Yukihiro 'matz' Matsumoto – This book was written by David Flanagan together with Ruby's creator matz to serve as a Language Reference for Ruby.
- Programming Ruby by Dave Thomas, Andy Hunt, and Chad Fowler – This book was the first English book about Ruby and served as the standard introduction and description of Ruby for a long time. This book also first documented the Ruby core library and standard library, and the authors donated that documentation back to the community.
- The Ruby Issue Tracking System, specifically, the Feature sub-tracker – However, please note that unfortunately, the community is really, really bad at distinguishing between Tickets about the Ruby Programming Language and Tickets about the YARV Ruby Implementation: they both get intermingled in the tracker.
- The Meeting Logs of the Ruby Developer Meetings. (Same problem: Ruby and YARV get intermingled.)
- New features are often discussed on the mailing lists, in particular the ruby-core (English) and ruby-dev (Japanese) mailing lists. (Same problem again.)
- The Ruby documentation – Again, be aware that this documentation is generated from the source code of YARV and does not distinguish between features of Ruby and features of YARV.
- In the past, there were a couple of attempts of formalizing changes to the Ruby Specification, such as the Ruby Change Request (RCR) and Ruby Enhancement Proposal (REP) processes, both of which were unsuccessful.
- If all else fails, you need to check the source code of the popular Ruby implementations to see what they actually do. Please note the plural: you have to look at multiple, ideally all, implementations to figure out what the consensus is. Only looking at one implementation cannot possibly tell you whether what you are looking at is an implementation quirk of this particular implementation or is a universally agreed-upon behavior of the Ruby Language.
CodePudding user response:
It seems to work just fine for me: (using MRI/YARV 1.9, 2.x, 3.x)
require 'objspace'
big_string = Random.bytes(1_000_000).force_encoding(Encoding::UTF_8)
big_string.encoding #=> #<Encoding:UTF-8>
big_string.bytesize #=> 1000000
ObjectSpace.memsize_of(big_string) #=> 1000041
dup_string = big_string.dup.force_encoding(Encoding::ASCII_8BIT)
dup_string.encoding #=> #<Encoding:ASCII-8BIT>
dup_string.bytesize #=> 1000000
ObjectSpace.memsize_of(dup_string) #=> 40
Those 40 bytes are the size to hold an object (RVALUE) in Ruby.
Note that instead of dup
/ force_encoding(Encoding::ASCII_8BIT)
there's also b
which returns a copy in binary encoding right away.
For more in-depth information, here's a blog post from 2012 (Ruby 1.9) about copy-on-write / shared strings in Ruby:
From the author's book Ruby Under a Microscope: (p. 265)
Internally, both JRuby and MRI use an optimization called copy-on-write for strings and other data. This trick allows two identical string values to share the same data buffer, which saves both memory and time because Ruby avoids making separate copies of the same string data unnecessarily.