October 16, 2011

Mariano Martínez Peck, winner of the 2011 edition of the Esug's Innovation Technology Awards.

Mariano Martinez Peck is an argentinian PhD Student at Ecole des Mines in association with RMOD-INRIA. . FUEL has been the winner of the 2011 ESUG edition of the Innovation Technology Awards, and Mariano is one of it developers. Here is a little talk about FUEL and his present work.

CS: What was the motivation to create a new serializer?
MMP: Well, this is an excellent question since most people's reaction when we announced Fuel was: "Yet another object serializer?". The truth is that I am doing a PhD with Stéphane Ducasse and others and, from the very beginning of my PhD, it was clear that for my solution I needed a good-designed, reliable, flexible, uniform and very fast serializer. I needed a serializer that I could understand, change, adapt for me needs and, mostly, I needed (because of my PhD domain) a serializer able to serialize all type of objects including classes, compiled methods, closures, contexts, traits, etc. At the same time, it was extremely important to make it fast. The main goal of the serializer needed to be the performance and not, for example, the portability as happens with other serializers. I checked all the serializers available for Pharo (since my PhD prototype is based in Pharo) and none of them met my expectations.

Stef also wanted a fast binary serializer to provide a future infrastructure for Monticello. I didn't have time to do my PhD and, at the same time, do the serializer so he decided to help me by asking Tristan Bourgois to make Fuel from scratch. Just a couple of weeks later, Martin Dias, from Universidad de Buenos Aires in Argentina, came to Lille for a 4-month internship. The team decided that Martin could also work on Fuel and use it for his thesis. Few months later, when I was starting to need the serializer for my PhD, I jumped directly into the team and I have been helping them since then. Tristan is not working anymore in Fuel so Martin and I are the current developers now.

Once Martin finished his internship and come back to Argentina, ESUG decided to sponsor him through the ESUG SummerTalk project. He is the student in such a project and I am currently taking the role of "mentor". So we should thanks ESUG for such a sponsorship. 

CS: Fuel is clean, platform agnostic and incredible fast in some scenarios. What were the topic most difficult to resolve in the framework?
MMP: The key characteristic in Fuel is the usage of a specific type of pickle algorithm. The only Smalltalk serializer that we are aware that uses such technique is VisualWorks Parcels. However, Parcels can be better described as a serializer for managing code than as a general purpose object graph serializer. Fuel is not focused in code loading and is highly customizable to cope with different objects. Fuel is the infrastructure on top of which you can then build other tools.

So, the pickle algorithm/logic in itself was not complicated since it is well known and there are papers or references about it. The main challenge was how to build a real object oriented approach for such algorithm. How to find the correct abstractions and hierarchies, where to put each responsibility, and all related questions that make a better design. It is also difficult  to maintain a good design without loosing performance. I think in Fuel we have a very clean and object oriented solution while having a good performance as well (it is really important to have a large set of benchmarks as we have).

Another complex topic was being able to serialize all type of objects because you have to know which objects are "special" and how they are represented internally. How to encode and decode the objects in a stream was difficult for us as well. Neither Martin or I are experts in streams nor in optimizing code, so we have learnt a lot about it in the process. 

CS: What is the pickle algorithm?
MMP: I think that, sometimes, there is some kind of confusion regarding this term. "Pickling'' and ``Unpickling'' are synonyms for "Serializing" and "Deserializing". In Fuel, we use the terms "serialize" and "materialize" (deserialize). In addition, we call "pickle" to the algorithm or format we use to encode or decode the objects in the stream.

It is a little bit complicated to explain Fuel pickle format in a couple of lines but I will do my best.

Traditional pickling formats take the object graph to serialize and, while traversing it, they serialize the object plus an identifier of its type into a sequence of bytes (Note that the type is usually its class but not necessary). The unpickling then starts to read objects from the stream. For each object it reads, it needs to read its type as well as determine and interpret how to materialize that encoded object. The materializer needs to determinate the type, search what it needs to do with it and perform the materialization. So, in the common case of a regular object, it will read its type and then it will need to get its class from the system and send #basicNew in order to get a new instance. Then, of course, it will fill its instance variables.  This unpickling is terrible slow because it means a lot of work for every single object. In other words, the materialization is done recursively.

Fuel pickle format is completely different. There is a first traversal of the graph (we call this phase "analysis") where each object is associated with an specific type which is called "cluster" in Fuel. As a result of the analysis phase, we have a list of clusters and, each cluster, contains the list of objects that belong to it. After that, we proceed to serialize. However,  there is another key aspect: the serialization is split in serialization of instances first and, then, references. This means that, first, we only serialize the instances (nodes of the object graph), and, then, all the references. This is different than the regular serializer that encodes both things together. Notice that, if an object is all references (those objects that are not variable), then nothing will be written in the "instances part" and everything will be in the "references part".  In the stream, we encode how many clusters there are and how many instances each cluster has.

During materialization, we first materialize the instances. Since all the objects of a cluster have the same type, we write/read that information in the stream only once. The materialization can be done in a bulk way which means that we can just iterate and instantiate the objects. Once we have finished with the "instances part", we continue with the "references part". Here, we iterate and set the references for each of the materialized object. In other words, the materialization is done iteratively.

So....the conclusion is that Fuel materialization is so fast because it can be done iteratively. To do that, we need to serialize instances separated from their references. This also means that we are a little bit slower during serialization as we need to map objects to clusters. Nonetheless, all benchmarks show that Fuel is the fastest serializer in materialization and still one of the fastest ones in serialization.

CS: How do you resolve the references in a serialization?
MMP: We encode references in Fuel by using an integer that denotes the position of the referenced object inside the stream. Then, during materialization, we can read that integer and we know exactly in which position is located the object we are looking for.

CS: What happens with the identity of an object? In other words, When an object it is materialized, is it the same object or it is a clone from the original?
MMP: It depends on the object you are need to serialize. For regular objects, yes, the identity changes and the materialized objects will be like a clone of the original. In fact, some consider a serialization as a very deep copy.
Now, Fuel supports what we call "global objects". Imagine that you serialize a graph that contains a reference to Transcript. You don't want to serialize the instance Transcript and then, during materalization, get yet another Transcript instance in your system. You want to use the same.

Global objects are not written into the stream. Instead, the serializer stores the minimal needed information to get the reference back at materialization time. In this example, we just store its global name to get the reference back during materialization. The same happens with the Smalltalk class pools and with classes. This means that, at materialization, all the classes and globals have to be present in the image.

The previous is normally the expected scenario. However, Fuel does support real serialization of classes. This means that Fuel can take a class and correctly serialize it together with its method dictionary, compiled methods, superclass, subclasses, etc. Of course, this is not the default behavior (the default is considering classes as globals) but the API will let you do that. In fact, this is needed for a small proof of concept we developed to manage Monticello packages with Fuel.

If I said ... Would you answer
Computer brand?
self isPayByEmployee
       ifTrue: [ Mac ]
       ifFalse: [ computers anyOne ]
Operative system?
self amIInMac
       ifTrue: [ MacOS]
       ifFalse: [ Ubuntu ]
Mobile Phone?
Lord of the Rings
Back to the Future
TV Series ?
The Big bang Theory
No one these days...only papers and blogs.
My mother-in-law’s Ford Focus. I have a special relationship with that car!
Open Source?
Sure! As much as I can

0 comentarios:

Post a Comment