=head1 NAME Precompilation cache in Pugs =head1 DESCRIPTION Rather than parse every .pl or .pm file pugs sees over and over again as they are needed, pugs stores the results of compilation in a cache. This gives the benefit of speed without the awkwardness of an opaque object file. =head1 PHASE 1 - GET IT WORKING =head2 Objectives In the first version of the design, we aim for simplicity, reasonable forward compatibility, but not kitchen sink flexibility. For now, this is a Pugs mechanism. One cache should support several versions of Pugs, sharing objects where possible, and allowing easy maintenance. If it gets screwed up, the admin should be able to delete the cache directory and not suffer from anything but subsequent temporary cold cache slowdowns. =head2 What goes into the cache Objects stored in the cache are compilation units after parsing. Perl 6 has separate compilation, and presumably every compilation unit has one canonical abstract representation per version of Pugs. Where this is not the case, e.g. if C blocks intends to change the compilation outcome of this unit, the author of the code should mark the unit as not cachable. XXX: notation for this. XXX: Also need to add deps for used modules that export macros. Especially macros in the prelude. Including the dependencies hashes in the hash of the source code should suffice. We now describe how cache objects are keyed in the cache, and the bytecode format of a cached object. =head2 Keying scheme Example: ~joe/.pugscache/1/14/148071fa07847bc0de8df7d75cd03072f27239c2/9600-9658 # < $HOME .pugscache $H1 $H2 $SHA1 $ReleaseRev-$ParserRev >.catdir The fast path for cache usage is successful lookup of a valid precompiled unit. The cache is (in the first insance) a filesystem directory under the user's C<$HOME>. A global cache would have been nice for disk and CPU performance on multiuser machines, but since original source code is easily deducible from the compiled version, this is a security issue we prefer to avoid. If we have pluggable backends we could potentially just allow a C backend as well. Every compilation unit entering the cache is stored in a directoy of its own. The entry location is a function of (a hash of) the source code. Inside that directory there is a file named with Pugs version, and Pugs parser version. This prevents us from loading a precompiled object for a unit that has changed, or for a parser that would have emitted a different structure than the one present in the cache. Using the hash as a directory name, without the parser version/etc lets us check in one stat if we have a cached entry or not, and then readdir for the compatible version info, instead of always having to readdir the entire cache level. (To keep file count in a given directory managable, the file may be hashed into a directory inside the cache. Optimal hashing depth may vary among systems and should be a user setting. On "modern" systems no extra dirs might actually be fastest.) For now the current keying scheme implies that the source file needs to be read from storage even if a precompiled version of it exists in the cache, because we require its hash. There may be ways to optimize that part away, but they can be added in the future. I having the compilation unit name as part of the hash key is nice, because it means we don't have to worry about name canonicalization or source files with similar names but different locations. It also means we can cache objects that aren't files at all, such as C results (as long as C<$string> is the same across runs), or units received over the network (as long as we can get their hashes before we attempt to compile them). =head2 Bytecode format The cached object is a compressed YAML document containing a serialized Haskell Pugs structure. =head3 Compression We use gzip currently as the de-facto compression mechanism. This is the easiest to deploy with Pugs: GHC bundles zlib, and Data.FastPackedString which we use anyway for file IO contains enough bindings to read gzipped files. Compression is desireable because serialized YAML for precompiled units are large: for example, the 22 KB C takes on the order of 1.2 MB to serialize, but 47 KB in gzipped form. There are better compression algorithms out there; for example, BZip2 compresses the same file to only 20 KB. If you wish to work on a patch for changing the compression scheme, it should not lose portability or deployability to the current setup (i.e., must bundle or implement whatever scheme you desire), and provide a transparent readFile function that identifies the actual object by magic number. =head3 Precompiled data C or C output a serialized form of the following structure. C or the module loader load it: data CompUnit = MkCompUnit { ver :: Int -- currently 1 , desc :: String -- e.g., the name of the contained module , glob :: (TVar Pad) -- pad for unit Env , ast :: Exp -- AST of unit } The C field is currenly set to 1. It is always the first element in the CompUnit structure, for forward compatibility. The C field is for diagnostics. C is the global pad for this unit, and finally, C is the parsed tree. =head2 Module load try { my $hash = $HASHFUNC($source); my $dir = cachedir($hash); die "no precompiled version found" unless $dir ~~ :d; for $dir.readdir.sort:{numerically} -> $fn { my ($pugsrev, $parserrev) = $fn ~~ /(\d+)-(\d+)/ orelse next; next if $XXX_handwaving($pugsrev, $parserrev); # against %?CONFIG etc. load_precompiled($fn) orelse { $fn.rm; die "error loading cached version: $!"; }; return; # success } die "no precompiled version found"; CATCH { my $compunit = $source.parse; $compunit.load; return if $compunit.nocache; cache_cleanups; write_cache($compunit, $dir); } } =head2 Maintenance There are a few cache maint tools that can be added bit by bit. They probably deserve a section of their own but for now: =over 4 =item Delete objects for Pugses older than X =item Clean least recently accessed objects (should happen as a matter of course during regular cache B usage). Proposal: use Pseudo-LRU L =item Warm up cache for an entire set of modules =item Change cache size limit =back