RE: Preprocessing of source text
- To: <misc>
- Subject: RE: Preprocessing of source text
- From: "James Hague" <jamesh@xxxxxxxxxxxxxxxx>
- Date: Thu, 14 Dec 2000 09:47:45 -0600
- In-Reply-To: <3A3812AB.D557B7D1@ultratechnology.com>
Jeff Fox wrote:
> > Overall, I'm leaning toward processing raw text with embedded
> color tokens.
> > I'd be interested in hearing other experiences.
>
> That is the smallest step away from traditional practice and might
> be easiest. I know that after I explained the idea of compression
> by a name dictionary and pointers in aha that Sean was able to
> implement it in Flux in a day or two.
Now I have to implement this just to say I did :)
What I have been pondering is, everything else aside, whether or not the
direct pointer scheme is more complex--in terms of required code and
potential for bugs--than simply parsing text. Parsing Forth is a trivial
problem. Keeping a dictionary at edit time is messier, and leaves open the
potential for one bad link causing major problems.
My current plan is to use something along the line of traditional blocks,
where each block is compressed individually. When a block is loaded in the
editor, it is decompressed from tokens to source. When it is saved, it is
converted back to tokens. This is instantaneous. The compiler is, in
effect, a block compiler. You pass it a tokenized block and it returns when
done. This means that blocks are not a fixed size, but that's okay (I
wrote a Windows-based editor for variable sized blocks in an evening last
Summer).
The downside is that block-at-a-time tokenization gives poorer compression
than whole-program tokenization, but (1) it keeps problems more localized,
and (2) the small size of blocks lets you use smaller tokens and related
data structures, which is better for cache coherency.
Thank you for the reply, Jeff. This project has me excited.
James