-
What kind of database or directory structure could provide for fast word searches and access? I'm thinking a SEQ file for each valid character in the name of any word. Eliminate chars files not used, add files dynamically as chars are added. ( this is not intended to be a regular expressions thread) |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
TLDR: There's only one structure in Commodore 1541 intended for fast random access, REL files. Only one can be opened at a time. Indexing is a separate story and the documentation suggests storing indices in separate SEQ files, but hashing could improve that picture. Are you suggesting multiple SEQ files for random access? Because that might be more nicely done with REL files. As I understand it, those are handled by the disk drive and effectively build a second file with an array of allocated sectors (detailed information in the 1541 user's guide). The directory is a singly linked list, slow to look up, and probably best kept compact. So are all blocks except those indexed by REL's side sectors. The main feature keeping directory access not too slow is that it's on a middle track. As for the indexing method described, it sounds rather prone to collisions. We have no shortage of words like: t[yxsa][yxsa], se[icd], endof/endif/and/abs (so no unique letters at all in and), and not to forget : * ?dup which contain characters not allowed in file names. But the idea of looking things up in the development system? I love that. It brings fond memories of browsing David Jurgens' HelpPC. Another format I'd look at for reference is texinfo; not only does it build indices of functions, neatly sorted, it stores an index of node offsets at the end of the file for fast random access. So my rough outline for a possibly useful document format: Convert the main documentation to texinfo, but tokenize the info file (links are particularly useful tokens to find), crunch it into sections, store the crunched format in relative file records, and rebuild the node index to use record numbers. Build something close to an info reader, probably using v's codebase for terminal handling. Of course not all these steps need to be done at once. Alternatively to relative files we could use raw disk access for the random access, probably along with a utility to build block indices. And then the next step up would be indexing symbols in source files. We could start easy, such as picking the words next to : code value variable create constant. Collect those in a file similar to ctags. Well, that's a lot off track and possibly only tangentially related. Back to indexing words. The most obvious index form is simply a sorted index, given we have random access, but it would mean O(log(n)) accesses. A B+tree style index could reduce the number of reads for most searches. A hash table would use just one, at the expense of the ordering. Pearson hashes are easy enough to calculate. I'd be tempted to use a simple hash to sort our words into bins that fit in blocks, and search within the block using find. I suspect it could be done by patching latest to look in our buffer. This already provides us with a fairly compact representation (name+3 bytes, 18 bits of which are usable, and one terminating nul); the linear search once we have the block is not a huge concern. If we need to dynamically add words, we could use a byte in the block for chained searches. An obvious downside is this index is larger than our word set, but the entire word set is already present elsewhere. (If we're feeling particularly crazy, we could even merge built in words with the in memory dictionary.) Perhaps even better, the original Pearson paper suggests a method to assign specific values to particular words. If the hash does that we might make the hash itself just point to the correct node, or even just nearby, and skip the separate index entirely. This might involve tricky tooling. |
Beta Was this translation helpful? Give feedback.
-
What about VLIR?
http://unusedino.de/ec64/technical/formats/geos.html
|
Beta Was this translation helpful? Give feedback.
-
Maybe a PETSCII translation of the docs are in order, perhaps sections in separate files to be read with V. This could be done at compilation. Alternatively, a PDF reader in durexForth. :) |
Beta Was this translation helpful? Give feedback.
-
http://www.armory.com/~spectre/cwi/hl/ looks like fun. |
Beta Was this translation helpful? Give feedback.
TLDR: There's only one structure in Commodore 1541 intended for fast random access, REL files. Only one can be opened at a time. Indexing is a separate story and the documentation suggests storing indices in separate SEQ files, but hashing could improve that picture.
I don't think the suggested index structure would serve us very well, but there are alternatives.
Are you suggesting multiple SEQ files for random access? Because that might be more nicely done with REL files. As I understand it, those are handled by the disk drive and effectively build a second file with an array of allocated sectors (detailed information in the 1541 user's guide). The directory is a singly linked list, slow…