Consider- MAN: DurexForth Disk II #328

Whammo · 2021-02-18T04:26:44Z

Whammo
Feb 18, 2021
Collaborator

What kind of database or directory structure could provide for fast word searches and access?

I'm thinking a SEQ file for each valid character in the name of any word. Eliminate chars files not used, add files dynamically as chars are added.
Each file is a string of bytes. Byte 1 is the number of words that has the filename byte as it first character. Byte 2 is the number of words that has the filename byte as it's second character. And so on, one for each. If you reach a zero before you've reached the end of the word name, it's not listed, if you reach a one, it's likely yours, but two ones are a positive match.
At the end of each is the list of 'ones' in order, each with information to quickly load the requested man page(s)

( this is not intended to be a regular expressions thread)

Answered by lonetech

Jul 10, 2022

TLDR: There's only one structure in Commodore 1541 intended for fast random access, REL files. Only one can be opened at a time. Indexing is a separate story and the documentation suggests storing indices in separate SEQ files, but hashing could improve that picture.
I don't think the suggested index structure would serve us very well, but there are alternatives.

Are you suggesting multiple SEQ files for random access? Because that might be more nicely done with REL files. As I understand it, those are handled by the disk drive and effectively build a second file with an array of allocated sectors (detailed information in the 1541 user's guide). The directory is a singly linked list, slow…

View full answer

lonetech · 2022-07-10T12:43:09Z

lonetech
Jul 10, 2022

TLDR: There's only one structure in Commodore 1541 intended for fast random access, REL files. Only one can be opened at a time. Indexing is a separate story and the documentation suggests storing indices in separate SEQ files, but hashing could improve that picture.
I don't think the suggested index structure would serve us very well, but there are alternatives.

Are you suggesting multiple SEQ files for random access? Because that might be more nicely done with REL files. As I understand it, those are handled by the disk drive and effectively build a second file with an array of allocated sectors (detailed information in the 1541 user's guide). The directory is a singly linked list, slow to look up, and probably best kept compact. So are all blocks except those indexed by REL's side sectors. The main feature keeping directory access not too slow is that it's on a middle track.

As for the indexing method described, it sounds rather prone to collisions. We have no shortage of words like: t[yxsa][yxsa], se[icd], endof/endif/and/abs (so no unique letters at all in and), and not to forget : * ?dup which contain characters not allowed in file names.

But the idea of looking things up in the development system? I love that. It brings fond memories of browsing David Jurgens' HelpPC. Another format I'd look at for reference is texinfo; not only does it build indices of functions, neatly sorted, it stores an index of node offsets at the end of the file for fast random access.

So my rough outline for a possibly useful document format: Convert the main documentation to texinfo, but tokenize the info file (links are particularly useful tokens to find), crunch it into sections, store the crunched format in relative file records, and rebuild the node index to use record numbers. Build something close to an info reader, probably using v's codebase for terminal handling.

Of course not all these steps need to be done at once. Alternatively to relative files we could use raw disk access for the random access, probably along with a utility to build block indices.

And then the next step up would be indexing symbols in source files. We could start easy, such as picking the words next to : code value variable create constant. Collect those in a file similar to ctags.

Well, that's a lot off track and possibly only tangentially related. Back to indexing words. The most obvious index form is simply a sorted index, given we have random access, but it would mean O(log(n)) accesses. A B+tree style index could reduce the number of reads for most searches. A hash table would use just one, at the expense of the ordering. Pearson hashes are easy enough to calculate.

I'd be tempted to use a simple hash to sort our words into bins that fit in blocks, and search within the block using find. I suspect it could be done by patching latest to look in our buffer. This already provides us with a fairly compact representation (name+3 bytes, 18 bits of which are usable, and one terminating nul); the linear search once we have the block is not a huge concern. If we need to dynamically add words, we could use a byte in the block for chained searches. An obvious downside is this index is larger than our word set, but the entire word set is already present elsewhere. (If we're feeling particularly crazy, we could even merge built in words with the in memory dictionary.)

Perhaps even better, the original Pearson paper suggests a method to assign specific values to particular words. If the hash does that we might make the hash itself just point to the correct node, or even just nearby, and skip the separate index entirely. This might involve tricky tooling.

1 reply

Whammo Jul 10, 2022
Collaborator Author

I have to admit looking back now that, "At the end of each is the list of 'ones' in order, each with information to quickly load the requested man page(s)" is quite vague and at the heart of the matter.

polluks · 2022-07-10T14:08:31Z

polluks
Jul 10, 2022

What about VLIR? http://unusedino.de/ec64/technical/formats/geos.html

2 replies

lonetech Jul 11, 2022

That's a more appropriate format for a bunch of fairly large chunks (e.g. texi2any --split-size), though AFAICT not built in to the 1541 and would be broken by a DOS VALIDATE command. It effectively puts 127 files in 1 entry. Still, it is at least supported by Vice's c1541 command (with CVT files). We could write Forth words to read it, even if we're not running GEOS.

I also took a moment to look up what SD2IEC does. It supports REL file access, but not growing or creating, even in R00 files. It's sufficient for reference purposes.

Thanks for the tip. I'm learning.

Whammo Jul 16, 2022
Collaborator Author

Speaking of GEOS, PCMIIAW, in an indirect reference to #435, Isn't the GEOS fast loader just a fast sector load/save? Starting sectors are pretty easy to find if you know what you're looking for.

Whammo · 2022-07-12T00:43:13Z

Whammo
Jul 12, 2022
Collaborator Author

Maybe a PETSCII translation of the docs are in order, perhaps sections in separate files to be read with V. This could be done at compilation.

Alternatively, a PDF reader in durexForth. :)

0 replies

jkotlinski · 2022-08-08T11:12:19Z

jkotlinski
Aug 8, 2022
Maintainer

http://www.armory.com/~spectre/cwi/hl/ looks like fun.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider- MAN: DurexForth Disk II #328

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Consider- MAN: DurexForth Disk II #328

Whammo Feb 18, 2021 Collaborator

Replies: 4 comments · 3 replies

lonetech Jul 10, 2022

Whammo Jul 10, 2022 Collaborator Author

polluks Jul 10, 2022

lonetech Jul 11, 2022

Whammo Jul 16, 2022 Collaborator Author

Whammo Jul 12, 2022 Collaborator Author

jkotlinski Aug 8, 2022 Maintainer

Whammo
Feb 18, 2021
Collaborator

Replies: 4 comments 3 replies

lonetech
Jul 10, 2022

Whammo Jul 10, 2022
Collaborator Author

polluks
Jul 10, 2022

Whammo Jul 16, 2022
Collaborator Author

Whammo
Jul 12, 2022
Collaborator Author

jkotlinski
Aug 8, 2022
Maintainer