This repository has been archived by the owner on Apr 30, 2020. It is now read-only.
forked from gigablast/open-source-search-engine
-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathdeveloper.html
2693 lines (2421 loc) · 193 KB
/
developer.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<br><br>
<h1>Developer Documentation</h1>
FAQ is <a href=/admin.html>here</a>.
<br><br>
<b>CAUTION: This documentation is old and a lot of it is out of date. -- Matt, May 2014</b>
<br><br>
<!--- TABLE OF CONTENTS-->
<h1>Table of Contents</h1>
<table cellpadding=3 border=0>
<tr><td>I.</td><td><a href="#started">Getting Started</a> - Setting up your PC for Gigablast development.</td></tr>
<!--subtable-->
<tr><td></td><td><table cellpadding=3>
<tr><td valign=top>A.</td><td><a href="#ssh">SSH</a> - A brief introduction to setting up ssh</td></tr>
</table></td></tr>
<tr><td>II.</td><td><a href="#shell">Using the Shell</a> - Some common shell commands.</td></tr>
<tr><td>III.</td><td><a href="#git">Using GIT</a> - Source code management.</td></tr>
<tr><td>IV.</td><td><a href="#hardware">Hardware Administration</a> - Gigablast hardware resources.</td></tr>
<tr><td>V.</td><td><a href="#dirstruct">Directory Structure</a> - How files are laid out.</td></tr>
<tr><td>VI.</td><td><a href="#kernels">Kernels</a> - Kernels used by Gigablast.</td></tr>
<tr><td>VII.</td><td><a href="#coding">Coding Conventions</a> - The coding style used at Gigablast.</td></tr>
<tr><td>VIII.</td><td><a href="#debug">Debugging Gigablast</a> - How to debug gb.</td></tr>
<tr><td>IX.</td><td><a href="#code">Code Overview</a> - Basic layers of the code.</td></tr>
<!--subtable1-->
<tr><td></td><td><table cellpadding=3>
<tr><td valign=top>A.</td><td><a href="#loop">The Control Loop Layer</a><br>
<tr><td valign=top>B.</td><td><a href="#database">The Database Layer</a><br>
<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#dbclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#database">Rdb</a> - The core database class.</td></tr>
<!--subtable3-->
<tr><td></td><td><table cellpadding=3>
<td><td>a.</td><td><a href="#addingrecs">Adding Records</a></td></tr>
<td><td>b.</td><td><a href="#mergingfiles">Merging Rdb Files</a></td></tr>
<td><td>c.</td><td><a href="#deletingrecs">Deleting Records</a></td></tr>
<td><td>d.</td><td><a href="#readingrecs">Reading Records</a></td></tr>
<td><td>e.</td><td><a href="#errorcorrection">Error Correction</a> - How Rdb deals with corrupted data on disk.</td></tr>
<td><td>f.</td><td><a href="#netadds">Adding Records over the Net</a></td></tr>
<td><td>g.</td><td><a href="#netreads">Reading Records over the Net</a></td></tr>
<td><td>i.</td><td><a href="#varkeysize">Variable Key Sizes</a></td></tr>
<td><td>j.</td><td><a href="#rdbcache">Caching Records</a></td></tr>
</table></tr>
<!--subtable3end-->
<tr><td>3. a.</td><td><a href="#posdb">Posdb</a> - The new word-position storing search index Rdb.</td></tr>
<tr><td>3. b.</td><td><a href="#indexdb">Indexdb</a> - The retired/unused tf/idf search index Rdb.</td></tr>
<tr><td>4.</td><td><a href="#datedb">Datedb</a> - For returning search results constrained or sorted by date.</td></tr>
<tr><td>5.</td><td><a href="#titledb">Titledb</a> - Holds the cached web pages.</td></tr>
<tr><td>6.</td><td><a href="#spiderdb">Spiderdb</a> - Holds the urls to be spidered, sorted by spider date.</td></tr>
<!--<tr><td>7.</td><td><a href="#urldb">Urldb</a> - Tells us if a url is in spiderdb or where it is in titledb.</td></tr>-->
<tr><td>8.</td><td><a href="#checksumdb">Checksumdb</a> - Each record is a hash of the document content. Used for deduping.</td></tr>
<tr><td>9.</td><td><a href="#sitedb">Sitedb</a> - Used for mapping a url or site to a ruleset.</td></tr>
<tr><td>10.</td><td><a href="#clusterdb">Clusterdb</a> - Used for doing site clustering and dupplicate search result removal.</td></tr>
<tr><td>11.</td><td><a href="#catdb">Catdb</a> - Used to hold DMOZ.</td></tr>
<tr><td>12.</td><td><a href="#robotdb">Robotdb</a> - Just a cache for robots.txt files.</td></tr>
</table></tr>
<!--subtable2end-->
<tr><td valign=top>C.</td><td><a href="#networklayer">The Network Layer</a><br>
<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#netclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#udpserver">Udp server</a> - Used for internal communication.</td></tr>
<!--subtable3-->
<tr><td></td><td><table cellpadding=3>
<td><td>a.</td><td><a href="#multicasting">Multicasting</a></td></tr>
<td><td>b.</td><td><a href="#msgclasses">Message Classes</a></td></tr>
</table></tr>
<!--subtable3end-->
<tr><td>3.</td><td><a href="#tcpserver">TCP server</a> - Used by the HTTP server.</td></tr>
<tr><td>4.</td><td><a href="#tcpserver">HTTP server</a> - The web server.</td></tr>
<tr><td>5.</td><td><a href="#dns">DNS Resolver</a></td></tr>
</table></tr>
<!--subtable2end-->
<tr><td valign=top>D.</td><td><a href="#buildlayer">The Build Layer</a><br>
<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#buildclasses">Associated Files</a></td></tr>
<tr><td>2.</td><td><a href="#spiderloop">Spiderdb.cpp</a> - Most of the spider code.</td></tr>
<!--
<tr><td>3.</td><td><a href="#spidercache">The Spider Cache</a> - Caches urls from spiderdb.</td></tr>
<tr><td>4.</td><td><a href="#msg14">Msg14</a> - Indexes a url.</td></tr>
<tr><td>5.</td><td><a href="#msg15">Msg15</a> - Sets the old IndexList.</td></tr>
<tr><td>6.</td><td><a href="#msg16">Msg16</a> - Sets the new IndexList.</td></tr>
<tr><td>7.</td><td><a href="#doc">The Doc Class</a> - Converts document to an IndexList.</td></tr>
-->
<tr><td>3.</td><td><a href="#linkanalysis">Link Analysis</a></td></tr>
<!--<tr><td>9.</td><td><a href="#robotdb">Robots.txt</a></td></tr>
<tr><td>10.</td><td><a href="#termtable">The TermTable</a></td></tr>
<tr><td>11.</td><td><a href="#indexlist">The IndexList</a></td></tr>-->
<tr><td>4.</td><td><a href="#samplevector">The Sample Vector</a> - Used for deduping at spider time.</td></tr>
<tr><td>5.</td><td><a href="#summaryvector">The Summary Vector</a> - Used for removing similar search results.</td></tr>
<tr><td>6.</td><td><a href="#gigabitvector">The Gigabit Vector</a> - Used for clustering results by topic.</td></tr>
<tr><td>7.</td><td><a href="#deduping">Deduping</a> - Preventing the index of duplicate pages.</td></tr>
<tr><td>8.</td><td><a href="#indexcode">Indexing Error Codes</a> - Reasons why a document failed to be indexed.</td></tr>
<tr><td>9.</td><td><a href="#docids">DocIds</a> - How DocIds are assigned.</td></tr>
<tr><td>10.</td><td><a href="#docidscaling">Scaling To Many DocIds</a> - How we scale to many DocIds.</td></tr>
</table></tr>
<!--subtable2end-->
<tr><td valign=top>E.</td><td><a href="#searchresultslayer">The Search Results Layer</a><br>
<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#queryclasses">Associated Classes</a></td></tr>
<tr><td>1.</td><td><a href="#query">The Query Parser</a></td></tr>
<tr><td>2.</td><td>Getting the Termlists</td></tr>
<tr><td>3.</td><td>Intersecting the Termlists</a></td></tr>
<tr><td>4.</td><td><a href="#raiding">Raiding Termlist Intersections</a></td></tr>
<tr><td>5.</td><td>The Search Results Cache</a></td></tr>
<tr><td>6.</td><td>Site Clustering</a></td></tr>
<tr><td>7.</td><td>Deduping</a></td></tr>
<tr><td>8.</td><td>Family Filter</a></td></tr>
<tr><td>9.</td><td>Gigabits</a></td></tr>
<tr><td>10.</td><td>Topic Clustering</a></td></tr>
<tr><td>11.</td><td>Related and Reference Pages</a></td></tr>
<tr><td>12.</td><td><a href="#spellchecker">The Spell Checker</a></td></tr>
</table></tr>
<!--subtable2end-->
<tr><td valign=top>F.</td><td><a href="#adminlayer">The Administration Layer</a><br>
<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#adminclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#collections">Collections</a></td></tr>
<tr><td>3.</td><td><a href="#parms">Parameters and Controls</a></td></tr>
<tr><td>4.</td><td><a href="#log">The Log System</a></td></tr>
<tr><td>5.</td><td><a href="#clusterswitch">Changing Live Clusters</a></td></tr>
</table>
<tr><td valign=top>G.</td><td><a href="#corelayer">The Core Functions Layer</a><br>
<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#coreclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#files"></a>Files</td></tr>
<tr><td>3.</td><td><a href="#threads"></a>Threads</td></tr>
</table>
<tr><td valign=top>H.</td><td><a href="#filelist">File List</a> - List of all source code files with brief descriptions.</a><br>
</table>
<tr><td>XIV.</td><td><a href="#spam">Fighting Spam</a> - Search engine spam identification.</td></tr>
</table>
<br><br>
<a name="started"></a>
<hr><h1>Getting Started</h1>
Welcome to Gigablast. Thanks for supporting this open source software.
<a name="ssh"></a>
<br><br>
<h2>SSH - A Brief Introduction</h2>
Gigablast uses SSH extensively to administer and develop its software. SSH is a secure shell
connection that helps protect sensitive information over an insecure network. Internally, we
access our development machines with SSH, but being prompted
for a password when logging in between machines can become tiresome as well as development gb
processes will not execute unless you can ssh without a password. This is where our authorized
keys file comes in. In your user directory you have a .ssh directory and inside of that directory
there lies an authorized_keys and known_hosts file. Now what we first have to do is generate a
key for your user on the current machine you are on. Remember to press <Enter> when asked for a
passphrase, otherwise you will still have to login.<br>
<code><i> ssh-keygen -t dsa </i></code><br>
Now that should have generated an id_dsa.pub and an id_dsa file. Since you now have a key, we need to
let all of our trusted hosts know that key. Luckily ssh provides a simple way to copy your key into
the appropriate files. When you perform this command the machine will ask you to login.<br>
<code><i> ssh-copy-id -i ~/.ssh/id_dsa.pub destHost</i></code><br>
Perform this operation to all of our development machines by replacing destHost with the appropriate
machine name. Once you have done this, try sshing into each machine. You should no longer be
prompted for a password. If you were requested to enter a password/phrase, go through this section again,
but DO NOT ENTER A PASSPHRASE when ssh-keygen requests one. <b>NOTE: Make sure to copy the id to the
same machine you generated the key on. This is important for running gigablast since it requires to ssh
freely.</b>
<br>
<br>
<i>Host Key Has Changed and Now I Can Not Access the Machine</i><br>
This is easily remedied. If a machine happens to crash and the OS needs to be replaced for whatever
reason, the SSH host key for that machine will change, but your known_hosts file will still have the
old host key. This is to prevent man-in-the-middle attacks, but when we know this is a rebuilt machine
we can simply correct the problem. Just open the ~/.ssh/known_hosts file with your favorite editor.
Find the line with the name or ip address of the offending machine and delete the entire line (the
known_hosts file separates host keys be newlines).<br>
<br><br>
<a name="shell"></a>
<hr><h1>Using the Shell</h1>
<!-- I'd like to nominate to remove a majority of this section - if you need help with bash at this point, you probably
shouldn't be poking around in the code... Maybe just keep the dsh and export command descriptions? SCC -->
Everyone uses the bash shell. The navigation keystrokes work in emacs, too. Here are the common commands:
<br><br>
<table cellpadding=3 border=1>
<tr><td>Command</td><td>Description</td></tr>
<tr><td><nobr>export LD_PATH=/home/mwells/dynamicslibs/</nobr></td><td>Export the LD_PATH variable, used to tell the OS where to look for dynamic libraries.</td></tr>
<tr><td>Ctrl+p</td><td>Show the previously executed command.</td></tr>
<tr><td>Ctrl+n</td><td>Show the next executed command.</td></tr>
<tr><td>Ctrl+f</td><td>Move cursor forward.</td></tr>
<tr><td>Ctrl+b</td><td>Move cursor backward.</td></tr>
<tr><td>Ctrl+a</td><td>Move cursor to start of line.</td></tr>
<tr><td>Ctrl+e</td><td>Move cursor to end of line.</td></tr>
<tr><td>Ctrl+k</td><td>Cut buffer from cursor forward.</td></tr>
<tr><td>Ctrl+y</td><td>Yank (paste) buffer at cursor location.</td></tr>
<tr><td>Ctrl+Shift+-</td><td>Undo last keystrokes.</td></tr>
<tr><td>history</td><td>Show list of last commands executed. Edit /home/username/.bashrc to change the number of commands stored in the history. All are stored in /home/username/.history file.</td></tr>
<tr><td>!xxx</td><td>Execute command #xxx, where xxx is a number shown from the 'history' command.</td></tr>
<tr><td>ps auxww</td><td>Show all processes.</td></tr>
<tr><td>ls -latr</td><td>Show all files reverse sorted by time.</td></tr>
<tr><td>ls -larS</td><td>Show all files reverse sorted by size.</td></tr>
<tr><td>ln -s <x> <y></td><td>Make directory or file y a symbolic link to x.</td></tr>
<tr><td>cat xxx | awk -F":" '{print $1}'</td><td>Show contents of file xxx, but for each line, use : as a delimeter and print out the first token.</td></tr>
<tr><td>dsh -c -f hosts 'cat /proc/scsi/scsi'</td><td>Show all hard drives on all machines listed in the file <i>hosts</i>. -c means to execute this command concurrently on all those machines. dsh must be installed with <i>apt-get install dsh</i> for this to work. You can use double quotes in a single quoted dsh command without problems, so you can grep for a phrase, for instance.</td></tr>
<tr><td>apt-cache search xxx</td><td>Search for a package to install. xxx is a space separated list of keywords. Debian only.</td></tr>
<tr><td>apt-cache show xxx</td><td>Show details of the package named xxx. Debian only.</td></tr>
<tr><td>apt-get install xxx</td><td>Installs a package named xxx. Must be root to do this. Debian only.</td></tr>
<tr><td>adduser xxx</td><td>Add a new user to the system with username xxx.</td></tr>
</table>
<br>
<a name="git"></a>
<hr><h1>Using Git</h1>
Git is what we use to do source code control. Git allows us to have many engineers making changes to a common set of files. The basic commands are the following:<br><br>
<table border=1 cellpadding=3>
<tr>
<td><nobr>git clone <srcdir> <destDit></nobr></td>
<td>This will copy the git repository to the destination directory.</td>
</tr>
</table>
<p>More information available at <a href=http://www.github.com/>github.com</a></p>
<br/>
<h3>Setting up GitHub on a Linux Box</h3>
To make things easier, you can set up github access via ssh. Here's a quick list of commands to run:
(assuming Ubuntu/Debian)
<pre>
sudo apt-get install git-core git-doc
git config --global user.name "Your Name"
git config --global user.email "[email protected]"
git config --global color.ui true
ssh-keygen -t rsa -C "[email protected]" -f ~/.ssh/git_rsa
cat ~/.ssh/git_rsa.pub
</pre>
Copy and paste the ssh-rsa output from the above command into your Github profile's list of SSH Keys.
<pre>
ssh-add ~/.ssh/git_rsa
</pre>
If that gives you an error about inability to connect to ssh agent, run:
<pre>
eval `ssh-agent -a`
</pre>
Then test and clone!
<pre>
ssh -T [email protected]
git clone [email protected]:gigablast/open-source-search-engine
</pre>
<br><br>
<a name="hardware"></a>
<hr><h1>Hardware Administration</h1>
<h2>Hardware Failures</h2>
If a machine has a bunch of I/O errors in the log or the gb process is at a standstill, login and type "dmesg" to see the kernel's ring buffer. The error that means the hard drive is fried is something like this:<br>
<pre>
mwells@gf36:/a$ dmesg | tail -3
scsi5: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 13 89 68 82 00 00 18 00
Info fld=0x1389688b, Current sd08:34: sense key Medium Error
I/O error: dev 08:34, sector 323627528
</pre>
If you do a <i>cat /proc/scsi/scsi</i> you can see what type of hard drives are in the server:<br>
<pre>
mwells@gf36:/a$ cat /proc/scsi/scsi
Attached devices:
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 00
Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 00 Lun: 00
Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi5 Channel: 00 Id: 00 Lun: 00
Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
Type: Direct-Access ANSI SCSI revision: 03
</pre>
<!-- Should most of the following be kept internally at GB? Not sure it would help with open-source users... SCC -->
So for this error we should replace the <b>rightmost</b> hard drive with a spare hard drive. Usually we have spare hard drives floating around. You will know by looking at colo.html where all the equipment is stored.
<br><br>
If a drive is not showing up in /proc/scsi/scsi when it should be, it may be a bad sata channel on the motherboard, but sometimes you can see it again if you reboot a few times.
<br><br>
Often, if a motherboard failures, or a machine just stops responding to pings and cannot be revived, we designate it as a junk machine and replace its good hard drives with bad ones from other machines. Then we just need to fix the junk machine.
<br><br>
At some point we email John Wang from TwinMicro, or Nick from ACME Micro to fix/RMA the gb/g/gf machines and gk machines respectively. We have to make sure there are plenty of spares to deal with spikes in the hardware failure rate.
<br><br>
Sometimes you can not ssh to a machine but you can rsh to it. This is usually because the a drive is bad and ssh can not access some files it needs to log you in. And sometimes you can ping a machine but cannot rsh or ssh to it, for the same reason. Recently we had some problems with some machines not coming back up after reboot, that was due to a flaky 48-port NetGear switch.
<br><br>
<h2>Replacing a Hard Drive</h2>
When a hard drive goes bad, make sure it is labelled as such so we don't end up reusing it. If it is the leftmost drive slot (drive #0) then the OS will need to be reinstalled using a boot cd. Otherwise, we can just <b>umount /dev/md0</b> as root and rebuild ext2 using the command in /home/mwells/gbsetup2, <b>mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0</b>. And to test the hard drives thoroughly, after the mke2fs, <i>nohup badblocks -b4096 -p0 -c10000 -s -w -o/home/mwells/badblocks /dev/md0 >& /home/mwells/bbout </i>. Sometimes you might even need to rebuild the software raid with <b>mkraid -c /etc/raidtab --really-force /dev/md0</b> before you do that. Once the raid is repaired, mount /dev/md0 and copy over the data from its twin to it. On the newly repaired host use <b>nohup rcp -r gbXX:/a/* /a/ &</b> where gbXX is its twin that has all the data still. DO NOT ERASE THE TWIN'S DATA. You will also have to do <b>rcp -r gbXX:/a/.antiword /a/</b> because rcp does not pick that directory up.
<br><br>
<h2>Measuring Drive Performance</h2>
Slow drives are sometimes worse than broken drives. Run <b>./gb thrutest <directoryToWriteTo> 10000000000</b> to make gigablast write out 10GB of files and then read from them. This is also a good test for checking a drive for errors. Each drive should have a write speed of about 39MB/s for a HITACHI, so the whole raid should be doing 150MB/s write, and close to 200MB/s for reading. Fuller drives will not do so well because of fragmentation. If the performance is less than what it should, then try mounting each drive individually and seeing what speed you get to isolate the culprit. Sometimes you can have the drive re-seated (just pulling it out and plugging it back in) and that does the trick. A soft reboot does not seem to work. I don't know if a power cycle works yet or not.
<br><br>
<h2>Measuring Memory Performance</h2>
Run <b>membustest 50000000 50</b> to read 50MB from main memory 50 times to see how fast the memory bus is. It should be at least 2000MB/sec, and 2500MB/sec for the faster servers. Partap also wrote <b>gb memtest</b> to see the maximum amount of memory that can be allocated, in addition to testing memory bus speeds. The max memory that can be allocated seems to be about 3GB on our 2.4.31 kernel.
<br><br>
<h2>Measuring Network Performance</h2>
Use <b>thunder 50000000</b> on the server with ip address a.b.c.d and <b>thunder 50000000 a.b.c.d</b> on the client to see how fast the client can read UDP packets from the server. Gigablast needs about a Gigabit between all machines, and they must all be on a Gigabit switch. Sometimes the 5013CM-T resets the Gigabit to 100Mbps and causes Gigablast to basically come to a hault. You can detect that by doing a <b>dmesg | grep e1000 | tail -1</b> on each machine to see what speed its network interface is running at.
<br><br>
<a name="kernels"></a>
<hr><h1>Kernels</h1>
Gigablast runs well on the Linux 2.4.31 kernel which resides in /gb/kernels/2.4.31/. The .config file is in that directory and named <b>dotconfig</b>. It does have the marvel patch, mvsata340_2.4.patch.bz2 applied so Linux can see the Marvel chip that drives the 4 sata devices on our SuperMicro 5013CM-T 1U servers. In addition we comment out the 3 'offline' statements in drives/scsi/scsi_error.c in the kernel source, and raised the 3 TIMEOUT to 30. These changes allow the kernel to deal with the 'drive offlined' messages that the Hitachi drives SCSI interface give every now and then. We still set the 'ghost lun' think in the kernel source to 10, but I am not sure if this actually helps, though.
<br><br>
Older kernels had problems with unncessary swapping and also with crashing, most likely due to the e1000 driver, which drives the 5013CM-T's Gigabit ethernet functionality. All issues seem to be worked out at this point, so any kernel crash is likely due to a hardware failure, like bad memory, or disk.
<br><br>
I would like to have larger block sizes on the ext2fs filesystem, because 4KB is just not good for fragmentation purposes. We use mostly huge files on our partitions, raid-0'ed across 4 400GB Sata drives, so it would help if we could use a block size of a few Megabytes even. I've contacted Theodore T'so, the guy he wrote e2fs, and he may do it for a decent chunk of cash.
<br><br>
<a name="coding"></a>
<hr><h1>Coding Conventions</h1>
When editing the source please do not wrap lines passed 80 columns. Also,
please make an effort to avoid excessive nesting, it makes for hard-to-read
code and can almost always be avoided.
<br><br>
Classes and files are almost exclusively named using Capitalized letters for
the beginning of each word, like MyClass.cpp. Global variables start with g_,
static variables start with s_ and member variables start with m_. In order
to save vertical space and make things more readable, left curly brackets ({'s)
are always placed on the same line as the class, function or loop declaration.
In pointer declations, the * sticks to the variable name, not the type as
in void *ptr. All variables are to be named in camel case as in lowerCamelCase.
Underscores in variable names are forbidden, with the exception of the g_, s_ and m_ prefixes, of course.
<br><br>
And try to avoid lots of nested curly brackets. Usually you do not have to nest
more than 3 levels deep if you write the code correctly.
<br><br>
Using the *p, p++ and p < pend mechanisms is generally preferred to using the
p[i], i++ and i < n mechanisms. So stick to that whenever possible.
<br><br>
Using ?'s in the code, like a = b ? d : e; is also forbidden because it is too cryptic and easy to mess up.
<br><br>
Upon error, g_errno is set and the conditions causing the error are logged.
Whenever possible, Gigablast should recover after error and continue running
gracefully.
<br><br>
Gigablast uses many non-blocking functions that return false if they block,
or true if they do not block. These type of functions will often set the
g_errno global error variable on error. These type of functions almost always
take a pointer to the callback function and the callback state as arguments.
The callback function will be called upon completion of the function.
<br><br>
Gigablast no longer uses Xavier Leroy's pthread library (LinuxThreads) because
it was buggy. Instead, the Linux clone() system call is used. That function
is pretty much exclusively Linux, and was the basis for pthreads anyway. It
uses an older version of libc which does not contain thread support for the
errno variable, so, like the old pthreads library, Gigablast's Threads.cpp
class has implemented the _errno_location() function to work around that. This
function provides a pointer to a separate thread's (process's) own errno
location so we do not contend for the same errno.
<br><br>
We also make an effort to avoid deep subroutine calls. Having to trace the
IP stack down several functions is time-consuming and should be avoided.
<br><br>
Gigablast also tries to make a consistent distinction between size and length.
Length usually refers to a string length (like strlen()) and does not include
the NULL terminator, if any. Size is the overall size of a data blob,
and includes everything.
<br><br>
When making comments please do not insert the "//" at the very first character
of the line in the first column of the screen. It should be indented just
like any other line.
<br><br>
Whenver possible, please mimice the currently used styles and conventions
and please avoid using third party libraries. Even the standard template
library has serious performance/scalability issues and should be avoided.
<br><br>
<a name="debug"></a>
<hr><h1>Debugging Gigablast</h1>
<h2>General Procedure</h2>
When Gigablast cores you generally want to use gdb to look at the core and see where it happened. You can use BitKeeper's <b>bk revtool <filename></b> to see if the class or file that cored was recently edited. You may be looking at a bug that was already fixed and only need to update to the latest version.
<h2>Using GDB</h2>
<p>We generally use gdb to debug Gigablast. If you are running gb, the gigablast process, under gdb, and it receives a signal, gdb will break and you have to tell it to ignore the signal from now now by typing <b>handle SIG39 nostop noprint</b> for signal 39, at least, and then continue execution by typing 'c' and enter. When debugging a core on a customer's machine you might have to copy your versino of gdb over to it, if they don't have one installed.</p>
<p>There is also a /gb/bin/gdbserver that you can use to debug a remote gb process, although no one really uses this except Partap used it a few times. </p>
<p>The most common way to use gdb is:
<ul>
<li><b>gdb ./gb</b> to start up gdb.
<li><b>run -c ./hosts.conf 0</b> to run gb under gdb.
<li><b>Ctrl-C</b> to break gb
<li><b>where</b> to make gdb print out the ip stack
<li><b>break MyClass.cpp:123</b> to break gb at line 123 in MyClass.cpp.
<li><b>print <variableName></b> to print out the value of a variable.
<li><b>watch <variableName></b> to set a watchpoint so gdb breaks if the variable changes value.
<li><b>set <variableName> <value></b> to set the value of a variable.
<li><b>set print elements 100000</b> to tell gdb to print out the first 100000 bytes when printing strings, as opposed to limiting it to the default 200 or so bytes.
</ul>
</p>
<p>You can also use gdb to do <i>poor man's profiling</i> by repeatedly attaching gdb to the gb pid like <b>gdb ./gb <pid></b> and seeing where it is spending its time. This is a fairly effective random sampling technique.</p>
<p>If a gb process goes into an infinite loop you can get it to save its in-memory data by attaching gdb to its pid and typing <b>print mainShutdown(1)</b> which will tell gdb to run that function which will save all gb's data to disk so you don't end up losing data.</p>
<p>To debug a core type <b>gdb ./gb <coreFilename></b> Then you can examine why gb core dumped. Please copy the gb binary and move the core to another filename if you want to preserve the core for another engineer to look at. You need to use the exact gb that produced the core in order to analyze it properly.
</p>
<p>It is useful to have the following .gdbinit file in your home directory:
<pre>
set print elements 100000
handle SIG32 nostop noprint
handle SIG35 nostop noprint
handle SIG39 nostop noprint
set overload-resolution off
</pre>
The overload-resolution gets in the way when tring to print the return value of some functions, like uccDebug() for instance.
</p>
<h2>Using Git to Help Debug</h2>
<a href="#git">Git</a> can be a useful debugging aid. By doing <b>git log</b> You can see what the latest changes made to gb were. This may allow you to isolate the bug. <b>bk diff HEAD~1 HEAD~0</b> will actually list the files that were changed from the last commit, instead of just the changesets. Substituting '1' with '2' will show the changes from the last two commits, etc.
<h2>Memory Leaks</h2>
All memory allocations, malloc, calloc, realloc and new, go through Gigablast's Mem.cpp class. That class has a hash table of pointers to any outstanding allocated memory, including the size of the allocation. When Gigablast allocates memory it will allocate a little more than requested so it can pad the memory with special bytes, called magic characters, that are verified when the memory is freed to see if the buffer was overwritten or underwritten. In addition to doing these boundary checks, Gigablast will also print out all the unfreed memory at exit. This allows you to detect memory leaks. The only way we can leak memory without detection now is if it is leaked in a third party library which Gigablast links to.
<h2>Random Memory Writes</h2>
This is the hardest problem to track down. Where did that random memory write come from? Usually, RdbTree.cpp, uses the most memory, so if the write happens it will corrupt the tree. To fight this you can turn <i>protect</i> the RdbTree by forcing RdbTree::m_useProtection to true in RdbTree::set(). This will make Gigablast protect the memory when not being accessed explicitly by RdbTree.cpp code. It does this by calling mprotect(), which can slow things down quite a bit.
<h2>Corrupted Data In Memory</h2>
Try to reproduce it under gdb immediately before the corruption occurs, then set a watchpoint on some memory that was getting corrupted. Try to determine the start of the corruption in memory. This may lead you to a buffer that was overflowing. Often times, a gb process will core in RdbTree.cpp which never happens unless the machine has bad RAM. Bad RAM also causes gb to go into an infinite loop in Msg3.cpp's RdbMap logic. In these cases try replacing the memory or using another machine.
<h2>Common Core Dumps</h2>
<table>
<tr><td>Problem</td><td>Cause</td></tr>
<tr><td>Core in RdbTree</td><td>Probably bad ram</td></tr>
</table>
<h2>Common Coding Mistakes</h2>
<table>
<tr><td>1.</td><td>Calling atoip() twice or more in the same printf() statement. atoip() outputs into a single static buffer and can not be shared this way.</td></tr>
<tr><td>2.</td><td>Not calling class constructors and destructors when mallocing/freeing class objects. Need to allow class to initialize properly after allocation and free any allocated memory before it is freed.</td></tr>
</table>
<br><br>
<a name="sa"></a>
<hr><h1>System Administration</h1>
<a name="code"></a>
<hr><h1>Code Overview</h1>
Gigablast consists of about 500,000 lines of C++ source. The latest gb version being developed as of this writing is /gb/datil/. Gigablast is a single executable, gb, that contains all the functionality needed to run a search engine. Gigablast attempts to uniformly distribute every function amongst all hosts in the network. Please see <a href="overview.html">overview.html</a> to get a general overview of the search engine from a USER perspective.
<br><br>
Matt started writing Gigablast in 2000, with the intention of making it the most efficient search engine in the world. A lot of attention was paid to making every piece of the code fast and scalable, while still readable. In July, 2013, the majority of the search engine source code was open sourced. Gigablast continues to develop and expand the software, as well as offers consulting services for companies wishing to run their own search engine.
<br><br>
<hr>
<a name="loop"></a>
<h1>The Control Loop Layer</h1>
Gigablast uses a <b>signal-based</b> architecture, as opposed to a thread-based architecture. At the heart of the Gigablast process is the I/O control loop, contained in the <a href="../latest/src/Loop.cpp">Loop.cpp</a> file. Gigablast sits in an idle state until it receives a signal on a file or socket descriptor, or a "thread" exits and sends a signal to be cleaned up. Threads are only used when necessary, which is currently for performing disk I/O, intersecting long lists of docIds to generate search results (which can block the CPU for a while) and merging database data from multiple files into a single list.
<br><br>
<a href="#threads">Threads</a> are not only hard to debug, but can make everything more latent. If you have 10 requests coming in, each one requiring 1 second of CPU time, if you thread them all it will take 10 seconds until each query can be satisfied and the results returned, but if you handle them FIFO (one at a time) then the first request is satisfied in 1 second, the second in two seconds, etc. The Threads.cpp class handles the use of threads and uses the Linux clone() call. Xavier Leroy's pthreads were found to be too buggy at the time.
<br><br>
Loop.cpp also allows other classes to call its registerCallback() function which takes a file or socket descriptor and when call a given callback when data is ready for reading or writing on that file descriptor. Gigablast accomplishes this with Linux's fcntl() call which you can use to tell the kernel to connect a particular file descriptor to a signal. The heart of gb is really sigtimedwait() which sits there idle waiting for signals. When one arrives it breaks out and the callback responsible for the associated file descriptor is called.
<br><br>
In the event that the queue of ready file descriptors is full, the kernel will deliver a SIGIO to the gb process, which performs a poll over all the registered file descriptors.
<br><br>
Gigablast will also catch other signals, like segmentation faults and HUP signals and save the data to disk before exiting. That way we do not lose data because of a core dump or an accidental HUP signal.
<br><br>
Using a signal based architecture means that we have to work with callbacks a lot. Everything is triggered by an <i>event</i> similar to GUIs, but our events are not mouse related, but disk and network. This naturally leads to the concept of a callback, which is a function to be called when a task is completed.
So you might call a function like <b>readData ( myCallback , myState , readbuf , bytesToRead )</b> and gigablast will perform the requested read operation and call <b>myCallback ( myState )</b> when it is done. This means you have to worry about state, whereas, in a thread-based architecture you do not.
<br><br>
Furthermore, all callbacks must be expressed as pointers to <b>C</b> functions, not C++ functions. So we call these C functions wrapper functions and have them immediately call their C++ counterparts. You'll see these wrapper functions clearly illustrated throughout the code.
<br><br>
Gigablast also has support for asynchronous signals. You can tell the kernel to deliver you an asynchronous signal in an I/O event, and that signal, also known as an interrupt, will interrupt whatever the CPU is doing and call gb's signal handler, sigHandlerRT(). Be careful, you cannot allocate memory when in a real-time/asynchronous signal handler. You should only call function names ending in _ass, which stands for asychronous-signal safe. The global instance of the UdpServer called g_udpServer2 is based on these asynchronous signals, mostly set up for doing fast pings without having to worry about what the CPU is doing at the time, but it really does not interrupt like it should most likely due to kernel issues. It doesn't seem any faster than the non-asynchronouys UDP server, g_udpServer.
<br><br>This signal-based architecture also makes a lot of functions return false if they block, meaning they never completed because they have to wait on I/O, or return true and set g_errno on error. So if a function returns false on you, generally, you should return false, too.
<br><br>
The Loop class is primarily responsible for calling callbacks when an event
occurs on a file/socket descriptor. Loop.cpp also calls callbacks that
registered with it from calling registerSleepCallback(). Those sleep callbacks
are called every X milliseconds, or more than X milliseconds if load is
extreme. Loop also responds to SIGCHLD signals which are called when a thread
exits. The only socket descriptor related callbacks are for TcpServer and
for UdpServer. UdpServer and TcpServer each have their own routine for calling
callbacks for an associated socket descriptor. But when the system is under
heavy load we must be careful with what callbacks we call. We should call the
highest priority (lowest niceness) callbacks before calling any of the
lowest priority (highest niceness) callbacks.
To solve this problem we require that every callback registered with Loop.cpp
return after 2 ms or more have elapsed if possible. If they still need to
be called by Loop.cpp to do more work they should return true at that time,
otherwise they should return false. We do this in order to give higher priority
callbacks a chance to be called. This is similar to the low latency patch in
the Linux kernel. Actually, low priority callbacks will try to break out after
2 ms or more, but high priority callbacks will not. Loop::doPoll() just calls
low priority
<a name="database"></a>
<hr><h1>The Database Layer</h2>
Gigablast has the following databases which each contain an instance of the
Rdb class:
<pre>
Indexdb - Used to hold the index.
Datedb - Like indexdb, but its scores are dates.
Titledb - Used to hold cached web pages.
Spiderdb - Used to hold urls sorted by their scheduled time to be spidered.
Checksumdb - Used for preventing spidering of duplicate pages.
Sitedb - Used for classifying webpages. Maps webpages to rulesets.
Clusterdb - Used to hold the site hash, family filter bit, language id of a document.
Catdb - Used to classify a document using DMOZ.
</pre>
<a name="rdb"></a>
<h2>Rdb</h2>
<b>Rdb.cpp</b> is a record-based database. Each record has a key, and optional
blob of data and an optional long (4 bytes) that holds the size of the blob of
data. If present, the size of the blob is specified right after the key. The
key is 12 bytes and of type key_t, as defined in the types.h file.
<a name="dbclasses"></a>
<h3>Associated Classes (.cpp and .h files)</h3>
<table cellpadding=3 border=1>
<tr><td>Checksumdb</td><td>DB</td><td>Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time.</td></tr>
<tr><td>Clusterdb</td><td>DB</td><td>Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time. </td></tr>
<tr><td>Datedb</td><td>DB</td><td>Like indexdb, but its <i>scores</i> are 4-byte dates.</td></tr>
<tr><td>Indexdb</td><td>DB</td><td>Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb.</td></tr>
<tr><td>MemPool</td><td>DB</td><td>Used by RdbTree to add new records to tree without having to do an individual malloc.</td></tr>
<tr><td>MemPoolTree</td><td>DB</td><td>Unused. Was our own malloc routine.</td></tr>
<tr><td>Msg0</td><td>DB</td><td>Fetches an RdbList from across the network.</td></tr>
<tr><td>Msg1</td><td>DB</td><td>Adds all the records in an RdbList to various hosts in the network.</td></tr>
<tr><td>Msg3</td><td>DB</td><td>Reads an RdbList from several consecutive files in a particular Rdb.</td></tr>
<!--<tr><td>Msg34</td><td>DB</td><td>Determines least loaded host in a group (shard) of hosts.</td></tr>
<tr><td>Msg35</td><td>DB</td><td>Merge token managament functions. Currently does not work.</td></tr>-->
<tr><td>Msg5</td><td>DB</td><td>Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList.</td></tr>
<tr><td>MsgB</td><td>DB</td><td>Unused. A distributed cache for caching anything.</td></tr>
<tr><td>Rdb</td><td>DB</td><td>The core database class from which all are derived.</td></tr>
<tr><td>RdbBase</td><td>DB</td><td>Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection.</td></tr>
<tr><td>RdbCache</td><td>DB</td><td>Can cache RdbLists or individual Rdb records.</td></tr>
<tr><td>RdbDump</td><td>DB</td><td>Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file.</td></tr>
<tr><td>RdbList</td><td>DB</td><td>A list of Rdb records.</td></tr>
<tr><td>RdbMap</td><td>DB</td><td>Maps an Rdb key to an offset into an RdbFile.</td></tr>
<tr><td>RdbMem</td><td>DB</td><td>Memory manager for RdbTree so it does not have to allocate space for every record in the three.</td></tr>
<tr><td>RdbMerge</td><td>DB</td><td>Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively.</td></tr>
<tr><td>RdbScan</td><td>DB</td><td>Reads an RdbList from an RdbFile, used by Msg3.</td></tr>
<tr><td>RdbTree</td><td>DB</td><td>A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree.</td></tr>
<tr><td>SiteRec</td><td>DB</td><td>A record in Sitedb.</td></tr>
<tr><td>Sitedb</td><td>DB</td><td>An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url.</td></tr>
<tr><td>SpiderRec</td><td>DB</td><td>A record in spiderdb.</td></tr>
<tr><td>Spiderdb</td><td>DB</td><td>An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is <i>old</i> or <i>new</i> to the index, and the priority of the url, currently from 0 to 7.</td></tr>
<tr><td>TitleRec</td><td>DB</td><td>A record in Titledb.</td></tr>
<tr><td>Titledb</td><td>DB</td><td>An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class.</td></tr>
<!--<tr><td>Urldb</td><td>DB</td><td>An Rdb whose records indicate if a url is in spiderdb or what particular Titledb BigFile contains the url.</td></tr>-->
</table>
<br><br>
<a name="addingrecs"></a>
<h3>Adding a Record to Rdb</h3>
<a name="rdbtree"></a>
When a record is added to an Rdb it is housed in a binary tree,
<b>RdbTree.cpp</b>,
whose size is configurable via gb.conf. For example, <indexdbMaxTreeSize>
specified how much memory in bytes that the tree for Indexdb can use. When the
tree is 90% full it is dumped to a file on disk. The records are dumped in
order of their keys. When enough files have accrued on disk, a merge is
performed to keep the number of files down and responsiveness up.
<br><br>
Each file, called a BigFile and defined by BigFile.h, can actually consist of
multiple physical files, each limited in size to 512MB. In this manner
Gigablast overcomes the 2GB file limit imposed by some kernels or disk systems.
Each physical file in a BigFile, after the first file, has a ".partX"
extension added to its filename, where X ranges from 1 to infinity. Throughout
this document, "BigFile" is used interchangably with "file".
<br><br>
If the tree can not accomodate a record add, it will return an ETRYAGAIN error.
Typically, most Gigablast routines will wait one second before retrying the
add.
<br>
<h3>Dumping an Rdb Tree</h3>
The RdbDump class is responsible for dumping the RdbTree class to disk. It
dumps a little bit of the tree at a time because it can take a few hundred
milliseconds seconds to gather a large list of records from the tree,
especially when the records are dataless (just keys). Since RdbTree::getList()
is not threaded it is important that RdbDump gets a little bit at a time so
it does not block other operations.
<br>
<a name="mergingfiles"></a>
<a name="rdbmerge"></a>
<h3>Merging Rdb Files</h3>
Rdb merges just enough files as to keep the number of files at or below the
threshold specified in gb.conf, which is <indexdbMaxFilesToMerge> for
Indexdb, for example. <b>RdbMerge.cpp</b> is used to control the merge logic.
The merge chooses which files to merge so as to minimize
the amount of disk writing for the long term. It is important to keep the
number of files low, because any time a key range of records is requested, the
number of disk seeks will be low and the database will be responsive.
<br><br>
When too many files have accumulated on disk, the database will enter "urgent
merge mode". It will show this in the log when it happens. When that happens,
Gigablast will not dump the corresponding RdbTree to disk because
it will create too many files. If the RdbTree is full then any
attempts to add data to it when Gigablast is in "urgent merge mode" will fail
with an ETRYAGAIN error. These error replies are counted on a per host basis
and displayed in the Hosts table. At this point the host is considered to be
a spider bottleneck, and most spiders will be stuck waiting to add data to
the host.
<br><br>
If Gigablast is configured to "use merge tokens", (which no longer works for some reason and has since been disabled) then any file merge operation
may be postponed if another instance of Gigablast is performing a merge on the
same computer's IDE bus, or if the twin of the host is merging. This is done
mostly for performance reasons. File merge operations tend to decrease disk
access times, so having a handy twin and not bogging down an IDE bus allows
Gigablast's load balancing algorithms to redirect requests to hosts that are
not involved with a big file merge.
<br>
<a name="deletingrecs"></a>
<h3>Deleting Rdb Records</h3>
The last bit of the 12 bytes key, key_t, is called the delbit. It is 0 if the
key is a "negative key", it is 1 if the key is a "positive key". When
performing a merge, a negative key may collide with its positive counterpart
thus annihilating one another. Therefore, deletes are performed by taking the
key of the record you want to delete, changing the low bit from a 1 to a 0,
and then adding that key to the database. Negative keys are always dataless.
<br>
<a name="readingrecs"></a>
<h3>Reading a List of Rdb Records</h3>
When a user requests a list of records, Gigablast will read the records
in the key range from each file. If no more than X bytes of records are
requested, then Gigablast will read no more than X bytes from each file. After
the reading, it merges the lists from each file into a final list. During this
phrase it will also annihilate positive keys with their negative counterparts.
<br><br>
<a name="rdbmap"></a>
In order to determine where to start reading in each file for the database,
Gigablast uses a "map file". The map file, encompassed by <b>RdbMap.cpp</b>,
records the key of the first record
that occurs on each disk page. Each disk page is PAGE_SIZE bytes, currently,
16k. In addition to recording the key of the first record that start on
each page, it records a 2 byte offset of the key into that page. The map file
is represented by the RdbMap class.
<br><br>
The Msg3::readList() routine is used retrieve a list of Rdb records from
a local Rdb. Like most Msg classes, it allows you to specify a callback
function and callback data in case it blocks. The Msg5 class contains the Msg3
class and extends it with the error correction (discussed below) and caching
capabilities. Calling Msg5::getList() also allows you to incorporate the
lists from the RdbTree in order to get realtime data. Msg3 is used mostly
by the file merge operation.
<br>
<a name="errorcorrection"></a>
<h3>Rdb Error Correction</h3>
Every time a list of records is read, either for answering a query or for
doing a file merge operation, it is checked for corruption. Right now
just the order of the keys of the records, and if those keys are out of the
requested key range, are checked. Later, checksums may
be implemented as well. If some keys are found to be out of order or out
of the requested key range, the request
for the list is forwared to that host's twin. The twin is a mirror image.
If the twin has the data intact, it returns its data across the network,
but the twin will return all keys it has in that range, not just necessarily from a single file, thus we can end up patching the corrupted data with a list that is hundreds of times bigger.
Since this
procedure also happens during a file merge operation, merging is a way of
correcting the data.
<br><br>
If there is no twin available, or the twin's data is corrupt as well, then
Gigablast attempts to excise the smallest amount of data so as to regain
integrity. Any time corruption is encountered it is noted in the log with
a message like:
<pre>
1145413139583 81 WARN db [31956] Key out of order in list of records.
1145413139583 81 WARN db [31956] Corrupt filename is indexdb1215.dat.
1145413139583 81 WARN db [31956] startKey.n1=af6da55 n0=14a9fe4f69cd6d46 endKey.n1=b14e5d0 n0=8d4cfb0deeb52cc3
1145413139729 81 WARN db [31956] Removed 0 bytes of data from list to make it sane.
1145413139729 81 WARN db [31956] Removed 6 recs to fix out of order problem.
1145413139729 81 WARN db [31956] Removed 12153 recs to fix out of range problem.
1145413139975 81 WARN net Encountered a corrupt list.
1145413139975 81 WARN net Getting remote list from twin instead.
1145413139471 81 WARN net Received good list from twin. Requested 5000000 bytes and got 5000010.
startKey.n1=af6cc0e n0=ae7dfec68a44a788 endKey.n1=ffffffffffffffff n0=ffffffffffffffff
</pre>
<br>
<a name="netreads"></a>
<a name="msg0"></a>
<h3>Getting a List of Records from Rdb in a Network</h3>
<b>Msg0.cpp</b> is used to retrieve a list of Rdb records from an abritrary
host. Indexdb is currently the only database where lists of records are
needed. All the other databases pretty much just do single key lookups. Each
database chooses how to partition its records based on the key across the
many hosts in a network. For the most part, the most significant bits of the
key are used to determine which host is responsible for storing that record.
This is why Gigablast is currently limited to running on 2, 4, 8, 16 ... or
2^n machines. Msg0 should normally be used to get records, not Msg5 or Msg3.
<br>
<a name="netadds"></a>
<a name="msg1"></a>
<h3>Adding a Record to Rdb in a Network</h3>
You can add record to a local instance of Rdb by using Rdb::addRecord() and
you can use Rdb::deleteRecord() to delete records. But in order to add a
record to a remote host you must use <b>Msg1.cpp</b>. This class
will use the key of the record to determine which host or hosts in the network
are responsible for storing the record. It will continue to send the add
request to each host until it receives a positive reply. If the host receiving
the record is down, the add operation will not complete, and will keep
retrying forever.
<br>
<a name="varkeysize"></a>
<h3>Variable Key Size</h3>
As of June 2005, Rdb supports a variable key size. Before, the key was always of class <i>key_t</i> as defined in types.h as being 12 bytes. Now the key can be 16 bytes as well. The key size, either 12 or 16 bytes, is passed into the Rdb::init() function. MAX_KEY_BYTES is currently #define'd as being 16, so if you ever use a key bigger than 16 bytes that must be updated. Furthermore, the functions for operating on key_t's, while still available, have been generalized to accept a key size parameter. These functions are in types.h as well, and the more popular ones are KEYSET() and KEYCMP() for setting and comparing variable sized keys.
<br><br>
To convert the Rdb files from a fixed-size key to a variable-sized one required modifying the files of the Associated Classes (listed above). Essentially, all variables of type key_t were changed to character pointers (char *), and all key assignment and comparison operators (and others) were changed to use the more general KEYSET() and KEYCMP() functions in types.h. To maintain backwards compatibility and ease migration, all Rdb associated classes still accept key_t parameters, but now they also accept character pointer parameters for passing in keys.
<br><br>
Some of the more common bugs from this change: since the keys are now character pointers, data owned by one class often got overwritten by another, therefore you have to remember to copy the key using KEYSET() rather than just operator on the key that the pointer points to. Another common mistake is using the KEYCMP() function without having a comparison operator immediately following in, such as < > = or !=. Also tossing the variable key size, m_ks, around and keeping it in agreement is another weak point. There were some cases of leaving a statement like <i>(char *)&key</i> alone when it should have been changed to just <i>key</i> since it was made into a <i>(char *)</i> from a <i>key_t</i>. And a case of not dealing with the 6-byte compression correctly, like replacing <i>6</i> with <i>m_ks-6</i> when we should not have, like in RdbList::constrain_r(). Whenever possible, the original code was left intact and simply commented out to aid in future debugging.
<br>
<a name="posdb"</a>
<h2>Posdb</h2>
Indexdb was replaced with Posdb in 2012 in order to store <b>word position</b> information as well as what <b>field</b> the word was contained in. Word position information is basically the position of the word in the document and starts at position <i>0</i>. A sequence of whitespace is counted as one, and a sequence of punctuation containing a comma or something else is counted as 2. An alphanumeric word is counted as one. So in the sentence "The quick, brown" the word <i>brown</i> would have a word position of 5. The <b>field</b> of the word in the document could be the title, a heading, a meta tag, the text of an inlink or just the plain document body.
<br><br>
The 18-byte key for an Posdb record has the following bitmap:
<pre>
tttttttt tttttttt tttttttt tttttttt t = termId (48bits)
tttttttt tttttttt dddddddd dddddddd d = docId (38 bits)
dddddddd dddddddd dddddd0r rrrggggg r = siterank, g = langid
wwwwwwww wwwwwwww wwGGGGss ssvvvvFF w = word postion , s = wordspamrank
pppppb1M MMMMLZZD v = diversityrank, p = densityrank
M = unused, b = in outlink text
L = langIdShiftBit (upper bit for langid)
Z = compression bits. can compress to
12 or 6 bytes keys.
G: 0 = body
1 = intitletag
2 = inheading
3 = inlist
4 = inmetatag
5 = inlinktext
6 = tag
7 = inneighborhood
8 = internalinlinktext
9 = inurl
F: 0 = original term
1 = conjugate/sing/plural
2 = synonym
3 = hyponym
</pre>
<br>
Posdb.cpp tries to rank documents highest that have the query terms closest together. If most terms are close together in the body, but one term is in the title, then there is a slight penalty. This penalty as well as the weights applied to the different density ranks, siteranks, etc are in the Posdb.h and Posdb.cpp files.
<br><br>
<a name="indexdb"></a>
<a name="indexlist"></a>
<h2>Indexdb</h2>
Indexdb has been replaced by <a href="#posdb">Posdb</a>, but the key for an Indexdb record has the following bitmap:
<pre>
tttttttt tttttttt tttttttt tttttttt t = termid (48bits)
tttttttt tttttttt ssssssss dddddddd s = ~score
dddddddd dddddddd dddddddd dddddd0Z d = docId (38 bits) Z = delbit
</pre>
When Rdb::m_useHalfKeys is on and the preceeding key as the same 6 bytes as
the following key, then the following key, called a half key, only requires
6 bytes, therefore, has the following bitmap:
<pre>
ssssssss dddddddd dddddddd dddddddd d = docId, s = ~score
dddddddd dddddd1Z Z = delbit
</pre>
Every term that Gigablast indexes, be it a word or phrase, is hashed using
the hash64() routine in the hash.h. This is a very fast and effective hashing
function. The resulting hash of the term is called the termid. It is
constrained to 48 bits.
<br><br>
Next, the score of that term is computed. Generally, the more times the term
occurs in the document the higher the score. Other factors, like incoming
links and link text and quality of the document, may contribute to the score
of the term. The score is 4 bytes but is logarithmically mapped into 8 bits,
and complemented, so documents with higher scores appear above those with
lower scores, since the Rdb is sorted in ascending order.
<br><br>
The <indexdbTruncationLimit> parameter in gb.conf, reflected as
Conf::m_indexdbTruncationLimit, allows you to specify the maximum
number of docids allowed per termid. This allows you to save disk space. Right
now it is set to 5 million, a fairly hefty number. The merge routine,
RdbList::indexMerge_r() will ensure that no returned Indexdb list violates
this truncation limit. This routine is used for reading data for resolving
queries as well as doing file merge operations.
A list of Indexdb records all with the same termid, is called a termlist or
indexlist.
<br><br>
<a name="datedb"></a>
<h2>Datedb</h2>
The key for an Datedb record has the following 16-byte bitmap:
<pre>
tttttttt tttttttt tttttttt tttttttt t = termId (48bits)
tttttttt tttttttt DDDDDDDD DDDDDDDD D = ~date
DDDDDDDD DDDDDDDD ssssssss dddddddd s = ~score
dddddddd dddddddd dddddddd dddddd0Z d = docId (38 bits)
And, similar to indexdb, datedb also has a half bit for compression down to 10 bytes:
DDDDDDDD DDDDDDDD D = ~date
DDDDDDDD DDDDDDDD ssssssss dddddddd s = ~score
dddddddd dddddddd dddddddd dddddd1Z d = docId (38 bits)
</pre>
Datedb was added along with variable-sized keys (mentioned above). It is basically the same as indexdb, but has a 4-byte date field inserted. IndexTable.cpp was modified slightly to treat dates as scores in order to provide sort by date functionality. By setting the sdate=1 cgi parameter, Gigablast should limit the termlist lookups to Datedb. Using date1=X and date2=Y cgi parameters will tell Gigablast to constrain the termlist by those dates. date1 and date2 are currently seconds since the epoch. Gigablast will search for the date of a document in this order, stopping at the first non-zero date value:<br>
1. <meta name=datenum content=123456789><br>
2. <meta name=date content="Dec 1 2006"><br>
3. The last modified date in the HTTP mime that the web server returns.
<br><br>
<a name="titledb"></a>
<a name="titlerec"></a>
<h2>Titledb</h2>
Titledb holds a cached copy of every web page in the index. The TitleRec
class is used to serialize and deserialize the information from an Rdb record
into something more useful. It also uses the freely available zlib library
to compress web pages to about 1/5th of their size. Depending on the word
density on a page and the truncation limit, Titledb is usually slightly bigger
than Indexdb. See TitleRec.h to see what kind of information is contained
in a TitleRec.
<br><br>
The key of a Titledb record, a TitleRec, has the following bitmap:
<pre>
dddddddd dddddddd dddddddd dddddddd d = docId
dddddddd hhhhhhhh hhhhhhhh hhhhhhhh h = hash of site name
hhcccccc cccccccc cccccccc cccccccD c = content hash, D = delbit
</pre>
The low bits of the top 31 bits of the docId are used to determine which
host in the network stores the Titledb record. See Titledb::getGroupId().
<br><br>
The hash of the sitename is used for doing site clustering. The hash of the
content is used for deduping identical pages from the search results.
<br><br>
<b>IMPORTANT:</b> Any time you change TITLEREC_VERSION in Titledb.h or make
any parsing change to TitleRec.cpp, you must also change it in all parent bk
respositories to keep things backwards compatible. We need to ensure that
newer versions of gb can parse the Titledb records of older versions. Because
if you increment TITLEREC_VERSION in the newer repository and then someone
else increments it for a different reason in the parent repository, then you
end up having a conflict as to what the latest version number actually stands
for. So versions need to be immediately propagated both ways to avoid this.
<a name="spiderdb"></a>
<h2>Spiderdb</h2>
NOTE: this description of spiderdb is very outdated! - matt Aug 2013.
<br><br>
Spiderdb is responsible for the spidering schedule. Every url in spiderdb is
either old or new. If it is already in the index, it is old, otherwise, it is
new. The urls in Spiderdb are further categorized by a priority from 0 to 7.
If there are urls ready to be spidered in the priority 7 queue, then they
take precedence over urls in the priority 6 queue.
<br><br>
All the urls in Spiderdb priority queue are sorted by the date they are
scheduled to be spidered. Typically, a url will not be spidered before its
time, but you can tweak the window in the Spider Controls page. Furthermore,
you can turn the spidering of new or old urls on and off, in addition to
disabling spidering based on the priority level. Link harvesting can also be
toggled based on the spider queue. This way you can harvest links only from
higher priority spider queues.
<br><br>
<a name="pagereindex"></a>
Some spiderdb records are docid based only and will not have an associated url.
These records were usually added from the <b>PageReindex.cpp</b> tool which
allows the user to store the docids of a bunch of search results directly into
spiderdb for respidering.
<br><br>
The bitmap of a key in Spiderdb:
<pre>
00000000 00000000 00000000 pppNtttt t = time to spider, p = ~ of priority
tttttttt tttttttt tttttttt ttttRRf0 R = retry #, f = forced?
dddddddd dddddddd dddddddd dddddddD d = top 32 bits of docId, D = delbit
N = 1 iff url not in titledb (isNew)
</pre>
Each Spiderdb record also records the number of times the url was tried and
failed. In the Spider Controls you can specify how many until Gigablast gives
up and deletes the url from Spiderdb, and possibly from the other databases
if it was indexed.
<br><br>
If a url was "forced", then it was added to Spiderdb even though it was
detected as already being in there. Urldb lets us know if a url is in Spiderdb
or not. So if the url is added to Spiderdb even though it is in there already
it is called "forced" url. Once a forced url is spidered then that Spiderdb
record is discarded. Otherwise, if the url was spidered and was not forced,
it is rescheduled for a future spidering, and the old Spiderdb record is
deleted and a new one is added.
<br><br>
Gigablast attempts to determine if the url changed since its last spidering.
If it has, then it will try to decrease the time between spiderings, otherwise
it will try to increase that time. The min and max respider times can be
specified in the Spider Controls page, so you can ensure Gigablast does not
wait long enough or ensure it does not wait too long between respiderings.
<br><br>
For more about the spider process see <a href="#buildlayer">The Build Layer</a>.
<br><br>
<!--
<a name="urldb"></a>
<h2>Urldb</h2>
Urldb is like a map file. A record in urldb has this bitmap:
<pre>
00000000 00000000 00000000 00000000 H = half bit
00000000 dddddddd dddddddd dddddddd d = 36 bit docId
dddddddd ddddddee eeeeefff fffffCHD e = url hash ext, f = title file # (tfn)
C = clean bit , D = delbit
</pre>
When a caller requests the cached web page for a docid they are essentially requesting the TitleRec for that docid. We have to read the TitleRec from a Titledb file. Because there can be hundreds of titledb files, checking
each one is a burden. Therefore, we ask urldb which one has it. If the title
file number (tfn) is 255, that means the url is just in the Spiderdb OR the TitleRec is in Titledb's RdbTree. The tfn is a secondary numeric identifier present in the actual filename of each titledb file. Only titledb files have this seconday id and it is the number immediately following the hyphen.
<br><br>
The purpose of urldb originally grew from the fact that looking up an old titlerec for re-spidering purposes would require looking in a few files. This was a big bottleneck for the spidering process and also made it harder to handle large query volumes while spidering in the background. Furthermore, much merging was required to keep the number of titledb big files down to a respectable few. Urldb solves this problem by mapping each docid directly to a single titledb file.
<br><br>
Urldb also greatly complicates things. It has to remain in perfect sync with titledb. Every time titledb files are merged the tfn of each TitleRec involved is changed. So right after we dump out the titledb list we add a urldb list to urldb that has the new tfns for the titledbs in that list. This is done in RdbDump::updateUrldbLoop(). Furthermore, since we often merge titledb files in a somewhat random nature, we can not just override the urldb rec for each TitleRec involved in that merge, because a more uptodate TitleRec for the same docid may be in another titledb file. Therefore, we must also read the corresponding urldb lists for the titledb lists involved in the merge, and lookup the uptodate tfn of each TitleRec. If it is indeed a match for the file being merged, then we can do the urldb override.