r2-trivial.txt



    ---
    some calculations
        calculations
            do intersection for two postings & -
            pseudocode for skiping list &
            Positional index usage & example p38
            “Proximity” intersection algo  &
            calcu Soundex & -  Example::Soundex of HERMAN H650 p41
            Successor Variety  &calcu - -e READABLE p28
            简单计算 ubung & 
            编辑距离calcu & 
            Jaccard coefficient to Scoring，how? & Query q [ides of March] doc1:"Caesar died in March”: JACCARD(q, d1) = 1/6
            tf idf calcu & - P28
            tf idf calcu & - 从p41开始计算
                tf idf algo p46 & - NEXT
            Heaps’ law mean? formular &
            Zipf’s law mean? formular &
                Bytewise compression  & calcu - p34
            基于块的排序索引方法Blocked Sort-Based Indexing (BSBI) idea & -  algo & - 
            内存式单遍扫描索引构建方法Single-Pass In-Memory Indexing (SPIMI) algo & - 
            ch08 p27计算 & -
            ch08 我的网上照片例子 &计算 - NEXT
            Precision and Recall
                计算p10 & -
            Accuracy ：：TP TN 占用的比例 
                公式 & -
                calcu &
            p28计算 & - ubung中有对应的题目 &
            ppt上计算 & p37 - 给定两个判断，求Kappa统计
                网页中那个计算 & - NEXT
            计算 状态检索值 Retrievalstatuswert & calcu 
            计算例子 & - p165 例12-2
            输入为Tasse Kanne的那个计算 & - NEXT
            ubung中 &
            ch13 次高价拍卖 & and calcu。 p47
            计算PageRank权重 & - p322 p23公式 & -
            Multinomiales Naïve Bayes ：idea & formuar  & calcu &
            ch19 calcu. p16 & 

    ---          
    Ubungen (will be merged
        improtant concepts from Ubung 
            u1 
                intersection algorithm x AND y and x OR y algo &
                How should the Boolean query x AND NOT y be handled &
                    Why is the naive evaluation of this query normally very expensive
                what can we achieve p6 &
                    x AND NOT y    ,  x OR NOT y    x+y    , N 
                For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? &
            u2
                The tokenization of a document is a trivial task
                stemming can lowers precision
                stemming increase or changed recall
                    // how affects the stemmer precision and recall and vocabulary size  &
                Stemming decreases the size of the vocabulary.
                Stemming should be applied to the documents, but not to the queries  
                The postings list of a stop word is usually longer than the postings list of a non-stop word  

                // what other approaches to stemming x3 &
                What are the respective Levenshtein editing operations x3 types of operations
            u3
                The Jaccard coefficient works well for ranking documents  
                Rare terms are less informative than frequent terms. 
                    // which words are more informasive ? &
                The inverted document frequency (idf) has no effect on the ranking for one-term queries  ! &
                There is exactly one way to calculate the tf-idf weights 

                What minimal and maximal values can the following variables have ? -
                Consider the case of a query term that is not in the set of indexed terms,what do we do ?
                What is the 𝑖𝑑𝑓 of a term that occurs in every document?
            u4  
                Zipf's Law states that the ith most frequent term has a collection frequency proportional to 1Τ𝑖
                Heaps' Law assumes that the vocabulary can grow infinitely !!!
                Elias Gamma Coding is a technique for dictionary compression
                Front Coding is a technique for dictionary compression
                Large document collections contain many frequent and few rare terms !!!
                    // contains few frequent items and many rare items  &
                Gamma coding cannot encode the number zero. Is this a problem for compressing postings lists

            u5
                External sorting algorithms are used to sort lists which do not fit in main memory
                It is impossible to calculate the top-k documents for a query without completely reading 
                    the postings list of each query term   
                        // the idea of NRA ? + calcu. the topk docs. without completely reading the postings list 
                Why can the MapReduce algorithm as described in the lecture not be run in this case? &?? next 
                For optimal load balancing, the inverters in MapReduce must get segmented postings
                    files of similar sizes. For a new collection, the distribution is unknown , what should we do ? &
            ---   
            u6 &
                a) The key measure for a search engine is user happiness.
                b) The 𝐹β measure combines both, precision and recall, into one number.
                c) The 11-point interpolated average precision projects the precision-recall curve to a single number.
                d) The Mean Average Precision (MAP) is yet another measure for evaluating the result of one query.
                e) Pooling means experts manually judge the relevance of each document in the collection.
                f) The kappa value is 1 if two judges always agree and 0 if they never agree.
                g) Dynamic result summaries can be constructed efficiently from the positional inverted index

                How many relevant documents does the
                    result contain and at which ranks are they?
                kappa measure formua -
            u7 &
                a) The main motivation for relevance feedback and query expansion is to decrease recall.
                b) The basic idea of the Rocchio algorithm is to move the query vector towards the vectors of
                relevant documents and away from the vectors of irrelevant documents.
                c) Relevance feedback is widely available in web search engines, because users can easily
                understand it and are willing to “tune” their query.
                d) Pseudo-relevance feedback can lead to query drift. !!!
                e) In thesaurus-based query expansion, we add terms to a query which are semantically related to
                the query terms specified by the user.
                f) Thesauri can only be built manually, there is no way to build them automatically.

                In Rocchio‘s algorithm, what weight setting for 𝛼, 𝛽, 𝛾 does a “Find pages like this one” search
                    correspond to? &
                Why is positive feedback likely to be more useful than negative feedback to an IR system? &
                Why might only using one non-relevant document be more effective than using several? !!! &
            u8
                a) Probabilistic IR is about estimating the probability that a document is relevant to a query.
                b) The odds of an event 𝐴 is defined as 𝑂 𝐴 = 𝑃(𝐴)Τ𝑃(𝐴ҧ) .
                c) The classical Binary Independence Retrieval (BIR) model takes term frequencies into account.
                d) The BIR model assumes that terms appear independently from each other in the documents.
                e) The cluster-hypothesis states, term distributions differ between relevant and irrelevant documents.
                f) The system Okapi was a much simpler predecessor of the BIR Model.

                in RSV calcu. :
                    what can be omitted in order to save time? &
                    Which of the calculations done so far have to be done anew for the new query? &
                    The classical BIR model does not consider term frequencies.
                        Is there a way to take these into account
                        as well?
                derived the formulas for the Retrieval Status Value:  
                    We made the assumption that terms appear independent from each other in a document
                        Provide an example of two terms for which this assumption is likely to hold and not .
                    
            u9 & 
                a) A statistical language model (LM) assigns probabilities to strings of symbols from some alphabet.
                b) For every unigram LM, cats hunt mice and mice hunt cats have the same probability.
                c) Language models are completely unrelated to Markov chains.

                d) We use LMs in IR like this: (step 1) we derive a LM from each document, (step 2) we rank all
                    documents by the probability that their LM generates the query.
                e) From a vocabulary containing 𝑉 terms we can construct approximately 𝑉 𝑛 𝑛-grams.

                f) A document containing 𝑑 tokens contains approximately 𝑑 𝑛 𝑛-grams.
                    // contains d vocabularies not tokens
                g) In practice, language models in IR consider 𝑛-grams with 𝑛 ≥ 3, i.e., at least tri-grams.
                h) The shorter the query, the more important is smoothing
                    // longer ? or no relationship ??
            u10
                a) The size of the web can easily be determined by crawling the web and counting the pages.
                b) In the context of web search, information needs are the only subclass of user needs.
                c) Shingling is a technique for detecting near duplicates of web pages.

                d) The average out-degree of all web pages is higher than their average in-degree.
                    why -- may be equal 
                    rem !!! &
                e) The PageRank algorithm ranks web pages by the number of occurrences of the query terms.
                f) The PageRank of a web page is query-dependent.

                g) The HITS-algorithm provides two scores per web page: an authority score and a hub score.
                h) The HITS-scores of a web page are query-dependen

                what happends as α becomes close to 1? &
            u11  
                a) Classification is about assigning a document to one (or more) out of several predefined categories.
                b) The accuracy of a classifier is always one minus its error rate.
                c) When using supervised learning for classification, we first train a classifier on unlabeled
                    documents.
                d) Overfitting means that a classifier is too general.
                e) With 𝑛-times 𝑘-fold cross-validation, we partition a set of 𝑛 labeled documents into 𝑘 (nearly)
                    equally-sized subsets and for each subset 𝑆𝑖 we train a classifier on the documents in 𝑆𝑖 and
                    evaluate its performance by applying it to the documents in 𝑆𝑖.
                f) Given a set of labeled documents, the sequential covering algorithm determines a set of rules for
                    rule-based text classification
                g) With probabilistic classification, a document 𝑑 is assigned to category 𝑐𝑖 if the probability 𝑃 𝑐𝑖 𝑑)
                    is the maximum of the probabilities 𝑃 𝑐𝑘 𝑑) for all categories 𝑐𝑘.
                h) Feature selection can reduce the training time of a classifier, but it cannot improve its quality.
                i) The Rocchio approach for vector-based classification assumes that the document vectors in each
                    category are close to each other, but distant from the document vectors in the other categories.
                j) The 𝑘NN classifier assigns a doc. 𝑑 to the 𝑘 categories whose centroid vectors are closest to 𝑑Ԧ.
                k) The idea of support vector machines (SVMs) is to separate the vector space using an optimal
                    hyperplane and to assign a document 𝑑 to one of two classes depending on whether 𝑑Ԧ lies on the
                    one or on the other side of the hyperplane.
                l) The SVM approach can only be applied to linearly separable datasets

                difference between  Bernoulli and multinomial ? &
                    
        answer to ubung // or can ref. to the answer docs 
            u1 
                calculate (NOT y) first as a new postings list, which takes O(N) time. 
                    where N is the total number of documents in the collection
                bounded by the number of documents 
                not optimal , just a gready method 
                
            u2
                t t t t f t 
            u3
                f f t f 
            u4  
                see ubung pdf 
            u5
                partition by docid as well as term for very frequent terms
                using a machine, to find the distribution of terms starting from various alphabets

            u6
            u7 
                But, in practice, it has been shown to be most useful for increasing recall 
                might not be conveyed properly to the IR system which results in low precision output

            u8
            u9
            u10
            u11   
                multinomial model uses term freq. where Bernoulli uses document freq. to calcu. the probability.

                multinomial calcu. the probability by the number of occurrences of the in the collection 
                The Bernoulli model , on the other hand , just calculates the probability by the number of
                    documents which contain the word 

    ---                
    key concepts
        ----lev 1 Basic Part + Boolean  
            freq. error
                Stemming,wildcard queries,phrase quries,Spelling Correction x3
            ch1 -
                was ist IR? &
                what are unstructured text (usually text)? & -  
                what is data, knowledge and Information &
                IR system architecture::structure of IR system  示意图 p15  & -
                usage of IR Systems &  
                compare IR and databank & -  x6

            ch2 -
                Why is grep not the solution & p6 -- x3
                To answer the query with help of Term-document incident matrix ？ example & - p8
                why not used? & -
                why inverted ? Inverted Index & 
                what is posting list? & - p11
                posting list 建立过程 3个步骤 &
                dic和posting如何存储 &
                What is the best order for processing this query? & 
                how to skip ? &
                why skip lists are not used now  ? & x2
                Westlaw特性 & x2
                how to use Bi-word index to do Phrase Queries &
                Phrase Queries methods x3
                key idea of Extended Bi-words &
                why Bi words not used ? &
                what is Positional index &
                how can position index be used to Proximity search & 
                    advantages of positional index ::
                        1. doesnt require much space 
                        2. can do proximity search   
                what is the Combination scheme of those two  & -
                What are “good” bi-words & -
            
            ch3
                how to determine lan in one text? & -
                what is & 
                    Term,Morphem,Inflection ,
                    Derivation,Kompositum,Noun Phrase (NP)
                promblems for reading words in IR? & *5
                what is &
                    Grammatical markings(POS Tags),Stemming,
                    Lemmatization,Case Folding,Stop words
                key idea of Sondex & 
                ways of Stemming & x3
                Dice’s coefficient and Jaccard diff. &
                Successor Variety &
                    advantages of it & x2
                n-gram stemmers:  & ::statistics statistical 0.8
                Porter's Stemmer explain & -
                Stemming?or not &
                Danger’s of stemming &
                Part-of-Speech Tagging what is &
                how can we do Part-of-Speech Tagging & x2
                edit distance & 

            ch4
                词典的数据结构what datastructures can be used for dictionaries &
                通配符查询::Wildcard queries  x3 & - 例子：：[hel*o] look up  
                Wildcard queries vs Phrase Queries &
                why not used 
                permuted index and kgram 比较 & -
                    why not used ?
                Spelling Correction 三种 &
                it was expensive , so how to solve this ? 
                Issues when using Spelling correction  & Problems 
            
        ----lev 2 Basic Rank
            ch5
                compare: rank or not & -
                Why rank & x1
                Jaccard coefficient to Scoring，how? & 
                    problems: & -
                    https://datascience.stackexchange.com/questions/15862/how-to-compute-the-jaccard-similarity-in-this-example-jaccard-vs-cosine
                Cosine similarity and Jaccard similarity differ.
                what is tf idf &
                tf-idf weighting , how & 
                tf-idf formular & -
                tf-idf compare & - those two 
                under what condition weight is 0? & x2
                The vector space models before and now  & x3
                why Euclidean distance is a bad idea & -
                differ. weghts &
                    Inc.Itn 
                    Itn.bnn
                    分别代表啥意思，公式列出来   
            ch6
                why compression & x2 -
                what is Lossy vs. lossless compression & - -
                Heaps’ law mean?
                the meaning for IR systems &
                zipf Law mean ？
                Dictionary as a string & 主要save的是哪里的空间
                why 3B here needed &
                Dictionary as a string with blocking & 主要save的是哪里的空间
                为啥block不能太大 & 
                how to Tradeoff &
                Front coding & 
                how to store gaps & x2
                Variable length code & 
                Variable byte code  &
                Gamma Codes for gap encoding & calcu -p39 13 四个步骤 &
                Gamma:some advantages & - explain - x2
                Gamma seldom used in practice & - explain
                
                
            ch7
                External Memory Sort &
                缩写的含义 &
                基于块的排序索引方法
                    Blocked Sort-Based Indexing (BSBI) idea & -  algo & - 
                内存式单遍扫描索引构建方法
                    Single-Pass In-Memory Indexing (SPIMI) algo & - 
                Caching策略 x2 &
                zwei Arten von Rechner‐Knoten :  & - two types of nodes 
                advantages or dis- of distributed index construction & - NEXT
                Mathods of Distributing &
                why can parallel? & - 
                MapReduce Architecture & - NEXT 
                Delta-Index: how it works ? & when put the delta index to Disk: &
            
            ch8
                Term-at-a-Time (TAAT): and DAAT &
                NRA ： No Random Accesses function explain & -
                    结束条件 &
                    p29公式 & - x5 mink,unseen,worst,best,high(t)

            ch9
                how to Measure the happiness ? & - p5  explain x4 2+2
                Precision and Recall
                    公式 p9 & -
                F-Measure Precision/recall tradeoff
                    formula & - p13
                Cost-based Measure & calcu
                Accuracy ：：TP TN 占用的比例 
                    公式 & -
                    calcu &
                Macro average (precision) and Micro average (precision) 
                    p19 calcu & 公式有问题? 貌似没有 
                PR curve 呈锯齿形的原因? & -
                what are included in benchmarks & x3
                Kappa统计 why ? &
                Pooling策略--评价所有文档工作量是很大的，可以只评价检索的 &
                A/B Testing &
                what to descreption in summaries &
                生成动态摘要的目标是选出满足如下条件的片段 &  p49


        ----lev 3 Ranking system
            ch10
                four different examples with picture & - for user feedback -next
                idea of Rocchio Algorithm to do feedback & 
                    centroid , opt. query formular &
                    有图来解释这个公式 & -：
                Used in practice &
                    what effect parameter has & -
                Positive versus negative feedback & -
                When Does Relevance Feedback Work? & ??
                Evaluation of feedback:how many times is enough to do that &
                problems，不起作用的原因  & - x3
                为相关反馈问题 &
                but it workd well? &
                what can we do to 间接相关反馈 &
                查询扩展(Query expansion) methods & x3 
                基于统计学的结果，相关性矩阵
                    calcu & - NEXT p48 

            ch11
                事件的优势率（ odds） & f
                状态检索值 Retrievalstatuswert ，Retrieval status value &    NEXT TUI 
                    公式
                    意义 & 
                Test auf Unabhängigkeit formula &
                概率检索模型 idea &
                二值独立模型BIR & 全称
                Okapi formular & calcu & 
                文档评分纵向比较： & -
                概率IR优缺点 & - NEXT explain x2 x2 

            ch12
                语言模型 idea & -
                what is language of LM &
                双圈节点对应的是 &
                描述有限自动机 EA & - explain
                计算 P(Katzen fangen Mäuse) & - 
                二元语言模型为啥不用 & - exp
                kann man LM durch Markow-Ketten darstellen & - Horizont（向前看的视野）?
                retrival function p20公式 & -
                Jelinek-Mercer Smoothing & -
                Multiplikation kleiner Werte Führt zu Rechenungenauigkeiten & - how to handle &
                LMs vs. BIR x1
                    联系 & - NEXT
                    区别 & - NEXT
                LMs vs. Vector-Space-Modell
                    联系 & - NEXT
                    区别 & - NEXT

            ch13  web basic , +++++++++
                为什么 Web纷繁杂乱、变化迅速 & x1
                web IR特点 & x3
                静态页面与动态  & - exp  
                magure the size of the web & Web规模估计 &
                整个Web有向图结构？ & - x6
                作弊问题：几种方法排名作弊？ & - x2
                google的用户体验特点： & -  x3
                普通的 Web 搜索查询似乎可以分成哪三大类？  & -
                Mercator 采集器 五个模块 &
                ppt中例子 & - 理解，给图画出来路径
                web中的重复问题 & x2
                网络搜索的具体形式特点 & 
                网络文档特点，主要的四个特性 & -
                Crawling，两个特点？ & 
                How are ads ranked  & - exp x2+n
                次高价拍卖 & and calcu。 p47 --second price auction 
                    Anybody can participate and bid on keywords
                主要的两个问题 
                    Keyword arbitrage
                    Violation of trademarks
                常见的Spam技术 & x2


            ch14
                链接分析的意义 & - 
                PageRank过程假设： & - x2
                面向主题的 PageRank:简单做法 &
                处理混合的情况？ &
                pagerank 特点 & - x2
                HITS循环迭代公式 & - 
                    计算优化？ & - 超链导向的主题搜索
                    idea & - 
                    问题：  & - x1
                    hits 特点 & - x2

        ----lev 4 ML classification
            ch15  
                在垃圾邮件处理中，有哪些基本的处理方法 & p7  x2
                Maschinelles Lernen，how it works  &画图 
                *总体：分类的评价 & x3 + 2
                    basic 
                        Accuracy and  Error Rate
                        Recall, Precision and F1 
                        formula & -
                    erweitern
                        Confusion Matrix  p27 pic 画图 
                        cross validation x3
                *Undersampling , oversampling,Hybrid-Verfahren & -- deal with data skew
                下采样方法说出名字就可以 & x4
                Oversampling & x2 说出名字，简称，全称
                Feature-Scaling x3
                *Feature-Auswahl x3 &

            ch16 -算法细节，如何继续优化 
                Form der Regeln  &
                *Evaluation von Regeln
                    Grundlegende Maße,两个指标 公式 &,计算 & p15
                        太大太小？
                    Drei Qualitätsmaße ,formular &,p21 ?? NEXT
                        combine the two values 
                        eg. & p26
                *Sequential Covering,
                    idea &,
                    formular &,eg p33 &&&,
                    *算法 p30 
                    Nachbearbeitung x2+ x2
            ch17 -公式有点费劲
                *贝叶斯分类：
                    why Naïve & x1
                    why not Naive & x2
                Multinomiales Naïve Bayes ：idea & formuar  & calcu & 
                    三个层级的公式 
                        *第一层次-
                        基本cj d, p ， tk cj 估计， gamma (d) 
                        how to de smoothing？
                    +代码对应 & 数组...
                    +ML实际上存储的是什么，学习的是啥
                    +term freq =1 的时候就是Bernoulli Naïve Bayes？ false ! why 
                Bernoulli Naïve Bayes ： idea & formuar  & 
                    也是三个层次
                    smoothing？ why + 2 in the denominator 
                    Anmerkungen & x2 advantages 
                Aktuelle Anwendung & x1
            ch18
                *Klassifikation im Vektorraum
                    Vektorraum:vector space
                    exp &
                Rocchio Klassifikator 
                    idea & 
                    formuar  & 
                    *Probleme bei Rocchio & x2
                kNN 
                    process & x2 
                    *Wahl von k? & x2-3
                    变种 &  x2
                
            ch19 -svm 细节
                Darstellung Hyperebene mit Ebenengleichung ::p9
                Vorgehen bei mehr als zwei Kategorien ? Multiclass SVMs?,exp & p10 x3steps
                    ::when there are more than two classes x2
                *Trainingsphase:Perzeptron
                    what is R ,ita &
                    verfahren exp & x3steps
                    if learn rate is large ? &
                    formula of perceptron &
                        alpha factor in the formular what is that &
                    dual form &
                    what have we learned ? &
                SVM
                    *SVM Linearer Fall
                        距离的表达式 &
                        Einfach Verfahren &
                            formular p30 & -
                                it can be solved with Lagrange multiplier or quadratic  programming
                                    -Well-studied solution algorithms
                                why larger than 1 ? (constraint formula )
                            //Duale Formulierung & -
                        Beispiel p33 &
                        

                    *Erweitertes Optimierungsverfahren
                            using slack variable in SVM：：引入松弛变量
                                formular & p38
                                Einfluss von C exp & -
                                    tradeoff parameter (chosen by cross-validation)
                                what is m :: number of slacks?
                                condition ? &

                    *说名字-Kernel Trick--Nichtlineare Entscheidungsgrenzen
                        key idea &
                        kernel types & x4
                            RBF Kernel--Gauß‘scher Kernel
                            Polynomial Kernel
                            Sigmoid Kernel
                            Cosinus Similarity Kernel
                            Chi-squared Kernel
                        how to choose ? 10k
        
        ---
    
    ---
    Lan training
        base::
            generic storage...
                similar. general
            math 
                calculus..
                Introduction to Calculus
                    regular polyhedra 
                    Sets, Functions, Graphs and Limits
                    Elementary Set Theory, Subsets, Set Operations, Coordinate Systems
                    The Exponential and Logarithmic Functions
                    The Hyperbolic Functions
                    Symmetry of Functions, Translation and Scaling of Axes,
                    Equations of Lines
                    Parabola, Ellipse, Hyperbola
                    Conic Sections in Polar Coordinates
                Differential Calculus
                    The Derivative of
                    take the derivative of both sides 
                    And then we take the derivative.!!!!
                        -r now , we are going to take the derivativ 
                Integral Calculus
                    Integrate 
                    Table of Integrals
                    I will integrate the function on that region !!!!
                        -r now , we are going to integrate this function on the rigion 
                    Method of Partial Fractions
                    The Definite Integral
                    Improper Integrals
                    Cylindrical Coordinates
                applications of 
                    Spring-mass System
                    Damping Forces
                    Mechanical Resonance

        ch1 
            Information retrieval 
                is the task, 
                        given a set of documents and a user query, 
                        finding the relevant documents
                        // find the relevant docs , according to the collection and the queries 
                Automated information retrieval systems are used to reduce
                    what has been called information overload
                        // why should we gen. IR system : there are huge amount informaiton that we should manage 
                An IR system is a software that provide access to books, 
                    journals and other documents, stores them and manages the document.
                        // we can access to these books , using the system 
                Web search engines are the most visible IR applications. 
                    // Medical Imaging is the most visible application in CG 
                information retrieval process begins when
                Queries are formal statements of information needs,
                search strings in web search engines
                with different degrees of relevancy.
                opposed to classical SQL queries of a database
                compute a numeric score on how well each object in the database matches the query
                    // the key idea of our sys. is to compute a numeric score to ranking the doc. 
                        so as to provide a very quick access
                The process may then be iterated if the user wishes to refine the query
                Now the world has changed, and hundreds of millions of people engage
                    in information retrieval every day when they use a web search engine or search their email
                    web search engine:: pronu.
                

            unstructured text
                does not have a pre-defined data model
                    // has no data model , when it compared to data in databank 
                typically text-heavy
                an ambiguous word/term/statement 
                This results in irregularities and ambiguities that make it 
                    difficult to understand using traditional programs 
                80-90% of all potentially usable business information may originate in unstructured form
                Other sources have reported similar or higher percentages of unstructured data
                majority of that will be unstructured
                    // xmajority of data that in our daily life are unstructured data 
                The earliest research into business intelligence focused in on unstructured textual data
                    // at the very begining , ir was used into business intelligence
                the extraction and classification of unstructured text
                Approaches in medicine and biomedical research
                biomedical documents include self-organizing map 
                    approaches for identifying topics among documents
            

        ch2 
            what is a model 
                Mathematical models are used to study the properties of 
                    the process, draw conclusions, make predictions
                derived from .. 

            boolean re. is something that very precise
            Boolean model of information retrieval (BIR)
                The BIR is based on Boolean logic and classical set theory 
                the user's query are conceived as sets of terms. 
                    //  bag of words model 
                Retrieval is based on whether or not the documents contain the query terms. 
                    // has ignore the term freq. and just check if the term is included in the doc.
            why inverted 
                postings file or inverted file
                    a mapping from content
                    named in contrast to a forward index, which maps from documents to content
                        // 
                            it is named in contrast to forward index 
                            forward index maps froms documents to content 
                            inverted maps from words to documents 
                    The purpose of an inverted index is to allow fast full-text searches
                        //
                            to enable fast full text research , we generate an inverted index for the collection 
                        It is the most popular data structure used in document retrieval systems
            Phrase Queries
                A word query is a query on a word or phrase
                    // a phrase can be 2 or more words :: the wto black albama ...
                search for documents containing an exact sentence or phrase
                searching for a certain string of text
                using quotation marks (") around a specific phrase to 
                    indicate that you want to search for instances of that search query
                    //
                        there should be quotation marks around this string 


        ch3 
            a morpheme is the smallest meaningful unit in a language
                // meaningful unit 
            stemmer 
                handling of word endings by reducing the words to their word roots  
                    // find the root of words 
                Stemming broadens our results to include both word roots and word derivatives
                        to improve recall
                    // stemmers are used to increse recall in ir sys.
                A stemming algorithm is an algorithm that converts a word to a related form
                conversion of plurals to singulars. 
                It is the most effective and widely used
                it can only be applied to text in the English Language
                    // the porter stemmer
                the user specifies a word in a query but only a variant of this word is present in a relevant document
                This problem can be partially overcome with the substitution of stems for the words
                A stem is the portion of a word that is left after the removal of its affixes (i.e., prefixes and suffixes)
                improving retrieval performance
                affix removal, table lookup, successor variety
                    determination of morpheme boundaries 
                and n-grams.
                might require considerable storage space
                reduce the size of indexing files
                In summary, the successor variety stemming process has 3 parts:
                develop a stemmer that required little or no human processing.
                    // the systems are considered to requiring little human processing 
                    // sys should do little interaction with human 
                it facilitates the construction of an attractive user interface
                to test the correctness of our work
                    // to check if it is correct 
                we have achieved an 80% level of correctness
                Several advantages of the Successor Variety Algorithm can be observed
                    --  without the need to use a dictionary
                    -- it is basically (domain independent)
                Another advantage is that it can be used in several domains; it is basically (domain independent).
                National Computer Conference and Exhibition
            Part-of-Speech 
                The part of speech indicates how the word functions in 
                    meaning as well as grammatically within the sentence.
                    // find the function of the words we are trying to tag in the sentance .
            
        ch4 
            encyclopedia
            Wildcard character
                a wildcard character is a kind of placeholder represented by a single character
                an asterisk 
                can be interpreted as a number of literal characters or an empty string
                used in Regular expressions
                doc* matches doc and document but not dodo
            k gram
                an n-gram is a contiguous sequence of n items from a given sample of text or speech
                n-grams may also be called shingles
                a type of probabilistic language model
                n-gram models are widely used in statistical natural language processing.
            Space–time tradeoff
                an algorithm or program trades increased space usage with decreased time
                Using stored knowledge or encoding stimuli reactions as "instincts" 
                    in the DNA avoids the need for "calculation" in time-critical situations
                More specific to computers, look-up tables have been implemented since the very earliest operating systems
            Spelling Correction
                spell checker :: ot check the spell of a word if it is correct
                levenstein distance 
                k gram 
                Context-sensitive process 
                a software feature that checks for misspellings in a text.
                n-grams, to recognize errors instead of correctly-spelled words
                with phonetic information
                Gorin wrote SPELL in assembly language, for faster action
                available on mainframe computers
                PCs with sufficient memory.
                For example, the concatenation of "snow" and "ball" is "snowball
                string theory, string concatenation is a primitive notion.
            Recreational mathematics

        ch5  
            Jaccard with k-Grams
                Jaccard similarity
                    //
                        k-Grams is just a simple concept and can be used in many algos. 
                        Jaccard measures simularity of tow sets using k gram 
                is a statistic used for comparing the similarity and diversity of sample sets. 
                divided by the size of the union of the sample sets
                If A and B are both empty, we define J(A,B) = 1.
                Intersection over Union
                ratio of the size of the symmetric difference 
                Weighted Jaccard similarity and distance
                Probability Jaccard similarity and distance
                Tanimoto similarity and distance
                Paul Jaccard
                Cosine similarity is for comparing two real-valued vectors, 
                Jaccard similarity is for comparing two binary vectors (sets).


        ch6  
            zipf Law 

        ch7 
        ch8 
        ch9 
            Kappa statistic
                a statistic which measures inter-rater agreement
                it is conceptually simpler to evaluate disagreement between two raters
                The seminal paper introducing kappa as a new technique was published by Jacob Cohen 
                between two raters only,cannot measure the aggrement amoung more than two raters 
                Fleiss' kappa
                The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic
                Kappa is also used to compare performance in machine learning
                Statistical significance
                type of intraclass correlation