Skip to main content

Discussion of search

Pinned

Comments

5 comments

  • Fletcher Penney

    Recap of text search, since this comes up in various comments/feature suggestions periodically...

     

    The requirements for search functionality in nvUltra include:

     

    1. Fast -- some users have large numbers of large documents (e.g. PDFs) and search should be as fast as possible.  Ideally it would be O(1), but I'm not sure that is possible.  So O(n) is a more reasonable goal (where n is the number of documents).  Possibly O(n*m) (where m is size of a document), depending on algorithm speed, I suppose, but even this is likely to be too slow.  nvUltra's search algorithm is O(n) -- searching a long document is just as quick as searching a short document, and depends only on the number of documents to be searched.  (Technically, it also depends on the complexity of the search -- number of terms, nesting complexity, AND/OR's, etc.  But that is a separate issue.)  Also, when typing in the search bar the results need to update with each keystroke.  This means the search will be performed for each letter typed, so it has to be fast and with minimized resource usage.

    2. Limit disk access -- This intersects with #1, but is also important to consider separately.  Reading each file to be searched, and performing the search in the raw file works, but requires loading each file into memory to perform the search, or keeping each file in memory at all times.

    3. Limit memory usage -- This intersects with #2 -- keeping the full text of each file in memory opens up additional full text search options, but vastly increases memory usage.  One can imagine a folder of 500 PDFs, each of which are 2-10 Mb (or more).  This is an unrealistic amount of RAM to tie up constantly, especially on an iOS device.

    4. Minimize unnecessary disk usage -- Index files to improve search efficiency should not take up as much space as the original files.  This would effectively double the disk requirements of your notes folders, which is also unrealistic, especially on iOS devices.

    5. Ability to search a wide range of files -- Unlike nvAlt, for example, nvUltra can search the contents of PDFs in addition to plain text files.  Over time, I may add support for additional file types, but PDFs were an important one for me.  This matters because of #2 and #3 above, as well as the fact that straight regular expression matching of the raw PDF file won't work properly.

    6. Focus on the types of searches performed most of the time -- yes.  I get it.  Everyone would love to have full regular expression support, optional case sensitivity/insensitivity, etc.  But the vast majority of the time users are simply typing a keyword.

     

    # Implications of these requirements #

    Basically this means that I can't simply search the raw text of each file for each query.  This (to my knowledge) *essentially* rules out arbitrary regular expressions, and requires an indexing approach of some sort.

    Indexing requires tradeoffs between index size on disk, memory usage, and search functionality (case sensitivity, regular expressions, etc.)

     

    # nvUltra's approach #

    Briefly, my indexing algorithm:

    * breaks text into "words" (which can be words as normally thought of in language, or sequences of numbers, or hyphenated/underscored word combinations such as "foo_bar")

    * strips diacritics (e.g. `Bücher` becomes `Bucher`)

    * performs lower-casing (for ASCII) or case-folding (for non-ASCII UTF-8) so that "FOO" and "foo" are the same

    * Indexes each word, as well as the leading sub-words for partial matching while typing.  For example, "book" is indexed as "b", "bo", "boo", and "book".  This is limited when it comes to really long words to avoid bloating the index too much.

    * The index is designed so that it will never offer a false negative (if the index reports no match for a given term, then that term is not in the document), but there is a very small chance of false positives (if the index reports a match, there is a very small chance that the term was not actually in the document -- roughly 2 in 10,000).  This is a consequence of the index design which offers many other advantages.

    * The index is much smaller than the source document, unless the document itself is also very small.  The index is designed to grow as necessary for large documents (since a 100k word novel requires a longer index than a 140 character tweet, for example).

    * The live index is stored in memory while using the app, and cached on disk when a folder is closed.  This prevents having to reindex a folder every time it is opened.  This means that a folder of 100s of PDFs can be "instantly" searched without having to store all of the PDFs in memory.

     

    # Limitations #

    * Because the search is performed on an index instead of the original text, regular expressions won't work.  To my knowledge and research, supporting regular expressions, WHILE MEETING THE ABOVE CRITERIA, is not possible.  An index sufficiently detailed to allow full regular expression support would be a 1:1 index of the source text, which defeats the purpose of indexing.

    * Similarly, some files (e.g. program source text) can be filled with a lot of "garbage characters" that is not particularly useful when indexed.  This can also include punctuation (indexing whether a document has a period, for example, is not very helpful for 99% of users).  There is a direct tradeoff between how non-letter characters are treated and the size of the index.  At this time, nvUltra indexes numbers and letters (and should index letters in any UTF-8 alphabet).  It does include hyphens and underscores as "intra-word" characters so that hyphenated words are indexed as the entire word, and as individual words ("foo-bar" is indexed as "foo-bar", "foo", and "bar").

    0
  • Fletcher Penney

    If you have suggestions for alternate search algorithms, I am definitely interested in hearing about them.  Please send links to descriptive web pages, but also any relevant publications as well.

     

    1
  • Michael Sollami

    In nvAlt, as you typed the matching snippets in each text note would be shown and highlighted in the results. I really miss that feature! Also very minor point, but the heights of each row in the results are too large.

    4
  • Daniel Schmidt

    I really feel the need of being able to search only filenames/titles.

     

    Say I prefix my work projects with "work", if I search for "work" it shows all notes containing "work" (which may be hundreds or more).

    I'd love to see an option to only show filenames containing "work".

     

    What's your take on this matter?

    0
  • Alessandro Melchiorre

    would it be possible an inline tag search à la Workflowy?

    Thanks

     

    0

Please sign in to leave a comment.

Powered by Zendesk