<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>indexing | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/indexing/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/indexing/index.xml" rel="self" type="application/rss+xml"/><description>indexing</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Sat, 23 Aug 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>indexing</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/indexing/</link></image><item><title>End-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/</guid><description>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Heading" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_f9e5e16b2001b9950ad995b2c786abc9.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_27bc4379277ab462935158b3db96d992.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_f9e5e16b2001b9950ad995b2c786abc9.webp"
width="760"
height="392"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h1 id="introduction">&lt;strong>Introduction&lt;/strong>&lt;/h1>
&lt;p>Hello everyone!&lt;br>
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/northwestern/statwrap/">StatWrap: Cross-Project Searching and Classification using Local Indexing&lt;/a> project, my &lt;a href="https://drive.google.com/file/d/1dxyBP2oMJwYDCKyIWzr465zNmm6UWtnI/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>, under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luke-rasmussen/">Luke Rasmussen&lt;/a>, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.&lt;/p>
&lt;h1 id="about-the-project">&lt;strong>About the Project&lt;/strong>&lt;/h1>
&lt;p>As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for researchers to discover relevant projects, notes, and assets across both current and archived work, using information that is either user-entered or passively collected by StatWrap.&lt;/p>
&lt;p>Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Evaluating open-source search libraries&lt;/strong> suitable for local indexing and retrieval&lt;/li>
&lt;li>&lt;strong>Building the full-text search functionality&lt;/strong> directly into the StatWrap UI to allow seamless querying across projects&lt;/li>
&lt;li>&lt;strong>Ensuring reliability&lt;/strong> through the development of unit tests and comprehensive system testing&lt;/li>
&lt;li>&lt;strong>Implementing a classification system&lt;/strong> to label projects as “Active,” “Pinned,” or “Past” within the user interface&lt;/li>
&lt;/ul>
&lt;p>This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.&lt;/p>
&lt;h1 id="deliverables">&lt;strong>Deliverables&lt;/strong>&lt;/h1>
&lt;p>The project has reached the end of its scope after 12 weeks of work. Here&amp;rsquo;s a breakdown:&lt;/p>
&lt;h2 id="1-descriptive-comparison-of-open-source-libraries">&lt;strong>1. Descriptive Comparison of Open-Source Libraries&lt;/strong>&lt;/h2>
&lt;p>Compared various open-source search libraries based on evaluation criteria such as &lt;strong>indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation&lt;/strong>, and &lt;strong>developer experience&lt;/strong>. Decided upon the weights to assign to each of the features and point out the best library to use. According to our weights assigned,
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Evaluation" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_4b5e863d88146124b333878508147eff.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_c2220a56c480048842e8b750cc2ca56f.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_4b5e863d88146124b333878508147eff.webp"
width="760"
height="603"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>These results are after tuning the hyperparameters to give the best set of results
For huge data, FlexSearch has the least memory usage, followed by MiniSearch. The examples we used were limited, so Minisearch had the better memory usage results.
Along with the research and evaluation, I looked upon the Performance Benchmark of Full-Text-Search Libraries (Stress Test), available &lt;a href="https://nextapps-de.github.io/flexsearch/" target="_blank" rel="noopener">here&lt;/a>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Stress Test" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_407cb964e7e05c64834433b6a84182ff.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_167223f62fbaf30991601d7745fad9f5.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_407cb964e7e05c64834433b6a84182ff.webp"
width="760"
height="384"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The benchmark was measured in terms per seconds, higher values are better (except the test &amp;ldquo;Memory&amp;rdquo;). The memory value refers to the amount of memory which was additionally allocated during search.&lt;/p>
&lt;p>FlexSearch performs queries up to 1,000,000 times faster compared to other libraries by also providing powerful search capabilities like multi-field search (document search), phonetic transformations, partial matching, tag-search, result highlighting or suggestions.
Bigger workloads are scalable through workers to perform any updates or queries to the index in parallel through dedicated balanced threads.&lt;/p>
&lt;h2 id="2-the-search-user-interface">&lt;strong>2. The Search User Interface&lt;/strong>&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="ui" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_5c88d9d2587c54c50da97d6c489519dc.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_82065ca30e98bced61362bca45765215.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_5c88d9d2587c54c50da97d6c489519dc.webp"
width="760"
height="428"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="ui2" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_7a3499ad0fc3cd06919fcdd17194742a.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_5840b85d48a6e608855c8e0d96b4fe49.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_7a3499ad0fc3cd06919fcdd17194742a.webp"
width="760"
height="652"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="3-complete-search-execution-pipeline">&lt;strong>3. Complete Search Execution Pipeline&lt;/strong>&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="ui2" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_bd4ac2fa5efb17e2b237cf8d78278398.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_a0e8f31fdbdc656a2886def3dca3410b.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_bd4ac2fa5efb17e2b237cf8d78278398.webp"
width="513"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="4-flexsearch-features">&lt;strong>4. FlexSearch Features&lt;/strong>&lt;/h2>
&lt;h4 id="1-persistent-indexing-with-automatic-loading">1. &lt;strong>Persistent Indexing with Automatic Loading&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Index persistence&lt;/strong>: Search index automatically saves to disk and loads on startup&lt;/li>
&lt;li>&lt;strong>Fast restoration&lt;/strong>: Rebuilds FlexSearch indices from saved document store without re-scanning files&lt;/li>
&lt;li>&lt;strong>Incremental updates&lt;/strong>: Detects project changes and updates only modified content&lt;/li>
&lt;li>&lt;strong>Background processing&lt;/strong>: Index updates happen asynchronously without blocking the User Interface.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="indexing" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_23074ee37edbb0f6abbd289ef211f756.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_993d6a1363d2cddf66632c4102acb8f5.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_23074ee37edbb0f6abbd289ef211f756.webp"
width="494"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="2-multi-document-type-support">2. &lt;strong>Multi-Document Type Support&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Unified search&lt;/strong>: Single search interface for projects, files, people, notes, and assets&lt;/li>
&lt;li>&lt;strong>Type-specific indices&lt;/strong>: Separate FlexSearch indices optimized for each document type&lt;/li>
&lt;li>&lt;strong>Cross-reference capabilities&lt;/strong>: Documents can reference and link to each other&lt;/li>
&lt;li>&lt;strong>Flexible schema&lt;/strong>: Each document type has tailored fields for optimal search performance&lt;/li>
&lt;/ul>
&lt;h4 id="3-intelligent-file-content-indexing">3. &lt;strong>Intelligent File Content Indexing&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Configurable file size limits&lt;/strong>: Admin-controlled maximum file size for content indexing&lt;/li>
&lt;li>&lt;strong>Smart file detection&lt;/strong>: Automatically identifies text files by extension and filename patterns&lt;/li>
&lt;li>&lt;strong>Content extraction&lt;/strong>: Full-text indexing with snippet generation for search results&lt;/li>
&lt;li>&lt;strong>Performance optimization&lt;/strong>: Skips binary files and respects size constraints to maintain speed&lt;/li>
&lt;/ul>
&lt;h4 id="4-advanced-query-processing">4. &lt;strong>Advanced Query Processing&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Multi-strategy search&lt;/strong>: Combines exact matches, fuzzy search, partial matches, and contextual search&lt;/li>
&lt;li>&lt;strong>Query preprocessing&lt;/strong>: Removes stop words and applies linguistic filters&lt;/li>
&lt;li>&lt;strong>Relevance scoring&lt;/strong>: Custom scoring algorithm considering multiple factors:
&lt;ul>
&lt;li>Exact phrase matches (highest weight)&lt;/li>
&lt;li>Individual word matches&lt;/li>
&lt;li>Term frequency with logarithmic capping&lt;/li>
&lt;li>Position-based scoring (earlier matches rank higher)&lt;/li>
&lt;li>Proximity bonuses for terms appearing near each other&lt;/li>
&lt;li>Completeness penalties for missing query terms&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="5-real-time-search-suggestions">5. &lt;strong>Real-Time Search Suggestions&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Autocomplete support&lt;/strong>: Dynamic suggestions based on indexed document titles&lt;/li>
&lt;li>&lt;strong>Search history&lt;/strong>: Maintains recent searches for quick re-execution&lt;/li>
&lt;li>&lt;strong>Debounced input&lt;/strong>: Prevents excessive API calls during typing&lt;/li>
&lt;li>&lt;strong>Contextual suggestions&lt;/strong>: Suggestions adapt based on current filters and context&lt;/li>
&lt;/ul>
&lt;h4 id="6-comprehensive-filtering-system">6. &lt;strong>Comprehensive Filtering System&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Type filtering&lt;/strong>: Filter by document type (projects, files, people, etc.)&lt;/li>
&lt;li>&lt;strong>Project scoping&lt;/strong>: Limit searches to specific projects&lt;/li>
&lt;li>&lt;strong>File type filtering&lt;/strong>: Filter files by extension&lt;/li>
&lt;li>&lt;strong>Advanced search panel&lt;/strong>: Collapsible interface for power users&lt;/li>
&lt;li>&lt;strong>Filter persistence&lt;/strong>: Maintains filter state across searches&lt;/li>
&lt;/ul>
&lt;h4 id="7-performance-monitoring--analytics">7. &lt;strong>Performance Monitoring &amp;amp; Analytics&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Real-time metrics&lt;/strong>: Track search times, cache hit rates, and index statistics&lt;/li>
&lt;li>&lt;strong>Performance dashboard&lt;/strong>: Visual indicators for system health&lt;/li>
&lt;li>&lt;strong>Cache management&lt;/strong>: LRU cache with configurable size and TTL&lt;/li>
&lt;li>&lt;strong>Search analytics&lt;/strong>: Historical data on search patterns and performance&lt;/li>
&lt;/ul>
&lt;h4 id="8-index-management-tools">8. &lt;strong>Index Management Tools&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Export/Import functionality&lt;/strong>: Backup and restore search indices&lt;/li>
&lt;li>&lt;strong>Full reindexing&lt;/strong>: Complete index rebuild with progress tracking&lt;/li>
&lt;li>&lt;strong>Index deletion&lt;/strong>: Clean slate functionality for troubleshooting&lt;/li>
&lt;li>&lt;strong>File size adjustment&lt;/strong>: Modify indexing constraints and rebuild affected content&lt;/li>
&lt;li>&lt;strong>Index statistics&lt;/strong>: Detailed breakdown of indexed content by type and project&lt;/li>
&lt;/ul>
&lt;h4 id="9-robust-error-handling--resilience">9. &lt;strong>Robust Error Handling &amp;amp; Resilience&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Graceful degradation&lt;/strong>: System continues operating even with partial index corruption&lt;/li>
&lt;li>&lt;strong>File system error handling&lt;/strong>: Handles missing files, permission issues, and path changes&lt;/li>
&lt;li>&lt;strong>Memory management&lt;/strong>: Prevents memory leaks during large indexing operations&lt;/li>
&lt;li>&lt;strong>Recovery mechanisms&lt;/strong>: Automatic fallback to basic search if advanced features fail&lt;/li>
&lt;/ul>
&lt;h4 id="10-user-experience-enhancements">10. &lt;strong>User Experience Enhancements&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Keyboard shortcuts&lt;/strong>: Ctrl+K to focus search, Escape to clear&lt;/li>
&lt;li>&lt;strong>Result highlighting&lt;/strong>: Visual emphasis on matching terms in results&lt;/li>
&lt;li>&lt;strong>Expandable results&lt;/strong>: Drill down into detailed information for each result&lt;/li>
&lt;li>&lt;strong>Loading states&lt;/strong>: Clear feedback during indexing and search operations&lt;/li>
&lt;li>&lt;strong>Responsive tabs&lt;/strong>: Organized results by type with badge counts&lt;/li>
&lt;/ul>
&lt;h2 id="5-classification-of-active-and-past-projects">&lt;strong>5. Classification of Active and Past Projects&lt;/strong>&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Active Pinned" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_1d3344ebb95180438d54893a9b5683e4.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_a0f8ee7f62445c2f5f806022268d0821.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_1d3344ebb95180438d54893a9b5683e4.webp"
width="733"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Past" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_76660a0dce9ac0ba1fa91c959db2773c.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_cc2abd1a6a3019f703ca3e656e55f920.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_76660a0dce9ac0ba1fa91c959db2773c.webp"
width="740"
height="542"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>A classification system is added within the User Interface similar to &lt;strong>&amp;ldquo;Add to Favorites&amp;rdquo;&lt;/strong> option. A new project added by default moves to &lt;strong>&amp;ldquo;Active&amp;rdquo;&lt;/strong> section, unless explicitely marked as &lt;strong>&amp;ldquo;Past&amp;rdquo;&lt;/strong>. Similarly, when a project is unpinned from Favorites, it goes to &amp;ldquo;Active&amp;rdquo; Section.&lt;/p>
&lt;h1 id="conclusion-and-future-scope">&lt;strong>Conclusion and future Scope&lt;/strong>&lt;/h1>
&lt;p>Building a comprehensive search system requires careful attention to performance, user experience, and maintainability. FlexSearch provided the foundation, but the real value came from thoughtful implementation of persistent indexing, advanced scoring, and robust error handling. The result is a search system that feels instant to users while handling complex queries across diverse document types.&lt;/p>
&lt;p>The key to success was treating search not as a single feature, but as a complete subsystem with its own data management, performance monitoring, and user interface considerations. By investing in these supporting systems, the search functionality became a central, reliable part of the application that users can depend on.&lt;/p>
&lt;p>The future scope would include:&lt;/p>
&lt;ol>
&lt;li>Using a database (for example, SQLite), instead of JSON, which is better for this use case than JSON due to better and efficient query performance and atomic (CRUD) operations.&lt;/li>
&lt;li>Integrating any suggestions from my mentors, as well as improvements we feel are necessary.&lt;/li>
&lt;li>Developing unit tests for further functionalities and improvements.&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Thank You!" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_f70985a589ad6b79f8c95b36c5279852.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_b28b9dbb6c70c33ca845fda461a64fcf.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_f70985a589ad6b79f8c95b36c5279852.webp"
width="760"
height="235"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p></description></item><item><title>Mid-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/</link><pubDate>Tue, 15 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Hello everyone!&lt;br>
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/northwestern/statwrap/">StatWrap: Cross-Project Searching and Classification using Local Indexing&lt;/a> project, my &lt;a href="https://drive.google.com/file/d/1dxyBP2oMJwYDCKyIWzr465zNmm6UWtnI/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>, under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luke-rasmussen/">Luke Rasmussen&lt;/a>, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.&lt;/p>
&lt;h2 id="about-the-project">&lt;strong>About the Project&lt;/strong>&lt;/h2>
&lt;p>As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for investigators to discover relevant projects, notes, and assets—across both current and archived work—using information that is either user-entered or passively collected by StatWrap.&lt;/p>
&lt;p>Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Evaluating open-source search libraries&lt;/strong> suitable for local indexing and retrieval&lt;/li>
&lt;li>&lt;strong>Building the full-text search functionality&lt;/strong> directly into the StatWrap UI to allow seamless querying across projects&lt;/li>
&lt;li>&lt;strong>Ensuring reliability&lt;/strong> through the development of unit tests and comprehensive system testing&lt;/li>
&lt;li>&lt;strong>Implementing a classification system&lt;/strong> to label projects as “Active,” “Pinned,” or “Past” within the user interface&lt;/li>
&lt;/ul>
&lt;p>This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.&lt;/p>
&lt;h2 id="progress">Progress&lt;/h2>
&lt;p>It has been more than six weeks since the project began, and significant progress has been made. Here&amp;rsquo;s a breakdown:&lt;/p>
&lt;h3 id="1-descriptive-comparison-of-open-source-libraries">1. &lt;strong>Descriptive Comparison of Open-Source Libraries&lt;/strong>&lt;/h3>
&lt;p>Compared various open-source search libraries based on evaluation criteria such as &lt;strong>indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation&lt;/strong>, and &lt;strong>developer experience&lt;/strong>.&lt;/p>
&lt;h3 id="2-the-libraries">2. &lt;strong>The Libraries&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Lunr.js&lt;/strong>&lt;br>
A small, client-side full-text search engine that mimics Solr capabilities.&lt;/p>
&lt;ul>
&lt;li>Field-based search, boosting&lt;/li>
&lt;li>Supports TF-IDF, inverted index&lt;/li>
&lt;li>No built-in fuzzy search (only basic wildcards)&lt;/li>
&lt;li>Can serialize/deserialize index&lt;/li>
&lt;li>Not designed for large datasets&lt;/li>
&lt;li>Moderate memory usage and indexing speed&lt;/li>
&lt;li>Good documentation&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Static websites or SPAs needing simple in-browser search&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>ElasticLunr.js&lt;/strong>&lt;br>
A lightweight, more flexible alternative to Lunr.js.&lt;/p>
&lt;ul>
&lt;li>Dynamic index (add/remove docs)&lt;/li>
&lt;li>Field-based and weighted search&lt;/li>
&lt;li>No advanced fuzzy matching&lt;/li>
&lt;li>Faster and more customizable than Lunr&lt;/li>
&lt;li>Smaller footprint&lt;/li>
&lt;li>Easy to use and maintain&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Developers wanting Lunr-like features with simpler customization&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Fuse.js&lt;/strong>&lt;br>
A fuzzy search library ideal for small to medium datasets.&lt;/p>
&lt;ul>
&lt;li>Fuzzy search with typo tolerance&lt;/li>
&lt;li>Deep key/path searching&lt;/li>
&lt;li>No need to build index&lt;/li>
&lt;li>Highly configurable (threshold, distance, etc.)&lt;/li>
&lt;li>Linear scan = slower on large datasets&lt;/li>
&lt;li>Not full-text search (scoring-based match)&lt;/li>
&lt;li>Extremely easy to set up and use&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Fuzzy search in small in-memory arrays (e.g., auto-suggest, dropdown filters)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>FlexSearch&lt;/strong>&lt;br>
A blazing-fast, modular search engine with advanced indexing options.&lt;/p>
&lt;ul>
&lt;li>Extremely fast search and indexing&lt;/li>
&lt;li>Supports phonetic, typo-tolerant, and partial matching&lt;/li>
&lt;li>Asynchronous support&lt;/li>
&lt;li>Multi-language + Unicode-friendly&lt;/li>
&lt;li>Low memory footprint&lt;/li>
&lt;li>Configuration can be complex for beginners&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: High-performance search in large/multilingual datasets&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>MiniSearch&lt;/strong>&lt;br>
A small, full-text search engine with balanced performance and simplicity.&lt;/p>
&lt;ul>
&lt;li>Fast indexing and searching&lt;/li>
&lt;li>Fuzzy search, stemming, stop words&lt;/li>
&lt;li>Field boosting and prefix search&lt;/li>
&lt;li>Compact, can serialize index&lt;/li>
&lt;li>Clean and modern API&lt;/li>
&lt;li>Lightweight and easy to maintain&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Balanced, in-browser full-text search for moderate datasets&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Search-Index&lt;/strong>&lt;br>
A persistent, full-featured search engine for Node.js and browsers.&lt;/p>
&lt;ul>
&lt;li>Persistent storage with LevelDB&lt;/li>
&lt;li>Real-time indexing&lt;/li>
&lt;li>Fielded queries, faceting, filtering&lt;/li>
&lt;li>Advanced queries (Boolean, range, etc.)&lt;/li>
&lt;li>Slightly heavier setup&lt;/li>
&lt;li>Good for offline/local-first apps&lt;/li>
&lt;li>Browser usage more complex than others&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Node.js apps, &lt;strong>not directly compatible with the Electron + React environment of StatWrap&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-developer-experience-and-maintenance">3. Developer Experience and Maintenance&lt;/h3>
&lt;p>We analyzed the download trends of the search libraries using npm trends, and also reviewed their maintenance statistics to assess how frequently they are updated.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="DOWNLOADS" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_2981b0e25cc7e6da71dd1af69f1ab499.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_52b5a1c87803e2c8a2f59ad52703cd75.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_2981b0e25cc7e6da71dd1af69f1ab499.webp"
width="760"
height="362"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;br>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Maintenance" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_50f35746c2224661759e3d1f68308f5c.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_1f83a8585ae086eae8ad16a0d18c8fff.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_50f35746c2224661759e3d1f68308f5c.webp"
width="760"
height="261"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="4-comparative-analysis-after-testing">4. Comparative Analysis After Testing&lt;/h3>
&lt;p>Each search library was benchmarked against a predefined set of queries based on the same evaluation criteria.&lt;br>
We are yet to finalize the weights for each criterion, which will be done during the end-term evaluation.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="COMPARATIVE ANALYSIS" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_cf08ab4466e54fc0970dac451ab583d2.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_4d08ea843125818ade4b1288b2ed91fd.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_cf08ab4466e54fc0970dac451ab583d2.webp"
width="760"
height="578"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="5-the-user-interface">5. The User Interface&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="User Interface" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_ad72fdc47d934ea42f989055b49d88aa.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_51decc3c2ce6793ca567153dd67113d0.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_ad72fdc47d934ea42f989055b49d88aa.webp"
width="760"
height="475"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;br>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Debug Tools" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_e86edc8fa7aba824f1fd8a90948c619c.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_ba6358e5089040847a0e39704677cc12.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_e86edc8fa7aba824f1fd8a90948c619c.webp"
width="760"
height="482"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The user interface includes options to search using three search modes (Basic, Advanced, Boolean operators) with configurable parameters. Results are sorted based on relevance score (highest first), and also grouped by category.&lt;/p>
&lt;h3 id="6-overall-functioning">6. Overall Functioning&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Indexing Workflow&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Projects are processed sequentially&lt;/li>
&lt;li>Metadata, files, people, and notes are indexed (larger files are queued for later)&lt;/li>
&lt;li>Uses a &amp;ldquo;brute-force&amp;rdquo; recursive approach to walk through project directories
&lt;ul>
&lt;li>Skips directories like &lt;code>node_modules&lt;/code>, &lt;code>.git&lt;/code>, &lt;code>.statwrap&lt;/code>&lt;/li>
&lt;li>Identifies eligible text files for indexing&lt;/li>
&lt;li>Logs progress every 10 files&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Document Creation Logic&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Reads file content as UTF-8 text&lt;/li>
&lt;li>Builds searchable documents with filename, content, and metadata&lt;/li>
&lt;li>Auto-generates tags based on content and file type&lt;/li>
&lt;li>Adds documents to the search index and document store&lt;/li>
&lt;li>Handles errors gracefully with debug logging&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Search Functionality&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Uses field-weighted search&lt;/li>
&lt;li>Enriches results with document metadata&lt;/li>
&lt;li>Supports filtering by type or project&lt;/li>
&lt;li>Groups results by category (files, projects, people, etc.)&lt;/li>
&lt;li>Implements caching for improved performance&lt;/li>
&lt;li>Search statistics are generated to monitor performance&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="challenges-and-end-term-goals">Challenges and End-Term Goals&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>In-memory Indexing Metadata Storing&lt;/strong>&lt;br>
Most JavaScript search libraries (like Fuse.js, Lunr, MiniSearch) store indexes entirely in memory, which can become problematic for large-scale datasets. A key challenge is designing a scalable solution that allows for disk persistence or lazy loading to prevent memory overflows.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Deciding the Weights Accordingly&lt;/strong>&lt;br>
An important challenge is tuning the relevance scoring by assigning appropriate weights to different aspects of the search, such as exact word matches, prefix matches, and typo tolerance. For instance, we prefer exact matches to be ranked higher than fuzzy or partial matches.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Implementing the Selected Library&lt;/strong>&lt;br>
Once a library is selected (based on speed, features, and compatibility with Electron + React), the next challenge is integrating it into StatWrap efficiently—ensuring local indexing, accurate search results, and smooth performance even with large projects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Classifying Active and Past Projects in the User Interface&lt;/strong>&lt;br>
To improve navigation and search scoping, we plan to introduce three project sections in the interface: &lt;strong>Pinned&lt;/strong>, &lt;strong>Active&lt;/strong>, and &lt;strong>Past&lt;/strong> projects. This classification will help users prioritize relevant content while enabling smarter indexing strategies.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Stay tuned for the next blog!&lt;/p></description></item><item><title>StatWrap: Cross-Project Searching and Classification using Local Indexing</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250614-debangi29/</link><pubDate>Sat, 14 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250614-debangi29/</guid><description>&lt;p>Hello👋! I am &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/debangi-ghosh/">Debangi Ghosh&lt;/a>, currently pursuing a degree in Mathematics and Computing at IIT (BHU) Varanasi, India. This summer, I will be working on the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/northwestern/statwrap/">StatWrap: Cross-Project Searching and Classification using Local Indexing&lt;/a> project under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luke-rasmussen/">Luke Rasmussen&lt;/a>. You can view my &lt;a href="https://drive.google.com/file/d/1dxyBP2oMJwYDCKyIWzr465zNmm6UWtnI/view?usp=sharing" target="_blank" rel="noopener">project proposal&lt;/a> for more details.&lt;/p>
&lt;p>My project aims to address the challenges in project navigation and discoverability by integrating a robust full-text search capability within the user interface. Instead of relying on basic keyword-based search—where remembering exact terms can be difficult—we plan to implement a natural language-based full-text search. This approach involves two main stages: indexing, which functions like creating a searchable map of the content, and searching, which retrieves relevant information from that map. We will evaluate and compare available open-source libraries to choose and implement the most effective one.
In addition, my project aims to enhance project organization by introducing a new classification system that clearly distinguishes between “Active” and “Past” projects in the user interface. This will improve clarity, reduce clutter, and provide a more streamlined experience as the number of projects grows.&lt;/p>
&lt;p>Stay tuned for updates on my progress in the coming weeks! 🚀&lt;/p></description></item></channel></rss>