{"id":4672,"date":"2025-06-27T14:49:34","date_gmt":"2025-06-27T14:49:34","guid":{"rendered":"https:\/\/truelogic.org\/wordpress\/?p=4672"},"modified":"2026-04-08T03:00:36","modified_gmt":"2026-04-08T03:00:36","slug":"converting-text-into-contextual-search-cosine-similarity","status":"publish","type":"post","link":"https:\/\/truelogic.org\/wordpress\/2025\/06\/27\/converting-text-into-contextual-search-cosine-similarity\/","title":{"rendered":"Converting Text Into Contextual Search : Cosine Similarity"},"content":{"rendered":"\n<p>The previous blog post looked at the basics of <a href=\"https:\/\/truelogic.org\/wordpress\/2025\/06\/25\/converting-text-into-contextual-search-understanding-tf-idf\/\">converting text into vectors using TF-IDF<\/a>.<\/p>\n\n\n\n<p>From TF-IDF we can now use Cosine Similarity to find documents which match a query.<\/p>\n\n\n\n<p>In the previous post we used the example of extracting text from a PDF. We converted words into vectors from the pdf. Now when we put a query, it takes each word from that query and converts it into a vector. Then it sees how close the query vector matches the database of vectors we have.<\/p>\n\n\n\n<p>Cosine similarity is the dot product of the vectors divided by their magnitude. For example, if we have two vectors, A and B, the similarity between them is calculated as:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"587\" height=\"88\" src=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine.png\" alt=\"\" class=\"wp-image-4673\" style=\"width:840px;height:auto\" srcset=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine.png 587w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine-300x45.png 300w\" sizes=\"auto, (max-width: 587px) 100vw, 587px\" \/><\/figure>\n\n\n\n<p>The similarity can take values between -1 and +1. Smaller angles between vectors produce larger cosine values, indicating greater cosine similarity. For example:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>When two vectors have the same orientation, the angle between them is 0, and the cosine similarity is 1.<\/li>\n\n\n\n<li>Perpendicular vectors have a 90-degree angle between them and a cosine similarity of 0.<\/li>\n\n\n\n<li>Opposite vectors have an angle of 180 degrees between them and a cosine similarity of -1.<\/li>\n<\/ul>\n\n\n\n<p>Graphically it can be shown as below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"940\" height=\"236\" src=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine2-940x236.jpg\" alt=\"\" class=\"wp-image-4675\" srcset=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine2-940x236.jpg 940w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine2-620x156.jpg 620w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine2-300x75.jpg 300w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine2-768x193.jpg 768w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine2-1536x386.jpg 1536w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/cosine2.jpg 1920w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/figure>\n\n\n\n<p>So based on this, if we put a query of &#8220;What is atmospheric pressure?&#8221; against the sample physics pdf, we get the top 5 vectors which have the best matching scores. Each index is the page number in the pdf where it finds the match,<\/p>\n\n\n\n<p><strong>Querying top 5 documents &nbsp;for &nbsp;[&#8216;What is atmospheric pressure&#8217;]<br>Top related indices:<br>[ 935 &nbsp;652 1126 1242 &nbsp;220]<br>Corresponding cosine similarities<br>[0.34417706 0.28845559 0.28715727 0.23502315 0.22972955]<\/strong><\/p>\n\n\n\n<p>So what we have now is a very basic search engine which returns results based on similarity of the query to the existing data.&nbsp; There are two big limitations in cosine similarity:<\/p>\n\n\n\n<p>1.It only considers the angle between two vectors, ignoring their magnitudes. As a result, a long page&nbsp; with many words can have the same importance as a short page with few words but similar content.<\/p>\n\n\n\n<p>2.It only measures frequency and has no idea of context and semantics of the words, thus making it a little dumb.<\/p>\n\n\n\n<p>For the model to get intelligence, we will apply a BERT transformer to the data corpus. BERT (Bidirectional Encoder Representations from Transformers) was developed by Google and it&nbsp; pre-trains deep bidirectional representations from the unlabeled text, conditioning on both the left and right contexts .By doing a BERT transformation, we take into consideration the context of a word&nbsp; by looking at the words to its left and right.<\/p>\n\n\n\n<p>So the context for the word shadow in the two sentences: &#8220;He looked at his shadow on the floor&#8221; and &#8220;He was a shadow of his former self&#8221; have completely different contexts. BERT is a contextual model which captures these relationships bidirectionally.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"940\" height=\"627\" src=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/masaaki-komori-yIHrNgkyTGc-unsplash-940x627.jpg\" alt=\"\" class=\"wp-image-4668\" srcset=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/masaaki-komori-yIHrNgkyTGc-unsplash-940x627.jpg 940w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/masaaki-komori-yIHrNgkyTGc-unsplash-620x414.jpg 620w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/masaaki-komori-yIHrNgkyTGc-unsplash-300x200.jpg 300w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/masaaki-komori-yIHrNgkyTGc-unsplash-768x513.jpg 768w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/masaaki-komori-yIHrNgkyTGc-unsplash-1536x1025.jpg 1536w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2025\/06\/masaaki-komori-yIHrNgkyTGc-unsplash-2048x1367.jpg 2048w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<div class=\"mh-excerpt\"><p>The previous blog post looked at the basics of converting text into vectors using TF-IDF. From TF-IDF we can now use Cosine Similarity to find <a class=\"mh-excerpt-more\" href=\"https:\/\/truelogic.org\/wordpress\/2025\/06\/27\/converting-text-into-contextual-search-cosine-similarity\/\" title=\"Converting Text Into Contextual Search : Cosine Similarity\">[&#8230;]<\/a><\/p>\n<\/div>","protected":false},"author":1,"featured_media":4668,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[368],"tags":[],"class_list":["post-4672","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-gpu-and-ai"],"_links":{"self":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts\/4672","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/comments?post=4672"}],"version-history":[{"count":6,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts\/4672\/revisions"}],"predecessor-version":[{"id":4709,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts\/4672\/revisions\/4709"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/media\/4668"}],"wp:attachment":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/media?parent=4672"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/categories?post=4672"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/tags?post=4672"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}