This article will go into detail about what an ‘inverted index’ is as well as explain how to build one.
What is it?
Alternate names for ‘inverted index’ are ‘postings file’ and ‘inverted file’. In computer science, this is an index data structure that stores a mapping from content, like words or numbers. Its place of storage is its locations within a document or set of documents. This is in stark contrast to a ‘forward index’, whose purpose is to map from documents to content. To put simply, it’s a hashmap-like data structure that guides you from a word to either a document or a web page.
The objective of an inverted index is to permit quick full-text searches. What’s more, they can do so at a cost of increasing processing whenever a document goes on the database. The inverted file may very well be the database file itself instead of its index. It is, inarguably, the most popular data structure that document retrieval systems use. It is especially useful on a large scale; in search engines, for instance.
Moreover, there are various general-purpose mainframe-based database management systems that utilize inverted list architectures. Such systems include the likes of ADABAS, DATACOM/DB, and Model 204.
The primary variants
There are two main types of inverted indexes:
- A record-level inverted index – Alternatively an ‘inverted file index’ or just ‘inverted file’. This variant contains a list of references to documents for each word.
- A word-level inverted index – Alternatively ‘full inverted index’ or ‘inverted list’. This variant contains the positions of each word that exist within a document. This particular form provides more functionality (like phrase searches). However, it requires additional processing power and space in order to achieve creation.
There is a comparatively straightforward way to explain this concept. As a matter of fact, this example may prove to be quite recognizable if you want to better understand the intent and purpose. All it takes is opening a textbook – any textbook – to the index section.
If you need information about a specific topic, the logical step would be to open up the index. There, you search for your keyword. If you are successful in locating it, it will provide you with a set of page numbers. These pages are where you will find the information you are looking for concerning your topic. This function is the inverted index, directing you to the page numbers where that word is prevalent in a sea of other pages.
Look at it this way: if you perform a standard linear search, it will take hours to reach that page. Now, with the inverted index, it takes no time at all, with the duration of the process lasting mere seconds.
From here, let’s look at a more technical example. Imagine that you want to search the following texts: “hi everyone,” “this article is based on inverted index,” “which is hashmap like data structure.” If we index by text, word within the text, then the index with its location in the text is:
hi – (1, 1)
everyone – (1, 2)
this – (2, 1)
article – (2, 2)
is – (2, 3); (3, 2)
based – (2, 4)
on – (2, 5)
inverted – (2, 6)
index – (2, 7)
which – (3, 1)
hashmap – (3, 3)
like – (3, 4)
data – (3, 5)
structure – (3, 6)
Upon first glance, this may appear confusing. However, when you look closely at the arrangement, this seemingly random set of numbers makes much more sense.
The texts are essentially separate sentences, otherwise known as ‘documents’. The example we are using has a total of three texts. The first document is the text “hi everyone.” The word “hi” exists within document #1 starting at word 1, therefore it has an entry of (1, 1).
The word “is” exists within two documents: document #2 and #3. Furthermore, it is at the ‘3rd’ and ‘2nd’ positions respectively, hence (2, 3) and (3, 2). In this specific case, position stems from the word.
It is important to note that the index will sometimes have weights, frequencies, or any other indicator type.
How to build an inverted index
There is something that should be said when it comes to building an inverted index for searching system maintenance. That being it requires you to carry out an array of steps while parsing the pages or documents.
Let’s assume you want to create a search engine for all of the documents within your computer. In this situation, you know what it is you seek. You will run a program that will go through the whole tree in your hard-disks. In doing so, it will collect the pages that you want. You know that files of the mp3 and jpeg format are of no use to you in this particular case. Instead, you will request your program to retrieve files of the txt, doc and pdf format.
As soon as you receive a document, you can proceed to the next step.
Step #1 – Retrieve the document
If you are receiving a text file (.txt), then the job is actually really simple. However, if you instead receive a document (.doc) or pdf, then you will need to parse them. You can do so by using certain libraries in order to retrieve their text. For the sake of this hypothetical scenario, let’s say that your efforts in reading the text are successful.
Step #2 – Remove the stop words
For this step, take the last section into consideration. You must seek out the most important words that were prevalent. What were they? Well, after looking back, they were the following:
There is, however, the nagging fact we must acknowledge, and that is most of the other words are a waste. You will designate the more recurring words as “stop words.” From here, you remove them so that you don’t get indexes for words like “you,” “the,” and “is,” among others. In regular use, you typically have a list of words that reach up to 500-1000. It may differ, though this largely depends on use.
Step #3 – Stem to the root word
At this point comes the step that centers on stemming. Whenever you want to search for “retrieval”, you obviously will want to see a document that contains information about it. However, there is a catch to this, in that the word in the document is “retrieve,” not “retrieval.” There is a way to relate both of these words. What you must do is remove a certain part of each word that you read. This way, you will be able to obtain the ‘root word’. “Retrieve” may end up becoming “retriev,” as will “retrieval.”
Step #4 – Keep a record of document IDs
This step is where you will get into the main task. Here, you will commence the act of indexing. Each and every document that you have has a totally unique document ID. Whenever you happen upon a non-stop word that is now stemmed, you save it in your memory. You do so in the following form:
retriev ==> docID104007
If you end up getting the same word in a separate document, then you may write:
retriev ==> docID104007
retriev ==> docID154033
However, you will eventually have to combine them together in a single list:
retriev ==> docID104007&docID154033
You are able to further improve this by writing down how many times the word would occur in the document. By doing this, you can now rank the comparatively more important documents while retrieving.
retriev ==> docID104007|5|&docID154033|2|
Step #5 – Blend the terms and then store them
At this point, when all is said and done, you save them in disk files. It is particularly ideal if you sort the index by drawing from the words for quick and easy retrieval.
The structure of inverted index data is a key component of a conventional search engine indexing algorithm. A predominant goal of a search engine implementation is to enhance the overall speed of the query. Basically, it locates the documents where word X occurs. Upon development for a forward index (storage for lists of words per document), it’s inverted to construct an inverted index.
Querying the forward index requires sequential repetition through each document and to each word. This way, it can validate an identical document. Technically speaking, the time, memory, and processing resources that carry out such a query are not always practical. Rather than list the words per document, the inverted index data structure lists the documents per word.
With the creation of the inverted index, the query can now reach a resolution. It can jump to the word ID in the inverted index. This is possible by way of random access.
In the time before computers, the creation of concordances to important books was subject to manual assembly. These were effectively inverted indexes along with a small amount of accompanying commentary. This commentary requires a considerable amount of effort in order to produce it.
Bioinformatics is a multidisciplinary field that works to develop methods and software tools for a better understanding of biological data. As an interdisciplinary field of science, bioinformatics combines an array of subjects and other fields. These include biology, computer science, information engineering, mathematics, and statistics for examining and deciphering biological data.
In this particular field, inverted indexes play an important part in the sequence assembly of short fragments of sequenced DNA. An ideal way to figure out the source of a fragment is by searching for it against a reference DNA sequence. Sometimes, there is a small number of mismatches, which can come from differences between the sequenced DNA and reference DNA. Alternatively, it could simply be errors. Whatever the cause, it’s possible for one to account for it by dividing the portion into smaller fragments.
There is a strong likelihood of at least one subfragment being able to match the reference DNA sequence. A requirement of the matching is to construct an inverted index of all substrings of a specific length. Moreover, it should stem from the reference DNA sequence.
The human DNA always contains base pairs that exceed 3 billion. On top of that, we must store a DNA substring for every index and a 32-bit integer for the index itself. Because of these factors, the storage requirement for such an inverted index might be in the tens of gigabytes.
Advantages & Disadvantages to an Inverted Index
For what it’s worth when it comes to the pros and cons of an inverted index, there are more advantages. With that in mind, the primary disadvantage is certainly a prominent one that deserves an acknowledgment.
The advantages that come from an inverted index are the following:
- Inverted index allows for faster full text searches. It is at a cost of increased processing whenever there is an addition of a document to the database.
- The development is rather easy.
- As you may recall, it’s one of the more popular data structures that document retrieval systems utilize. It is an especially prominent tool on systems of a larger scale, like search engines.
As mentioned before, there really is only one dominant disadvantage, which is:
- Large storage expense and tremendous maintenance costs on each update, delete and insert.