Every seasoned computer user has come across it at least once, a folder full of documents without any structure or order. Finding a document is only possible if you know what you are looking for.
How can Data Science help create order in the chaos?
Before we delve into the technical details, let’s have a look at the historical background of this problem. Already since the beginning of the computer era, the organizing of documents and texts has been researched by a wide array of researchers in the field. For example, one of the basic principles of working with documents goes back to 1957!
Researcher Hans Peter Luhn laid the foundation for one of the most used solutions for documents known as TF-IDF.
Despite its age, TF-IDF is still commonly used in a wide variety of applications.
The Technology Demystified
The full name is Term Frequency – Inverse Document Frequency. TF-IDF counts how often words occur in a text and how unique these words are compared to the other documents that are available.
Words that appear frequently in one document and rarely appear in the other documents most likely give a good description of the documents’ topic.By exploiting this phenomenon the words that best describe the document can be identified.
And finally, these words can then be used to organize documents!
Let’s go back to our example, a folder full of documents with different contents and format. Of course, we did not always make the effort to give the files very descriptive names. Now that we have collected all these documents we take ages to find our favourite lasagna recipe.
By using TF-IDF we can get the words that best describe each document. By simply matching these words we can then group and organize the documents by their contents.
Once these documents are grouped we can then apply TF-IDF again for each group and use the best describing words to suggest tags for each document!
Due to its simplicity, this technique provides a strong base for many techniques that deal with textual data.
It is also a very transparent technique, so we have a good idea if the technique will work for any other use case and how to get the most value out of it.
Applications in the real world
Now perhaps bringing order to a single folder on a single computer is not the most worthwhile of tasks, but when you start to think outside of the box you may come across some very interesting applications.
An example is grouping questions of clients to compose a list of frequently asked questions. Instead of guessing which questions are common you can serve your customers more precisely by tailoring your support information to the really important questions. Another practical example is finding the closest related question in order to provide a solution guideline. Perhaps not something that came to your mind first, but how about grouping items in a catalogue?
Now that you have seen some examples, how would you apply document organization to your problems? We are excited to hear about your ideas and we will happily give some thought on their feasibility!