Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document (Wikipedia, 2021). Generally, we can split keywords extraction methods in two:
In this brief article I am going to explore a simple keywords assignment algorithm using Python. The reason for this post is that while trying to find a valid imputer for the missing keywords in the NYT articles (GitHub) I have stumbled upon multiple solutions that most of the time worked inconsistently if not quite poorly. The main target for me was that the keyword extractor could have been capable of using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases since I was dealing already with a large number of different keywords (~4.5k).
I ended up using KeyBERT (more resource and insights of different algorithms can be found here ). KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document (Grootendorst, 2021) built upon BERT. BERT stands for Bidirectional Encoder Representations from Transformers and it's a language representation model designed to pre-train deep bidirectional representation from unlabelled text by jointly conditioning on both left and right context in all layers (Jacob Devlin, 2018). More about BERT can be found here.
Pro of keryBERT:
To demonstrate the potential of KeyBERT I will use a snippet of code from my study and let you play with it. But first here's a brief description of the parameters:
"The Biden administration plans to require most foreign visitors to be vaccinated. Biden Plans New Policy Requiring That All Foreign Travelers to U.S. Be Vaccinated The Biden administration is developing plans to require all foreign travelers to the United States to be vaccinated against Covid-19, with limited exceptions, according to an administration official with knowledge of the developing policy.Officials say the new policy is being readied in the event that the United States eases its travel rules, which isn't expected soon."
Using KeyBERT with the following parameters on the sample text:
Based on the sample text about Biden's vaccination policy for foreign travelers, KeyBERT would extract keywords such as:
Note: In the original Flask application, users could interactively adjust these parameters and see real-time keyword extraction results. This static version shows example output.