Keyword assignment with python

KeyBERT

Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document (Wikipedia, 2021). Generally, we can split keywords extraction methods in two:

In this brief article I am going to explore a simple keywords assignment algorithm using Python. The reason for this post is that while trying to find a valid imputer for the missing keywords in the NYT articles (GitHub) I have stumbled upon multiple solutions that most of the time worked inconsistently if not quite poorly. The main target for me was that the keyword extractor could have been capable of using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases since I was dealing already with a large number of different keywords (~4.5k).

I ended up using KeyBERT (more resource and insights of different algorithms can be found here ). KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document (Grootendorst, 2021) built upon BERT. BERT stands for Bidirectional Encoder Representations from Transformers and it's a language representation model designed to pre-train deep bidirectional representation from unlabelled text by jointly conditioning on both left and right context in all layers (Jacob Devlin, 2018). More about BERT can be found here.

About keyBERT

Pro of keryBERT:

Cons:

To demonstrate the potential of KeyBERT I will use a snippet of code from my study and let you play with it. But first here's a brief description of the parameters:

Extracting keywords from following text:

"The Biden administration plans to require most foreign visitors to be vaccinated. Biden Plans New Policy Requiring That All Foreign Travelers to U.S. Be Vaccinated The Biden administration is developing plans to require all foreign travelers to the United States to be vaccinated against Covid-19, with limited exceptions, according to an administration official with knowledge of the developing policy.Officials say the new policy is being readied in the event that the United States eases its travel rules, which isn't expected soon."

['Politics and Government', 'Global Warming', 'United Nations', 'Johnson, Boris', 'Great Britain', 'Coronavirus (2019-nCoV)', 'Quarantine (Life and Culture)', 'Putin, Vladimir V', 'Travel and Vacations', 'Quarantines', 'Deaths (Fatalities)', 'Vaccination and Immunization', 'Demonstrations, Protests and Riots', 'Muslims and Islam', 'Terrorism', 'Defense and Military Forces', 'Human Rights and Human Rights Violations', 'Mexico', 'United States International Relations', 'United States Politics and Government', 'Biden, Joseph R Jr', 'Palestinians', 'Gaza Strip', 'Israel', 'Hamas', 'Economic Conditions and Trends', 'China', 'Social Media', 'Communist Party of China', 'Law and Legislation', 'Shortages', 'Floods', 'Civilian Casualties', 'Afghanistan War (2001- )', 'Hospitals', 'Taliban', 'AFGHANISTAN', 'Kabul (Afghanistan)', 'Brazil', 'War and Armed Conflicts', 'Refugees and Displaced Persons', 'Russia', 'International Relations', 'Australia', 'Sex Crimes', 'South Korea', 'War Crimes, Genocide and Crimes Against Humanity', 'World Health Organization', 'India', 'Embargoes and Sanctions', 'United States Defense and Military Forces', 'Iran', 'Legislatures and Parliaments', 'Disease Rates', 'Italy', 'Germany', 'Merkel, Angela', 'Europe', 'Elections', 'Great Britain Withdrawal from EU (Brexit)', 'International Trade and World Market', 'European Union', 'Macron, Emmanuel (1977- )', 'France', 'Corruption (Institutional)', 'Canada', 'Women and Girls', 'Immigration and Emigration', 'Discrimination', 'Content Type: Personal Profile', 'Japan', 'Shutdowns (Institutional)', "Coups D'Etat and Attempted Coups D'Etat", 'Modi, Narendra', 'Roman Catholic Church', 'Political Prisoners', 'Assassinations and Attempted Assassinations', 'Myanmar', 'Evacuations and Evacuees', 'Murders, Attempted Murders and Homicides', 'News and News Media', 'Coronavirus Reopenings', 'United States', 'London (England)', 'England', 'Belarus', 'Poland', 'Pfizer Inc', 'Trump, Donald J', 'Freedom of the Press', 'Hong Kong', 'Islamic State in Iraq and Syria (ISIS)', 'Royal Families', 'AFRICA', 'Navalny, Aleksei A', 'Haiti', 'Netanyahu, Benjamin', "Women's Rights", 'AstraZeneca PLC', 'Afghan National Security Forces', 'Francis', 'Hong Kong Protests (2019)', 'internal-essential']

Example Keyword Extraction Results

Using KeyBERT with the following parameters on the sample text:

Extracted Keywords:

Based on the sample text about Biden's vaccination policy for foreign travelers, KeyBERT would extract keywords such as:

Extracted Keywords:
['Biden, Joseph R Jr', 'Vaccination and Immunization', 'United States Politics and Government', 'Travel and Vacations', 'Coronavirus (2019-nCoV)', 'Politics and Government']

Note: In the original Flask application, users could interactively adjust these parameters and see real-time keyword extraction results. This static version shows example output.

References