KeyBERT for Keyword Extraction
Whenever large amounts of data are generated, one of the biggest challenges can be organizing this data in a useful fashion. Being able to efficiently extract keywords from blocks of text can aid tremendously in this situation. While there are many methods and algorithms for accomplishing this task, I will be focusing on KeyBERT which uses DistilBERT (a smaller version of the infamous BERT model) for keyword extraction. This solution is simple to install and implement, making it great for experimenting with DL keyword extraction. For more in-depth install and usage instructions or more information about KeyBERT, please visit their GitHub repo at https://github.com/MaartenGr/KeyBERT.
Setting up KeyBERT is very straightforward. Install Python and pip if you have not done so already and then simply run pip install keybert. After installation is complete, you are ready to go. Please note that if you are having issues with the PyTorch part of the install you may need to install PyTorch version 1.2.0 or higher independently before installing KeyBERT.
To use KeyBERT, only a few lines of code are required. First, as with any Python library, you have to import it with
from keybert import KeyBERT
Then, create a variable to hold the text you wish to extract keywords from. Here’s my example which is meant to represent possible ailments a patient might report:
text = “””
I have been having irregular heart beats and I feel weak at times. When I was walking my dog, I started feeling dizzy and almost fell over. Nothing has changed about my lifestyle to cause these changes so I don't know what's going on. I also have no appetite, and feel mildly depressed.
Now that we have our text, it’s time to import the KeyBERT model. We do this using the line below:
model = KeyBERT('distilbert-base-nli-mean-tokens')
Finally, we extract the keywords using this model and print them using the following lines:
keywords = model.extract_keywords(text)
Now, all that’s left to do is to run the script. This should print a Python list of keywords found in the text. Here’s the output I got from running the script with the example text above.
['depressed', 'dizzy', 'weak', 'irregular', 'fell']
Once you have experimented with using KeyBERT on your sample text, take a minute to evaluate how useful the results were. If they are pretty reasonable, KeyBERT can be used on all your data to organize it by keywords into various categories. Otherwise, you can try some statistic-based keyword extractors like RAKE (https://github.com/vgrabovets/multi_rake) or YAKE (https://github.com/LIAAD/yake). If you are feeling adventurous, you could also try to retrain the DistilBERT model on text in your target domain to see if this improves results.