Natural Language Processing

References

History

Common terms

Tokenization:

The process of breaking text into smaller units, like words or sentences.

Normalization:

Transforming text into a consistent format, such as converting all characters to lowercase.

Stop Words:

Common words (e.g., “and”, “the”, “is”) that are often removed from text to focus on more meaningful words.

Stemming: Reducing words to their root form by removing suffixes (e.g., “running” to “run”).

Lemmatization: Similar to stemming, but uses vocabulary and morphological analysis to return the base or dictionary form of a word (e.g., “better” to “good”).

Part-of-Speech (POS) Tagging: Assigning parts of speech to each word in a sentence (e.g., noun, verb, adjective).

Named Entity Recognition (NER): Identifying and classifying entities in text, such as names of people, organizations, locations, dates, and more.

Sentiment Analysis: Determining the sentiment expressed in a text, such as positive, negative, or neutral.

Word Embeddings: Representing words in a continuous vector space where semantically similar words are close together (e.g., Word2Vec, GloVe).

Bag-of-Words (BoW): A representation of text that describes the occurrence of words within a document, disregarding grammar and word order.

TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.

N-grams: Contiguous sequences of n items from a given text (e.g., bigrams are pairs of consecutive words).

Parsing: Analyzing the grammatical structure of a sentence to understand relationships between words.

Syntax Tree: A tree representation of the syntactic structure of a sentence according to a given grammar.

Dependency Parsing: Analyzing the grammatical structure of a sentence to establish relationships between “head” words and words that modify those heads.

Coreference Resolution: Determining when different words or phrases refer to the same entity in a text.

Language Model: A probabilistic model used to predict the next word in a sequence, often used for generating text (e.g., GPT, BERT).

Transformer: A deep learning model architecture that relies on attention mechanisms to handle dependencies between input and output.

Attention Mechanism: A technique that allows the model to focus on specific parts of the input sequence when making predictions.

Seq2Seq (Sequence-to-Sequence): Models that transform one sequence into another, used in tasks like machine translation.

Recurrent Neural Network (RNN): A type of neural network designed for sequential data, such as time series or natural language.

Long Short-Term Memory (LSTM): A type of RNN that can capture long-term dependencies by mitigating the vanishing gradient problem.

Bidirectional Encoder Representations from Transformers (BERT): A pre-trained language model designed to understand the context of words in search queries and other text.

Generative Pre-trained Transformer (GPT): A series of language models that use transformer architecture to generate human-like text.

Transfer Learning: Using a pre-trained model on one task and fine-tuning it for another related task.

Named Entity Linking (NEL): Connecting entities recognized in text to a knowledge base or database of entities.

Text Summarization: The process of creating a concise and coherent summary of a larger text document.

Word Sense Disambiguation (WSD): Determining which meaning of a word is used in a given context.

Speech Recognition: Converting spoken language into text.

Text-to-Speech (TTS): Converting text into spoken language.