Tokenization:
The process of breaking text into smaller units, like words or sentences.
Normalization:
Transforming text into a consistent format, such as converting all characters to lowercase.
Stop Words:
Common words (e.g., “and”, “the”, “is”) that are often removed from text to focus on more meaningful words.
Stemming: Reducing words to their root form by removing suffixes (e.g., “running” to “run”).
Lemmatization: Similar to stemming, but uses vocabulary and morphological analysis to return the base or dictionary form of a word (e.g., “better” to “good”).
Part-of-Speech (POS) Tagging: Assigning parts of speech to each word in a sentence (e.g., noun, verb, adjective).
Named Entity Recognition (NER): Identifying and classifying entities in text, such as names of people, organizations, locations, dates, and more.
Sentiment Analysis: Determining the sentiment expressed in a text, such as positive, negative, or neutral.
Word Embeddings: Representing words in a continuous vector space where semantically similar words are close together (e.g., Word2Vec, GloVe).
Bag-of-Words (BoW): A representation of text that describes the occurrence of words within a document, disregarding grammar and word order.
TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.
N-grams: Contiguous sequences of n items from a given text (e.g., bigrams are pairs of consecutive words).
Parsing: Analyzing the grammatical structure of a sentence to understand relationships between words.
Syntax Tree: A tree representation of the syntactic structure of a sentence according to a given grammar.
Dependency Parsing: Analyzing the grammatical structure of a sentence to establish relationships between “head” words and words that modify those heads.
Coreference Resolution: Determining when different words or phrases refer to the same entity in a text.
Language Model: A probabilistic model used to predict the next word in a sequence, often used for generating text (e.g., GPT, BERT).
Transformer: A deep learning model architecture that relies on attention mechanisms to handle dependencies between input and output.
Attention Mechanism: A technique that allows the model to focus on specific parts of the input sequence when making predictions.
Seq2Seq (Sequence-to-Sequence): Models that transform one sequence into another, used in tasks like machine translation.
Recurrent Neural Network (RNN): A type of neural network designed for sequential data, such as time series or natural language.
Long Short-Term Memory (LSTM): A type of RNN that can capture long-term dependencies by mitigating the vanishing gradient problem.
Bidirectional Encoder Representations from Transformers (BERT): A pre-trained language model designed to understand the context of words in search queries and other text.
Generative Pre-trained Transformer (GPT): A series of language models that use transformer architecture to generate human-like text.
Transfer Learning: Using a pre-trained model on one task and fine-tuning it for another related task.
Named Entity Linking (NEL): Connecting entities recognized in text to a knowledge base or database of entities.
Text Summarization: The process of creating a concise and coherent summary of a larger text document.
Word Sense Disambiguation (WSD): Determining which meaning of a word is used in a given context.
Speech Recognition: Converting spoken language into text.
Text-to-Speech (TTS): Converting text into spoken language.