Exercise#1 Classification & mining

Objective

Learn to use distance metric and try out paraphrase mining.

Part-1 Implement classification

You will be given 2 lists of strings, each list has a set of facts. When the user provides a query string, you need to identify whether the query string belongs to 1st or the 2nd list. In addition you need to extract the closest matching 2 strings for the query. Follow the steps below:

  1. Copy paste the facts below.
# Category : Sports
sports_facts = [
    "Football, also known as soccer in some countries, is the most popular sport in the world, with billions of fans worldwide.",
    "Basketball was invented in 1891 by Dr. James Naismith, a Canadian physical education instructor, as an indoor game to keep his students active during the winter months.",
    "Tennis is a highly competitive sport that originated in the 19th century and is played by millions of people around the world on various surfaces such as grass, clay, and hardcourt.",
    "Golf is a precision club-and-ball sport in which players use various clubs to hit balls into a series of holes on a course in as few strokes as possible."
]

# Category : History
history_facts = [
    "The Renaissance was a period of cultural rebirth that emerged in Europe during the 14th to 17th centuries, marking a transition from the Middle Ages to modernity.",
    "The Industrial Revolution, which began in Britain in the late 18th century, transformed society by introducing mechanized manufacturing processes and urbanization.",
    "The Cold War, spanning from the late 1940s to the early 1990s, was a geopolitical conflict between the United States and the Soviet Union, characterized by ideological, economic, and military competition.",
    "The French Revolution, which erupted in 1789, was a watershed moment in European history, leading to the overthrow of the monarchy and the rise of democratic principles."
]
  1. Encode the two lists seperately

    • sports_facts_embeddings
    • history_facts_embeddings
  2. Generate the embedding for the query string. Use the test strings below or create your own

test_sentences = [
    "i like putting",
    "two strong armies came face to face",
    "hoops on the two ends of the court",
    "steam engine changed the world",
    "arts, and self expression was the highlight"
]

# change the index try out different strings
test_sentence = test_sentences[4]
  1. Use appropriate util function to answer the following:
    • Which category does query belongs to
    • Which string in corpus is closest to the query

Part-2

You learned about the Paraphrase Mining task in the last lesson. Apply the util.paraphrase_mining to the given corpus below:

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

References

Sentence Transformers Package

Sentence Transformers Package - util

Solution

The solution to the exercise is available in section#2 and #3 in the notebook:

ex-1-classification-mining

Open in Google Colab
Open In Colab