w3resource

NLTK Corpus Exercises with Solution

Python NLTK Corpus [13 exercises with solution]

[An editor is available at the bottom of the page to write and execute the scripts.]

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Each corpus reader class is specialized to handle a specific corpus format. In addition, the nltk.corpus package automatically creates a set of corpus reader instances that can be used to access the corpora in the NLTK data package.

1. Write a Python NLTK program to list down all the corpus names.
Click me to see the sample solution

2. Write a Python NLTK program to get a list of common stop words in various languages in Python.
Click me to see the sample solution

3. Write a Python NLTK program to check the list of stopwords in various languages.
From Wikipedia:
In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.
Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines remove some of the most common words-including lexical words, such as "want"-from a query in order to improve performance.
Click me to see the sample solution

4. Write a Python NLTK program to remove stop words from a given text.
Click me to see the sample solution

5. Write a Python NLTK program to omit some given stop words from the stopwords list.
Click me to see the sample solution

6. Write a Python NLTK program to find the definition and examples of a given word using WordNet.
From Wikipedia,
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and are freely available for download from the WordNet website. Both the lexicographic data (lexicographer files) and the compiler (called grind) for producing the distributed database are available.
Click me to see the sample solution

7. Write a Python NLTK program to find the sets of synonyms and antonyms of a given word.
From Wikipedia,
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.
Click me to see the sample solution

8. Write a Python NLTK program to get the overview of the tagset, details of a specific tag in the tagset and details on several related tagsets, using regular expression.
Click me to see the sample solution

9. Write a Python NLTK program to compare the similarity of two given nouns.
Click me to see the sample solution

10. Write a Python NLTK program to compare the similarity of two given verbs.
Click me to see the sample solution

11. Write a Python NLTK program to find the number of male and female names in the names corpus. Print the first 10 male and female names.
Note: The names corpus contains a total of around 2943 male (male.txt) and 5001 female (female.txt) names. It's compiled by Kantrowitz, Ross.
Click me to see the sample solution

12. Write a Python NLTK program to print the first 15 random combine labeled male and labeled female names from names corpus.
Click me to see the sample solution

13. Write a Python NLTK program to extract the last letter of all the labeled names and create a new array with the last letter of each name and the associated label.
Click me to see the sample solution.

More to Come !

Do not submit any solution of the above exercises at here, if you want to contribute go to the appropriate exercise page.

[ Want to contribute to Python exercises? Send your code (attached with a .zip file) to us at w3resource[at]yahoo[dot]com. Please avoid copyrighted materials.]



Follow us on Facebook and Twitter for latest update.

Python: Tips of the Day

Kwargs:

**kwargs and *args are function arguments that can be very useful.

They are quite underused and often under-understood as well.

Let's try to explain what kwargs are and how to use them.

  • While *args are used to pass arguments at an unknown amount to functions, **kwargs are used to do the same but with named arguments.
  • So, if *args is a list being passed as an argument, you can think of **kwargs as a dictionary that's being passed as an argument to a function.
  • You can use arguments as you wish as long as you follow the correct order which is: arg1, arg2, *args, **kwargs. It's okay to use only one of those but you can't mix the order, for instance, you can't have: function(**kwargs, arg1), that'd be a major faux pas in Python.
  • Another example: You can do function(*args,**kwargs) since it follows the correct order.
  • Here is an example. Let's say satelites are given with their names and weight in tons in dictionary format. Code prints their weight as kilograms along with their names.
def payloads(**kwargs):
    for key, value in kwargs.items():
        print( key+" |||", float(value)*100)
payloads(NavSat1 = '2.5', BaysatG2 = '4')

Output:

NavSat1 ||| 250.0
BaysatG2 ||| 400.0

Since the function above would work for any number of dictionary keys, **kwargs makes perfect sense rather than passing arguments with a fixed amount.

def payloads(**kwargs):
    for key, value in kwargs.items():
        print( key+" |||", float(value)*100)

sats={"Tx211":"3", "V1":"0.50"}
payloads(**sats)

Output:

Tx211 ||| 300.0
V1 ||| 50.0