Skip to Content

How to Extract Links from a Text File in Python?

In this tutorial, we’ll try to understand how to extract links from a text file in Python with the help of some examples.

There are multiple ways to extract URLs from a text file using Python. Some of the commonly used methods are –

  1. Using regular expressions.
  2. Using the urllib.parse library.

Let’s now look at both methods in detail.

We’ll be working with a text file “learn.txt” which contains some words and URLs to demonstrate the usage of the above methods. This is how the file looks in a text editor.

the contents of the text file displayed in a text editor

Extract Links from a text file using Regular Expressions

Regular expressions are commonly used to extract information from text using pattern matching. The idea is to define a pattern (or rule) and then scan the entire text to find any matches. Since URLs (links) have a pattern (for example, starting with https://, etc.) we can utilize regular expressions to extract them from a text file.

You can use Python built-in re module to implement regular expressions in Python. We’ll use the re.findall() function to find all the matching URLs from a text. The following is the syntax –

Basic Syntax:

re.findall(regex,text)

Parameters: The parameters are, the regex which is the regular expression pattern that we want to match in the text, and the text in which we want to search for the pattern.

For more details, refer this.

Let’s now use this method to extract all the URLs from the text file “learn.txt”. First, let’s read the contents of the file to a string.

# open the text file and read its contents to a string
s = ""
with open("learn.txt", "r") as text_file:
    s = text_file.read()

# display the text
print(s)

Output:

This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.

Let’s now extract all the above URLs from the above string (the contents of the text file) using regular expressions.

import re
# extract the URLs
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', s)
# display the extracted URLs
print(urls)

Output:

['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']

You can see that we were able to extract all three URLs in the above text file.

Extract Links using the urllib.parse module

The urllib.parse module in Python comes with a urlparse method that is used to parse a URL into its constituents. This corresponds to the general structure of a URL: scheme://netloc/path;parameters?query#fragment.

The idea is to split the string into tokens (or words) and then try to parse each word as a URL, if we’re able to parse it as a URL (using whether it has a scheme or not), we add it to our urls list.

Let’s print out the contents of the text file “learn.txt” that we read above again.

# text file content
print(s)

Output:

This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.

Let’s now extract the URLs from the above text.

from urllib.parse import urlparse

# Extract the URLs using the urlparse() function
urls = [urlparse(url).geturl() for url in s.split() if urlparse(url).scheme]
# display the extracted URLs
print(urls)

Output:

['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']

We get the same result as above.

For more on the urlparse method, refer to its documentation.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. In the past, he's worked as a Data Scientist for ZS and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Chaitanya Betha

    I'm an undergrad student at IIT Madras interested in exploring new technologies. I have worked on various projects related to Data science, Machine learning & Neural Networks, including image classification using Convolutional Neural Networks, Stock prediction using Recurrent Neural Networks, and many more machine learning model training. I write blog articles in which I would try to provide a complete guide on a particular topic and try to cover as many different examples as possible with all the edge cases to understand the topic better and have a complete glance over the topic.