extract links from a text file with python

How to Extract Links from a Text File in Python?

In this tutorial, we’ll try to understand how to extract links from a text file in Python with the help of some examples.

There are multiple ways to extract URLs from a text file using Python. Some of the commonly used methods are –

  1. Using regular expressions.
  2. Using the urllib.parse library.

Let’s now look at both methods in detail.

We’ll be working with a text file “learn.txt” which contains some words and URLs to demonstrate the usage of the above methods. This is how the file looks in a text editor.

the contents of the text file displayed in a text editor

Extract Links from a text file using Regular Expressions

Regular expressions are commonly used to extract information from text using pattern matching. The idea is to define a pattern (or rule) and then scan the entire text to find any matches. Since URLs (links) have a pattern (for example, starting with https://, etc.) we can utilize regular expressions to extract them from a text file.

You can use Python built-in re module to implement regular expressions in Python. We’ll use the re.findall() function to find all the matching URLs from a text. The following is the syntax –

Basic Syntax:

re.findall(regex,text)

Parameters: The parameters are, the regex which is the regular expression pattern that we want to match in the text, and the text in which we want to search for the pattern.

📚 Data Science Programs By Skill Level

Introductory

Intermediate ⭐⭐⭐

Advanced ⭐⭐⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

For more details, refer this.

Let’s now use this method to extract all the URLs from the text file “learn.txt”. First, let’s read the contents of the file to a string.

# open the text file and read its contents to a string
s = ""
with open("learn.txt", "r") as text_file:
    s = text_file.read()

# display the text
print(s)

Output:

This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.

Let’s now extract all the above URLs from the above string (the contents of the text file) using regular expressions.

import re
# extract the URLs
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', s)
# display the extracted URLs
print(urls)

Output:

['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']

You can see that we were able to extract all three URLs in the above text file.

Extract Links using the urllib.parse module

The urllib.parse module in Python comes with a urlparse method that is used to parse a URL into its constituents. This corresponds to the general structure of a URL: scheme://netloc/path;parameters?query#fragment.

The idea is to split the string into tokens (or words) and then try to parse each word as a URL, if we’re able to parse it as a URL (using whether it has a scheme or not), we add it to our urls list.

Let’s print out the contents of the text file “learn.txt” that we read above again.

# text file content
print(s)

Output:

This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.

Let’s now extract the URLs from the above text.

from urllib.parse import urlparse

# Extract the URLs using the urlparse() function
urls = [urlparse(url).geturl() for url in s.split() if urlparse(url).scheme]
# display the extracted URLs
print(urls)

Output:

['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']

We get the same result as above.

For more on the urlparse method, refer to its documentation.

You might also be interested in –


Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.


Authors

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

  • Chaitanya Betha

    I'm an undergrad student at IIT Madras interested in exploring new technologies. I have worked on various projects related to Data science, Machine learning & Neural Networks, including image classification using Convolutional Neural Networks, Stock prediction using Recurrent Neural Networks, and many more machine learning model training. I write blog articles in which I would try to provide a complete guide on a particular topic and try to cover as many different examples as possible with all the edge cases to understand the topic better and have a complete glance over the topic.

Scroll to Top