In this tutorial, we’ll try to understand how to extract links from a text file in Python with the help of some examples.
There are multiple ways to extract URLs from a text file using Python. Some of the commonly used methods are –
- Using regular expressions.
- Using the
urllib.parse
library.
Let’s now look at both methods in detail.
We’ll be working with a text file “learn.txt” which contains some words and URLs to demonstrate the usage of the above methods. This is how the file looks in a text editor.

Extract Links from a text file using Regular Expressions
Regular expressions are commonly used to extract information from text using pattern matching. The idea is to define a pattern (or rule) and then scan the entire text to find any matches. Since URLs (links) have a pattern (for example, starting with https://
, etc.) we can utilize regular expressions to extract them from a text file.
You can use Python built-in re
module to implement regular expressions in Python. We’ll use the re.findall()
function to find all the matching URLs from a text. The following is the syntax –
Basic Syntax:
re.findall(regex,text)
Parameters: The parameters are, the regex which is the regular expression pattern that we want to match in the text, and the text in which we want to search for the pattern.
Introductory ⭐
- Harvard University Data Science: Learn R Basics for Data Science
- Standford University Data Science: Introduction to Machine Learning
- UC Davis Data Science: Learn SQL Basics for Data Science
- IBM Data Science: Professional Certificate in Data Science
- IBM Data Analysis: Professional Certificate in Data Analytics
- Google Data Analysis: Professional Certificate in Data Analytics
- IBM Data Science: Professional Certificate in Python Data Science
- IBM Data Engineering Fundamentals: Python Basics for Data Science
Intermediate ⭐⭐⭐
- Harvard University Learning Python for Data Science: Introduction to Data Science with Python
- Harvard University Computer Science Courses: Using Python for Research
- IBM Python Data Science: Visualizing Data with Python
- DeepLearning.AI Data Science and Machine Learning: Deep Learning Specialization
Advanced ⭐⭐⭐⭐⭐
- UC San Diego Data Science: Python for Data Science
- UC San Diego Data Science: Probability and Statistics in Data Science using Python
- Google Data Analysis: Professional Certificate in Advanced Data Analytics
- MIT Statistics and Data Science: Machine Learning with Python - from Linear Models to Deep Learning
- MIT Statistics and Data Science: MicroMasters® Program in Statistics and Data Science
🔎 Find Data Science Programs 👨💻 111,889 already enrolled
Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.
For more details, refer this.
Let’s now use this method to extract all the URLs from the text file “learn.txt”. First, let’s read the contents of the file to a string.
# open the text file and read its contents to a string s = "" with open("learn.txt", "r") as text_file: s = text_file.read() # display the text print(s)
Output:
This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.
Let’s now extract all the above URLs from the above string (the contents of the text file) using regular expressions.
import re # extract the URLs urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', s) # display the extracted URLs print(urls)
Output:
['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']
You can see that we were able to extract all three URLs in the above text file.
Extract Links using the urllib.parse
module
The urllib.parse
module in Python comes with a urlparse
method that is used to parse a URL into its constituents. This corresponds to the general structure of a URL: scheme://netloc/path;parameters?query#fragment
.
The idea is to split the string into tokens (or words) and then try to parse each word as a URL, if we’re able to parse it as a URL (using whether it has a scheme
or not), we add it to our urls list.
Let’s print out the contents of the text file “learn.txt” that we read above again.
# text file content print(s)
Output:
This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.
Let’s now extract the URLs from the above text.
from urllib.parse import urlparse # Extract the URLs using the urlparse() function urls = [urlparse(url).geturl() for url in s.split() if urlparse(url).scheme] # display the extracted URLs print(urls)
Output:
['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']
We get the same result as above.
For more on the urlparse
method, refer to its documentation.
You might also be interested in –
- How to add a header in a CSV file using Python?
- How to search and replace text in a file using Python?
- Python – Determine File Type
- Get File size using Python
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.