Python For Beginner: Extracting Phone Number and Email Address From Websites

Addison Chen
7 min readJun 11, 2020

--

Photo by Pakata Goh on Unsplash

Have you ever wonder if there is a quicker way to look up specific information on a website? The average human attention span is around eight seconds, so most people will quickly lose interest when reading long paragraphs after paragraphs.

I decide to keep myself productive by learning to code using Python that can make my life just a tiny bit easier. The first thing on my path to becoming a better programmer is to build a simple data extractor program.

I started my coding journey by reading a coding book called “Automate the Boring Stuff with Python” by Al Sweigart. Personally, the book makes it very easy to get started on creating your first program to automate simple tasks. The book first goes over the fundamentals of coding and have instructions on how to install the Python Interpreter on your computer. One of the projects that book wants us to work on is to create a data extractor program — where it grabs only the phone number and email address by copying all the text on a website to the clipboard. The project came from chapter 7 from “Automate the boring stuff with Python” called Phone Number and Email Address Extractor. You can find the book and the project linked here to “Automate the Boring Stuff with Python” by Al Sweigart.

Few fundamentals that you should look into before tackling the project:

  • Know your variable types: Integers (int), Floating-Point, and string
  • Learn to store values into variables, and coming up with good names for each of your variables
  • Have a good understanding of conditional statements (If, Else, Elif, While, for, and break)
  • Know your import modules
  • Know your basic functions (Return values and Return Statements)
  • List (Understanding your list functions — len(), group(), and just arrays in general)
  • Manipulating strings — Suggest going over functions that you need for manipulating the string statements in programming.
  • Regular Expressions (Regex) — Beneficial for validations purposes

Simple Workflow:

First, you will need to figure out what modules to import. In this case, we will need the “re” and “pyperclip” modules. The “re” module provides regular expression matching operations, which I will be using to create a validation rule for phone numbers and email addresses (Link to more information on re). The “pyperclip” module is used for the basic copying and pasting of plain text to the clipboard (Link to more information on pyperclip).

Next, we will need to write a few lines of code to validate that phone numbers and email addresses have the correct output when running the code. To do so, we will need to write a regex expression that can make sure the condition is true. Creating regular expressions can get a little complex, so I suggest learning the basics first. Luckily for us, the book gave us the regular expression statement for validating phone numbers. Still, there are many different ways you can write expression because the regular expression is universal. Here is a link to a site that teaches you the basics of the regular expression.

Brief Summary: See below for the breakdown of regex for validating phone numbers

1. (\d{3}|\(\d{3}\))? — Validating for the area code (must be the three-digit number)

“\d{3}” matches if there are three digits

“|” creates a conditional statement where depending on the validation would either be “(\d{3}” or “{\d{3}\)”

2. (\s|-|\.) — The delimiter (-)

“\s|” matches a single space character then output “-”

“|\.” matches a period character then output “-”

3. (\d{3}) — Validates for three digits

“\d{3}” matches if there are three digits

4. (\s|-|\.)? — The delimiter (-)

5. (\d{4}) — Validate the last Four digits

“\d{4}” matches if there are four digits

6. (\s*(ext|x|ext.)\s*(\d{2,5}))? — Validates if there is an extension (e.g. x95, x956. X9556, or x95556).

The expression checks to see if there is an “x,” then the “\d{2,5}” must match the digits that are 2 to 5 numbers long.

7. re.VERBOSE — This is a VERBOSE format flag that comes from the re module

The VERBOSE flag is used at the end of the statement to make the regular expression more organized. It is what allows us to write this code line by line

Brief Summary: The regex for validating email addresses, which is broken up by the below:

1. ([a-zA-Z0–9_\-\.]) — Pretty self-explanatory, this looking for usernames

“a-zA-Z0–9” matches for any lower case letters (a — z), as well as capital letters from “A — Z” and accepts numbers (0–9).

“[…]” accepts any one character within the square bracket.

“\-\.” Also can contain dashes or periods in the username

2. @ + — Simply looking only for text with the @ sign

3. [a-zA-Z0–9_\-\.] — The same concept as step 1, but looking for the domain name

4. (\.[a-zA-Z]{2,5}) — looks for the .web extensions (e.g. .gov, .edu, .org, .com, .net, .biz, .info)

“a — zA –Z” matches for lower case letters (a — z), as well as capital letters from “A — Z” and accepts numbers (0–9).

“{2,5}” look for 2–5 characters just in case. Realistically “{2,4}” would be enough.

The above code will find the matches from the clipboard when you copy a text onto your computer. We will need two conditional loops for the phone number and email address. The conditional statements utilize the “For” and “If” loops for the program to logically know what it needs to do.

1. Started with declaring a variable called “text” that contains the paperclip function

  • “pyperclip.paste()” is called to paste text from the clipboard, and the text will return as a string value.
  • The first for a function takes the variable “phoneNumForRegex” and looks at the text that we copied to the computer’s paperclip. The logic of the code will check to see if the copied text contains the three criteria (area code, middle digits, and the last four digits).
  • “join([groups[1], groups[3], groups[5]]” the statement is grabbing the list of groups based on the first regular expression we created. To have a better understanding of how group list works, I would suggest you do trial and error. You can play around with it and see what outputs. In this case, groups [1] outputs the area code, groups [3] will output the first three digits, and groups [5] will output the last four digits.
  • “if groups[8] != ‘ ‘:” states that if the extension is not a blank string, then make sure to output the “x” and the digits right after that. (Note: the regular expression is only validating if the “x” exist then the condition will be met)

Brief Summary: If you do not have a condition in case an error occurs, then make sure you add on at the end. To make sure your code is functioning well, it is a good idea to add a friendly error message. In this case, I added one in case the text I copy does not grab any phone numbers or email addresses.

1. if len(matches) > 0:

pyperclip.copy(‘\n’.join(matches))

print(‘Copied to clipboard:’)

print(‘\n’.join(matches))

  • The “if” loop checks the length of the text that you copy to the paperclip. If it is greater than zero, then the code will run properly and begin checking for matches.

2. else:

print(‘No phone numbers or emails found.’)

  • The “else” statement will run if there are no matches or if the length is less than zero. It will output the custom message I created.

Note: Always set up a backup statement for testing purposes when running scenarios.

Final Result:

As stated in the book, we will be using the No Starch Press website as a test for the program. Copy the whole webpage (ctrl+a) and then copy the text to the clipboard (ctrl+c) then run the program. See the final output below.

Considerations:

The regular expression for the phone numbers provided by the book only works for the United State phone format. I have tested this with a random UK number and it did not output the correct phone number.

+447911123456

Overall:

It was a short but fun project overall as I have learned a lot compared to the four years of college that I spent in a classroom. The biggest struggle that I faced when writing out the code was figuring out the regular expression for validating email addresses. I was able to figure out by testing and doing some research on regular expression with examples. In the end, this simple code can be reused and will make your life easier when looking for specific information on a website.

Screenshot of the No Starch Press Contact Page: https://nostarch.com/contactus
This is the final output when running the code.

--

--