To Buy or Build Web Scraper with Selenium

23 September 2020 / 1 min read

To Buy or Build Web Scraper with Selenium

Item: To Buy or Build Web Scraper with Selenium
Rating: 7
Author: Rachael Chapman

Set Up Selenium

Starting Selenium

Selenium vs Real-Time Crawler

Selenium Web Scraping by Locating Elements

By Rachael Chapman

In Selenium Web Scraping,

3 years ago

1 min read

Add comment

There are different frameworks and libraries that you would have to learn and make use of while understanding the basics of web scraping. With good knowledge of various HTTP methods like GET and POST and utilizing selenium web scraping, your data extraction process would become easier.

Selenium is a widely known tool for automated web browsing interactions. Combining it with other technologies like **BeautyifulSoup **would give you even better results when you perform web scraping. Selenium works by automating the processes of your written script so there is no need for human intervention like clicking, scrolling, etc. to facilitate the interaction between the script and the browser.

Interesting Read : Using Web Scraping for Lead Generation

Even though selenium is described as the perfect tool for testing web applications, its functions go beyond that.

And so in this guide, we would be dealing with selenium web scraping using python 3.x. as the input language.

Post Quick Links

Jump straight to the section of the post you want to read:

Set Up Selenium
Starting Selenium
Selenium vs Real-Time Crawler
Selenium Web Scraping by Locating Elements

Set Up Selenium

You would need to download the selenium package first and to do so, execute this pip command in your terminal:

pip install selenium

After this, you would need to install selenium drivers too. This will allow python to control and interact with the web browser on the level of the operating system. If you are doing a manual installation, it would be available via the PATH variable. Selenium drivers for Firefox, Edge, and Mozilla can be downloaded here.

Starting Selenium

Let us begin by staring up your web browser:

Open a new browser window
Load any page of your choice. In this instance, ours would be used

from selenium import webdriver

browser = webdriver.Firefox()

browser.get('http://limeproxies.com/')

Doing this will launch a headful mode. If you want to switch your browser into a headless mode and run it on a server, it should first look like this:

from selenium import webdriver

from selenium.webdriver.firefox.options import Options

options = Options()

options.headless = True

options.add_argument("--window-size=1920,1200")

driver = webdriver.firefox(options=options, executable_path=DRIVER_PATH)

driver.get("https://www.oxylabs.io/")

print(driver.page_source)

driver.quit()

Selenium vs Real-Time Crawler

If you want to learn web scraping, a great option would be to use selenium. It's best to use it together with BeautifulSoup, learning HTTP protocols, the processes involved in data exchange between server and browser, and also how cookies and headers work. If you are looking for an easier way to perform web scraping, you have a variety of tools to help with this. Depending on the amount of data you wish to collect, and the targets, using a web scraping tool would not only save you time but will also save you resources.

Real-Time Crawler is a tool that can be used for an easier web scraping process. Its two main functionalities are:

HTML Crawler API: this functionality allows you to scrape most websites in HTML
Data API: this is mainly for re-commerce and search engine websites, and it allows you to receive the data in structured JSON format

You can easily integrate real-time crawler, and here is the process for python:

import requests

  from pprint import pprint

  # Structure payload.

  payload = {

    'source': 'universal',

    'url': 'https://stackoverflow.com/questions/tagged/python',

    'user_agent_type': 'desktop',

  # Get response.

  response = requests.request(

    'POST',

    'https://realtime.oxylabs.io/v1/queries',

    auth=('user', 'pass1'),

    json=payload,

  # This will return the JSON response with results.

  pprint(response.json())

With real-time crawler and selenium, there are a lot of advantages including:

Easy scraping
Every successful request has a guaranteed 100% success rate
No need for extra coding
Automated web scraping processes
There is a built-in tool for proxy rotation

Selenium Web Scraping by Locating Elements

Find_element

There are different functions that you can use to find elements using selenium on a page:

Find_element_by_id
Find_element_by_name
Find_element_by_xpath
Find_element_by_link_text (that is using text value)
Find_element_by_partial_link_text (that is by matching some part of a hyperlink text)
Find_element_by_tag_name
Find_element_by_class_name
Find_element_by_css_selector (that is using a CSS selector for id class)

For example, lets locate the H1 tag on limeproxies homepage using selenium

<html>

    <head>

        ... something

    </head>

    <body>

        <h1 class="someclass" id="greatID"> Partner Up With Proxy Experts</h1>

    </body>

</html>

h1 = driver.find_element_by_name('h1')

h1 = driver.find_element_by_class_name('someclass')

h1 = driver.find_element_by_xpath('//h1')

h1 = driver.find_element_by_id('greatID')

You can also use the find_elements function to return to a list of elements:

all_links = driver.find_elements_by_tag_name('a')

Doing this will provide you with all anchors in a page. Some elements are however not easy to access using an ID or class, so you would need XPath.

WebElement

In selenium, WebElement represents an HTML element. The following are some of the most common actions:

Element.text (to access text element)
Element.click() (click on element)
Element.get_attribute(‘class’ (to access attribute)
Element.send_keys(‘mypassword”) (send text to an input)

XPath

XPath is a syntax language and can help you find an object in DOM. It finds the node from the root element using either a relative path or an absolute path. Example:

/ : select node from the root. /html/body/div(1) will find the first div
//: select node from current node irrespective of their location. //form(1) will find the initial form element
(attributename=’ value’): a predicate. It finds a specific node or a node with a specific value

//input[@name='email'] will find the first input element with the name "email".

<html>

 <body>

   <div class = "content-login">

     <form id="loginForm">

         <div>

            <input type="text" name="email" value="Email Address:">

            <input type="password" name="password"value="Password:">

         </div>

        <button type="submit">Submit</button>

     </form>

   </div>

 </body>

</html>

Render Solutions for Slow Websites

Some websites have a lot of JavaScript in their encoding to render content. It can be a bit tricky to deal with this as they also use a lot of AJAX calls and this issue can be solved in any of the following ways:

Time.sleep(ARBITRRY_TIME)
WebDriverWait()

Example

try:

    element = WebDriverWait(driver, 10).until(

        EC.presence_of_element_located((By.ID, "mySuperId"))

finally:

    driver.quit()

This way, the located element can be loaded after 10 seconds.

Using selenium for web scraping will make the job easier for you, especially if you are new and learning the basics. Even though selenium web scraping is efficient, you may need to perform large scale scraping and require an already built tool that would facilitate your data extraction process. A real-time crawler is an example of such a tool and in combination with selenium, you can expect great results.