Data scraping is also called web scraping. It is the process of importing information or data from a website to the file saved on your computer. It is the most efficient way to get data from the web. In this article, we learn how to scrap data from the website using python selenium.
What is Selenium?
Selenium is an open-source testing tool, which means it can be downloaded from the internet without spending anything. Selenium is a functional testing tool and is also compatible with non-functional testing tools as well. It is one of the most popular automation testing tools. Here automation testing is a process of converting any manual test case into test scripts using automation tools. So that’s why it is very efficient to scrap data because you have to write a simple python script using the Selenium testing tool.
In this article, we learn step-by-step procedures to scrap data from the website Wikipedia using selenium web driver and after scraping put that scrap data into a data frame and then save this data into a CSV file in a local computer. Here we use the Mozilla Firefox web driver for the automation.
Step 1. Install the required module:
pip install selenium
Step 2. Import the required module, web driver and create web driver object
First, you have to download the firefox web driver from the internet and then install it into your system then give the executable path to the web driver object for the automation. And the data we going to scrap from Wikipedia the link is given below:
https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_badminton
# Python program to demonstrate
import pandas as pd
# selenium
# import webdriver
from selenium import webdriver
# create webdriver object
driver = webdriver.Firefox(executable_path = "C:\\Users\\siddh\\Downloads\\geckodriver-v0.31.0-win64\\geckodriver.exe")
# get google.co.in
driver.get("https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_badminton")
Step 3. Extract all elements from the table:
For better understanding, we extract each web element one by one with the help of Xpath. Xpath is one of the methods for searching web elements. If you want to learn about that the link is given below:
https://www.selenium.dev/documentation/webdriver/elements/finders/
Now we have to find the length of the table which means how many players are there in the table. For that, we have to inspect the table by pressing the F12 button. From there you have to copy the Xpath of each element. Now we have to create an empty list to save this extracted data.
# Extracting the length of table :
total_element = len(driver.find_elements(By.XPATH,
"/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr"))
print(total_element)
# creating empty list to save data:
medalist = []
nations =[]
olympic = []
gold =[]
silver = []
bronze = []
total = []
# for extracting medalist name
for i in range(total_element+1):
w = driver.find_elements(By.XPATH,
'/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' +
str(i) + ']/td[1]/a')
for element in w:
medalist.append(element.text)
# for extracting nations
for i in range(total_element+1):
n = driver.find_elements(By.XPATH,
'/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' +
str(i) + ']/td[2]/a')
for element in n:
nations.append(element.text)
# for extracting which olympic
for i in range(total_elemnt+1):
o = driver.find_elements(By.XPATH,
'/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' +
str(i) + ']/td[3]')
for element in o:
olympic.append(element.text)
# for extracting gold medal
for i in range(152):
g = driver.find_elements(By.XPATH,
'/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' +
str(i) + ']/td[4]')
for element in g:
gold.append(element.text)
# For extracting silver medal
for i in range(total_element+1):
s = driver.find_elements(By.XPATH,
'/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' +
str(i) + ']/td[5]')
for element in s:
silver.append(element.text)
# for extracting bronze medal
for i in range(total_element+1):
b = driver.find_elements(By.XPATH,
'/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' +
str(i) + ']/td[6]')
for element in b:
bronze.append(element.text)
# for extracting total number of medal
for i in range(total_element+1):
t = driver.find_elements(By.XPATH,
'/html/body/div[3]/div[3]/div[5]/div[1]/table[7]/tbody/tr[' +
str(i) + ']/td[7]')
for element in t:
total.append(element.text)
Step 4: Put Extracted data into Dataframe
Now we will create a data frame with the data we extracted in the last step. Using the Python pandas library we save the extracted data into the data frame.
df=pd.DataFrame(list(zip(medalist,nations,olympic,gold,silver,bronze,total)),
columns =['Medalist', 'Nation', 'Olympic','Gold','Silver','Bronze','Total'])
print(df)
Step 5. Export CSV to a working directory
The next step is to create a CSV file from this data frame. To do that, we simply export a Dataframe to a CSV file using df.to_csv().
df.to_csv('bdo_medalist.csv')
Now you can open this CSV and see the data in the excel sheets. This was how you can scrape data using Selenium and a few python libraries. We will look more into data scrapping and saving in the coming articles.
If you like the article, please share and subscribe to the blog. Also, follow me on Linkedin if you think these articles are helping you and want to see more.