Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

How to scrape results from table using Selenium python

How to select anchor tag text element and click using By.LINK_TEXT

How to scrape all second/third/four td element of every tr(for loop)

How to scrape data td or tr elements By TAG_NAME

How to scrape inner tag of table or tbody

How to replace $ to rs from all element of list using for loop,listcomprhension,reg exp,lambda

How to remove $ from all element of list using for loop,listcomprhension,reg exp,lambda

Learning Points

Use of list slicing to get 20 records
Use of len to know total no of elments(tr/td)
Use of append to store data in list

Step 1: install selenium using python

pip install --upgrade selenium
Enter fullscreen mode Exit fullscreen mode

Step 2: Import Required Libraries

import pandas as pd
from selenium import webdriver
import warnings
warnings.filterwarnings('ignore')
from selenium.webdriver.common.by import By
import time
Enter fullscreen mode Exit fullscreen mode

Step 3: Start a WebDriver Instance

driver=webdriver.Chrome()
Enter fullscreen mode Exit fullscreen mode

Step 4: Navigate to a Webpage
Navigate to the webpage where you want to interact with the search input bar:

driver.get('https://www.jagranjosh.com/')
Enter fullscreen mode Exit fullscreen mode

Step 5: click anchor tag text GK element

anchor_tag = driver.find_element(By.LINK_TEXT, 'GK')
anchor_tag.click()
Enter fullscreen mode Exit fullscreen mode

Step 6: click inner page anchor tag text List of all Prime Ministers of India element

anchor_tag = driver.find_element(By.LINK_TEXT, 'List of all Prime Ministers of India')
anchor_tag.click()
Enter fullscreen mode Exit fullscreen mode

Step 7:scrape the all name
How to scrape all second/third/four td element of every tr(for loop)
step1:find all tr elements by tag name
step2:apply for loop in above found all tr elements
step3:find all td elements by tag name
step5: if length of td contain more than two/three/four
step6: store all td data in list using append
step7:get only 19 element using list slicing

tr_elements = driver.find_elements(By.TAG_NAME, 'tr')

# Initialize an empty list to store the text from the second <td> in each <tr>
td2_texts = []

# Iterate through the <tr> elements and extract text from the second <td> in each
for tr in tr_elements:
    # Find all <td> elements within the current <tr>
    td_elements = tr.find_elements(By.TAG_NAME, 'td')

    # Check if there are at least two <td> elements in the current <tr>
    if len(td_elements) >= 2:
        # Extract the text from the second <td> and append it to the list
        td2_texts.append(td_elements[1].text) 
td2_texts= td2_texts[:19] 
td2_texts
Enter fullscreen mode Exit fullscreen mode

Image description

Step 8: Scrape the all born date results

tr_elements = driver.find_elements(By.TAG_NAME, 'tr')

# Initialize an empty list to store the text from the second <td> in each <tr>
td3_texts = []

# Iterate through the <tr> elements and extract text from the second <td> in each
for tr in tr_elements:
    # Find all <td> elements within the current <tr>
    td_elements = tr.find_elements(By.TAG_NAME, 'td')

    # Check if there are at least two <td> elements in the current <tr>
    if len(td_elements) >= 3:
        # Extract the text from the second <td> and append it to the list
        td3_texts.append(td_elements[2].text) 
td3_texts  
Enter fullscreen mode Exit fullscreen mode

output
Image description

Step 9: Scrape the all term_of_office results

tr_elements = driver.find_elements(By.TAG_NAME, 'tr')

# Initialize an empty list to store the text from the second <td> in each <tr>
td4_texts = []

# Iterate through the <tr> elements and extract text from the second <td> in each
for tr in tr_elements:
    # Find all <td> elements within the current <tr>
    td_elements = tr.find_elements(By.TAG_NAME, 'td')

    # Check if there are at least two <td> elements in the current <tr>
    if len(td_elements) >= 4:
        # Extract the text from the second <td> and append it to the list
        td4_texts.append(td_elements[3].text) 
td4_texts
Enter fullscreen mode Exit fullscreen mode

output

Image description

Step 10: Scrape the all remark

tr_elements = driver.find_elements(By.TAG_NAME, 'tr')

# Initialize an empty list to store the text from the second <td> in each <tr>
td5_texts = []

# Iterate through the <tr> elements and extract text from the second <td> in each
for tr in tr_elements:
    # Find all <td> elements within the current <tr>
    td_elements = tr.find_elements(By.TAG_NAME, 'td')

    # Check if there are at least two <td> elements in the current <tr>
    if len(td_elements) >= 5:
        # Extract the text from the second <td> and append it to the list
        td5_texts.append(td_elements[4].text) 
td5_texts 
Enter fullscreen mode Exit fullscreen mode

output

Image description

step11: make a dataframe

Image description

==========================================================

Write a python program to scrape the details for all billionaires from www.forbes.com. Details to be scrapped:
“Rank”, “Name”, “Net worth”, “Age”, “Citizenship”, “Source”, “Industry”.

Step 1: install selenium using python

pip install --upgrade selenium
Enter fullscreen mode Exit fullscreen mode

Step 2: Import Required Libraries

import pandas as pd
from selenium import webdriver
import warnings
warnings.filterwarnings('ignore')
from selenium.webdriver.common.by import By
import time
Enter fullscreen mode Exit fullscreen mode

Step 3: Start a WebDriver Instance

driver=webdriver.Chrome()
Enter fullscreen mode Exit fullscreen mode

Step 4: Navigate to a Webpage
Navigate to the webpage where you want to interact with the search input bar:

driver.get("https://www.forbes.com/real-time-billionaires/#583345d23d78")
Enter fullscreen mode Exit fullscreen mode

step 5: scrape the name results
How to scrape inner tag of table or tbody

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[3]/div[1]/h3')
name=[]
for element in element:
    elements=element.text
    name.append(elements)
name 
Enter fullscreen mode Exit fullscreen mode

output

Image description

step 6: scrape the rank results

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[2]/div/span')
rank=[]
for element in element:
    elements=element.text
    rank.append(elements)
rank  
Enter fullscreen mode Exit fullscreen mode

output

Image description

step 7: scrape the net worth results

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)
net_worth 
Enter fullscreen mode Exit fullscreen mode

Image description
replace $ to rs using 4 methods
How to replace $ to rs from all element of list using for loop,listcomprhension,reg exp,lambda **
**How to remove $ from all element of list using for loop,listcomprhension,reg exp,lambda

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)   
net_worth = ["rs" + item[1:] for item in net_worth]
net_worth
Enter fullscreen mode Exit fullscreen mode

output

Image description

=====================================

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)
for item in net_worth:
    # Remove the dollar sign "$" and add "rs" to the beginning of each item
    updated_item = "rs" + item[1:]
    updated_list.append(updated_item)
updated_list
Enter fullscreen mode Exit fullscreen mode

Image description

import re
element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)
pattern = re.compile(r'\$')

# Initialize an empty list to store the modified strings
updated_list = []

# Iterate through the original list
for item in net_worth:
    # Use the sub() method to replace "$" with "rs"
    updated_item = pattern.sub('rs', item)
    updated_list.append(updated_item) 
updated_list  
Enter fullscreen mode Exit fullscreen mode

Image description

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)   
updated_list = []
updated_list = list(map(lambda x: x.replace('$', 'rs'), net_worth))
updated_list
Enter fullscreen mode Exit fullscreen mode

remove $ from all elements of list

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)   
updated_list = [item.replace('$', '') for item in net_worth]
updated_list  
Enter fullscreen mode Exit fullscreen mode

output

Image description

===================

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)   
updated_list = []
for item in net_worth:
    updated_item = item.replace('$', '')
    updated_list.append(updated_item)
updated_list 
Enter fullscreen mode Exit fullscreen mode

Image description
=======================

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)   
updated_list = []
pattern = re.compile(r'\$')
updated_list = [pattern.sub('', item) for item in net_worth]
updated_list
Enter fullscreen mode Exit fullscreen mode

Image description

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[4]/div/span')
net_worth=[]
for element in element:
    elements=element.text
    net_worth.append(elements)   
updated_list = []
updated_list = list(map(lambda x: x.replace('$', ''), net_worth))
updated_list
Enter fullscreen mode Exit fullscreen mode

step 8: scrape the age results

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[6]/div/span')
age=[]
for element in element:
    elements=element.text
    age.append(elements)
age 
Enter fullscreen mode Exit fullscreen mode

output

Image description

step 9: scrape the source results

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[7]/div/span')
source=[]
for element in element:
    elements=element.text
    source.append(elements)
source 
Enter fullscreen mode Exit fullscreen mode

output

Image description

step 10: scrape the citizenship results

element = driver.find_elements(By.XPATH, '//table[@class="ng-scope ng-table"]/tbody/tr/td[8]/div/span')
citizenship=[]
for element in element:
    elements=element.text
    citizenship.append(elements)
citizenship 
Enter fullscreen mode Exit fullscreen mode

output

Image description

step 11: make a dataframe

import pandas as pd
df= pd.DataFrame({'Rank':rank,'Name':name,'Net_worth':net_worth,'Age':age,'Source':source,'citizenship':citizenship})
df
Enter fullscreen mode Exit fullscreen mode

Image description

Top comments (0)