Web Scraping - Cyber Security News
A web scraper that uses the following modules (pandas, requests, lxml.html)
Purpose:
Instead of visiting the various sites used as an example for daily cyber security news, wouldn't it be faster if I just gather the main headlines and have them all in one place.
Improvements to still be made
-
Better Error Handling
-
Search/Emphasize on specific key words
-
Automate running and sending the file in the morning
Problems Noticed:
-
Not scalable once certain number of URLs
-
if for more sites Scrapy is a better option
-
-
Can easily break if a site changes the xpath
-
Can try to use contain certain words or number of characters
-
Modules / Libraries
Pandas - open source data analysis and manipulation tool for python
Requests - make HTTP requests simpler and more human friendly
lxml - easy handling of XML / HTML files
For this project I determined after trying to use Scrapy and BeautifulSoup that lxml was a better fit for this project.
import pandas as pd
import requests
import lxml.html
Sites Chosen
To the right you will see the sites I decided to use for this project stored in the variable URL as a list.
Data Cleaning and Storing
The following is the portion of the code where I did cleaning of the data if needed and ensuring the data are stored in the format needed. While also including where the data was taken from.
Comments / Explanation
The for loop allows uses the variable "p" which is the urls we have specified. To iterate over the HTML request and storing the data in the variable "page". Then using lxml to extract the content of the page and storing it in the variable "doc".
Following that as each site have their own specific way to select / choose the elements you want. I decided that xpath was the easiest and stored the extract information in its own variable.
The data was then cleaned of any extra characters that was not needed. For the links in particular it would remove any duplicates to ensure they all have the same amount of rows.
All of the data was then stored in a pandas dataframe.
Following using the list comprehension the list ended up becoming a list of list which needed to be further cleaned.
I felt that to clarify where the information was coming from in the export later to add a row with a small title of the site.
Then finally combining both dataframe together with the site title as the first row.
for p in url:
page = requests.get(p)
doc = lxml.html.fromstring(page.content)
if p == "https://cyware.com/cyber-security-news-articles":
title = doc.xpath('//h1[@class="cy-card__title m-0 cursor-pointer pb-3"]/text()')
descrip = doc.xpath('//div[@class="cy-card__description"]/text()')
links = doc.xpath('//div[@class="cy-panel__body"]//a/@href')
#Cleaning up the data
titlesplit = [title.lstrip().rstrip().split(",") for title in title]
descripsplit = [descrip.lstrip().rstrip().split(",") for descrip in descrip]
nodupelink = list(set(links))
linkclean = [word for word in nodupelink if "alert" not in word]
linkssplit = [linkclean.split(',') for linkclean in linkclean]
#Create a dataframe for the data
df = pd.DataFrame({
"Title": titlesplit,
"Description": descripsplit,
"Link": linkssplit
})
#Remove the brackets due to list type
df["Title"] = df["Title"].str.join(', ')
df["Description"] = df["Description"].str.join(', ')
df["Link"] = df["Link"].str.join(', ')
#Add the row to separate where the information is from
Siterow = pd.DataFrame({
"Title": "Cyware",
"Description": " ",
"Link": " "
}, index = [0])
df = pd.concat([Siterow, df]).reset_index(drop = True)
Comments / Explanation
Then for each site it would be the same except for the xpath used to select the data.
As you can see in some of the sites no cleaning was needed to be done which made my life easier.
Comments / Explanation
At the very end if the url was not defined then it would print out the comment with the site that isn't defined.
elif p == "https://threatpost.com/":
title2 = doc.xpath('//div[@class="c-border-layout"]//h2[@class="c-card__title"]//a/text()')
descrip2 = doc.xpath('//div[@class="c-border-layout"]//p/text()')
links2 = doc.xpath('//div[@class="c-border-layout"]//h2[@class="c-card__title"]//a/@href')
#Create a dataframe for the data
df2 = pd.DataFrame({
"Title": title2,
"Description": descrip2,
"Link": links2
})
#Add the row to separate where the information is from
Siterow2 = pd.DataFrame({
"Title": "Threatpost",
"Description": " ",
"Link": " "
}, index = [0])
df2 = pd.concat([Siterow2, df2]).reset_index(drop = True)
else:
print(f"Something went wrong with {p}")
Export Data
Then finally it would combine all the dataframes for each site and export it as a CSV to view.
finaldf = df.append([df2, df3, df4])
finaldf.to_csv('Summary.csv', index = False, header=True)