Cyber Security News Web Scraping

Web Scraping - Cyber Security News

A web scraper that uses the following modules (pandas, requests, lxml.html)

Purpose:

Instead of visiting the various sites used as an example for daily cyber security news, wouldn't it be faster if I just gather the main headlines and have them all in one place.

Improvements to still be made

Better Error Handling
Search/Emphasize on specific key words
Automate running and sending the file in the morning

Problems Noticed:

Not scalable once certain number of URLs
1. if for more sites Scrapy is a better option
Can easily break if a site changes the xpath
1. Can try to use contain certain words or number of characters

Full Code on GitHub

Modules / Libraries

Pandas - open source data analysis and manipulation tool for python

Requests - make HTTP requests simpler and more human friendly

lxml - easy handling of XML / HTML files

For this project I determined after trying to use Scrapy and BeautifulSoup that lxml was a better fit for this project.

import pandas as pd

import requests

import lxml.html

Sites Chosen

To the right you will see the sites I decided to use for this project stored in the variable URL as a list.

url = [

"https://cyware.com/cyber-security-news-articles",

"https://threatpost.com/",

"https://thehackernews.com/",

"https://www.securitymagazine.com/topics/2236-cyber-security-news",

"https://www.bobbythings.com"

]

Data Cleaning and Storing

The following is the portion of the code where I did cleaning of the data if needed and ensuring the data are stored in the format needed. While also including where the data was taken from.

Comments / Explanation

The for loop allows uses the variable "p" which is the urls we have specified. To iterate over the HTML request and storing the data in the variable "page". Then using lxml to extract the content of the page and storing it in the variable "doc".

Following that as each site have their own specific way to select / choose the elements you want. I decided that xpath was the easiest and stored the extract information in its own variable.

The data was then cleaned of any extra characters that was not needed. For the links in particular it would remove any duplicates to ensure they all have the same amount of rows.

All of the data was then stored in a pandas dataframe.

Following using the list comprehension the list ended up becoming a list of list which needed to be further cleaned.

I felt that to clarify where the information was coming from in the export later to add a row with a small title of the site.

Then finally combining both dataframe together with the site title as the first row.

for p in url:

page = requests.get(p)

doc = lxml.html.fromstring(page.content)

if p == "https://cyware.com/cyber-security-news-articles":

title = doc.xpath('//h1[@class="cy-card__title m-0 cursor-pointer pb-3"]/text()')

descrip = doc.xpath('//div[@class="cy-card__description"]/text()')

links = doc.xpath('//div[@class="cy-panel__body"]//a/@href')

#Cleaning up the data

titlesplit = [title.lstrip().rstrip().split(",") for title in title]

descripsplit = [descrip.lstrip().rstrip().split(",") for descrip in descrip]

nodupelink = list(set(links))

linkclean = [word for word in nodupelink if "alert" not in word]

linkssplit = [linkclean.split(',') for linkclean in linkclean]

#Create a dataframe for the data

df = pd.DataFrame({

"Title": titlesplit,

"Description": descripsplit,

"Link": linkssplit

})

#Remove the brackets due to list type

df["Title"] = df["Title"].str.join(', ')

df["Description"] = df["Description"].str.join(', ')

df["Link"] = df["Link"].str.join(', ')

#Add the row to separate where the information is from

Siterow = pd.DataFrame({

"Title": "Cyware",

"Description": " ",

"Link": " "

}, index = [0])

df = pd.concat([Siterow, df]).reset_index(drop = True)

Comments / Explanation

Then for each site it would be the same except for the xpath used to select the data.

As you can see in some of the sites no cleaning was needed to be done which made my life easier.

Comments / Explanation

At the very end if the url was not defined then it would print out the comment with the site that isn't defined.

elif p == "https://threatpost.com/":

title2 = doc.xpath('//div[@class="c-border-layout"]//h2[@class="c-card__title"]//a/text()')

descrip2 = doc.xpath('//div[@class="c-border-layout"]//p/text()')

links2 = doc.xpath('//div[@class="c-border-layout"]//h2[@class="c-card__title"]//a/@href')

#Create a dataframe for the data

df2 = pd.DataFrame({

"Title": title2,

"Description": descrip2,

"Link": links2

})

#Add the row to separate where the information is from

Siterow2 = pd.DataFrame({

"Title": "Threatpost",

"Description": " ",

"Link": " "

}, index = [0])

df2 = pd.concat([Siterow2, df2]).reset_index(drop = True)

else:

print(f"Something went wrong with {p}")

Export Data

Then finally it would combine all the dataframes for each site and export it as a CSV to view.

finaldf = df.append([df2, df3, df4])

finaldf.to_csv('Summary.csv', index = False, header=True)

Web Scraping - Cyber Security News

Full Code on GitHub

Modules / Libraries

Pandas - open source data analysis and manipulation tool for python

Requests - make HTTP requests simpler and more human friendly

lxml - easy handling of XML / HTML files

​

For this project I determined after trying to use Scrapy and BeautifulSoup that lxml was a better fit for this project.

Sites Chosen

To the right you will see the sites I decided to use for this project stored in the variable URL as a list.

Data Cleaning and Storing

The following is the portion of the code where I did cleaning of the data if needed and ensuring the data are stored in the format needed. While also including where the data was taken from.

Comments / Explanation

The for loop allows uses the variable "p" which is the urls we have specified. To iterate over the HTML request and storing the data in the variable "page". Then using lxml to extract the content of the page and storing it in the variable "doc".

​

Following that as each site have their own specific way to select / choose the elements you want. I decided that xpath was the easiest and stored the extract information in its own variable.

​

The data was then cleaned of any extra characters that was not needed. For the links in particular it would remove any duplicates to ensure they all have the same amount of rows.

​

All of the data was then stored in a pandas dataframe.

​

Following using the list comprehension the list ended up becoming a list of list which needed to be further cleaned.

​

I felt that to clarify where the information was coming from in the export later to add a row with a small title of the site.

​

Then finally combining both dataframe together with the site title as the first row.

Comments / Explanation

Then for each site it would be the same except for the xpath used to select the data.

As you can see in some of the sites no cleaning was needed to be done which made my life easier.

Comments / Explanation

At the very end if the url was not defined then it would print out the comment with the site that isn't defined.

Export Data

Then finally it would combine all the dataframes for each site and export it as a CSV to view.