top of page

Web Scraping - Cyber Security News

A web scraper that uses the following modules (pandas, requests, lxml.html)

Purpose:

Instead of visiting the various sites used as an example for daily cyber security news, wouldn't it be faster if I just gather the main headlines and have them all in one place.


Improvements to still be made

  1. Better Error Handling

  2. Search/Emphasize on specific key words

  3. Automate running and sending the file in the morning

Problems Noticed:

  1. Not scalable once certain number of URLs

    1. if for more sites Scrapy is a better option

  2. Can easily break if a site changes the xpath 

    1. Can try to use contain certain words or number of characters​

 
Modules / Libraries
Pandas - open source data analysis and manipulation tool for python
Requests - make HTTP requests simpler and more human friendly
lxml - easy handling of XML / HTML files 
For this project I determined after trying to use Scrapy and BeautifulSoup that lxml was a better fit for this project.

import pandas  as pd

import requests

import lxml.html

Sites Chosen
To the right you will see the sites I decided to use for this project stored in the variable URL as a list.
Data Cleaning and Storing
The following is the portion of the code where I did cleaning of the data if needed and ensuring the data are stored in the format needed. While also including where the data was taken from. 
Comments / Explanation
The for loop allows uses the variable "p" which is the urls we have specified. To iterate over the HTML request and storing the data in the variable "page". Then using lxml to extract the content of the page and storing it in the variable "doc".
Following that as each site have their own specific way to select / choose the elements you want. I decided that xpath was the easiest and stored the extract information in its own variable.
The data was then cleaned of any extra characters that was not needed. For the links in particular it would remove any duplicates to ensure they all have the same amount of rows.
All of the data was then stored in a pandas dataframe.
Following using the list comprehension the list ended up becoming a list of list which needed to be further cleaned. 
I felt that to clarify where the information was coming from in the export later to add a row with a small title of the site.
Then finally combining both dataframe together with the site title as the first row.

for p in url:

    page requests.get(p)

    doc = lxml.html.fromstring(page.content)

    if p == "https://cyware.com/cyber-security-news-articles":

        title doc.xpath('//h1[@class="cy-card__title m-0 cursor-pointer pb-3"]/text()')

        descrip doc.xpath('//div[@class="cy-card__description"]/text()')

        links doc.xpath('//div[@class="cy-panel__body"]//a/@href')

 

        #Cleaning up the data

        titlesplit = [title.lstrip().rstrip().split(",") for title in title

        descripsplit = [descrip.lstrip().rstrip().split(",") for descrip in descrip

 

        nodupelink list(set(links)) 

        linkclean = [word for word in nodupelink if "alert" not in word

        linkssplit = [linkclean.split(',') for linkclean in linkclean

 

        #Create a dataframe for the data

        df pd.DataFrame({

            "Title"titlesplit,

            "Description"descripsplit,

            "Link"linkssplit

        })

        #Remove the brackets due to list type

        df["Title"] = df["Title"].str.join(', ')

        df["Description"] = df["Description"].str.join(', ')

        df["Link"] = df["Link"].str.join(', ')

 

        #Add the row to separate where the information is from

        Siterow pd.DataFrame({

            "Title""Cyware",

            "Description"" ",

            "Link"" "

        }, index = [0])

        df pd.concat([Siterowdf]).reset_index(drop True

 
Comments / Explanation
Then for each site it would be the same except for the xpath used to select the data.
As you can see in some of the sites no cleaning was needed to be done which made my life easier.
Comments / Explanation
At the very end if the url was not defined then it would print out the comment with the site that isn't defined.

    elif p == "https://threatpost.com/":

        title2 doc.xpath('//div[@class="c-border-layout"]//h2[@class="c-card__title"]//a/text()')

        descrip2 doc.xpath('//div[@class="c-border-layout"]//p/text()')

        links2 doc.xpath('//div[@class="c-border-layout"]//h2[@class="c-card__title"]//a/@href')

        #Create a dataframe for the data

        df2 pd.DataFrame({

            "Title"title2,

            "Description"descrip2,

            "Link"links2

        })

 

        #Add the row to separate where the information is from

        Siterow2 pd.DataFrame({

            "Title""Threatpost",

            "Description"" ",

            "Link"" "

        }, index = [0])

        df2 pd.concat([Siterow2df2]).reset_index(drop True

    else:

        print(f"Something went wrong with {p}")

 
Export Data
Then finally it would combine all the dataframes for each site and export it as a CSV to view.

finaldf df.append([df2, df3df4])

finaldf.to_csv('Summary.csv'index = False, header=True)

bottom of page