Web Scraping with Powershell

Web scraping is about as fun as the name implies. You're scraping a web page to get data that you want to do something with. It's painful, time consuming and sometimes a requirement.

When Invoke-RestMethod can't work becuase a site does not provide a public API, your only hope is to scrape the data directly from the HTML.

Powershell provides a cool method for this. Invoke-WebRequest will pull down the site into a ComObject that you can then work with. You can output it to a file or store it into a variable and use it from there.

This will require some knowledge of HTML

Let's get started.

Say you find a really good recipe, you want to hang onto it locally but only want the recipe and not the extra fluff on the page. You could write it down, or print it. But, that doesn't get you just the recipe.

First, take a look at the recipe site: 'http://damndelicious.net/2017/04/21/korean-beef-bowl-meal-prep/'. Use the Inspect feature and take a look at it, you will see that some of the classes use a standardized recipe schema that makes building recipes on websites, and in turn scraping, much easier.

If you're inclined to cook after this, give this one a shot. It's fantastic.

First we need to grab the site and put it into a local variable.

$uri = 'http://damndelicious.net/2017/04/21/korean-beef-bowl-meal-prep/'

$site = Invoke-WebRequest -Uri $uri

Now if you use the Get-Member on the variable you should see it come back as a HtmlWebResponseObject object.

Lets start with the recipe card, because it's farily easy to get. We just want all the text inside the recipe in a nice and pretty view.

Add this to your script:

$recipe = $site.ParsedHtml.getElementsByTagName('div') |
    Where-Object {$_.getAttributeNode('class').Value -eq 'recipe'}

$recipeCard = $recipe.textContent

Let's break this down:

  1. Take the site's "parsed html", which is a property of the powershell object.
  2. Get all elements with the tag name of div
  3. Pipe this to Where-Object
  4. Filter where the attribute 'class' is equal to 'recipe'

Because a good portion of recipe sites use this standardized schema, it makes developing, and scraping, easier.

Print the $recipeCard to the screen and you should see the recipe card!

Because of the standard schema that they are using, we can actually get the other properties with it. Add this big ole block to your script:

     $ingredients = $recipe.getElementsByTagName('div') |
          Where-Object {$_.getAttributeNode('class').Value -eq 'ingredients'}

     $instructions = $recipe.getElementsByTagName('div') |
          Where-Object {$_.getAttributeNode('class').Value -eq 'instructions'}

     $meta = ($recipe.getElementsByTagName('div') | 
          Where-Object {$_.getAttributeNode('class').Value -like '*time*' } ).getElementsByTagName('p')

     $recipeName = $recipe.getElementsByTagName('h2') |
              Where-Object {$_.getAttributeNode('itemProp').Value -eq 'name'}

Again you can read through this and see it does the same thing as the recipe card, but it's more narrowed to the properties we want. The big difference here under #recipeName it looks for the attribute 'itemProp' instead of 'class' this is purely because they didn't add a 'Name' class to <h2> tag.

Now let's make a nice custom object for us to use:

$recipeObject = [PSCustomObject]@{
    Name = $recipeName.InnerText
    Ingredients = @()
    Instructions = @()
    Url = $uri
    Card = $recipeCard
}

If you print the $instructions of ingredients object you will see that there are several of them inside it, we need to cycle through these to get what we want. What we are looking for is the same as with the $recipeCard, we want the InnerText property.

InnerText is the text between the tag, which is the stuff we want.

So let's add a foreach loop to get that for the Ingredients and Instructions:

foreach  ( $ingredient in $ingredients ) {
        $recipeObject.Ingredients += $ingredient.InnerText
}

foreach  ( $instruction in $instructions ) {
        $recipeObject.Instructions += $instruction.InnerText
}

Pretty straight forward on this one. For each X inside Y, add X's innerText to the $recipeObject Instructions/Ingredients array. Because we know they are in order, we don't need to do anything else with it.

Now let's get to the $meta properties. You will notice that we did not add properties to the $recipeObject. That's because of the way we got the tags, we have no way of knowing which is which. So we have to massage it a bit. ( for this one, we have to do very minimal massaging, it cvould be worse.)

Add this to your script:

foreach  ( $tag in $meta ) {
        $tagName = $($tag.getElementsByTagName('strong')).InnerText
        $tagName = $tagName -replace ':'
        $tagName = $tagName -replace 'Time'

        $content = $($tag.InnerText).split(':')[1].trim()

        Add-Member -InputObject $recipeObject -MemberType NoteProperty -Name $tagName -Value $content
}

What does this do? Well basically the same as the others, but because we don't know the property we are working with until we parse the inner text, we can't assign it.

So what we do here is cycle through each "meta" tag and then dynamically assign the property to our $recipeObject once we know what we are working with.

Because it follows this format: 'Cook Time: 20 minutes', we get the Name and the content and apply them to the object using the Add-Member cmdlet.

Next, we just want a good way to look at the card:

$recipeObject | format-list

Now we have a cool recipe that we can redirect to a file and save as a JSON object, or just a plain text file and save that recipe for future use.

Hopefully this helped you, if you have any questions feel free to shoot me an email.