Infinite loop using wget

Im working in a project where I have to simulate traffic to certain website sites. The solution had to be simple and while python would be the obvious choice bash was right there with wget to be used with less lines and libs than python.

Requirements

  • Had to run forever
  • URL iterations had to be random
  • I want to use a different http-user-agent each iteration
  • I want to wait a random space between each iteration

 

WGET Options

You probably know this, but wget allows you to access a website without download its content, along with many other options. Here is the ones Im using.

wget --spider --recursive --delete-after --no-check-certificate --timeout=30 --tries=1 --no-cache --level=3 --max-redirect=3 -nH -U "$uagent" $line

–spider 

When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.

–recursive

This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on.

–delete-after

This option tells Wget to delete every single file it downloads, after having done so.

–no-check-certificate

Don’t check the server certificate against the available certificate authorities. Also don’t require the URL host name to match the common name presented by the certificate.

–timeout=30

Wait no longer than 30 seconds for a page to load (30 second is a lot!!)

–tries=1

Tries as much as 1 time then move on

–no-cache

Disable server-side cache. In this case, Wget will send the remote server an appropriate directive (‘Pragma: no-cache’) to get the file from the remote service, rather than returning the cached version

–level=3

Specify recursion maximum depth level depth, in this case 3.

–max-redirect=3

Specifies the maximum number of redirections to follow for a resource

-nH

Disable generation of host-prefixed directories. By default, invoking Wget with ‘-r http://fly.srk.fer.hr/’ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.

-U

This option allows you to change the User-Agent line issued by Wget. Use of this option is discouraged, unless you really know what you are doing.

 

The code

Random wait time is defined by the snippet

sleep $[( $RANDOM % 10) + 1]s

and the random http user-agent is set by

uagent=$(shuf $ua|head -n 1)

where $ua is the list of user agents you want.. mine I got from this website – https://developers.whatismybrowser.com/useragents/explore/

#!/bin/bash

# Load files
ua="user-agents.txt"
urls="urls.txt"
input="shuffled.txt"

#Loops forever
while true; do
   while IFS= read -r line; do

      shuf $urls > shuffled.txt # I want the  list to be randomized each round
      uagent=$(shuf $ua|head -n 1) # I want to use a different user-agent each request.
      wget --spider --recursive --delete-after --no-check-certificate --timeout=30 --tries=1 --no-cache --level=3 --max-redirect=3 -nH -T 60 -U "$uagent" $line
      sleep $[( $RANDOM % 10) + 1]s
   done < $input
done

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *