Django middleware - Crawler detection

Sometimes you have to detect search engine crawling activity on your system, to handle the workflow for this type of request.

On our website, there is a high crawling activity, because BV FAPESP (https://bv.fapesp.br) provides useful information abour reaserch for science, technology and academy in Sao Paulo state of Brazil.

Recently, I deployed a feature on our system to allow people to store their queries and filters they do on our website. This is an open feature for everybody who navigates on our system. It is based on the HTTP session of the brower, which is stored on Python/Django server-side.

When we were tunning Django session table in the database, we found out that there were a lot of sessions created by crawling activities. To take control of the session creation, I wrote the middleware that follows.

 The list of crawlers, I get from ngix logs, on a very small timeframe, so maybe there are some more search engines that are not listed.

The use-case presented, is just one among many. This middleware can be useful for other situations.

This is my middleware to detect crawlers:
  
  class DetectCrawler():    
    """
    Check user-agent to identify a crawler in a list of crawlers.
    Update request object and include attribute is_crawler 
    """
    crawlers = ['Sogou','Slack-ImgProxy','IABot','Twitterbot','Sleuth','CCBot','PiplBot',
                'Googlebot','Slurp','Twiceler','msnbot','KaloogaBot','YodaoBot',
                '"Baiduspider','googlebot','Speedy Spider','DotBot','AhrefsBot','Applebot',
                'bingbot','YandexBot','trendiction','BLEXBot','SEMrushBot','AddThis',
                'TurnitinBot','magpie-crawler']


    def process_request(self, request):
        request.is_crawler = False

        if not 'HTTP_USER_AGENT' in request.META:
            return
        
        ua = request.META['HTTP_USER_AGENT']
        for crawler in self.crawlers:
            if bool(re.search(crawler, ua , re.IGNORECASE)):
                request.is_crawler = True
                return
	
    

Popular posts from this blog

Atom - Jupyter / Hydrogen

Metodologias em ação

Design Patterns - Observer