Posts

Showing posts from 2019

Django middleware - Crawler detection

Sometimes you have to detect search engine crawling activity on your system, to handle the workflow for this type of request. On our website, there is a high crawling activity, because BV FAPESP (https://bv.fapesp.br) provides useful information abour reaserch for science, technology and academy in Sao Paulo state of Brazil. Recently, I deployed a feature on our system to allow people to store their queries and filters they do on our website. This is an open feature for everybody who navigates on our system. It is based on the HTTP session of the brower, which is stored on Python/Django server-side. When we were tunning Django session table in the database, we found out that there were a lot of sessions created by crawling activities. To take control of the session creation, I wrote the middleware that follows.  The list of crawlers, I get from ngix logs, on a very small timeframe, so maybe there are some more search engines that are not listed. The use-case presented, is j

Connect Django Haystack to Solr Cloud

At BV FAPESP (www.bv.fapesp.br) we use Solr as the searchengine backend, and a library called Haystack to tie Solr to Django. In 2018, me and my team wrote a Python/Django library to use with Apache Solr in cloud mode. We were avoiding the use of Django/Haystack library, since there were some features not supported, like grouping, Streaming Expressions, Graph Analysis. So far so good, before the end of the project I had in production environment Solr Cloud running smoothly, but I still had a single Solr running with Haystack, because we didn't re-code the whole system, and there still exist a legacy using Haystack. To turn-off the single Solr, we moved all documents to Solr Cloud and connected Haystack to it. This is what I documented here, for myself and maybe you, trying to make the same. Step-by-step There is Solr Cloud python backend for Haystack, that you can find here: https://github.com/django-haystack/django-haystack/pull/1580/commits/13df4a9e69ececd5567636085df4

Dynamic data use-case for d3.js

Image
I have been using d3.js in a production environment for about 3 years. And now due to some features upgrade on our website some changes need to be made on our charts. I decided to put this on paper to document and share this experiments. To play with d3.js, you have to deal with some technologies: Javascript Css Ajax Server-side (Python/Django) for me For this system I also use Solr to deliver graph data. System overview The portal ( https://bv.fapesp.br ) is a standard web site that loads dynamic data from relational database and nosql database. To display analytics charts, we use d3.js. The diagram below shows the communication layer between the components, to load d3.js charts on the front-end of BV FAPESP. Essentially, we use d3.js just like that, a javascript framework to display chart data. However, when you have to load data dynamically in a couple of charts, these things need to be a bit elaborated. BV FAPESP portal makes use of some d3.js charts seemin

Solr - Graph traversal query

Image
In this document I will show you how I am using some distinct technology layers to display graph data from BV FAPESP, stored inside Solr. FAPESP´s virtual library (BV FAPESP) is the information system, source of the referential data, of funded projects from São Paulo funding agency. These projects holds relation to each other in a three structure, a specialized form of a graph structure. BV FAPESP data projects structure In the relational database, the projects are linked by foreign keys, the most simple form to store the three structure relationship. But the main data source of BV FAPESP, is not a relational database, but the search engine Solr. Solr is a search engine, that uses the inverted index structure which is very fast at query time. The initial approach At the early stages of BV FAPESP, the associated projects were indexed with some kind of information from their parent, to allow the inverted search to get all children of a subset of projects. Everything is

Atom - Jupyter / Hydrogen

Image
I will show you how to use the Atom editor with Jupyter Notebook to debug Python/Django views using Hydrogen plugin. On the Internet you can find some examples using this setup to debug Python but I found nothing related to Django, so I decided to publish this notes. This is not a high performance setup, because it involves many technologies at several layers. This is more a proof of concept of nowadays technologies which are available to the programmer's toolbox. This notes has been written during the development and improvement of the automated tests of the Virtual Library of FAPESP ( http://www.bv.fapesp.br ). The environment The diagram below shows an actual infrastructure to use this setup in a development environment. Technologies It is not expected that you know all of the technologies in detail, but there are some trick points. This paper is oriented for Python programmer who already use Atom and Jupyter and enjoy explore the edges of these tools.

Solr Facet add link

Image
This is about how to link the resulting facet data from Solr on your web application. I am using Django as the front-end framework. But this solution applies to any language. The problem You have modeled your data and indexed to Solr. You will show facets on your web application to facilitate the navigation on your site to the users. You get faceted data from Solr; so far so good. The problem arises when you need to link your facet item to any destination other than the value of the facet itself, because Solr faceted data returns only the field value and count of occurences. The example below is a simplification of the Solr facet web page 1 , and show you how the json facet returns: {   "facet_counts" :{     "facet_queries" :{},     "facet_fields" :{       "categories" :[         "electronics" , 14 ,         "currency" , 4 ,         "memory" , 3 ,         "connector" , 2 ,