Wednesday, June 24, 2015

Using the Python Requests module to POST documents to Solr


Method Definition


The first method accepts the Solr host URL and the JSON payload.  The auto-commit feature is turned off.  The second method will not POST any data, but it will commit any pending transactions.


def post(host, data):
 headers =  { "content-type" : "application/json" }
 params =  { "commit" : "false" }
 return requests.post(host, data=data, params=params, headers=headers)

def commit(host):
 headers =  { "content-type" : "application/json" }
 params =  { "commit" : "true" }
 return requests.post(host, params=params, headers=headers)



Method Invocation


The most important part of the method invocation is construction of the JSON payload:
payload = {
 "add" : {
  "doc" : str(data)
 }
}


The payload has three aspects:
  1. The add command tells Solr that a Create or Update is going to be performed
  2. and the doc signals the beginning of the JSON payload.
  3. The str(date) contains the data read in from a file

The full code also takes into account the commit threshold:
def parse(host, dir_in, ext, threshold) :
 
 counter = 0
 total_commits = 0

 files = file_utils.getfiles(dir_in, ext)
 total_required_commits = len(files) / threshold

 for file in files :

  # READ INCOMING FILE ...
  with open (file, "r") as myfile:
   data = myfile.read().replace('\n', '')
   payload = {
    "add" : {
     "doc" : str(data)
    }
   }

   myfile.close()
   response = post(host, cleanse(payload))

   counter = counter + 1
   print ("Post Response (status = {0}, counter = {1}-{2}, total-commits = {3}-{4})".format(response.status_code, counter, threshold, total_commits, total_required_commits))

   if counter >= threshold :
    print ("About to Commit (total-docs = {0})".format(threshold))
    commit(host)
    counter = 0
    total_commits = total_commits + 1

def cleanse(payload) :
 payload = str(payload)
 payload = payload.replace("'{", "{")
 payload = payload.replace("}'", "}")
 payload = payload.replace("'add'", "\"add\"")
 payload = payload.replace("'doc'", "\"doc\"")
 return payload



Payload Definition


The payload must correspond to the schema.xml file defined within the solr core (solr_data/{core}/conf/schema.xml):
<schema name="documents" version="1.5">

<fields>
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="_root_" type="string" indexed="true" stored="false"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
   <field name="filename" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
   <field name="title" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
   <field name="referenced_title" type="text_transcript" indexed="true" stored="true" omitNorms="true" multiValued="true"/>
   <field name="abstract" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
   <field name="text" type="text_transcript" indexed="true" stored="true" omitNorms="true" multiValued="true"/>
 </fields>


A sample payload looks like this:
payload = {
   "add" : {
      "doc" : {  
  "id" : -3141779815403614,
  "filename" : "S007911130030X.xml",
  "title" : "Methane production induced by dimethylsulfide in surface water of an upwelling ecosystem",
  "abstract" : "Atmospheric oxidation of the surface of chalcopyrite has been investigated using electrochemical techniques.", 
  "referenced_title" : [  
     "The contribution of nano- and micro-planktonic assemblages in the surface layer (0\\u201330 m) under different hydrographic conditions in the upwelling area off Concepci\\u00f3n, central Chile",
     "Ocean-atmosphere interaction in the global biogeochemical sulfur cycle",
     "Atmospheric methane and global change"
  ]
      }
   }
}


Note that when posting this to Solr it's a good idea to use the python str(...) function:
commit(host, str(payload))



References

  1. [Blogger] Python Snippets (includes file_utils.py referenced above)

13 comments:

  1. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I try this but doesn't works do you see the error?


    def post(host, data):
    headers = { "content-type" : "application/json" }
    params = { "commit" : "false" }
    return requests.post(host, data=data, params=params, headers=headers)

    def commit(host):
    headers = { "content-type" : "application/json" }
    params = { "commit" : "true" }
    return requests.post(host, params=params, headers=headers)

    def cleanse(payload) :
    payload = str(payload)
    payload = payload.replace("'{", "{")
    payload = payload.replace("}'", "}")
    payload = payload.replace("'add'", "\"add\"")
    payload = payload.replace("'doc'", "\"doc\"")
    return payload

    payload = {
    "add" : {
    "doc" : {
    "id" : 3141779815403614,
    "filename" : "S007911130030X.xml",
    "t_nameAdm1": [
    "cacca"
    ],
    "title" : "Methane production induced by dimethylsulfide in surface water of an upwelling ecosystem",
    "abstract" : "Atmospheric oxidation of the surface of chalcopyrite has been investigated using electrochemical techniques.",
    "referenced_title" : [
    "The contribution of nano- and micro-planktonic assemblages in the surface layer (0\\u201330 m) under different hydrographic conditions in the upwelling area off Concepci\\u00f3n, central Chile",
    "Ocean-atmosphere interaction in the global biogeochemical sulfur cycle",
    "Atmospheric methane and global change"
    ]
    }
    }
    }

    response=post('http://localhost:8080/solr', cleanse(payload))
    commit('http://localhost:8080/solr')
    print response.status_code

    ReplyDelete
  4. http://pastebin.com/34xYxKvi

    ReplyDelete
  5. This looks like it is NOT using one of the Solr libraries. If you ARE using one of the python libraries at http://wiki.apache.org/solr/SolPython then what I am saying below may not apply.

    If this is plain HTTP, then sending requests to /solr will not work. This is the context path for the webapp, but this typically sends back a redirect to the admin interface for browsers, it can't accept requests. Send to /solr/corename/update instead. The core name is required, and the /update handler is what can actually accept the request.

    ReplyDelete
    Replies
    1. Even if you are using a python library, the base URL should be /solr/corename where "corename" is the name of a core or collection in your Solr install.

      Delete
    2. http://pastebin.com/34xYxKvi
      Thanks but if I add the name of the core at line 39 like:
      response=post('http://localhost:8080/solr/interlinking', cleanse(payload))

      it says 404 error!

      Delete
    3. I change the name to

      response=post('http://localhost:8080/solr/interlinking/update', cleanse(payload))


      and get:

      null:org.noggit.JSONParser$ParseException: JSON Parse Error: char=a,​position=0 BEFORE='a' AFTER='dd=doc'
      null:org.noggit.JSONParser$ParseException: JSON Parse Error: char=a,position=0 BEFORE='a' AFTER='dd=doc'
      at org.noggit.JSONParser.err(JSONParser.java:223)
      at org.noggit.JSONParser.next(JSONParser.java:622)
      at org.noggit.JSONParser.nextEvent(JSONParser.java:663)
      at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
      at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:102)
      at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:66)
      at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
      at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1962)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
      at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
      at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
      at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
      at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
      at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
      at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
      at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
      at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1074)
      at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
      at org.apache.tomcat.util.net.AprEndpoint$SocketWithOptionsProcessor.run(AprEndpoint.java:2403)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
      at java.lang.Thread.run(Thread.java:745)


      so seems the payoad is not correctly formatted but it at least get it finally!!!

      Delete
    4. Glad you had some success. Sometimes when you copy & paste code other characters get picked up that are not immediately obvious. You might want to try validating the JSON you're using at a site like this http://jsonformatter.curiousconcept.com/ and/or generating your own JSON structure that conforms to your Solr schema.

      Delete
    5. I'm using your json, and it's valid:

      {
      "add" : {
      "doc" : {
      "id" : -3141779815403614,
      "filename" : "S007911130030X.xml",
      "title" : "Methane production induced by dimethylsulfide in surface water of an upwelling ecosystem",
      "abstract" : "Atmospheric oxidation of the surface of chalcopyrite has been investigated using electrochemical techniques.",
      "referenced_title" : [
      "The contribution of nano- and micro-planktonic assemblages in the surface layer (0\\u201330 m) under different hydrographic conditions in the upwelling area off Concepci\\u00f3n, central Chile",
      "Ocean-atmosphere interaction in the global biogeochemical sulfur cycle",
      "Atmospheric methane and global change"
      ]
      }
      }
      }


      before sending it I pass it to

      def cleanse(payload) :
      payload = str(payload)
      payload = payload.replace("'{", "{")
      payload = payload.replace("}'", "}")
      payload = payload.replace("'add'", "\"add\"")
      payload = payload.replace("'doc'", "\"doc\"")
      payload = payload.replace('\n', '')
      return payload

      this is the actual code it have to be identical to the last one with only that line modified, but i repost it anyway:
      new code: http://pastebin.com/p9G3GQqK

      (line changed: response=post('http://localhost:8080/solr/interlinking/update', cleanse(payload)))

      Delete
  6. Found the error, you don't have to use cleanse but just a json.dumps(...) is good and works perfectly, then add ?wt=json at the end of the url like:

    res=post('http://localhost:8983/solr/tableData/update?wt=json', document)

    ReplyDelete





  7. Well Done ! the blog is great and Interactive it is about Using the Python Requests module to POST documents to Solr it is useful for students and Python Developers for more updates on python

    python online course

    ReplyDelete