-3

I want to download all the paid/unpaid papers from google scholar, that are cited in any particular paper, I will simply extract the references of the corresponding citation tag and therefore reference. What I cant think of is how will I forward all the references one by one to Scholar and download them all. Any help would b appreciated. And I only need the abstract of the papaer, therefore plx do advise tht whether there is way to just access the abstract or I will hve to download the whole papers.

1 Answers1

0
  1. You should find the PDFs

    You can use the okhttp to make a request (with GET), to the url:

    OkHttpClient client = new OkHttpClient();
    
    Request request = new Request.Builder()
                     .url(" https://scholar.google.com.br/scholar?q=the_paper_i_want")
                     .build();
    
    Response response = client.newCall(request).execute();
    String html = response.body().string();
    

    This will give you a html page.

  2. Parse the html page using JSoup (for example).

     Document doc = Jsoup.parse(html);
     Elements links = doc.select("a[href]");
    

    You might look for tags where the attribute href ends with ".pdf".

  3. Download the PDFs

    Now you can download the (free) pdfs using a code like this

PS: Sorry for not pointing the link to JSoup, my reputation isn't high enough.

Community
  • 1
  • 1
  • well I just need the abstract of the papers only, I will match the abstract with its corresponding reference and will apply cosine similarity algorithm for cbf , finally rank the papers.. ! as u instructed .url(" https://scholar.google.com.br/scholar?q=the_paper_i_want"), I will simply extract the references frm paper and for each instance of paper I will hve to update this line the scholar url will remain same, paper i want will change ...! am i in rite direction .?? ps Thanks – farooq ahmed May 09 '17 at 17:46
  • I couldn't find a way to list just the abstracts. I couldn't read the citations either. It calls a javascript to show them in a pop-up, it is more complicated to automate. It is not an easy task, maybe you will have to download the paper (pdf), check if it fit your needs and keep or delete, based on your analysis. I never worked with PDFs but I am sure there is a library to read their content. – Yuri Pourre May 09 '17 at 20:38
  • for tht im using pdfx online tool, i send the request and it converts pdf to xml , hence making easier to extract only citations and references. but the problem is it has a threshold. It cant convert more than 10 pdfs from in an hour..! – farooq ahmed May 12 '17 at 08:29
  • iam unable to extract the the a tag which ends with .pdf.! any help... ! ps it dosent end with .pdf, it ends with .pdf?sequence=1 ! – farooq ahmed May 14 '17 at 17:38
  • Can you please elaborate more? I couldn't understand what you are trying to do now. – Yuri Pourre May 17 '17 at 18:04
  • yeah dont worry, I have done wot i had to do... ! Thank u for ur help.! appreciate it..! (y) – farooq ahmed May 17 '17 at 19:30