1

This is a bit of a specific question, but somebody must have done this before. I would like to get the latest papers from pubmed. Not papers about a certain subjects, but all of them. I thought to query depending on modification date (mdat). I use biopython.py and my code looks like this

handle = Entrez.egquery(mindate='2015/01/10',maxdate='2017/02/19',datetype='mdat')
results = Entrez.read(handle)
for row in results["eGQueryResult"]:
        if row["DbName"]=="nuccore":
            print(row["Count"])

However, this results in zero papers. If I add term='cancer' I get heaps of papers. So the query seems to need the term keyword... but I want all papers, not papers on a certain subjects. Any ideas how to do this? thanks carl

BioGeek
  • 19,132
  • 21
  • 75
  • 123
carl
  • 3,322
  • 6
  • 32
  • 80

2 Answers2

3

term is a required parameter, so you can't omit it in your call to Entrez.egquery.

If you need all the papers within a specified timeframe, you will probably need a local copy of MEDLINE and PubMed Central:

For MEDLINE, this involves getting a license. For PubMed Central, you can download the Open Access subset without a license by ftp.

BioGeek
  • 19,132
  • 21
  • 75
  • 123
3

This is sloppy, and I'd like to hear feedback, but here is code with the idea that the latest pubmed id the same thing as the latest paper (which I'm not sure is true). Basically does a binary search for the latest PMID, then gives a list of the n most recent. This does not look at dates, and only returns PMIDs, so I'm not sure it's a suitable answer, but maybe the idea can be adapted.

CODE:

import urllib2

def pmid_exists(pmid):
    url_stem = 'https://www.ncbi.nlm.nih.gov/pubmed/'
    query = url_stem+str(pmid)
    try:
        request = urllib2.urlopen(query)
        return True
    except urllib2.HTTPError:
        return False


def get_latest_pmid(max_exists = 27239557, min_missing = -1):
    #print max_exists,'-->',min_missing
    if abs(min_missing-max_exists) <= 1:
        return max_exists

    guess = (max_exists+min_missing)/2
    if min_missing == -1:
        guess = 2*max_exists

    if pmid_exists(guess):
        return get_latest_pmid(guess, min_missing)
    else:
        return get_latest_pmid(max_exists, guess)

#Start of program
if __name__ == '__main__':
    n = 5
    latest_pmid = get_latest_pmid()
    most_recent_n_pmids = range(latest_pmid-n, latest_pmid)
    print most_recent_n_pmids

OUTPUT:

[28245638, 28245639, 28245640, 28245641, 28245642]
mitoRibo
  • 1,723
  • 1
  • 9
  • 14
  • thank you so much... I really would have thought that they have some bulk download in pubmed... there must be people who are interested in the newest articles? I can't believe that there is no standard way to do this? – carl Mar 03 '17 at 23:25