Skip to content Skip to navigation

Scientific Literature Searching is a Disaster Today

Tomorrow's Research

Message Number: 
1502

Younger colleagues I have shared drafts of this paper with suggest that this issue is part of a larger sea change in the way research is performed and credited, with an increased emphasis on “rating” of both the journal and the work.

Folks:

The posting below looks at some critical issues related to accessing scientific articles, particularly older ones that can provide important reference points for further research.  It is by biologist Frank Heppner (birdman@uri.edu). ©2016 Frank Heppner. All rights reserved. Reprinted with permission.
 
Regards,
 
Rick Reis
UP NEXT: Tame Your Inner Critic
 
Tomorrow’s Research
 
---------- 1,960 words ----------
 
Scientific Literature Searching is a Disaster Today
 
My first scientific publication came out in 1964 when I was a Master’s candidate. I still remember the pleasure when my package of 300 reprints arrived, at a cost that was well within the budget of my $2,400 a year salary as a teaching assistant. When the reprint requests started to come in from some of the Big Names in the field I tried not to let my head swell as I signed each, “Cordially yours, Frank Heppner.” 
 
Over the ensuing 51 years, I published regularly in the whole spectrum of journals, from Science, Nature, and PNAS, to the Mindanao Journal of Science Teaching. My last paper before my retirement in 2010 was a 2009 review article about organized flight in birds in Animal Behavior that I coauthored with a young friend from Slovenia named Iztok Lebar Bajec. I did the literature search in the old-fashioned way; lots of time in the library, and boxes of copies and reprints. After that, sayonara scientific publishing. 
 
Late last year however, I had reason to do a scientific literature search again, and my, my, how things had changed. My university office was long closed, so I started searching online from home. I had always “sort of” liked Google Scholar as a first-order search engine, because it was multidisciplinary, and very easy to use. Find an article, find the list of citations, call up the citations, and repeat the process. Not perfect, not complete, but very fast. 
 
So, coffee cup in hand, I settled down, called up Google Scholar and began. Enter my search terms, and here’s a nice list of references. Click on one, and presto! For $35 Elsevier will let me look at the article. Forget that. Tried a couple of other articles and on-line search engines. Same story. No payee, no lookie. So then I figured (although the System didn’t tell me so) that if I accessed the net using my university library password (which I had providentially kept active), it might be different. 
 
I logged on to the library website, and lo and behold. I still couldn’t get access to articles in most contemporary journals. Time to call the reference librarian. I found out that if I wanted to look at articles using one of the university’s on-campus computers, I’d have “access rights” to these journals, but if I wanted to use a remote (off-campus computer) I’d have to get an additional password. Or, fork over big wads of cash. Grumble. Grumble. But-- okay, I signed up. What on earth do people without an institutional affiliation do? Or what if your institution can’t afford subscriptions to the big publications? 
 
I got back on line at home and discovered new wonders like Scopus, Research Gate, and Web of Science. So I tried out Scopus and just out of curiosity found Iztok and my 2009 paper without much difficulty. It had about 125 references. I decided to see how easy it was to find the papers we had cited in this paper in the Scopus version of the whole paper. Cited papers written within the previous 10 years did not present many problems. They had a DOI number, which was, in theory, a direct link to the cited paper. It worked about 90% of the time, but at times required three or four links to arrive at the paper. The balance were dead links, or did not actually lead to the article for some reason. A large number of the remaining cited articles, which didn’t have DOI numbers, but did have links, could not be reached by clicking on the links in the citation–no idea why. What about papers that didn’t have links at all, usually older ones? You had to cut-and-paste the citation, copy it to a notepad, and then go back to the Scopus homepage and do a search. 
 
Just to try a little experiment, I entered my name in various formats in the “Author” Scopus search space. “Heppner F H” had 297 listed articles, most not mine, but from a neurobiologist in Switzerland who was actually F L Heppner. “Heppner FH” had none. What a difference a space makes. “Heppner F.” had 285. “Heppner F.H. had 6 (all mine), “Heppner Frank” had 13 (only one mine), and Heppner Frank H.” had one (that was also mine). My, what a fussy little thing this Scopus is. I actually had about 40 papers in science that should have been picked up. I tried Google Scholar to see if it gave variable results too. “Heppner FH” yielded 17 hits, but “Heppner Frank” produced 34, more or less the reverse of Scopus. So it was not just what you enter in a search box, but how you say it that determines how complete your search is. Google Scholar seemed to be a bit less fussy. 
What would happen if you entered a general topic, rather than an author name? I had worked for many years on the questions of how starling flocks managed to turn and wheel together, and why geese (and similar birds) fly in V formations. The names by which these phenomena were usually known were either, “Organized Flight in Birds,” or “Avian Flight Formations.” Our 2009 paper was a review article for Animal Behavior called “Organized Flight in Birds.” 
 
Back to Scopus. Entering “Organized Flight in Birds” produced 25 hits, none earlier than 1981. “Birds in Organized Flight” produced the same result. “Bird Flight Formations” yielded 188 hits, but none going back before 1950. Some of the hits made you scratch your head about why they were on the list, like “Reconstruction and in vivo analysis of the extinct tbx5 gene from ancient wingless moa (Aves: Dinornithiformes).” Moas can’t even fly, for goodness sake. Maybe they marched in an organized way. 
 
However, “Birds Flying in Organized Groups,” produced 5 hits only three of which were remotely relevant. Changing it to “Birds Flying in Formation” kicked out 81 references including “Nepotistic Alarm Calling in Siberian Jays.” I looked this paper up, and nothing in it suggested anything having to do with organized flight. I wondered what the tiny-brain algorithm of the search engine was thinking of. 
 
Google Scholar produced similar results. “Organized Flight in Birds” produced 10 hits on the first page. “Avian Flight Formations” yielded the same number. However, none of the references were the same. Clearly Google Scholar’sdeeply-thinking algorithm didn’t work the same way as my ageing brain. 
Returning to this activity of my scientific youth, looking up references, was a sobering experience. After a visit to my old university library, I quickly realized that doing it the old way, browsing through journals in the library, was dead. Most libraries, especially small ones, don’t subscribe to print journals any more. In many cases, the old stacks of bound journals have been chucked or put into storage. The young-kid investigators (under 35) I later talked to about this had never known a system other than computerized search, and by and large, seemed to be unaware of any fundamental deficiencies of the system–that was what they’d grown up with. 
 
However, from the perspective of an old fossil, I saw a lot of problems, and the fact that the youngsters may be unaware of them makes it scarier. I’m sure that if you grew up with the modern system, you would be able to do searches faster, and more easily than I could, but that is not the issue. 
 
The first surprise was that the search functions were amazingly context/vocabulary sensitive. Enter the “wrong” choice of search terms, and you could either miss a critical reference, or be overloaded with a deluge of irrelevant material. For example, if you were new or peripheral to the field and not familiar with the conventional terminology, you might search for “Structured Flocks of Birds,” which is for all practical purposes synonymous with “Organized Flocks of Birds,” but this search would yield 42 hits, only three of which were actually relevant. 
 
The second startling discovery I made about these data base searches is how deficient they are in older (say more than 25 years old) references. Of the 125 references cited in our “Organized Flight in Birds” paper, 15 were published between 1953 and 1964 (including some of the landmark papers in the field), and 33 were published between 1965 and 1990, The search for “coordinated bird flocks” in Scopus failed to find the most cited paper (7,248) in the field: Craig Reynolds’ 1987 paper, nor did it find Ulf Grenander’s and my 1990 symposium volume bird flock simulation paper that had 326 citations. At the time, I could not find an on-line version of this paper anywhere, which makes me wonder how many of those 326 people who had cited it in, say, the last ten years, had actually read it.  (Subsequent to writing the above, I uploaded a scanned version of this symposium chapter to ResearchGate, where it has had almost 400 reads in a few months. I wonder how many other older, significant papers are in limbo because the authors were unaware, as I was, that they could be made available by scanning and uploading).
The reason this is important is that science as an intellectual activity is self-policing in that conclusions are stated as the result of experiments, but at some time in the future, if new data suggests an alternate conclusion, you can always go back to the original paper and see if you can find the cause for a discrepancy. The most famous example of this in my field is that in 1947, a man named Yeagley published a paper concluding that magnetic fields influenced pigeon homing. This was highly controversial, his experiments couldn’t be replicated, and Yeagley was consigned to the avian scrap bin. Then in 1970, an ornithologist named Keeton, discouraged by the failure to find a reasonable alternate hypothesis for pigeon homing, repeated Yeagley’s experiments, but this time he kept track of the weather conditions when the birds with magnets attached to their heads were released. Sure enough, if they were released on sunny days, magnets made no difference to their orientation, but on cloudy days, the magnets screwed up their flight direction. This is probably why Yeagley’s experiments couldn’t be replicated; subsequent efforts did not consider sky cover either. 
 
What this means is that as the old references disappear, or become difficult to find, we will lose the basis upon which many lines of investigation are based, and the opportunity to make spectacular errors will increase.
 
The difficulty experienced in finding whole-article copies (versus just an abstract) of older articles also suggests to me that with increased time pressure on investigators, there may be a temptation for a writer to cite an older paper (for completeness) without actually reading the article. 
 
To be fair, I get the impression that if you are working in a very, very narrow field of research, and there is a common consensus about vocabulary, the data base literature search system is wonderful, and much faster than wandering through the library stacks. But if you’re just following a hunch, or think that maybe an area that seems to be peripheral to yours might really be useful, the probability of not making a serendipitous hit, or being drowned in irrelevant specialty papers seems high. 
 
Younger colleagues I have shared drafts of this paper with suggest that this issue is part of a larger sea change in the way research is performed, documented, and evaluated, with selection of sites for publication decided on the basis of increasingly sophisticated metrics, many of which seem to be manipulable by “gaming the system” one way or another ( or being gamed by it, depending on circumstances). 
I think what I would like to happen, before it becomes too late, is for someone to do an extensive scientific examination of the questions I’ve asked here. How much can we trust our search engines? How can we inadvertently screw up a literature search so we miss something critical, just by the wrong choice of search term? Can inclusion of older print references be made a higher priority? At some point, there will be nobody left who remembers how it used to be done, and we will be condemned to the future, like it or not.