Tag Archives: quality of information

Another example of Google’s Knowledge Graph getting it wrong

Voting in the UK election has finished and the results are in, but the dust has most definitely not settled. It looks as we in the UK are in for interesting times ahead. It would help those of us researching the various political parties and policies if Google could at least get the basics right, such as who is now the Member of Parliament for a particular constituency.  I am in Reading East and we have switched from a Conservative MP to Labour (Matt Rodda). Out of curiosity, I tried a search in Google on Reading East constituency.  This is what Google’s Knowledge Graph came up with:

Reading East Google Knowledge Graph

I took this screenshot yesterday (Friday, 9th June)  at around 8 a.m. and expected to see Rob Wilson given as the MP throughout . I was impressed, though, to see that the snippet from Wikipedia correctly gives Matt Rodda as our MP. Whoever had updated that entry was pretty quick off the mark. Possibly a Labour Party worker? The rest of the information, which is taken from Google’s database of “facts”, is either wrong, confusing or nonsensical.

“Member of Parliament: Rob Wilson” – wrong.  But he was MP until around 4 a.m. on the 9th June when the result of the election in Reading East was announced, so perhaps I am expecting a little too much from Google to be that quick about updating its facts.

“Major settlement: Reading” – yes we are part of Reading but I find it strange that it is referred to as a major settlement rather than a town.

“Number of members: 1” – not sure why that is there as each constituency can only have one MP.

“Party: Conservative” – correct for Rob Wilson but the new MP is Labour. 

“European Parliament constituency: South East England” – correct!

The final two lines “Replaced by:” and “Created from:” had me totally flummoxed. The entries are the same  – Reading North, Reading South, Henley.  Reading North and Reading South were constituencies formed by splitting the Reading constituency in 1950. They were then merged back into Reading in 1955, re-created in 1974, and in 1983  Reading East and West were formed (Yes, it’s complicated!). As for Henley, it is not even in the same county.  I can only think that this comes from Caversham (now part of Reading East) being part of Oxfordshire until 1911, when it probably did fall within the Henley constituency.   The “Replaced by” is wrong because Reading East has not been replaced by anything. Google can’t even blame a template that has to be filled in with information at all costs because different information appears in the Knowledge Graph depending on the constituency.

Here is the information for Aylesbury:

And the one for Guildford:

Going  back to the how up to date the information is,  how quickly does Google update their “facts”. Rob Wilson was still our MP mid Friday afternoon. I submitted feedback using the link that Google provides at the bottom of each Knowledge Graph but this morning (10th June) nothing had changed. I’ll update this posting when it does change.

I would hope that most people would look at the other links in the search results, in this case the latest news, but preferably a reliable  authoritative source.  The list of MPs on the UK Parliament website would be an obvious choice but might take a day to be updated after an election. Just don’t rely on Google to get it right.

Seasonal opening times – never trust Google’s answers (or Bing’s)

This is my usual Christmas/New Year reminder to never trust Google’s answers (or Bing’s) on opening times of shops over the holiday season, especially if you are thinking of visiting small, local, independent shops.

I was contemplating going to our True Food Co-operative but suspected that it might still be shut. A search on my laptop for True Food Emmer Green opening times gave me a link to their website at the top of the results list. On the right hand side was a knowledge graph with information on the shop, it’s opening times and reviews that had been compiled from a variety of sources . For most of it the source of the information is not given.  On my mobile and tablet it is the knowledge graph that appears at the top of the results list and  takes up the first couple of screens.

It claims that the shop is “Open today 10am-6pm” [today is Thursday, 29th December].

When I go to True Food’s website it clearly states near the top of the home page that they are currently closed and re-opening on 4th January 2017.

Google gets it wrong again in the knowledge graph but so does Bing. So, always check the shop’s own website, and if you are searching on your mobile or tablet please make the effort to scroll down a couple of screens to get to links to more reliable information.

How to write totally misleading headlines for social media

Or how to seriously annoy intelligent people by telling deliberate lies.

A story about renewable energy has been doing the rounds within my social media circles,  and especially on FaceBook. It is an article from The Independent newspaper that has been eagerly shared by those with an interest in the subject.  The headline reads “Britain just managed to run entirely on renewable energy for six days”.

This is what it looks like on FaceBook:

britain_entriely_run_renewable_energy_1

My first thought was that, obviously, this was complete nonsense. Had all of the petrol and diesel powered cars in Britain been miraculously converted to electric and hundreds of charging points installed overnight? I think that we would have noticed, or perhaps I am living in a parallel universe where such things have not yet happened.  So I assumed that the writer of the article, or the sub-editor,  had done what some journalists are prone to do, which is to use the terms energy and electricity interchangeably. Even if they meant “electricity”  I still found the claim that all of our electricity had been generated from renewable sources for six days difficult to believe.

Look below the  headline and you will see that the first sentence says “More than half of the UK’s electricity has come from low-carbon sources for the first time, a new study has found.” That is more like it. Rather than “run entirely on renewable energy” we now have “half of the UK’s electricity has come from low-carbon sources” [my emphasis in both quotes]. But why does the title make the claim when straightaway the text tells a different story? And low carbon sources are not necessarily renewable, for example nuclear. As I keep telling people on my workshops, always click through to the original article and read it before you start sharing with your friends.

The title on the source article is very different from the facebook version as is the subtitle.

britain_entriely_run_renewable_energy_2
We now have the title “Half of UK electricity comes from low-carbon sources for first time ever, claims new report”, which is possibly more accurate. Note that “renewable” has gone and we have “low carbon sources” instead. Also, the subtitle muddies the waters further by referring to “coal- free”.

If you read the article in full it tells you that “electricity from low-emission sources had peaked at 50.2 per cent between July and September” and that happened for nearly six days during the quarter.  So we have half of electricity being generated by “low emission sources” but, again, that does not necessarily equate to renewables. The article does go on to say that the low emission sources included UK nuclear (26 per cent) , imported French nuclear,  biomass, hydro, wind and solar.  Nuclear may be low emission or low carbon but it is not a renewable.

Many of the other newspapers are regurgitating almost identical content that has all the hallmarks of a press release. As usual, hardly any of them give a link to the original report but most do say it is a collaboration between Drax and Imperial College London. If you want to see more details or the full report then you have to head off to your favourite search engine to hunt it down.  It can be found on the Drax Electric Insights webpage. Chunks of the report can be read online (click on Read Reports near the bottom of the homepage) or you can download the whole thing as a PDF. There is also an option on the Electric Insights homepage that enables you to explore the data in more detail.

This just leaves the question as to where the FaceBook version of the headline came from.  I suspected that a separate and very different headline had been specifically written for social media. I tested it by copying the URL and headline of the original article using a Chrome extension and pasted it into FaceBook. Sure enough, the headline automatically changed to the misleading title.

To see exactly what is going on and how, you need to look at the source code of the original article:

britain_entriely_run_renewable_energy_3

Buried in the meta data of page and tagged “og:title” is the headline that is displayed on FaceBook. This is the only place where it appears in the code.  The “og:title” is one of the open graph meta tags that tell FaceBook and other social media platforms what to display when someone shares the content. Thus you can have totally different “headlines” for the web and FaceBook that say completely different things.

Compare “Britain just managed to run entirely on renewable energy for six days” with “Half of UK electricity comes from low-carbon sources for first time ever, claims new report” and you have to admit that the former is more likely to get shared. That is how misinformation spreads. Always, always read articles in full before sharing and, if possible, try and find the original data or report. It is not always easy but we should all have learnt by now that we cannot trust politicians, corporates or the media to give us the facts and tell the full story.

Update: The original press release from DRAX “More than 50% of Britain’s electricity now low carbon according to ground-breaking new report

And you thought Google couldn’t get any worse

We’ve all come across examples of how Google can get things wrong: incorrect supermarket opening hours (http://www.rba.co.uk/wordpress/2015/01/02/google-gets-it-wrong-again/), false information and dubious sources used in Quick Answers (http://www.rba.co.uk/wordpress/2014/12/08/the-quality-of-googles-results-is-becoming-more-strained/), authors who die 400 years before they are born (http://googlesystem.blogspot.co.uk/2013/11/google-knowledge-graph-gets-confused.html), a photo of the actress Jane Seymour ending up in a carousel of Henry VIII’s wives (http://www.slate.com/blogs/future_tense/2013/09/23/google_henry_viii_wives_jane_seymour_reveals_search_engine_s_blind_spots.html) and many more. What is concerning is that in many cases no source is given. According to Search Engine Land (http://searchengineland.com/google-shows-source-credit-quick-answers-knowledge-graph-203293) Google doesn’t provide a source link when the information is basic factual data and can be found in many places. But what if the basic factual data is wrong? It is worrying enough that incorrect or poor quality information is being presented in the Quick Answers at the top of our results and in the Knowledge Graph to the right, but the rot could spread to the main results.

An article in New Scientist (http://www.newscientist.com/article/mg22530102.600-google-wants-to-rank-websites-based-on-facts-not-links.html) suggests that Google may be looking at significantly changing the way in which it ranks websites by counting the number of false facts in a source and ranking by “truthfulness”. The article cites a paper by Google employees that has appeared in arXiv (http://arxiv.org/abs/1502.03519) “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”. It is heavy going so you may prefer to stick with just abstract:

“The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy. The facts are automatically extracted from each source by information extraction methods commonly used to construct knowledge bases. We propose a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model. We call the trustworthiness score we computed Knowledge-Based Trust (KBT). On synthetic data, we show that our method can reliably compute the true trustworthiness levels of the sources. We then apply it to a database of 2.8B facts extracted from the web, and thereby estimate the trustworthiness of 119M webpages. Manual evaluation of a subset of the results confirms the effectiveness of the method.”

If this is implemented in some way, and based on Google’s track record so far, I dread to think how much more time we shall have to spend on assessing each and every source that appears in our results. It implies that if enough people repeat something on the web it will deemed to be true and trustworthy, and that pages containing contradictory information may fall down in the rankings. The former is of concern because it is so easy to spread and duplicate mis-information throughout the web and social media. The latter is of concern because a good scientific review on a topic will present all points of view and inevitably contain multiple examples of contradictory information. How will Google allow for that?

It will all end in tears – ours, not Google’s.

The quality of Google’s results is very strained

I recently received an email from a friend asking about whether it was acceptable for a student to cite Google as a source in their work. My friend’s instinct was to say no, but there was a problem getting beyond Google and to the original source of the answer. The student had used the Google define search option to find a definition of the term “leadership”, which Google duly did but failed to provide the source of the definition. My response to citing Google as a source is always “No” unless it is an example of  how Google presents results or a comment on the quality (or lack of it) of the information that has been found. The results that appear at the top of the results, such as the definitions or the new quick answers, have been created and compiled by someone else so Google should not get the credit for it. In addition, what is displayed by Google in response to the search will vary from day to day and in creating these quick answers Google sometimes introduces errors or gets it completely wrong.

There have been several well documented instances of Google providing incorrect information in the knowledge graph to the right of search results and in the carousel that sometimes appears at the top of the page (see http://googlesystem.blogspot.co.uk/2013/11/google-knowledge-graph-gets-confused.html and http://www.slate.com/blogs/future_tense/2013/09/23/google_henry_viii_wives_jane_seymour_reveals_search_engine_s_blind_spots.html). The same problems beset the quick answers. For a short time, a Google search on David Schaal came up with a quick answer saying that he had died on April 11th, 2003! (As far as I am aware, he is still very much alive).

No source was given nor was there any indication of where this information had come from. Many have questioned Google on how it selects information for quick answers and why it does not always give the source. Google’s response is that it doesn’t provide a link when the information is basic factual data (http://searchengineland.com/google-shows-source-credit-quick-answers-knowledge-graph-203293), but as we have seen the “basic factual data” is sometimes wrong.

Quick answers above the Google results have been around for a while. Type in the name of a Premier League football club and Google will give you the results for the most recent match as well as the scores and schedule for the current season. Not being a fan myself I would have to spend some time checking the accuracy of that data or I could, like most people, accept what Google has given me as true. Looking for flights between two destinations? Google will come up with suggestions from its Google Flights; and this is where it starts to get really messy. I’ve played around with the flights option for several destinations. Although Google gives you an idea of which airlines fly between those two airports and possible costs, the specialist travel sites and airline websites give you a far wider range of options and cheaper deals. It is when we come to health related queries, though, that I have major concerns over what Google is doing.

Try typing in a search along the lines of symptoms of [insert medical condition of your choice] and see what comes up. When I searched for symptoms of diabetes the quick answer that Google gave me was from Diabetes UK.

Google Quick Answer - symptoms of diabetes

At least Google gives the source for this type of query so that I can click through to the site for further information and assess the quality. In this case I am happy with the information and the website. Having worked in the past for an insulin manufacturer I am familiar with the organisation and the work it does. It was a very different story for some of the other medical conditions I searched for.

A search for symptoms of wheat intolerance gave me a quick answer from an Australian site whose main purpose seemed to be the sale of books on food allergies and intolerances, and very expensive self-diagnosis food diaries. The quality of information and advice on the topic was contradictory and sometimes wrong. The source for the quick answer for this query varied day by day and the quality ranged from appalling to downright dangerous. A few days ago, it was the Daily Mail that supplied the quick answer, which actually turned to be the best of the bunch, probably because the information had been copied from an authoritative site on the topic.

Today, Google unilaterally decided that I was actually interested in gluten sensitivity and gave me information from Natural News.

Google quick answer for wheat intolerance

I shall leave you to assess whether or not this page merits being a reliable, quick answer (the link to the page is http://www.naturalnews.com/038170_gluten_sensitivity_symptoms_intolerance.html).

Many of the sources that are used for a Google quick answer appear within the first three results for my searches and a few are listed at number four or five. This one, however, came in at number seven. Given that Google customises results one cannot really say whether or not the page’s position in the results is relevant or if Google uses some other way of determining what is used. Google does not say. In all of the medical queries I tested relevant pages from the NHS Choices website, which I expected to be a quick answer in at least a couple of queries, were number one or two in the results but they have never appeared as a quick answer.

Do not trust Google’s quick answers on medical queries, or anything else. Always click through to the website that has been used to provide the answer or, even better, work your way through the results yourself.

So what advice did I suggest my friend give their student? No, don’t cite Google. I already know who Google currently uses for its define command but a quick way to find out is to simply phrase search a chunk of the definition. That took me straight to an identical definition at Oxford Dictionaries (http://www.oxforddictionaries.com/), and I hope that is the source the student cited.

UK crime data as clear as mud

I’m a nosy neighbour. I like to know what’s going on in my area: who’s bought the house next door, local planning applications, any dodgy activity going on? My husband and I are both self employed so there is usually at least one of us out and about in Caversham during the day. That means we have the chance to chat with our local postman, workmen digging up the road, Police Community Support Officers doing their rounds and with people in the local shops, bank and post office. Crime, not surprisingly, is a major topic on our “watch list” and just over two years ago police forces in England and Wales started to provide access to local crime statistics via online maps. The new service allowed you to drill down to ward level and view trends in burglary, robbery, theft, vehicle crime, violent crime and anti-social behaviour.

The format varied from one police force to another. For example Thames Valley Police provided a basic map and tables of data:

Thames Valley Police 2008 crime rates

Others such as the Metropolitan Police included additional graphical representation of the statistics such as  bar charts:

Metropolitan Police 2008 Crime Rates

None of them pinned down incidents to individual streets or addresses but they did give you an idea of the level of crime in a particular neighbourhood, how it compared with the same period the previous year and whether it was high, above average, average, below average, low or no crime. They were short, though, on detailed definitions of what each category of crime included. I looked at these maps out of personal curiosity rather than using them for any serious business application, and I made certain assumptions such as murder being included under ‘Violence against the person’. That may not have been the case.

Some police forces placed obvious links to the information on their home pages whilst others buried the data in obscure corners of their web sites. The crime maps where then all moved to the CrimeMapper web site – the Thames Valley Police map can still be seen at http://maps.police.uk/view/thames-valley – but that has now been integrated into Police.uk website, which “includes street-level crime data and many other enhancements“.

All you have to do is go to http://www.police.uk/, type in your postcode, town, village or street into the search box and “get instant access to street-level crime maps and data, as well as details of your local policing team and beat meetings“. The first screen looks good with news of local meetings, events, recent tweets, YouTube videos and – as the home page promised – information on my local policing team.

Police UK page for RG4 5BE

When I focus on the map to look at the detail there are markers for the location of the crimes and clicking on them gives you a brief description of the crime:

Detail on Police UK crime rates for Caversham

In this example, the detail box had details of two crimes “on or near Anglefield Road” and this is where I started to become confused. Were the burglary and the violent crime part  of the same incident or totally separate? Furthermore, if you look in the left hand panel of the screen you will see “To protect privacy, individual addresses are not pinpointed on the map. Crimes are mapped to an anonymous point on or near the road where they occurred.” Fair enough, but I would like to know how near ‘near’ is. 100, 200, 400 yards? Half a mile, a mile? And does the focus shift from one street to another from one month to the next? If it stays put then a street could gain a crime rate reputation that it does not deserve but if it shifts there is no way one can compare data from one month or year to another, which brings me to my next question.

Why is there only one month’s data? Previous versions of the crime maps gave you three months data for the current and the previous year for comparison. There is nothing about this in the Help section of  the Police UK site but the Guardian reports:

police forces have indicated that whenever a new set of data is uploaded – probably each month – the previous set will be removed from public view, making comparisons impossible unless outside developers actively store it.” (Crime maps are ‘worse than useless’, claim developers http://www.guardian.co.uk/technology/2011/feb/02/uk-crime-maps-developers-unhappy?CMP=twt_iph).

This means that if you want to run comparisons over time you will have to download the files and store them on your own system each month, or find someone else who is already doing it.

The Guardian article also says:

the Information Commissioner‘s Office (ICO) advised that tying crime reports to postcodes or streets with fewer than 12 addresses would render the individuals involved too identifiable. The police have also decided to remove data about murders or sexual assaults.

With respect to the latter the help file on the Police UK site suggests otherwise:

Crimes have been grouped into six categories following advice from the Information Commissioner’s Office. This doesn’t mean that the crimes listed under ‘other’ are not seen as important. Rather it ensures that for some of the more sensitive crimes there is even greater privacy for the victims.

So which is it: murders and sexual assaults are not included at all or aggregated under “other”? Jonathan Raper says on his blog Placr News (“Five reasons to be cautious about street level crime data” http://placr.co.uk/blog/2011/02/five-reasons-to-be-cautious-about-street-level-crime-data/):

Some data is redacted eg sexual offences, murder. The Metropolitan Police has already released this data to ward level though… and it is easy to cross-reference one murder in one ward to reports in the local press at the same time

Data visualisations and mashups are becoming increasingly popular and make it considerably easier to assess a situation and view trends. The Guardian Datablog (http://www.guardian.co.uk/news/datablog), for example, encourages people to take data sets, mash them up and create their own visualisations, and upload a screen shot to  the Guardian Datastore on Flickr (http://www.flickr.com/groups/1115946@N24/). It is vital, though, that the source of the data, whether the full data set or just a selection has been used, and whether or not it is going to be updated is clearly spelt out. All too often one or even all of these are missing from the accompanying notes, and in some cases there are no notes at all!

An example of good practice is “UK transport mapped: Every bus stop, train station, ferry port and taxi rank in Britain” (http://www.guardian.co.uk/news/datablog/2010/sep/27/uk-transport-national-public-data-repository). The posting clearly states the source (http://data.gov.uk/dataset/nptdr) and its coverage:

“A snapshot of every public transport journey in Great Britain for a selected week in October each year. The dataset is compiled with information from many sources, including local public transport information from each of the traveline regions, also coach services from the national coach services database and rail information from the Association of Train Operating Companies”

It then goes on to specify the time period  (5-11 October, 2009) and the tools that were used to create the visualisation.

Another is the “Live map of London Underground trains” (http://traintimes.org.uk/map/tube/). This shows “all trains on the London Underground network in approximately real time“. The source is a live data feed from Transport for London (TfL) and the notes state that a “small number of stations are misplaced or missing; occasional trains behave oddly; some H&C and Circle stations are missing in the TfL feed.” It would be helpful to have a list of those missing stations, but the site has at least brought the issue of potential missing data to the users’ attention.

Returning to the Police.uk crime data, there are three major problems with the site for me as a researcher:

1. Are all crimes included in the database, or are some such as murders and sexual assaults excluded altogether or aggregated under “other”? More detailed and unambiguous scope notes please.

2. The street data level is useless. The markers are not exact locations but “near” to, there is no definition of “near”, no information on how the position of the marker is calculated or the geographic radius that it covers. It would be better to return to aggregated data at the ward level.

3. There are no options for comparing time periods and it seems that historical data will not be available on the web site. An ad hoc researcher will have to spend time and effort tracking down a developer or a web site that is downloading and keeping copies of all of the datasets as they are published.

The new crime data web site is a retrograde step. We need transparency and clarity rather than the muddle and confusion that has been generated by the lack of information on what is being provided.