[Wikimetrics] Fwd: [Labs-l] Tell us about the SQL that you can't get to work

Pine W

2016-12-06 18:07:40 UTC

Forwarding to Analytics, Research, and Wikimetrics in case this is of
interest to people who aren't subscribed to the Labs mailing list.

Pine

---------- Forwarded message ----------
From: Bryan Davis <***@wikimedia.org>
Date: Tue, Dec 6, 2016 at 9:28 AM
Subject: [Labs-l] Tell us about the SQL that you can't get to work
To: labs-l <labs-***@lists.wikimedia.org>

In early January there is going to be a Developer Summit in San
Francisco [0]. Chase and I are in charge of scheduling talks on the
topic "Building on Wikimedia services: APIs and Developer Resources".
One of the more interesting to me talks that has been proposed for
this is "Labsdbs for WMF tools and contributors: get more data,
faster" by Jamie Crespo [1].

I know that most of you won't be able to attend in person, but if we
can show that there is enough interest in this topic we can get the
talk scheduled in a main room and recorded so anyone can watch it
later.

An idea I just had for showing interest is to get Tool Labs
maintainers and other Labs users to describe questions that they have
tried and failed to answer using SQL queries. We can look at the kinds
of questions that come up and ask Jamie (and others) if there are some
general recommendations that can be made about how to improve
performance or understand how the bits and pieces of our data model
fit together.

To kick things off, here's an example I tried to help with over the
weekend. A Quarry user was adapting a query they had used before to
find non-redirect File namespace pages not paired with binary files on
Commons. The query they had come up with was:

SELECT DISTINCT page_title, img_name
FROM (
SELECT DISTINCT page_title
FROM page WHERE page_namespace = 6
AND page_is_redirect = 0
) AS page
LEFT JOIN (
SELECT DISTINCT img_name
FROM image
) AS image ON page_title=img_name
WHERE img_name IS NULL;

The performance of this is horrible for several reasons including the
excessive use of DISTINCT. The query was consistently killed by the 30
minute runtime limit. MaxSem and I both came up with about the same
optimization that eliminated the sub-queries and use of DISTINCT:

SELECT page_title, img_name
FROM page LEFT OUTER JOIN image ON page_title=img_name
WHERE page_namespace = 6
AND page_is_redirect = 0
AND img_name IS NULL;

This new query is not fast in any sense of the word, but it does
finish without timing out. There is still some debate about whether
the 906 rows it returned are correct or not [2].

[0]: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit
[1]: https://phabricator.wikimedia.org/T149624
[2]: https://quarry.wmflabs.org/query/14501

Bryan
--
Bryan Davis Wikimedia Foundation <***@wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855