Discussion:
[Wikimetrics] wikimetrics policy
Dan Andreescu
2013-11-02 04:00:15 UTC
Permalink
Hi,

I just noticed someone ran a query from 2012 to 2013 as a timeseries by
hour. This... creates a *lot* of data. For the cohort they used, it's
about 1.8 million pieces of data. Should we cap report sizes somehow? It
doesn't pose any immediate dangers other than taking up a lot of resources
and computation time, as well as IO time spent logging the results (the log
is currently acting as rudimentary backup - perhaps this is ill conceived).

In this case it looks like maybe it was a mistake, so one idea is to warn
the user that they are about to generate a lot of data, and to ask them to
confirm.

Thoughts?

Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131102/639e7c5d/attachment.html>
Dan Andreescu
2013-11-02 04:34:04 UTC
Permalink
Good suggestion from Steven:

No hourly reports over a month long, No daily reports over a year long.
Does that seem fair?

Dan
Post by Dan Andreescu
Hi,
I just noticed someone ran a query from 2012 to 2013 as a timeseries by
hour. This... creates a *lot* of data. For the cohort they used, it's
about 1.8 million pieces of data. Should we cap report sizes somehow? It
doesn't pose any immediate dangers other than taking up a lot of resources
and computation time, as well as IO time spent logging the results (the log
is currently acting as rudimentary backup - perhaps this is ill conceived).
In this case it looks like maybe it was a mistake, so one idea is to warn
the user that they are about to generate a lot of data, and to ask them to
confirm.
Thoughts?
Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131102/6ced4b32/attachment.html>
Dario Taraborelli
2013-11-02 06:02:28 UTC
Permalink
so, assuming that user wasn’t me <kidding>
. how about some kind of throttling for non-WMF users?

The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.

Dario
No hourly reports over a month long, No daily reports over a year long. Does that seem fair?
Dan
Hi,
I just noticed someone ran a query from 2012 to 2013 as a timeseries by hour. This... creates a *lot* of data. For the cohort they used, it's about 1.8 million pieces of data. Should we cap report sizes somehow? It doesn't pose any immediate dangers other than taking up a lot of resources and computation time, as well as IO time spent logging the results (the log is currently acting as rudimentary backup - perhaps this is ill conceived).
In this case it looks like maybe it was a mistake, so one idea is to warn the user that they are about to generate a lot of data, and to ask them to confirm.
Thoughts?
Dan
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131101/b51a4ba4/attachment.html>
Steven Walling
2013-11-02 06:08:28 UTC
Permalink
On Fri, Nov 1, 2013 at 11:02 PM, Dario Taraborelli <
Post by Dario Taraborelli
The limits sound fair anyway, but I see external researchers (and even
community members interested in historical data) using this tool to collect
very long data series.
I think that use case is out of scope for Wikimetrics. It's getting
dangerously close to using Wikimetrics as a general data platform or
service, rather than sticking to getting human-readable results for
standardized metrics. It's okay to go back months or years in time, but not
simultaneously at a level of detail not interpretable except with further
heavy processing of the result.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131101/f08e8193/attachment.html>
Dario Taraborelli
2013-11-02 06:17:13 UTC
Permalink
that’s correct, the original plan was to build an API.
Post by Dario Taraborelli
The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.
I think that use case is out of scope for Wikimetrics. It's getting dangerously close to using Wikimetrics as a general data platform or service, rather than sticking to getting human-readable results for standardized metrics. It's okay to go back months or years in time, but not simultaneously at a level of detail not interpretable except with further heavy processing of the result.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131101/022a6d55/attachment.html>
Dario Taraborelli
2013-11-02 06:18:02 UTC
Permalink
and that’s why we need throttling anyway
Post by Dario Taraborelli
that’s correct, the original plan was to build an API.
Post by Dario Taraborelli
The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.
I think that use case is out of scope for Wikimetrics. It's getting dangerously close to using Wikimetrics as a general data platform or service, rather than sticking to getting human-readable results for standardized metrics. It's okay to go back months or years in time, but not simultaneously at a level of detail not interpretable except with further heavy processing of the result.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131101/49bf915c/attachment-0001.html>
Dan Andreescu
2013-11-02 12:38:28 UTC
Permalink
Well, Dario, it was actually someone at WMF. But I don't think that should
matter much. Let's do this as a compromise:

If someone runs an hourly report longer than a month and a daily report
longer than a year, we give them a warning telling them what's going to
happen. If they say OK, we have to assume they know what they're doing and
they really need the data.

I know I accidentally ran a really long query once, so we'd at least guard
against that. Like I said though, even that crazy long query last night
didn't cause any huge problems. It just used up a bit of memory and slowed
access to the wikimetrics server for a few hours. There are a couple of
simple monitoring, tracing, and backup improvements I could make in order
to alleviate that as well. So if it keeps happening despite the warning,
I'll just do that.

Dan


On Sat, Nov 2, 2013 at 2:18 AM, Dario Taraborelli <
Post by Dario Taraborelli
and that’s why we need throttling anyway
that’s correct, the original plan was to build an API.
On Fri, Nov 1, 2013 at 11:02 PM, Dario Taraborelli <
Post by Dario Taraborelli
The limits sound fair anyway, but I see external researchers (and even
community members interested in historical data) using this tool to collect
very long data series.
I think that use case is out of scope for Wikimetrics. It's getting
dangerously close to using Wikimetrics as a general data platform or
service, rather than sticking to getting human-readable results for
standardized metrics. It's okay to go back months or years in time, but not
simultaneously at a level of detail not interpretable except with further
heavy processing of the result.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131102/535ae590/attachment.html>
Jaime Anstee
2013-11-04 17:22:10 UTC
Permalink
Dan,

I think the warning is important and would be useful for prevention of this
type of query as a mistake. I have seen this almost happen, and with the
rate at which Sarah and our interns have been pulling data I know I have
heard them wince some at choosing the wrong command at times. Anyway, I
support your idea to institute a warning.

Thanks,

Jaime
--
Jaime Anstee, Ph.D
Program Evaluation Specialist
Wikimedia Foundation
+1.415.839.6885 ext 6869
www.wikimediafoundation.org

Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
*https://donate.wikimedia.org <https://donate.wikimedia.org/>*
Post by Dan Andreescu
Well, Dario, it was actually someone at WMF. But I don't think that
If someone runs an hourly report longer than a month and a daily report
longer than a year, we give them a warning telling them what's going to
happen. If they say OK, we have to assume they know what they're doing and
they really need the data.
I know I accidentally ran a really long query once, so we'd at least guard
against that. Like I said though, even that crazy long query last night
didn't cause any huge problems. It just used up a bit of memory and slowed
access to the wikimetrics server for a few hours. There are a couple of
simple monitoring, tracing, and backup improvements I could make in order
to alleviate that as well. So if it keeps happening despite the warning,
I'll just do that.
Dan
On Sat, Nov 2, 2013 at 2:18 AM, Dario Taraborelli <
Post by Dario Taraborelli
and that’s why we need throttling anyway
that’s correct, the original plan was to build an API.
On Fri, Nov 1, 2013 at 11:02 PM, Dario Taraborelli <
Post by Dario Taraborelli
The limits sound fair anyway, but I see external researchers (and even
community members interested in historical data) using this tool to collect
very long data series.
I think that use case is out of scope for Wikimetrics. It's getting
dangerously close to using Wikimetrics as a general data platform or
service, rather than sticking to getting human-readable results for
standardized metrics. It's okay to go back months or years in time, but not
simultaneously at a level of detail not interpretable except with further
heavy processing of the result.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131104/d09fac15/attachment.html>
Loading...