Discussion:
[Wikimetrics] user_id and user_name distinction
Dan Andreescu
2013-11-22 22:27:39 UTC
Permalink
Hi everyone,

A quick note about something that just messed me up. When uploading a
cohort to wikimetrics, you are told you can use either user_name, user_id,
or a mixture in the first column. However, this can really produce
unexpected results if you don't know how it works. I think it needs to
change, but until then, this is how it works and how it can bite you:

Let's say I have a list of users:

1,en
2,en
3,en

When it validates, it will look up user_name == 1, if it doesn't find
anything it will look up user_id == 1. Then user_name == 2, user_id == 2,
user_name == 3, user_id == 3. If what you meant with the above cohort was
the users with ids 1, 2, and 3, then you might be very confused later when
you see user id 234215 in your output results. This might happen if a
user_name is actually 2! So, for now, until I figure out how to fix this,
it will always prefer user_names before user_ids.

Please let me know if this is confusing. Also, the whole problem stems
from needing to accept both user_id and user_name in the *same* upload. If
everyone agrees, I'd much rather just allow people to toggle between one or
the other. This would speed up validation and make it much clearer what is
going on.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131122/07841e67/attachment.html>
Steven Walling
2013-11-22 22:33:17 UTC
Permalink
So, for now, until I figure out how to fix this, it will always prefer
user_names before user_ids.
I think this is an argument for making users specifying whether it's names
or ids up front, and not allowing mixtures. Assuming it might be a mixture
and looking for names first is almost certain to produce inaccurate results
at some point. We have ids precisely to avoid collisions with names,
allowing for renaming users, and other cases.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131122/1163c986/attachment.html>
Dan Andreescu
2013-11-22 23:28:13 UTC
Permalink
So, for now, until I figure out how to fix this, it will always prefer
Post by Dan Andreescu
user_names before user_ids.
I think this is an argument for making users specifying whether it's names
or ids up front, and not allowing mixtures. Assuming it might be a mixture
and looking for names first is almost certain to produce inaccurate results
at some point. We have ids precisely to avoid collisions with names,
allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front of
a bunch of people I admire. So, I'd be glad if I'm the only one that this
happens to. If nobody objects, I'm going to allow the user to select
whether their cohort contains user_ids OR user_names, and strictly prohibit
mixtures.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131122/9517948a/attachment.html>
Dario Taraborelli
2013-11-23 00:04:15 UTC
Permalink
that works for me, thanks!

Jaimee – can you give us more details on the use case for mixed cohorts that you had in mind?
So, for now, until I figure out how to fix this, it will always prefer user_names before user_ids.
I think this is an argument for making users specifying whether it's names or ids up front, and not allowing mixtures. Assuming it might be a mixture and looking for names first is almost certain to produce inaccurate results at some point. We have ids precisely to avoid collisions with names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front of a bunch of people I admire. So, I'd be glad if I'm the only one that this happens to. If nobody objects, I'm going to allow the user to select whether their cohort contains user_ids OR user_names, and strictly prohibit mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131122/4584464b/attachment.html>
Jaime Anstee
2013-11-26 22:35:33 UTC
Permalink
Missed the question back to me, sorry. Mixed cohorts might occur due to
the output as user IDs while collection is of usernames - say someone has a
repeating events and has a csv output of data for those new users that were
retained at a certain activity level from Point A to B and then has new
cohort members opt in at Point B but only wants to include those that
already survived from Point A and new at Point B cohort members for
examining at another Point C. Without the output of usernames to create
the active Point B cohort separately this would make the Point C cohort a
mix of qualified user ids and new user names. There are several ways of
dealing with this, it was just the first scenario I could think of that
could cause this. Seems we still need to revisit the possibility of
accessing usernames as output, also for reasons of matching to other data
points where most users and most program leaders do not know user ids -
Jaime
--
Jaime Anstee, Ph.D
Program Evaluation Specialist
Wikimedia Foundation
+1.415.839.6885 ext 6869
www.wikimediafoundation.org

Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
*https://donate.wikimedia.org <https://donate.wikimedia.org/>*



On Fri, Nov 22, 2013 at 4:04 PM, Dario Taraborelli <
Post by Dario Taraborelli
that works for me, thanks!
Jaimee – can you give us more details on the use case for mixed cohorts
that you had in mind?
So, for now, until I figure out how to fix this, it will always prefer
Post by Dan Andreescu
user_names before user_ids.
I think this is an argument for making users specifying whether it's
names or ids up front, and not allowing mixtures. Assuming it might be a
mixture and looking for names first is almost certain to produce inaccurate
results at some point. We have ids precisely to avoid collisions with
names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front
of a bunch of people I admire. So, I'd be glad if I'm the only one that
this happens to. If nobody objects, I'm going to allow the user to select
whether their cohort contains user_ids OR user_names, and strictly prohibit
mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131126/d998052b/attachment.html>
Dario Taraborelli
2013-11-26 22:46:51 UTC
Permalink
thanks for the clarification Jaimee – it sounds like we should consider adding user_names to the output if this is the main cause of the problem instead of building functionality at the input to deal with this. Dan, any thoughts?

BTW this notion of rerunning cohort analysis for members of a previous cohort who meet specific criteria is a use case that Product/Editor Engagement is also interested in. We used to call these “generated cohorts” in the old design plans for UserMetrics and I’d love if we revisited this feature requests and its relative priority.

D
Missed the question back to me, sorry. Mixed cohorts might occur due to the output as user IDs while collection is of usernames - say someone has a repeating events and has a csv output of data for those new users that were retained at a certain activity level from Point A to B and then has new cohort members opt in at Point B but only wants to include those that already survived from Point A and new at Point B cohort members for examining at another Point C. Without the output of usernames to create the active Point B cohort separately this would make the Point C cohort a mix of qualified user ids and new user names. There are several ways of dealing with this, it was just the first scenario I could think of that could cause this. Seems we still need to revisit the possibility of accessing usernames as output, also for reasons of matching to other data points where most users and most program leaders do not know user ids - Jaime
--
Jaime Anstee, Ph.D
Program Evaluation Specialist
Wikimedia Foundation
+1.415.839.6885 ext 6869
www.wikimediafoundation.org
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
that works for me, thanks!
Jaimee – can you give us more details on the use case for mixed cohorts that you had in mind?
So, for now, until I figure out how to fix this, it will always prefer user_names before user_ids.
I think this is an argument for making users specifying whether it's names or ids up front, and not allowing mixtures. Assuming it might be a mixture and looking for names first is almost certain to produce inaccurate results at some point. We have ids precisely to avoid collisions with names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front of a bunch of people I admire. So, I'd be glad if I'm the only one that this happens to. If nobody objects, I'm going to allow the user to select whether their cohort contains user_ids OR user_names, and strictly prohibit mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131126/b3360130/attachment-0001.html>
LiAnna Davis
2013-11-26 23:59:00 UTC
Permalink
I would LOVE it if the output gave user names instead of user IDs. Often
the data makes me want to investigate the individual stories of
contributors who added a lot of content/made a lot of edits/etc., but
there's no way of doing that with user IDs since I can't convert user IDs
to usernames.




On Tue, Nov 26, 2013 at 2:46 PM, Dario Taraborelli <
Post by Dario Taraborelli
thanks for the clarification Jaimee – it sounds like we should consider
adding user_names to the output if this is the main cause of the problem
instead of building functionality at the input to deal with this. Dan, any
thoughts?
BTW this notion of rerunning cohort analysis for members of a previous
cohort who meet specific criteria is a use case that Product/Editor
Engagement is also interested in. We used to call these “generated cohorts”
in the old design plans for UserMetrics and I’d love if we revisited this
feature requests and its relative priority.
D
Missed the question back to me, sorry. Mixed cohorts might occur due to
the output as user IDs while collection is of usernames - say someone has a
repeating events and has a csv output of data for those new users that were
retained at a certain activity level from Point A to B and then has new
cohort members opt in at Point B but only wants to include those that
already survived from Point A and new at Point B cohort members for
examining at another Point C. Without the output of usernames to create
the active Point B cohort separately this would make the Point C cohort a
mix of qualified user ids and new user names. There are several ways of
dealing with this, it was just the first scenario I could think of that
could cause this. Seems we still need to revisit the possibility of
accessing usernames as output, also for reasons of matching to other data
points where most users and most program leaders do not know user ids -
Jaime
--
Jaime Anstee, Ph.D
Program Evaluation Specialist
Wikimedia Foundation
+1.415.839.6885 ext 6869
www.wikimediafoundation.org
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
*https://donate.wikimedia.org <https://donate.wikimedia.org/>*
On Fri, Nov 22, 2013 at 4:04 PM, Dario Taraborelli <
Post by Dario Taraborelli
that works for me, thanks!
Jaimee – can you give us more details on the use case for mixed cohorts
that you had in mind?
So, for now, until I figure out how to fix this, it will always prefer
Post by Dan Andreescu
user_names before user_ids.
I think this is an argument for making users specifying whether it's
names or ids up front, and not allowing mixtures. Assuming it might be a
mixture and looking for names first is almost certain to produce inaccurate
results at some point. We have ids precisely to avoid collisions with
names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front
of a bunch of people I admire. So, I'd be glad if I'm the only one that
this happens to. If nobody objects, I'm going to allow the user to select
whether their cohort contains user_ids OR user_names, and strictly prohibit
mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
--
LiAnna Davis
Wikipedia Education Program Communications Manager
Wikimedia Foundation
http://education.wikimedia.org
(415) 839-6885 x6649
ldavis at wikimedia.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131126/67af06a9/attachment.html>
Dan Andreescu
2013-11-27 01:01:07 UTC
Permalink
These things have definitely been discussed before, so it's time to get
them prioritized. CC-ed Toby directly so he can follow up:

1. wikimetrics should allow user_name to be the key in report outputs.
Right now, only user_id is allowed and this is not great. LiAnna, Jaime,
and Jessie are definitely interested in this, and have mentioned it a few
times.
2. wikimetrics should allow "generated cohorts" as implemented by user
metrics api. These are cohorts defined by reports on other cohorts. For
example, if we run report R on cohort C, then generated cohort (GC) would
be: GC = {user | user in C and R(user) is true}. Dario is definitely
interested in this, and Jaime might be as well.
Post by LiAnna Davis
I would LOVE it if the output gave user names instead of user IDs. Often
the data makes me want to investigate the individual stories of
contributors who added a lot of content/made a lot of edits/etc., but
there's no way of doing that with user IDs since I can't convert user IDs
to usernames.
On Tue, Nov 26, 2013 at 2:46 PM, Dario Taraborelli <
Post by Dario Taraborelli
thanks for the clarification Jaimee – it sounds like we should consider
adding user_names to the output if this is the main cause of the problem
instead of building functionality at the input to deal with this. Dan, any
thoughts?
BTW this notion of rerunning cohort analysis for members of a previous
cohort who meet specific criteria is a use case that Product/Editor
Engagement is also interested in. We used to call these “generated cohorts”
in the old design plans for UserMetrics and I’d love if we revisited this
feature requests and its relative priority.
D
Missed the question back to me, sorry. Mixed cohorts might occur due to
the output as user IDs while collection is of usernames - say someone has a
repeating events and has a csv output of data for those new users that were
retained at a certain activity level from Point A to B and then has new
cohort members opt in at Point B but only wants to include those that
already survived from Point A and new at Point B cohort members for
examining at another Point C. Without the output of usernames to create
the active Point B cohort separately this would make the Point C cohort a
mix of qualified user ids and new user names. There are several ways of
dealing with this, it was just the first scenario I could think of that
could cause this. Seems we still need to revisit the possibility of
accessing usernames as output, also for reasons of matching to other data
points where most users and most program leaders do not know user ids -
Jaime
--
Jaime Anstee, Ph.D
Program Evaluation Specialist
Wikimedia Foundation
+1.415.839.6885 ext 6869
www.wikimediafoundation.org
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
*https://donate.wikimedia.org <https://donate.wikimedia.org/>*
On Fri, Nov 22, 2013 at 4:04 PM, Dario Taraborelli <
Post by Dario Taraborelli
that works for me, thanks!
Jaimee – can you give us more details on the use case for mixed cohorts
that you had in mind?
So, for now, until I figure out how to fix this, it will always prefer
Post by Dan Andreescu
user_names before user_ids.
I think this is an argument for making users specifying whether it's
names or ids up front, and not allowing mixtures. Assuming it might be a
mixture and looking for names first is almost certain to produce inaccurate
results at some point. We have ids precisely to avoid collisions with
names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front
of a bunch of people I admire. So, I'd be glad if I'm the only one that
this happens to. If nobody objects, I'm going to allow the user to select
whether their cohort contains user_ids OR user_names, and strictly prohibit
mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
--
LiAnna Davis
Wikipedia Education Program Communications Manager
Wikimedia Foundation
http://education.wikimedia.org
(415) 839-6885 x6649
ldavis at wikimedia.org
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131126/ca9bebba/attachment-0001.html>
Dario Taraborelli
2013-11-27 01:16:01 UTC
Permalink
Thanks Dan. I told Toby he should subscribe to this list :)

Regarding #1 another option could be to:
• only allow user_ids as keys (after all most JSON consumers prefer to work with user_ids) but add user_names as attributes
• return both user_ids and user_names as separate columns in flat CSVs.

Either way, it sounds like this would be a great addition for Wikimetrics customers.

Dario
1. wikimetrics should allow user_name to be the key in report outputs. Right now, only user_id is allowed and this is not great. LiAnna, Jaime, and Jessie are definitely interested in this, and have mentioned it a few times.
2. wikimetrics should allow "generated cohorts" as implemented by user metrics api. These are cohorts defined by reports on other cohorts. For example, if we run report R on cohort C, then generated cohort (GC) would be: GC = {user | user in C and R(user) is true}. Dario is definitely interested in this, and Jaime might be as well.
I would LOVE it if the output gave user names instead of user IDs. Often the data makes me want to investigate the individual stories of contributors who added a lot of content/made a lot of edits/etc., but there's no way of doing that with user IDs since I can't convert user IDs to usernames.
thanks for the clarification Jaimee – it sounds like we should consider adding user_names to the output if this is the main cause of the problem instead of building functionality at the input to deal with this. Dan, any thoughts?
BTW this notion of rerunning cohort analysis for members of a previous cohort who meet specific criteria is a use case that Product/Editor Engagement is also interested in. We used to call these “generated cohorts” in the old design plans for UserMetrics and I’d love if we revisited this feature requests and its relative priority.
D
Missed the question back to me, sorry. Mixed cohorts might occur due to the output as user IDs while collection is of usernames - say someone has a repeating events and has a csv output of data for those new users that were retained at a certain activity level from Point A to B and then has new cohort members opt in at Point B but only wants to include those that already survived from Point A and new at Point B cohort members for examining at another Point C. Without the output of usernames to create the active Point B cohort separately this would make the Point C cohort a mix of qualified user ids and new user names. There are several ways of dealing with this, it was just the first scenario I could think of that could cause this. Seems we still need to revisit the possibility of accessing usernames as output, also for reasons of matching to other data points where most users and most program leaders do not know user ids - Jaime
--
Jaime Anstee, Ph.D
Program Evaluation Specialist
Wikimedia Foundation
+1.415.839.6885 ext 6869
www.wikimediafoundation.org
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
that works for me, thanks!
Jaimee – can you give us more details on the use case for mixed cohorts that you had in mind?
So, for now, until I figure out how to fix this, it will always prefer user_names before user_ids.
I think this is an argument for making users specifying whether it's names or ids up front, and not allowing mixtures. Assuming it might be a mixture and looking for names first is almost certain to produce inaccurate results at some point. We have ids precisely to avoid collisions with names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front of a bunch of people I admire. So, I'd be glad if I'm the only one that this happens to. If nobody objects, I'm going to allow the user to select whether their cohort contains user_ids OR user_names, and strictly prohibit mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
--
LiAnna Davis
Wikipedia Education Program Communications Manager
Wikimedia Foundation
http://education.wikimedia.org
(415) 839-6885 x6649
ldavis at wikimedia.org
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131126/15479675/attachment.html>
Steven Walling
2013-11-27 18:21:51 UTC
Permalink
Post by Dan Andreescu
1. wikimetrics should allow user_name to be the key in report outputs.
Right now, only user_id is allowed and this is not great. LiAnna, Jaime,
and Jessie are definitely interested in this, and have mentioned it a few
times.
2. wikimetrics should allow "generated cohorts" as implemented by user
metrics api. These are cohorts defined by reports on other cohorts. For
example, if we run report R on cohort C, then generated cohort (GC) would
be: GC = {user | user in C and R(user) is true}. Dario is definitely
interested in this, and Jaime might be as well.
These both seem pretty high priority to me. Thanks for bringing these up
Dan. I don't directly need usernames as output, but it would be nice for
qualitative analysis purposes. And the generated cohorts idea has been
floating around for a long time.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131127/d1856a4f/attachment.html>
Steven Walling
2013-11-27 05:31:42 UTC
Permalink
Post by LiAnna Davis
I would LOVE it if the output gave user names instead of user IDs. Often
the data makes me want to investigate the individual stories of
contributors who added a lot of content/made a lot of edits/etc., but
there's no way of doing that with user IDs since I can't convert user IDs
to usernames.
Hey LiAnna,

I think I might have mentioned it on the analytics list before...
Special:Redirect is on all wikis and it will convert user_ids to usernames,
if you want to investigate any individual user.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131126/0b369b8a/attachment.html>
LiAnna Davis
2013-11-27 16:58:48 UTC
Permalink
Post by Steven Walling
Hey LiAnna,
I think I might have mentioned it on the analytics list before...
Special:Redirect is on all wikis and it will convert user_ids to usernames,
if you want to investigate any individual user.
OMG! I had no idea this existed, and this will make my life so much easier.
Thanks, Steven! :)

LiAnna
--
LiAnna Davis
Wikipedia Education Program Communications Manager
Wikimedia Foundation
http://education.wikimedia.org
(415) 839-6885 x6649
ldavis at wikimedia.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131127/922b8a53/attachment.html>
Jessie Wild
2013-11-23 01:16:10 UTC
Permalink
Good catch, Dan!

adding in the upfront option of indicating names or ids seems like a
good/needed addition. I would recommend also including the definition of
each as well - I think some of our users might get a little confused about
the delineation just from the names (i.e., not realize that user_names are
associated with user_ids).
Post by Dan Andreescu
So, for now, until I figure out how to fix this, it will always prefer
Post by Dan Andreescu
user_names before user_ids.
I think this is an argument for making users specifying whether it's
names or ids up front, and not allowing mixtures. Assuming it might be a
mixture and looking for names first is almost certain to produce inaccurate
results at some point. We have ids precisely to avoid collisions with
names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front
of a bunch of people I admire. So, I'd be glad if I'm the only one that
this happens to. If nobody objects, I'm going to allow the user to select
whether their cohort contains user_ids OR user_names, and strictly prohibit
mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
--
*Jessie WildGrantmaking Learning & Evaluation *
*Wikimedia Foundation*

Imagine a world in which every single human being can freely share in
the sum of all knowledge. Help us make it a reality!
Donate to Wikimedia <https://donate.wikimedia.org/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131122/8bd01066/attachment-0001.html>
Dan Andreescu
2013-11-23 19:30:51 UTC
Permalink
Ok, so I deployed the new cohort upload:
https://metrics.wmflabs.org/cohorts/upload. Hopefully the explanation and
examples make a bit more sense. And I think it's much cleaner this way.
It's also faster :)

As always, let me know if you have any trouble.
Post by Jessie Wild
Good catch, Dan!
adding in the upfront option of indicating names or ids seems like a
good/needed addition. I would recommend also including the definition of
each as well - I think some of our users might get a little confused about
the delineation just from the names (i.e., not realize that user_names are
associated with user_ids).
Post by Dan Andreescu
So, for now, until I figure out how to fix this, it will always prefer
Post by Dan Andreescu
user_names before user_ids.
I think this is an argument for making users specifying whether it's
names or ids up front, and not allowing mixtures. Assuming it might be a
mixture and looking for names first is almost certain to produce inaccurate
results at some point. We have ids precisely to avoid collisions with
names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front
of a bunch of people I admire. So, I'd be glad if I'm the only one that
this happens to. If nobody objects, I'm going to allow the user to select
whether their cohort contains user_ids OR user_names, and strictly prohibit
mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
--
*Jessie WildGrantmaking Learning & Evaluation *
*Wikimedia Foundation*
Imagine a world in which every single human being can freely share in
the sum of all knowledge. Help us make it a reality!
Donate to Wikimedia <https://donate.wikimedia.org/>
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131123/5f23d4f8/attachment.html>
Dario Taraborelli
2013-11-23 20:04:15 UTC
Permalink
works like a charm and validation for user_ids is fast, thanks Dan!
Ok, so I deployed the new cohort upload: https://metrics.wmflabs.org/cohorts/upload. Hopefully the explanation and examples make a bit more sense. And I think it's much cleaner this way. It's also faster :)
As always, let me know if you have any trouble.
Good catch, Dan!
adding in the upfront option of indicating names or ids seems like a good/needed addition. I would recommend also including the definition of each as well - I think some of our users might get a little confused about the delineation just from the names (i.e., not realize that user_names are associated with user_ids).
So, for now, until I figure out how to fix this, it will always prefer user_names before user_ids.
I think this is an argument for making users specifying whether it's names or ids up front, and not allowing mixtures. Assuming it might be a mixture and looking for names first is almost certain to produce inaccurate results at some point. We have ids precisely to avoid collisions with names, allowing for renaming users, and other cases.
Yep, I just learned this the hard way and made a fool of myself in front of a bunch of people I admire. So, I'd be glad if I'm the only one that this happens to. If nobody objects, I'm going to allow the user to select whether their cohort contains user_ids OR user_names, and strictly prohibit mixtures.
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
--
Jessie Wild
Grantmaking Learning & Evaluation
Wikimedia Foundation
Imagine a world in which every single human being can freely share in
the sum of all knowledge. Help us make it a reality!
Donate to Wikimedia
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131123/30d43f95/attachment.html>
Steven Walling
2013-11-23 23:14:56 UTC
Permalink
Post by Dan Andreescu
https://metrics.wmflabs.org/cohorts/upload. Hopefully the explanation
and examples make a bit more sense. And I think it's much cleaner this
way. It's also faster :)
As always, let me know if you have any trouble.
Awesome. :)

Works like a charm.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131123/737db71a/attachment.html>
Dan Andreescu
2013-11-23 23:22:32 UTC
Permalink
A note about this - since validation on really large cohorts is fast and
running reports with it seems pretty damn fast. I'm not sure if/when we'll
run out of space in labs. But I'm working currently on migrating
wikimetrics to stat1001 where we should never run out of production.

So I wanted to ask, how far back do people need their reports to stay
alive? Right now it's set to delete them after 30 days, but are people
really using them that far back? I was thinking of changing that to 7 days
unless people find it useful as is.

Dan
Post by Steven Walling
Post by Dan Andreescu
https://metrics.wmflabs.org/cohorts/upload. Hopefully the explanation
and examples make a bit more sense. And I think it's much cleaner this
way. It's also faster :)
As always, let me know if you have any trouble.
Awesome. :)
Works like a charm.
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
_______________________________________________
Wikimetrics mailing list
Wikimetrics at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/wikimetrics/attachments/20131123/1acfcc67/attachment-0001.html>
Loading...