-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Require sample weights to sum to less than 1 when replace = True #61582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Require sample weights to sum to less than 1 when replace = True #61582
Conversation
1281002 to
1d1f742
Compare
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, but the condition needs adjusted. Will also need a note in the whatsnew for 3.0 under the Other section.
pandas/core/sample.py
Outdated
| if weight_sum > 1 and replace: | ||
| raise ValueError( | ||
| "Invalid weights: If `replace`=True weights must sum to less than 1" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct condition. See the formula in the OP of the linked issue.
pandas/core/generic.py
Outdated
| When replace = True will not allow weights that add up to less | ||
| than 1, to avoid biased results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When replace = True will not allow weights that add up to less | |
| than 1, to avoid biased results. | |
| When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``, | |
| to avoid biased results. |
2bc3618 to
e5f9a07
Compare
9f62a65 to
11df613
Compare
|
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
pandas/core/sample.py
Outdated
| if weights is not None: | ||
| is_max_weight_dominating = size * weights.max() > 1 | ||
| if is_max_weight_dominating and not replace: | ||
| raise ValueError( | ||
| "Invalid weights: If `replace`=False, " | ||
| "total unit probabilities have to be less than 1" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Avoid O(n) computation if it is not needed.
- Combine with the
ifblock above; no need to checkif weights is not Nonetwice. - Suggestion on a more clear error message.
| if weights is not None: | |
| is_max_weight_dominating = size * weights.max() > 1 | |
| if is_max_weight_dominating and not replace: | |
| raise ValueError( | |
| "Invalid weights: If `replace`=False, " | |
| "total unit probabilities have to be less than 1" | |
| ) | |
| if not replace and size * weights.max() > 1: | |
| raise ValueError( | |
| "Weighted sampling cannot be achieved with replace=False. Either " | |
| "set replace=True or use smaller weights. See the docstring of sample " | |
| "for details." | |
| ) |
pandas/core/generic.py
Outdated
| Missing values in the weights column will be treated as zero. | ||
| Infinite values not allowed. | ||
| When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``, | ||
| in order to avoid biased results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add "See the Notes below for more details." and then add a bit more detail in the notes section. I can give it a shot if you'd prefer.
| # edge case, n*max(weights)/sum(weights) == 1 | ||
| edge_variance_weights = [1] * 10 | ||
| edge_variance_weights[0] = 9 | ||
| # should not raise | ||
| obj.sample(n=2, weights=edge_variance_weights, replace=False) | ||
|
|
||
| low_variance_weights = [1] * 10 | ||
| low_variance_weights[0] = 8 | ||
| # should not raise | ||
| obj.sample(n=2, weights=low_variance_weights, replace=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make these two separate tests.
5f7484b to
7e7a73c
Compare
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the chnages. A few minor requests and we should be good to merge.
doc/source/whatsnew/v0.16.1.rst
Outdated
| df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]}) | ||
| df.sample(n=3, weights="weight_column") | ||
| df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]}) | ||
| df.sample(n=2, weights="weight_column") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added indentation here looks incorrect.
pandas/core/generic.py
Outdated
| If weights do not sum to 1, they will be normalized to sum to 1. | ||
| Missing values in the weights column will be treated as zero. | ||
| Infinite values not allowed. | ||
| When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the , at the end of this line
| ----- | ||
| If `frac` > 1, `replacement` should be set to `True`. | ||
| When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the trailing ,.
pandas/core/generic.py
Outdated
| When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``, | ||
| since that would cause results to be biased. E.g. sampling 2 items without replacement, | ||
| with weights [100, 1, 1] would yield two last items in 1/2 of cases, instead of 1/102 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| with weights [100, 1, 1] would yield two last items in 1/2 of cases, instead of 1/102 | |
| with weights ``[100, 1, 1]`` would yield two last items in 1/2 of cases, instead of 1/102. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add, "This is similar to specifying n=4 without replacement on a Series with 3 elements."
f553ed7 to
c4241fb
Compare
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
Thanks @microslaw! |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.