Commit 2a222da
committed
bpo-37966: Fully implement the UAX python#15 quick-check algorithm.
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX python#15.
However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.
Implement the standard's algorithm. This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.
At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop
With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:
$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
-- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop1 parent 4025110 commit 2a222da
4 files changed
Lines changed: 47 additions & 28 deletions
File tree
- Doc/whatsnew
- Lib/test
- Misc/NEWS.d/next/Core and Builtins
- Modules
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1090 | 1090 | | |
1091 | 1091 | | |
1092 | 1092 | | |
1093 | | - | |
1094 | | - | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
1095 | 1096 | | |
1096 | 1097 | | |
1097 | 1098 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
208 | 208 | | |
209 | 209 | | |
210 | 210 | | |
| 211 | + | |
| 212 | + | |
211 | 213 | | |
212 | 214 | | |
213 | 215 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
775 | 775 | | |
776 | 776 | | |
777 | 777 | | |
778 | | - | |
779 | | - | |
780 | | - | |
781 | | - | |
782 | | - | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
783 | 789 | | |
784 | | - | |
785 | | - | |
786 | | - | |
787 | | - | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
788 | 794 | | |
789 | 795 | | |
790 | 796 | | |
791 | 797 | | |
792 | 798 | | |
793 | 799 | | |
794 | | - | |
795 | | - | |
796 | | - | |
797 | | - | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
798 | 804 | | |
799 | | - | |
800 | | - | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
801 | 809 | | |
802 | 810 | | |
803 | 811 | | |
| |||
806 | 814 | | |
807 | 815 | | |
808 | 816 | | |
809 | | - | |
810 | | - | |
811 | 817 | | |
812 | | - | |
813 | | - | |
| 818 | + | |
814 | 819 | | |
815 | 820 | | |
816 | 821 | | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
817 | 830 | | |
818 | | - | |
| 831 | + | |
819 | 832 | | |
820 | 833 | | |
821 | 834 | | |
| |||
848 | 861 | | |
849 | 862 | | |
850 | 863 | | |
851 | | - | |
| 864 | + | |
852 | 865 | | |
853 | 866 | | |
854 | 867 | | |
| |||
871 | 884 | | |
872 | 885 | | |
873 | 886 | | |
874 | | - | |
| 887 | + | |
875 | 888 | | |
876 | 889 | | |
877 | 890 | | |
| |||
917 | 930 | | |
918 | 931 | | |
919 | 932 | | |
920 | | - | |
| 933 | + | |
921 | 934 | | |
922 | 935 | | |
923 | 936 | | |
924 | 937 | | |
925 | 938 | | |
926 | 939 | | |
927 | | - | |
| 940 | + | |
928 | 941 | | |
929 | 942 | | |
930 | 943 | | |
931 | 944 | | |
932 | 945 | | |
933 | 946 | | |
934 | | - | |
| 947 | + | |
935 | 948 | | |
936 | 949 | | |
937 | 950 | | |
938 | 951 | | |
939 | 952 | | |
940 | 953 | | |
941 | | - | |
| 954 | + | |
942 | 955 | | |
943 | 956 | | |
944 | 957 | | |
| |||
0 commit comments