VirtualBox

source: kBuild/vendor/grep/current/TODO

Last change on this file was 3529, checked in by bird, 3 years ago

Imported grep 3.7 from grep-3.7.tar.gz (sha256: c22b0cf2d4f6bbe599c902387e8058990e1eee99aef333a203829e5fd3dbb342), applying minimal auto-props.

File size: 11.0 KB
Line 
1Things to do for GNU grep
2
3 Copyright (C) 1992, 1997-2002, 2004-2021 Free Software Foundation, Inc.
4
5 Copying and distribution of this file, with or without modification,
6 are permitted in any medium without royalty provided the copyright
7 notice and this notice are preserved.
8
9===============
10Short term work
11===============
12
13See where we are with UTF-8 performance.
14
15Merge Debian patches that seem relevant.
16
17Go through patches in Savannah.
18
19Fix --directories=read.
20
21Write better Texinfo documentation for grep. The manual page would be a
22good place to start, but Info documents are also supposed to contain a
23tutorial and examples.
24
25Some tests in tests/spencer2.tests should have failed! Need to filter out
26some bugs in dfa.[ch]/regex.[ch].
27
28Multithreading?
29
30GNU grep originally did 32-bit arithmetic. Although it has moved to
3164-bit on 64-bit platforms by using types like ptrdiff_t and size_t,
32this conversion has not been entirely systematic and should be checked.
33
34Lazy dynamic linking of libpcre. See Debian’s 03-397262-dlopen-pcre.patch.
35
36Check FreeBSD’s integration of zgrep (-Z) and bzgrep (-J) in one
37binary. Is there a possibility of doing even better by automatically
38checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
390x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
40libpcre is decided, do the same for libz and libbz2.
41
42
43
44===================
45Matching algorithms
46===================
47
48Take a look at these and consider opportunities for merging or cloning:
49
50 -- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants
51 -- ja-grep’s mlb2 patch (Japanese grep)
52 <http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz>
53 -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
54 <http://www.mt.cs.keio.ac.jp/person/narita/lv/>;
55 -- cgrep (Context grep) <https://awgn.github.io/cgrep/>
56 seems like nice work;
57 -- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
58 -- agrep (Approximate grep) <https://www.tgries.de/agrep/>,
59 from glimpse;
60 -- nr-grep (Nondeterministic reverse grep)
61 <https://www.dcc.uchile.cl/~gnavarro/software/>;
62 -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
63 -- freegrep <https://github.com/howardjp/freegrep>;
64
65Check some new algorithms for matching. See, for example, Faro &
66Lecroq (cited in kwset.c).
67
68Fix the DFA matcher to never use exponential space. (Fortunately, these
69cases are rare.)
70
71
72
73============================
74Standards: POSIX and Unicode
75============================
76
77For POSIX compliance issues, see POSIX 1003.1.
78
79Current support for the POSIX [= =] and [. .] constructs is limited to
80platforms whose regular expression matchers are sufficiently
81compatible with the GNU C library so that the --without-included-regex
82option of ‘configure’ is in effect. Extend this support to non-glibc
83platforms, where --with-included-regex is in effect, by modifying the
84included version of the regex code to defer to the native version when
85handling [= =] and [. .].
86
87For Unicode, interesting things to check include the Unicode Standard
88<https://www.unicode.org/standard/standard.html> and the Unicode Technical
89Standard #18 (<https://www.unicode.org/reports/tr18/> “Unicode Regular
90Expressions”). Talk to Bruno Haible who’s maintaining GNU libunistring.
91See also Unicode Standard Annex #15 (<https://www.unicode.org/reports/tr15/>
92“Unicode Normalization Forms”), already implemented by GNU libunistring.
93
94In particular, --ignore-case needs to be evaluated against the standards.
95We may want to deviate from POSIX if Unicode provides better or clearer
96semantics.
97
98POSIX and --ignore-case
99-----------------------
100
101For this issue, interesting things to check in POSIX include the
102Open Group Base Specifications, Chapter “Regular Expressions”, in
103particular Section “Regular Expression General Requirements” and its
104paragraph about caseless matching (this may not have been fully
105thought through and that this text may be self-contradicting
106[specifically: “of either data or patterns” versus all the rest]).
107See:
108
109http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02
110
111In particular, consider the following with POSIX’s approach to case
112folding in mind. Assume a non-Turkic locale with a character
113repertoire reduced to the following various forms of “LATIN LETTER I”:
114
115 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
116 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
117 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\
118 LATIN CAPITAL LETTER I DOT;;;0069;
119 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
120
121UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69)
122versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This implies that
123whole UTF-8 strings cannot be case-converted in place, using the same
124memory buffer, and that the needed octet-size of the new buffer cannot
125merely be guessed (although there’s a simple upper bound of five times
126the size of the input, as the longest UTF-8 encoding of any character
127is five bytes).
128
129We have
130
131 lc(I) = i, uc(I) = I
132 lc(i) = i, uc(i) = I
133 lc(İ) = i, uc(İ) = İ
134 lc(ı) = ı, uc(ı) = I
135
136where lc() and uc() denote lower-case and upper-case conversions.
137
138There are several candidate --ignore-case logics. Using the
139
140 if (lc(input_wchar) == lc(pattern_wchar))
141
142logic leads to the following matches:
143
144 \in I i İ ı
145 pat\ ----------
146 I | Y Y Y n
147 i | Y Y Y n
148 İ | Y Y Y n
149 ı | n n n Y
150
151There is a lack of symmetry between CAPITAL and SMALL LETTERs with
152this. Using the
153
154 if (uc(input_wchar) == uc(pattern_wchar))
155
156logic (which is what GNU grep currently does although this is not
157documented or guaranteed in the future), leads to the following
158matches:
159
160 \in I i İ ı
161 pat\ ----------
162 I | Y Y n Y
163 i | Y Y n Y
164 İ | n n Y n
165 ı | Y Y n Y
166
167There is a lack of symmetry between CAPITAL and SMALL LETTERs with
168this.
169
170Using the
171
172 if (lc(input_wchar) == lc(pattern_wchar)
173 || uc(input_wchar) == uc(pattern_wchar))
174
175logic leads to the following matches:
176
177 \in I i İ ı
178 pat\ ----------
179 I | Y Y Y Y
180 i | Y Y Y Y
181 İ | Y Y Y n
182 ı | Y Y n Y
183
184There is some elegance and symmetry with this. But there are
185potentially two conversions to be made per input character. If the
186pattern is pre-converted, two copies of it need to be kept and used in
187a mutually coherent fashion.
188
189Using the
190
191 if (input_wchar == pattern_wchar
192 || lc(input_wchar) == pattern_wchar
193 || uc(input_wchar) == pattern_wchar)
194
195logic (a plausible interpretation of POSIX) leads to the following
196matches:
197
198 \in I i İ ı
199 pat\ ----------
200 I | Y Y n Y
201 i | Y Y Y n
202 İ | n n Y n
203 ı | n n n Y
204
205There is a different CAPITAL/SMALL symmetry with this. But there’s
206also a loss of pattern/input symmetry that’s unique to it. Also there
207are potentially two conversions to be made per input character.
208
209Using the
210
211 if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
212
213logic leads to the following matches:
214
215 \in I i İ ı
216 pat\ ----------
217 I | Y Y Y Y
218 i | Y Y Y Y
219 İ | Y Y Y Y
220 ı | Y Y Y Y
221
222This shows total symmetry and transitivity (at least in this example
223analysis). There are two conversions to be made per input character,
224but support could be added for having a single straight mapping
225performing a composition of the two conversions.
226
227Any optimization in the implementation of each logic must not change
228its basic semantic.
229
230
231Unicode and --ignore-case
232-------------------------
233
234For this issue, interesting things to check in Unicode include:
235
236 - The Unicode Standard, Chapter 3
237 (<https://www.unicode.org/versions/Unicode9.0.0/ch03.pdf>
238 “Conformance”), Section 3.13 (“Default Case Algorithms”) and the
239 toCasefold() case conversion operation.
240
241 - The Unicode Standard, Chapter 4
242 (<https://www.unicode.org/versions/Unicode9.0.0/ch04.pdf>
243 “Character Properties”), Section 4.2 (“Case”) and
244 the <https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
245 SpecialCasing.txt and
246 <https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
247 CaseFolding.txt files.
248
249 - The Unicode Standard, Chapter 5
250 (<https://www.unicode.org/versions/Unicode9.0.0/ch05.pdf>
251 “Implementation Guidelines”), Section 5.18 (“Case Mappings”),
252 Subsection “Caseless Matching”.
253
254 - The Unicode case charts <https://www.unicode.org/charts/case/>.
255
256Unicode uses the
257
258 if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
259
260logic for caseless matching. Consider the “LATIN LETTER I” example
261mentioned above. In a non-Turkic locale, simple case folding yields
262
263 toCasefold_simple(U+0049) = U+0069
264 toCasefold_simple(U+0069) = U+0069
265 toCasefold_simple(U+0130) = U+0130
266 toCasefold_simple(U+0131) = U+0131
267
268which leads to the following matches:
269
270 \in I i İ ı
271 pat\ ----------
272 I | Y Y n n
273 i | Y Y n n
274 İ | n n Y n
275 ı | n n n Y
276
277This is different from anything so far!
278
279In a non-Turkic locale, full case folding yields
280
281 toCasefold_full(U+0049) = U+0069
282 toCasefold_full(U+0069) = U+0069
283 toCasefold_full(U+0130) = <U+0069, U+0307>
284 toCasefold_full(U+0131) = U+0131
285
286with
287
288 0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
289
290which leads to the following matches:
291
292 \in I i İ ı
293 pat\ ----------
294 I | Y Y * n
295 i | Y Y * n
296 İ | n n Y n
297 ı | n n n Y
298
299This is just sad!
300
301Having toCasefold(U+0131), simple or full, map to itself instead of
302U+0069 is in contradiction with the rules of Section 5.18 of the
303Unicode Standard since toUpperCase(U+0131) is U+0049. Same thing for
304toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069. The
305justification for the weird toCasefold_full(U+0130) mapping is
306unknown; it doesn’t even make sense to add a dot (U+0307) to a letter
307that already has one (U+0069). It would have been so simple to put
308them all in the same equivalence class!
309
310Otherwise, also consider the following problem with Unicode’s approach
311on case folding in mind. Assume that we want to perform
312
313 echo 'AßBC' | grep -i 'Sb'
314
315which corresponds to
316
317 input: U+0041 U+00DF U+0042 U+0043 U+000A
318 pattern: U+0053 U+0062
319
320Following CaseFolding.txt, applying the toCasefold() transformation to
321these yields
322
323 input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
324 pattern: U+0073 U+0062
325
326so, according to this approach, the input should match the pattern.
327As long as the original input line is to be reported to the user as a
328whole, there is no problem (from the user’s point-of-view;
329implementation is complicated by this).
330
331However, consider both these GNU extensions:
332
333 echo 'AßBC' | grep -i --only-matching 'Sb'
334 echo 'AßBC' | grep -i --color=always 'Sb'
335
336What is to be reported in these cases, since the match begins in the
337*middle* of the original input character ‘ß’?
338
339Unicode’s toCasefold() cannot be implemented in terms of POSIX’s
340towctrans() since that can only return a single wint_t value per input
341wint_t value.
Note: See TracBrowser for help on using the repository browser.

© 2024 Oracle Support Privacy / Do Not Sell My Info Terms of Use Trademark Policy Automated Access Etiquette