特殊字符串的切片问题

RexWzh · 2022 年6 月 5 日 08:26

Julia 中特殊字符串怎么做切片，比如将字符串 "Sandić>" 取到倒数第二个

txt = "Sandić>"
txt[1:end-1]

报错信息

StringIndexError: invalid index [7], valid nearby indices [6]=>'ć', [8]=>'>'

Stacktrace:
 [1] string_index_err(s::String, i::Int64)
   @ Base ./strings/string.jl:12
 [2] getindex(s::String, r::UnitRange{Int64})
   @ Base ./strings/string.jl:263
 [3] top-level scope
   @ In[100]:1
 [4] eval
   @ ./boot.jl:373 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

问题出在 ć 是个 unicode 字符，和这个帖子遇到的问题类似

现在的问题：这类字符串如何做切片。
我能想到的间接方式为

join(collect("txt")[1:end-1])

或者对于这个字符串

txt[1:length(txt)-1] # 对 Sandićć> 情形失效
rstrip(txt, '>') # 仅限特定模式

xgdgsc · 2022 年6 月 5 日 08:45

https://docs.julialang.org/en/v1/manual/strings/
nextind

RexWzh · 2022 年6 月 14 日 03:53

对 unicode 字符切片问题，我的策略是用 collect 把字符串转为 Vector 类型，然后在数组上切片，运算结果再用 join 返回。这是间接处理方式，但我只能想到这个。

试了下 nextind，这个并不能解决 unicode 字符的切片问题。

nextind 用法：nextind(s, i) 返回位置 s[i] 所在字符的下一个字符初始位置，比如

s = "αβγμ" # 每个字符占位为 2
s[nextind(s, 0)] # 返回 α
s[nextind(s, 1)] # 返回 β
s[nextind(s, 2)] # 返回 β
s[nextind(s, 3)] # 返回 γ
s[nextind(s, 4)] # 返回 γ
s[nextind(s, 5)] # 返回 μ
s[nextind(s, 6)] # 返回 μ

nextind 只能获取该位置邻近字符的初始位置

RexWzh · 2022 年6 月 14 日 04:13

这是做 NER 时遇到的：用字典树搜索字符串的落在字典中的单词

"前缀树（字典树）"
Base.@kwdef mutable struct PrefixTree
    isend::Bool = false
    children = Dict{Char,PrefixTree}()
end

"给字典树增加单词"
function add_node!(node::PrefixTree, word::String)::Nothing
    for c in word
        children = node.children
        haskey(children, c) || (children[c] = PrefixTree())
        node = children[c]
    end
    node.isend = true
    return nothing
end

"在字符串里搜索字典单词"
function search_valid_word(node::PrefixTree, word::String)
    res, n = String[], length(word)
    for i in 1:n
        # 检索 word[i:end]
        dict = node
        for j in i:n
            haskey(dict.children, word[j]) || break # 不存在到该位置的路径
            dict = dict.children[word[j]] # 切换到该节点
            dict.isend && push!(res, word[i:j])
        end
    end
    res
end

函数 search_valid_word 用到了字符串索引和切片，遇到 unicode 字符时可能报错，最后用前边提到的 collect 来解决：

function search_valid_word(node::PrefixTree, word::String)
    res, n, word = String[], length(word), collect(word)
    for i in 1:n
        # 检索 word[i:end]
        dict = node
        for j in i:n
            haskey(dict.children, word[j]) || break # 不存在到该位置的路径
            dict = dict.children[word[j]] # 切换到该节点
            dict.isend && push!(res, join(word[i:j]))
        end
    end
    res
end

不理解 Julia 字符切片为什么按 unicode 长度算，跟 String 结构有关还是出于什么考虑