正则表达式(Regex)嵌套标签提取

U0D764b6RH · 2019 年7 月 3 日 09:57

<div class="goal" not_only_these_elements>
  <div>something here
  <div>anything here</div>
  </div>
<div>nothing here </div>
</div>
<div class="trouble">catch me is error</div>

如何利用正则来提取这个<div class="goal">.....</div>(内涵

若干)标签的内容…？

如何用Julialang v1.1.1实现提取？可以解释一下的话更好，谢谢

<div class=\"goal\"[^>]*>[^<>]*(((?'d'<div[^>]*>)[^<>]*)+((?'-d'</div>)[^<>]*)+)*(?(d)(?!))</div>

上面为搜到的表达式简单改写后的式子

woclass · 2019 年7 月 3 日 10:44

要提取标签内容当然是直接去找 HTML 的 parser 啦。正则用起来还是不太方便。

Algocircle/Cascadia.jl: A CSS Selector library in Julia

ref:

Extracting and Constructing Tables from HTML Files using Julia - Stack Overflow

U0D764b6RH · 2019 年7 月 3 日 11:39

或许使用html parser是最快的解决方案

但是，假设下次同类问题(非html标签)的话，就不能使用html parser了吧 _(:3

正则提取嵌套是各个语言之中通用的部分，似乎是真正学到点什么的地方

之前也有看到你提出的julia转义问题，上面的那个表达式在julia里面使用的话有转义或其他的问题吗？

如果那个式子完全不对的话，可以麻烦的实现一下吗？

woclass · 2019 年7 月 3 日 12:47

我拿在线的 PCRE 试了下这个正则有问题。

搜了下应该是需要反向引用来解决配对的问题。可以参考

Regex Tutorial - Backreferences To Match The Same Text Again

不过它给的例子貌似也不太好用 RegExr: Learn, Build, & Test RegEx 只是把最近的闭合标签匹配了.
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1> 这个貌似也不行。

不是很会用正则。

/ 也需要转义，不过加了之后还是不对。

问题应该出在 (?'-d'</div>)

正则是真的不适合搞这个，简单点的文本匹配还行，这种成对的、还会递归嵌套的标签，还是上 parser 比较好。

I wrote an entire blog entry on this subject: Regular Expression Limitations

The crux of the issue is that HTML and XML are recursive structures which require counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.

Can you provide some examples of why it is hard to parse XML and HTML with a regex? - Stack Overflow

当然现在用的正则都加了额外的东西来满足各种奇奇怪怪的需求（有些人还认为 PCRE 是图灵完全的）。

U0D764b6RH · 2019 年7 月 3 日 13:18

非常感谢

这么看来，那些实现了这些功能的是属于额外加的部分……

我之前看得是关于平衡组与递归匹配

总结一下就是用正则不如写个parser了吧……

woclass · 2019 年7 月 3 日 14:38

parser 也不一定要手写，现在都有现成的框架可以自动生成。

julia 这边可以试试 @thautwarm dalao 写的

一些使用自动生成 parser 的例子

woclass · 2019 年7 月 3 日 15:17

补充 Cascadia.jl 的用法

txt = """
<div class="goal" not_only_these_elements>
  <div>something here
  <div>anything here</div>
  </div>
<div>nothing here </div>
</div>
<div class="trouble">catch me is error</div>
"""

using Cascadia
using Gumbo

n=parsehtml(txt)
@show eachmatch(sel"div.goal", n.root)
# 1-element Array{HTMLNode,1}:
#  HTMLElement{:div}:
# <div class="goal"not_only_these_elements="">
#   <div>
#     something here
#     <div>
#       anything here
#     </div>
#   </div>
#   <div>
#     nothing here
#   </div>
# </div>

@show eachmatch(sel"div.goal > div", n.root)
# 2-element Array{HTMLNode,1}:
#  HTMLElement{:div}:
# <div>
#   something here
#   <div>
#     anything here
#   </div>
# </div>
# 
#  HTMLElement{:div}:
# <div>
#   nothing here
# </div>

@show eachmatch(sel"div.goal > div", n.root)[2].children
# 1-element Array{HTMLNode,1}:
#  HTML Text: nothing here

print(eachmatch(sel"div.goal > div", n.root)[2].children[1])
# nothing here