标题: 2006-5-9 从 文本文件导出数据到 Access 全实录 By Stabx
正文:
QUOTE:
主要使用技术为正则表达式.
工具为 (EditPlus 或 Search and Replace) 与 Access 2003.
分为文本操作与数据操作两部分
("[a-zA-Z0-9\-\!]+",")
\n\1
1. 文本操作
2. Access 操作
CODE:
1. 文本操作
要导入的文件是一些 IT 术语文件, 记录确切数不详, 大概有十五万多行, 分为 A-Z 各个文本文件.
一个一个操作文件, 那很麻烦, 我也没耐心.
先把文本文件合并, 打开 CMD 转到当前路径, "copy *.txt glossary.txt",合并文件.
打开文件观察文件规律, 有规律才好办, 没规律可难搞.
文件大致内容:
---/---------------------------------------------------------------------------------
ASPI
{Advanced SCSI Peripheral Interface}
ASPIK
<language, specification> A multiple-style specification
language.
["Algebraic Specifications in an Integrated Software
Development and Verification System", A. Voss, Diss, U
Kaiserslautern, 1985].
(1994-11-30)
Aspirin
<language, tool> A {freeware} language from {MITRE Corp} for
the description of {neural network}s. A compiler, bpmake, is
included. Aspirin is designed for use with the {MIGRAINES}
interface.
{Version: 6.0 (ftp://ftp.cognet.ucla.edu/alexis/)}
(1995-03-08)
ASPLE
<language> A {toy language}.
["A Sampler of Formal Definitions", M. Marcotty et al,
Computing Surveys 8(2):191-276 (Feb 1976)].
(1995-02-08)
---/---------------------------------------------------------------------------------
哦, 很有规律是吧, 那就好办.
我需要的是, 标题, 内容.
格式应该为 "标题","内容 , 公式应该为 "*","
首先应当替换标题, 然后再清除换行符, 再加上术语换行, 文本文件操作为此.
替换标题:
使用 EditPlus , 按CTRL+H
输入正则为: (^[a-zA-Z0-9\-\!]+.*)
替换正则为: "\1","
执行操作
结果:
---/---------------------------------------------------------------------------------
"ASPI","
{Advanced SCSI Peripheral Interface}
"ASPIK","
<language, specification> A multiple-style specification
language.
["Algebraic Specifications in an Integrated Software
Development and Verification System", A. Voss, Diss, U
Kaiserslautern, 1985].
(1994-11-30)
"Aspirin","
<language, tool> A {freeware} language from {MITRE Corp} for
the description of {neural network}s. A compiler, bpmake, is
included. Aspirin is designed for use with the {MIGRAINES}
interface.
{Version: 6.0 (ftp://ftp.cognet.ucla.edu/alexis/)}
(1995-03-08)
"ASPLE","
<language> A {toy language}.
["A Sampler of Formal Definitions", M. Marcotty et al,
Computing Surveys 8(2):191-276 (Feb 1976)].
(1995-02-08)
---/---------------------------------------------------------------------------------
清除换行符, 这主要是为了正确的导入数据到 Access.
输入查找正则: \n
替换内容: <br/>
结果:
---/---------------------------------------------------------------------------------
"ASPI","<br/><br/> {Advanced SCSI Peripheral Interface}<br/><br/>"ASPIK","<br/><br/> <language, specification> A multiple-style specification<br/> language.<br/><br/> ["Algebraic Specifications in an Integrated Software<br/> Development and Verification System", A. Voss, Diss, U<br/> Kaiserslautern, 1985].<br/><br/> (1994-11-30)<br/><br/>"Aspirin","<br/><br/> <language, tool> A {freeware} language from {MITRE Corp} for<br/> the description of {neural network}s. A compiler, bpmake, is<br/> included. Aspirin is designed for use with the {MIGRAINES}<br/> interface.<br/><br/> {Version: 6.0 (ftp://ftp.cognet.ucla.edu/alexis/)}<br/><br/> (1995-03-08)<br/><br/>"ASPLE","<br/><br/> <language> A {toy language}.<br/><br/> ["A Sampler of Formal Definitions", M. Marcotty et al,<br/> Computing Surveys 8(2):191-276 (Feb 1976)].<br/><br/> (1995-02-08)
---/---------------------------------------------------------------------------------
现在应当加上换行符, 每一条数据为一行.
输入查找正则: ("[a-zA-Z0-9\-\!]+",")
替换内容: \n\1
结果:
---/---------------------------------------------------------------------------------
"ASPI","<br/><br/> {Advanced SCSI Peripheral Interface}<br/><br/>
"ASPIK","<br/><br/> <language, specification> A multiple-style specification<br/> language.<br/><br/> ["Algebraic Specifications in an Integrated Software<br/> Development and Verification System", A. Voss, Diss, U<br/> Kaiserslautern, 1985].<br/><br/> (1994-11-30)<br/><br/>
"Aspirin","<br/><br/> <language, tool> A {freeware} language from {MITRE Corp} for<br/> the description of {neural network}s. A compiler, bpmake, is<br/> included. Aspirin is designed for use with the {MIGRAINES}<br/> interface.<br/><br/> {Version: 6.0 (ftp://ftp.cognet.ucla.edu/alexis/)}<br/><br/> (1995-03-08)<br/><br/>
"ASPLE","<br/><br/> <language> A {toy language}.<br/><br/> ["A Sampler of Formal Definitions", M. Marcotty et al,<br/> Computing Surveys 8(2):191-276 (Feb 1976)].<br/><br/> (1995-02-08)
---/---------------------------------------------------------------------------------
这样就完成了文本操作, 现在差的是导入到 Access 中去.
当然, 这几条数据是很容易很不需要时间就替换好了, 不过十多万行的替换操作有够呛.
且 EditPlus 也不怎么完善, 数据过多, 就无法逐个替换, 只好多执行几次.
事实上我使用 EditPlus 根本无法完成我要的操作, 改成 Search and Replace 后轻松完成.
---/---------------------------------------------------------------------------------
Search and Replace 操作以上操作
1. 替换标题
搜索: ^[a-zA-Z0-9\.\-\!]+
替换: "%1%2%3","
2. 清除换行符
搜索: \n
替换: <br/>
3. 每条目加上一个换行符
搜索: <br/>"[a-zA-Z0-9\.\-\!]+","<br/>
替换: <br/>\n"%1%2%3","<br/>
最终用 Search And Replace 处理完后是 12739 行
篇数为 7640篇, 真不少, 七千多条术语.
---/---------------------------------------------------------------------------------
2. Access 操作
打开数据库, 空白处右击, 点导入, 选择有 *.txt 项的选项.
如没有编码方面的问题, 一切默认.
导入完成后我打开 Glossary数据库表 一看, 有29 个字段, 呵呵, 不过除了 字段1 字段2 之外, 其余全是 Null.
删除不必要的字段就剩下了 ID 字段1 字段2 三个字段, 这就是我要的.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'
' subject : 2006-5-9 从 文本文件导出数据到 Access 全实录 By Stabx
'
' writer : Stabx<shawl.qiu@gmail.com>
'
' blog : http://blog.csdn.net/btbtd \ http://btbtd.exblog.jp/
'
' blog/site : Phoenix.GI - P.GI / \ 绿色学院 - Green Institute
'
' date : 2006-5-10
'
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''