Monkey on Oracle: March 2010

Monday, March 29, 2010

CKEditor Image Browser - APEX Style

FCKEditor has been used with Oracle APEX for a long while, in APEX terms it is known as HTML Editor Standard and HTML Editor Minimal items. In APEX 4.0 it looks like the new CKEditor is included, called Rich Text Editor item. It actually looks like you are given a choice between the old FCKEditor and the new CKEditor, which is a nice touch for those that have customized FCKEditor in earlier versions.

If you already have FCKEditor configured with image browsing in you current install, I would not sweat through integrating CKEditor unless you have good reason. Editors are known to change, and this particular will be included in APEX 4.0.

The problem with CKEditor native image browser is the default parameters it includes when launching a pop-up window. Additional parameters does not combine very well with APEX, to say the least. In this post I will give an example on how to adapt the native CKEditor image editor to include your own APEX image browser. This is similar to the earlier article by Carsten Czarski found here in German, or a translated version here. I will not detail how to create your own image browser, but focus on how to change the CKEditor.

Beware: I am not a Javascript expert. I am not a CKEditor expert. This is basically just a documentation of my experience of adapting CKEditor.

Prerequisites
To follow the example you should:

Download and extract ckeditor_3.2.zip to a folder named ckeditor
Download ckpackager.exe and place it under the ckeditor directory
Map the ckeditor directory in your webserver so it can be reached with a web browser. I will use "/js/ckeditor" in this example.

CKEditor comes complete with examples, source code and documentation (albeit a bit lacking). CKPackager is not as streamlined, but browsing around the subversion repository you get the general idea. In short CKPackager takes care of putting together ckeditor.js and ckeditor_basic.js after you have fiddled with the source code. I chose the exe-version of CKPackager as I do my development in Windows, but there is a jar-file available as well.

Create an APEX page to serve as an image browser/picker (it does not have to do anything yet, we'll get to that).

Give the page an alias: IMAGE_PICKER
Create an APEX page with a text area item
Put the following code in the HTML-header of the APEX page:

<script type="text/javascript" src="/js/ckeditor/ckeditor.js"></script>
   <link href="/js/ckeditor/bouvetckeditor.css" rel="stylesheet" type="text/css" />

Put the following code in the Post Element Text of the text area item:

<script type="text/javascript">
   //<![CDATA[
      // Replace the <textarea id="editor"> with an CKEditor
      // instance, using default configurations.
      CKEDITOR.replace( 'P12_TEXT',
                        {filebrowserImageBrowseUrl : 'f?p=&APP_ID.:IMAGE_PICKER:&SESSION.::NO::',
                        width: '500'
                        });
   //]]>
   </script>

Replace "/js/ckeditor" to match your own configuration
Replace "P12_TEXT" to match the item name of your text area item.

The Problematic URL
You should now be able to run the page described above, and have CKEditor magically appear where your text area item was supposed to be. When you click the image button in the CKEditor toolbar, the image dialog appears, but when you click the "Browse Server" button, the fun begins. The pop-up window spawns with an URL that has three extra parameters attached.

CKEditor which contains the item name
CKEditorFuncNum which contains som mystic numeric ID
langCode which contains the language code set for the CKEditor instance

APEX will blow up when this URL is tried (as f has no parameters matching the three). Actually APEX won't blow, because f will never be reached as mod_plsql won't recognize any procedure with the signature required. You get the idea.

Changing CKEditor Source Code
To adapt the URL to APEX style, you must get your hands dirty with Javascript. Actually I think the code of CKEditor looks quite clean, it's just so much of it :-) The source code is located in the "_source" directory, open the file "ckeditor/_source/plugins/filebrowser/plugin.js" in your favourite Javascript editor (or TextPad in my case).

In this example I will keep the original parameters (and parameter names), but if you want to change them, they are located in the browseServer function.

The addQueryString function accepts the URL and parameters, and builds the complete URL. Change the function to look something like this:

function addQueryString( url, params )
        {
                var paramString = [];
                var queryString = [];

                if ( !params )
                        return url;
                else
                {
                        for ( var i in params ) {
                                paramString.push( i );
                                queryString.push( encodeURIComponent( params[ i ] ) );
                        }
                }

                return url + paramString.join( "," ) + ":" + queryString.join( "," );
        }

The function will now add the parameter names before the colon symbol, and the parameter values after. All in the same order, all separated by a comma. APEX style :-)

Re-Packing CKEditor
Changing the code is not enough, your changes have not reached ckeditor.js yet. This is what we need the CKPackager for. It collects the source, and builds a new ckeditor.js based on the current source.

Start a command line window and browse to the ckeditor directory
Execute CKPackager like this: "CKPackager.exe ckeditor.pack -v" (or: "java -jar CKPackager.jar ckeditor.pack -v")
A new version of ckeditor.js and ckeditor_basic.js will be generated for you

ckeditor.pack contains reference to the files to be included into the ckeditor.js and ckeditor_basic.js files. The switch -v will give a verbose execution.

Reload the APEX page containing the CKEditor item (a proper reload to ensure the new version of the ckeditor.js file is loaded), and this time the image picker APEX page should pop-up, so to speak.

Returning From APEX Pop-up Window
The only thing missing now is the value to return to the CKEditor image dialog. This requires som additional code in your Apex application:

First create three application level items (unrestricted) named:

CKEDITOR
CKEDITORFUNCNUM
LANGCODE

These items will contain the parameters recieved from CKEditor.

Next, modify your image picker APEX page:

Create a new submit button on the page, and name it ADD_IMAGE, accept the rest as defaults.
Create a new After Submit Page Process, type PL/SQL, and call it close_popup
Paste the code below into the PL/SQL Process code area:

declare
      l_file_url varchar2(4000);
   begin
      l_file_url := '/path/of/image/imagename.jpg';
      htp.p('<body>');
      htp.p('<script type="text/javascript">');
      htp.p('window.opener.CKEDITOR.tools.callFunction( '||:ckeditorfuncnum||', '''||l_file_url||''');');
      htp.p('window.close();');
      htp.p('</script>');
      htp.p('</body>');
   end;

This bit of code will call the appropriate CKEditor return function, and close the pop-up window. Modify the code to suite your needs, but beware not to put any code below the last htp.p-call.

Testing the Code
To test the code so far:

Run the page containing the CKEditor item.
Click the image button in the CKEditor toolbar
Click the Browse Server button
A pop-up window with your image picker APEX page should now appear
Click the Add Images button on your APEX page, and the CKEditor image dialog URL should now reflect the URL set from your APEX image picker page

Modifying CKEditor Image Browser Parameters
So far you have a complete working example with CKEditor image browser and Oracle APEX, but what if you want to modify the parameters themselves? Earlier you modified the function assembling the complete URL based on the defatult parameters, to change the parameters you must modify the browseServer-function in the same file as before ("ckeditor/_source/plugins/filebrowser/plugin.js"). The only parameter the CKEditor really needs when returning image source, is the image source (duh) and CKEditorFuncNum.

Here is an example of the browseServer-function when parameters CKEditor and langCode is removed, and the current APEX-page id added as a parameter called "callingPage":

function browseServer( evt )
   {
      var dialog = this.getDialog();
      var editor = dialog.getParentEditor();

      editor._.filebrowserSe = this;

      var width = editor.config[ 'filebrowser' + ucFirst( dialog.getName() ) + 'WindowWidth' ]
          || editor.config.filebrowserWindowWidth || '80%';
      var height = editor.config[ 'filebrowser' + ucFirst( dialog.getName() ) + 'WindowHeight' ]
          || editor.config.filebrowserWindowHeight || '70%';

      var params = this.filebrowser.params || {};
      params.CKEditorFuncNum = editor._.filebrowserFn;
      
      // New parameter to get APEX pageId
      params.callingPage = $v('pFlowStepId');

   var url = addQueryString( this.filebrowser.url, params );
      editor.popup( url, width, height );
   }

To generate a new version of CKEditor, use the CKPackager as described earlier in the post. The parameters in the URL uses the same (APEX friendly) construct as before (pName,pName:pValue,pValue), but the number and names of the parameters have changed. The new pop-up URL should now look something like this: "http://<yourserver>/pls/apex/f?p=<app_id>:IMAGE_PICKER:<sessionid>::NO::CKEditorFuncNum,callingPage:<X>,<pageid>".

A Word of Caution
When hacking the source like this, and not making your own plugin, guess what happens when you need to upgrade to the next version of CKEditor? When you push the toothbrush too far back, you have to start all over again.

This hack is in the filebrowser-plugin, which in turn is used by the Flash- and Link-dialogs, so pay attention to changes there as well.

Not as sleek as I would like it to be, but if moving to CKEditor is on your list, this is a way to achieve just that.

As always, use it at your own risk!

Enjoy :-)

Tuesday, March 2, 2010

DOCX Part II: How to Index Document Content With Oracle Text

Here I will demonstrate how to index content of CLOB data using Oracle Text. Other than the database objects used in the example, the post addresses the use of Oracle Text in a general way, and will work for any CLOB column. There is no ground breaking code here, but it completes the example on how to handle Microsoft Open Office XML Format (DOCX) documents. The examples and content here are based on my previous post.

The code here is tested with Oracle database release 11.1.0.6, but should work with releases down to 10g. When I say "tested", remember I am a developer, so it basically means it does compile and it did give the expected result on one run trough...

See it in Action
I have included a search page in my demo application. Upload a DOCX document, and try it for yourself: Go to demo application.

About Oracle Text
Oracle Text (formerly Oracle interMedia) is Oracles full-text retrieval technology, and part of Standard, Enterprise and even XE editions of the database. See Oracle Text on oracle.com for more information.

In my case I will use it to index text content stored in a CLOB column, and utilize it's search capabilities to enable users to retrieve relevant results based on search phrases.

Indexing CLOBs
Or BLOBs, or BFILEs, for that matter. It is very easy to get up and running with a text index. The following code is all that is needed to get a text index up and running against the TEXT_CONTENT column of the SAMPLES_DOCX table:

create index samples_docx_ctx on samples_docx
(text_content)
indextype is ctxsys.context
/

That's it! If you want to get fancy, there are a number of options and ways to enhance the indexing, look at the documentation for more information on the subject (lexers, sections, stopword lists, etc.).

Searching with Oracle Text
With the text index in place, the next step is to use the index to retrieve search results. In my example I will use a technique called "Progressive Relaxation" as described by Roger Ford here. It uses a query template to implement a progressive relaxation of the search tokens.

The following script creates a package to handle and use the new text index:

create or replace package samples_docx_search_p
as
   -- result types
   type t_result_rec is record (
      score         number,
      filename      samples_docx.filename%type,
      snippet       varchar2(4000));
   type t_result_tab is table of t_result_rec;
   -- synchronize search index
   procedure sync_search_index (
      p_ctx_index_name     in          varchar2 default 'SAMPLES_DOCX_CTX'
   );
   -- search docx content
   function search_text (
      p_tokens    in          varchar2
   ) return t_result_tab pipelined;
end;
/

create or replace package body samples_docx_search_p
as
   -- synchronize search index
   procedure sync_search_index (
      p_ctx_index_name     in          varchar2 default 'SAMPLES_DOCX_CTX'
   ) is
   begin
      ctx_ddl.sync_index(p_ctx_index_name, '2M');
   end;
   -- search docx content
   function search_text (
      p_tokens    in          varchar2
   ) return t_result_tab pipelined
   as 
      l_max_rows         integer := 10;
      l_counter          integer := 0;
      l_ret_rec          t_result_rec;
      l_tokens           varchar2(4000) := lower(p_tokens); 
      l_query_template   varchar2(32000):='<query>'
                              ||chr(10)||'   <textquery> heregoesthetokens'
                              ||chr(10)||'     <progression>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>/seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite></seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite></seq>'
                              ||chr(10)||'     </progression>'
                              ||chr(10)||'   </textquery>'
                              ||chr(10)||'</query>';
      l_query            varchar2(32000);
   begin
      if l_tokens is null
      then
         return;
      end if;
      -- create query temlpate based on tokens
      l_query_template := replace(l_query_template, 'heregoesthetokens', l_tokens);
      -- restructure tokens for use with snippet
      l_tokens := replace(l_tokens, ' ', ' | ');
      for c in (select rowid
                     , filename 
                  from samples_docx 
                 where contains (text_content, l_query_template, 1) > 0)
      loop
         -- crude custom score, just listing the results as they come
         l_ret_rec.score := l_counter;
         l_ret_rec.filename := c.filename;
         -- create snippet (with tokenbased highlight) of content to return to the user
         ctx_doc.set_key_type('ROWID');
         l_ret_rec.snippet :=  ctx_doc.snippet('SAMPLES_DOCX_CTX', c.rowid, l_tokens, '<b>', '</b>', false);
         l_counter := l_counter + 1;
         -- return row
         pipe row (l_ret_rec);
         -- exit when max number of rows exceeded
         exit when l_counter >= l_max_rows;
      end loop;
   end;
end;
/

Some parts of the code are more cryptic than others.

First up is the "SYNC_SEARCH_INDEX" procedure. It's sole purpose is to synchronize the search index with the TEXT_CONTENT column. When you perform DML on the TEXT_CONTENT column, changes to the text index does not automatically propagate to the search index (unless the index is explicitly told to do so). So, in this example, if you insert a new row, the TEXT_CONTENT will not show up through Oracle Text searches until you have told it to synchronize the index. Indexing can be a resource demanding operation, so it makes sense to separate table DML from the indexing job.

In short: Make sure you synchronize the text index after DML on the indexed column has been performed.

Next is the "SEARCH_TEXT" function, particularly the query template. In the past I have used Oracle Text in its most simple form, but query templates adds a new dimension to controlling your searches. Basically you define how you want to use the search tokens (search phrase) to find matches in the text index. It is written in XML-form, and the <seq>-tag states the order of which you want to process your criteria. For more information on query templates, take a look at the documentation.

A short explanation on the query template:

<rewrite>transform((TOKENS, "{", "}", " "))</rewrite>: Matches when whole phrase (tokens in order) is present within document
<rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>: Matches when each token is present within document
<rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite>: Matches when each token is present within docoument, allowing typos (fussy search)
<rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document
<rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document, allowing typos (fussy search)

Although it will return matches according to the sequence, the "score" within the sequence will be random (as far as I could see at least), so the "score" column in the example above needs a bit of work. I also learned the hard way that structure of the XML in the query template is not validated, and it swallows typos without a word.

Test the Code
You can test the code directly from SQL*Plus or your favorite weapon of choice. The following code will search for "test":

select filename
,      snippet
from   table(samples_docx_search_p.search_text('test'))
order  by score
/

Wildcards
My query template does not support searching for parts of words, there are some reasons for this behavior.

Wildcard searches requires more resources
The results returned will be unpredictable. Imagine searching for "ora*", both "oracle" and "orange" will match.
Wildcard searches are more suitable for look-ups in structured data

What did I mean by that last statement? As I see it, users today are using a wide range of search engines and site searches every day, and wildcard matching are in most cases available only through the "advanced search"-link (that is rarely used, if ever). Let alone that people do not expect this kind of behavior anymore. If you want to help the user, add ajax autocomplete to your search field, and use the Oracle Text index. In the traditional back office application, it is another matter (look-ups and such).

Create a Search Page in Oracle APEX
I will not do a detailed explanation here, but creating a report based on a query will do the trick. If you choose "Yes" for "Enable search" in the wizard it will save you some work by creating the search dialog on the page. Clean up the query in the report region so that tour query looks something like this:

select filename
,      snippet
from   table(samples_docx_search_p.search_text(:PXX_REPORT_SEARCH))
order  by score

When keeping the order by, I assume that you will let the score column determine the result order. For good measure you can create a row template for the result representation in the APEX report so you can have heading and snippet spread over two rows.

Update Sample Upload Page
If you created the upload page described in my previous post, include a new page process (After Submit, and after the "extract_text"-process). The process should be of PL/SQL-type, and the code should look something like this:

samples_docx_search_p.sync_search_index;

It does not necessary have to be a page process, but it is important to synchronize the text index after DML.

How to Handle DOC and DOCX
...And PDF, XLS, PPT, etc. Well, the DOCX is the exception here. Oracle Text does not understand the format yet (unless you are running on release 11.1.0.7 or higher). If you create an Oracle Text index on a BLOB column containing other file types which it does understand, they will be indexed automagically. If you want to index the content of a DOCX document in the same index, use the technique described in my last post, convert the output (plain text) into a BLOB, and insert the BLOB into the table with the other documents. Or said in a different way: you except and handle DOCX documents differently than the known document types.

Querying an Oracle Text index works the same for BLOB and CLOB columns.

There is definitely room for improvement in the code, but it serves well enough for demonstration purposes.

Enjoy :-)

Monday, March 1, 2010

DOCX Part I: How to Extract Document Content as Plain Text

In this post I will demonstrate how to extract plain text from Microsoft Open Office XML Format (DOCX) documents stored as a BLOB data type in an Oracle database. It is part one of two posts concerning how to extract and index unstructured content stored in DOCX files. I found snippets of solutions several places, but no complete copy-paste solution, so perhaps this will save some of you a bit of work.

The solution is tested on Oracle database Enterprise Edition release 11.1.0.6 with Apex 3.2.1 running on Windows XP, but should work down to 10g Standard Edition (not Oracle XE, since part of the code is in Java).

See it In Action
I have updated my demo application with a couple of pages so you can try it out yourselves: Go to demo application. There is a limited space available at apex.oracle.com, so please take care when uploading (and it is very easy to test in your own environment).

When will You Ever Need This?
If you have an Oracle database version prior to 11.1.0.7, Oracle Text is not able to index DOCX documents (as it does DOC documents). From version 11.1.0.7 and on, Oracle uses technology from Stellent, and DOCX is indexed as the other formats. So if you want to index DOCX text content in Oracle Text in a version prior to 11.1.0.7, then this could be a way to do it.

Oracle is working on a back-port to 10.x, but I have no status when (if) this will be available. Microsoft Office 2007 has been around since.. Take a wild guess! So these things obviously takes some time.

About Microsoft Open Office XML Format
As seen from the eyes of a PL/SQL developer, that is. First of all, XML sounds promising, you can do a ton of things with a valid XML in an Oracle database. You get the first nasty surprise when opening a DOCX-document in your favorite text editor; it is a zip archive! The next is when you realize that utl_compress can't help you uncompressing it either.

So Google next, and realizing this cannot be easily done in a pure PL/SQL solution, Google yields this gem from Ted Neward. It is a DOCX (Open XML) walk through as seen from the eyes of a Java developer. Very educational.

How to Unzip in PL/SQL
You probably can, but that would probably also involve a lot of work. The easy way is to take hold of a Java method that already does what you want. After searching the net I came up with this post from peterv6i which does exactly what I want (and more, it can also add files (BLOB) to an existing zip archive).

My short version of this (as I only need to extract content), follows below. The script creates a java class with a method to extract a file from a zip archive, and a PL/SQL wrapper to the getFileFromZip method. The database user must have JAVA_DEPLOY and JAVAUSERPRIV (I think) roles.

create or replace java source named "ZIPImpl" 
AS 
import java.io.*; 
import java.util.zip.*; 
import java.sql.*; 
import oracle.sql.*; 
public class ZIPImpl 
{ 
public static void getFileFromZip(oracle.sql.BLOB srcBlob, oracle.sql.BLOB dstBlob[], java.lang.String name) { 
try { 
   OutputStream outBuffer = dstBlob[0].getBinaryOutputStream(); 
   InputStream inBuffer = srcBlob.getBinaryStream(); 
   ZipInputStream zip = new ZipInputStream(inBuffer); 
   ZipEntry entry; 
   byte[] tmpBuffer = new byte[2048]; 
   while((entry = zip.getNextEntry()) != null) { 
      if (entry.getName().compareTo(name)==0) { 
         int n; 
         while ((n = zip.read(tmpBuffer)) >= 0) 
           outBuffer.write(tmpBuffer, 0, n); 
      } 
   } 
   outBuffer.close(); 
 } 
 catch (SQLException e) { 
   System.err.println(e); 
 } 
 catch (IOException e) { 
   System.err.println(e); 
 } 
} 
}; 
/

alter java source "ZIPImpl" compile 
/      

create or replace package zip as 
   procedure unzip( 
      p_src       in          blob
   ,  p_dst       in out      blob
   ,  p_filename  in          varchar2); 
end; 
/

create or replace package body zip as 
   procedure unzip( 
      p_src       in          blob
   ,  p_dst       in out      blob
   ,  p_filename  in          varchar2) 
   as language java 
   name 'ZIPImpl.getFileFromZip(oracle.sql.BLOB, oracle.sql.BLOB[], java.lang.String)'; 
end; 
/

How to Extract Plain Text Content from DOCX
So, now you have a procedure to extract a file from a zip archive, next is to attack the content of the file. The DOCX file consists of several files, but the text content of the document resides in "word/document.xml". As the file is an XML document, I will perform an XML/XSL transformation of the file to be left with only plain text.

The script below creates a table SAMPLES_DOCX which will hold the original DOCX file stored as a BLOB, and a CLOB containing the plain text. It also creates a package with three procedures:

Store the DOCX (as uploaded into APEX_APPLICATION_FILES)
Download the original file (strictly not necessary for this exercise, but nice to have all the same)
Extract and store plain text from DOCX

The XSL used to transform the document.xml file is an (extremely) abbreviated version of the XSL posted here in the Oracle Technical Forums.

create table samples_docx (
   filename       varchar2(255) not null
,  mime_type      varchar2(255)
,  orig_file      blob
,  text_content   clob)
/

alter table samples_docx add constraint samples_docx_pk primary key (filename)
/

create or replace package sample_docx_p as
   -- get file from apex_application_files and store it in samples_docx table
   procedure store_docx (
      p_file_name    in       varchar2);
   -- download procedure for original docx file
   procedure retrieve_docx (
      p_file_name    in       varchar2);
   -- extract plain text from docx file (this is the meat)
   procedure store_text (
      p_file_name    in       varchar2);
end;
/

create or replace package body sample_docx_p as
   -- get file from apex_application_files and store it in samples_docx table
   procedure store_docx (
      p_file_name    in       varchar2)
   is
      l_file         blob;
      l_mime_type    apex_application_files.mime_type%type;
      l_name         apex_application_files.name%type;
   begin
      -- get file from apex files
      select  name
            , mime_type
            , blob_content
       into   l_name
            , l_mime_type
            , l_file
       from   apex_application_files
      where   name = p_file_name;
      -- insert record into samples table
      insert into samples_docx ( filename
                               , mime_type
                               , orig_file)
       values   (   l_name
                  , l_mime_type
                  , l_file);
      -- delete file from apex files when done
      delete from   apex_application_files
           where   name = p_file_name;
   end store_docx;
   -- download procedure for original docx file
   procedure retrieve_docx (
      p_file_name    in       varchar2)
   is
      l_file         blob;
      l_mime_type    apex_application_files.mime_type%type;
      l_name         apex_application_files.name%type;
      l_size         number;
   begin
      -- get file from apex files
      select  filename
            , mime_type
            , orig_file
            , dbms_lob.getlength(orig_file)
       into   l_name
            , l_mime_type
            , l_file
            , l_size
       from samples_docx   
      where   filename = p_file_name;
      -- return file
      owa_util.mime_header( nvl(l_mime_type,'application/octet'), false);
      htp.p('Content-length: ' || l_size);
      htp.p('Content-Disposition:  attachment; filename="'||replace(replace(substr(l_name,instr(l_name,'/')+1),chr(10),null),chr(13),null)|| '"');
      owa_util.http_header_close;
      wpg_docload.download_file(l_file);
   end retrieve_docx;
   -- perform xsl tranformation of document.xml
   function get_text (
      p_docx_xml     in       xmltype
   ) return clob
   is
   l_clob   clob;
   -- xsl monkeyed from http://forums.oracle.com/forums/thread.jspa?messageID=3368284
   -- abbreviated quite a bit, check out original posting by "user304344" for the original
   l_xsl    xmltype := xmltype('<?xml version="1.0" encoding="utf-8"?>'
                     ||chr(10)||'<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" '
                     ||chr(10)||'xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"'
                     ||chr(10)||'xmlns:v="urn:schemas-microsoft-com:vml"'
                     ||chr(10)||'exclude-result-prefixes="w v">'
                     ||chr(10)||'<xsl:output method="text" indent="no" encoding="UTF-8" version="1.0"/>'
                     ||chr(10)||'<!-- document root -->'
                     ||chr(10)||'<xsl:template match="/">'
                     ||chr(10)||'<!-- root element in document --> '
                     ||chr(10)||'<xsl:apply-templates select="w:document"/> '
                     ||chr(10)||'</xsl:template>'
                     ||chr(10)||'<!-- ****************************start document**************************** -->'
                     ||chr(10)||'<xsl:template match="w:document">'
                     ||chr(10)||'<xsl:for-each select="//w:p">'
                     ||chr(10)||'<xsl:apply-templates select="*/w:t"/> '
                     ||chr(10)||'<xsl:text>|¤¤</xsl:text> '
                     ||chr(10)||'</xsl:for-each> '
                     ||chr(10)||'</xsl:template>'
                     ||chr(10)||'<!-- get all text nodes within a para -->'
                     ||chr(10)||'<xsl:template match="*/w:t">'
                     ||chr(10)||'<xsl:value-of select="."/>'
                     ||chr(10)||'</xsl:template>'
                     ||chr(10)||'<!-- **************************** end document**************************** -->'
                     ||chr(10)||'</xsl:stylesheet>');
   begin
      -- "|¤¤" is just a hack to get linebreaks, should be an easier way to achieve this
      select replace(xmltransform(p_docx_xml, l_xsl).GetClobVal(), '|¤¤', chr(10))
        into l_clob
        from dual;
      return l_clob;
   end;   
   -- extract plain text from docx file (this is the meat)
   procedure store_text (
      p_file_name    in       varchar2)
   is
      l_docx            blob;
      l_docx_unzip      blob;
      l_doc             clob;
      l_dest_offset     integer := 1;
      l_src_offset      integer := 1;
      l_lang_context    integer := dbms_lob.default_lang_ctx;
      l_warning         integer;
   begin
      -- get original file
      select orig_file
        into l_docx
        from samples_docx
       where filename = p_file_name;
      -- create lob locators
      dbms_lob.createtemporary(l_docx_unzip,false);
      dbms_lob.createtemporary(l_doc,false);
      -- use java to unzip the docx file and retrieve document.xml
      zip.unzip(l_docx, l_docx_unzip, 'word/document.xml');
      -- convert blob to clob
      dbms_lob.converttoclob
               ( dest_lob => l_doc
               , src_blob => l_docx_unzip
               , amount => dbms_lob.lobmaxsize
               , dest_offset => l_dest_offset
               , src_offset => l_src_offset
               , blob_csid => nls_charset_id('AL32UTF8') --in my case, it is stored as UTF8
               , lang_context => l_lang_context
               , warning => l_warning
               );
      -- transform clob via xsl to get clean text
      l_doc := get_text(xmltype(l_doc));
      -- update the column containing document text
      update samples_docx
         set text_content = l_doc
       where filename = p_file_name;
      -- clean lob locators, should be repeated in exception block
      dbms_lob.freetemporary(l_docx_unzip);
      dbms_lob.freetemporary(l_doc);
   end;
end;
/

Bringing it Together in Oracle APEX
To test the code, you can create a simple upload page in APEX:

Create new page
Form
Form on a procedure
<schema name> where you installed the application
Choose "sample_docx_p.store_docx" as stored procedure name
Accept defaults (or change to your liking)
Choose desired tab options
Leave Invoking Page/Button Label blank if you do not want an invoking page
Choose branch pages (can be the same as the one you are creating)
Accept defaults
Click Finish
Click Edit Page
Choose PXX_FILE_NAME, and change "Display as" from "Text item" to "File Browse..."
Apply changes

To add post processing to extract DOCX file content:

Click Create under Processes
Choose PL/SQL
Give the process a name like "extract_text"
In the "Enter PL/SQL Page Process" dialog, paste "sample_docx_p.store_text(:PXX_FILE_NAME);" (XX being your page number)
Create process (and ensure the process comes after "Run Stored Procedure" process created by the page wizard

The last process could be implemented as a trigger on the samples_docx table.

Run the page and test by uploading a DOCX document (or test it in my demo application).

Building More
If you have a need to manipulate DOCX documents, this could probably be achieved by using the complete "ZIPImpl" as posted in this article, and using substitution variables in the same manner as I have described with RTF documents in an earlier post.

The zip/unzip may also come in handy when handling other office files (like Open Office ODF).

This is only a proof of concept, and the code is not production ready. Really! I'm not just saying that out of habit.

The next (and final) part of the DOCX "series" will address how to index the plain text content using Oracle Text.

Monkey on Oracle

Monday, March 29, 2010

CKEditor Image Browser - APEX Style

Tuesday, March 2, 2010

DOCX Part II: How to Index Document Content With Oracle Text

Monday, March 1, 2010

DOCX Part I: How to Extract Document Content as Plain Text

About me

Blog Archive

Labels

Blogroll