Data transformation

From Wikipedia, the free encyclopedia

This article is about data transformation in computer science (metadata). For statistical application, see data transformation (statistics).

In metadata, a data transformation converts data from a source data format into destination data.

Data transformation can be divided into two steps:

data mapping maps data elements from the source to the destination and captures any transformation that must occur
code generation that creates the actual transformation program

Data element to data element mapping is frequently complicated by complex transformations that requires one-to-many and many-to-one transformation rules.

The code generation step takes the data element mapping specification and creates an executable program that can be run on a computer system. Code generation can also create transformation in easy-to-maintain computer languages such as Java or XSLT.

When the mapping is indirect via a mediating data model, the process is also called data mediation.

1 Transformational Languages
2 Difficult Problems
3 See also
4 References

[edit] Transformational Languages

There are numerous languages available for performing data transformation. Many transformational languages require a grammar to be provided. In many cases the grammar is structured using something closely resembling Backus–Naur Form (BNF). There are numerous languages available for such purposes varying in their accessibility (cost) and general usefulness. Examples of such languages include:

XSLT - the XML transformation language
TXL - prototyping language-based descriptions using source transformation

It should be noted that though transformational languages are typically best suited for transformation, something as simple as regular expressions can be used to achieve useful transformation. Textpad supports the use of regular expressions with arguments. This would allow all instances of a particular pattern to be replaced with another pattern using parts of the original pattern. For example:

foo ("some string", 42, gCommon);
bar (someObj, anotherObj);

foo ("another string", 24, gCommon);
bar (myObj, myOtherObj);

could both be transformed into a more compact form like:

foobar("some string", 42, someObj, anotherObj);
foobar("another string", 24, myObj, myOtherObj);

In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with two invocations would be replaced with a single function invocation using some or all of the original set of arguments.

Another advantage to using regular expressions is that they will not fail the null transform test. That is, using your transformational language of choice, run a sample program through a transformation that doesn't perform any transformations. Many transformational languages will fail this test.

[edit] Difficult Problems

There are many challenges in data transformation. Probably the most difficult problem to address in C++ is "unstructured preprocessor directives". These are preprocessor directives which do not contain blocks of code with simple grammatical descriptions - example:

void MyFunc ()
{
  if (x>17)
  { printf("test");
#ifdef FOO
  } else {
#endif
    if (gWatch)
      mTest = 42;
  }
}

A really general solution to handling this is very hard because such preprocessor directives can essentially edit the underlying language in arbitrary ways. However, because such directives are not, in practice, used in completely arbitrary ways, one can build practical tools for handling preprocessed languages. The DMS Software Reengineering Toolkit] is capable of handling structured macros and preprocessor conditionals.