Help to create a script that recognises a text sequence

Posted almost 4 years ago by Magali Colin - Avizua

Post a topic

Solved

Magali Colin - Avizua

I have to do something that requires, I think, two scripting exercises.

The first part is to build a correspondence table.

For example: a file contains several thousand addresses (see an extract with the attached Excel). I need to group the addresses by similar sequences in the wording.

This attached drawing explains what I need.

This will save time in making a mapping table which can then be used in a join.... for which I need another block (another script).

I will write another ticket (called Help to create a join block by "one text contains another text") for the join.

Thanks in advance for help.

Best Regards

Magali

0 Votes

7 Comments

Paola Tomei posted over 1 year ago Admin

Hi Magali,

It would be good to know whether you've had a chance to review the solution and if it deals with the requirement?

Thanks

Paola

0 Votes

Paola Tomei posted almost 4 years ago Admin

You can use the ETL process to break the string into individual words, merge them with the keywords list, then join the result with the original data.

Nice feature here is that if the text contains multiple keywords - this will be captured and multiple words will be listed, comma separated in the aggregated field.

See the IOZ attached.

Attachments (1)

ioz

Key words CO....ioz
401 KB

0 Votes

Magali Colin - Avizua posted almost 4 years ago

I tryed with a "left" join instead of a "right" one but it does not work finally :-(

Another idea ?

Magali

0 Votes

Magali Colin - Avizua posted almost 4 years ago

Hi Antonio,

I understand why the result is not satifying : it keeps only one address line, and I need to keep all the adresses

I made a short exemple in the attached IOZ to explain :

Best Regards,

Magali

Attachments (1)

ioz

Test Fuzzy M....ioz
365 KB

0 Votes

Magali Colin - Avizua posted almost 4 years ago

Thanks Antonio.

However the "fuzzy match" matches for 313 records on 1609... the idea of trying to isolate a sequence that repeats may be better when both entry are not always in the same language.

Magali

0 Votes

Antonio Poggi posted almost 4 years ago Admin

As per screenshare, you can exploit the community Fuzzy Join block to merge datasets based on similar text. It performs a join between the first (left) and second (right) input. The field on which the join is performed must be text containing multiple terms.

The result will contain joined records based on how many terms they share, weighted by inverse document frequency ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf )

Also you can leverage the fuzzy match in the Record filter but if you have many rules to set up that could be a bit tedious to set up and maintain.

1 Votes

Magali Colin - Avizua posted almost 4 years ago

These are the files

Attachments (2)

xlsx

file for new....xlsx
19 KB

Help to reco....png
71.9 KB

0 Votes