Start a new topic

Help to create a script that recognises a text sequence

I have to do something that requires, I think, two scripting exercises.

The first part is to build a correspondence table.

For example: a file contains several thousand addresses (see an extract with the attached Excel). I need to group the addresses by similar sequences in the wording.

This attached drawing explains what I need.

This will save time in making a mapping table which can then be used in a join.... for which I need another block (another script).

I will write another ticket (called Help to create a join block by "one text contains another text") for the join.

Thanks in advance for help.

Best Regards

Magali

 

 

 

 

 

 

 

 

 

 

 

 


As per screenshare, you can exploit the community Fuzzy Join block to merge datasets based on similar text. It performs a join between the first (left) and second (right) input. The field on which the join is performed must be text containing multiple terms. 

The result will contain joined records based on how many terms they share, weighted by inverse document frequency ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf )


Also you can leverage the fuzzy match in the Record filter but if you have many rules to set up that could be a bit tedious to set up and maintain.


1 person likes this

Hi Antonio,

I understand why the result is not satifying : it keeps only one address line, and I need to keep all the adresses

I made a short exemple in the attached IOZ to explain :

image


Best Regards,


Magali

ioz

I tryed with a "left" join instead of a "right" one but it does not work finally :-(

Another idea ?

Magali

You can use the ETL process to break the string into individual words, merge them with the keywords list, then join the result with the original data.

Nice feature here is that if the text contains multiple keywords - this will be captured and multiple words will be listed, comma separated in the aggregated field.

See the IOZ attached.


ioz

These are the files

xlsx

Thanks Antonio.

However the "fuzzy match" matches for 313 records on 1609... the idea of trying to isolate a sequence that repeats may be better when both entry are not always in the same language.

Magali

Login or Signup to post a comment