22-50015_R5_LDRD

b'Automated Type and DataDiscerning data type information from compiled binaries using machine Structure Resolution learning has the potential to enable better preparation against and response to malware attacks.R esearch was conducted to enhance the ability for reverse engineers to take compiled binaries and disassemble them to gain more information about their inner workings. This project specifically explored pathways to automatically discern data structure and data type information from compiled TOTAL APPROVED AMOUNT: binaries, primarily using machine learning techniques. An additional benefit of $550,000 over 2 years automating the process would allow for greater scale of binaries to be reverse PROJECT NUMBER: engineered without manual type resolution. The research: (1) developed a training 20A44-108 data set using compiled binaries and known type information; (2) expanded the training set so that the machine learning algorithms could recognize a diverse set PRINCIPAL INVESTIGATOR:of type information from the compiled binaries; (3) experimented with machine Jared Verba learning processes against an unknown binary (one in which the source code is not CO-INVESTIGATOR: available) so that we could automatically discern type information from it without Sean Salinas, INL user intervention; and (4) analyzed how this newly discerned type information might enable reverse-engineers to focus on other interesting parts of the compiled code instead of manually resolving data types.Research results indicated that discerning type information using machine learning algorithms was very sensitive to conditions and may only be applicable to specialized environments. The algorithms rarely gave good results. When the results were good, they were specific to the data set that was used for training. When properly trained and tuned, the algorithms would give more positive results on problems similar to their training set. However, when extrapolated outside of its training set norms, the algorithms would give poor results that required user verification and manual intervention. Further refinement of the solution is required before meaningful application of this technique can be developed.Example of the sharding process that simulates every section of memory that the binary references and keeps track of what machine instructions reference each section. Using this technique, large instruction sets were generated for each memory reference and were then used 128 for machine learning.'