JabRef References output

The papers below are subject to ACM, IEEE, or other copyrights as noted in the paper's text

C.C.K. Mikkelsen, J. Alastruey-Benedé, P. Ibáñez-Mar\in and P. Garc\ia-Risueño (2015), "Accelerating Sparse Arithmetic in the Context of Newton's Method for Small Molecules with Bond Constraints", In Parallel Processing and Applied Mathematics - 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I. , pp. 160-171.

[BibTeX] [DOI] [URL] [PDF]

BibTeX:

@inproceedings{Mikkelsen2015,
  author = {Carl Christian Kjelgaard Mikkelsen and Jesús Alastruey-Benedé and Pablo Ibáñez-Mar\in and Pablo Garc\ia-Risueño},
  title = {Accelerating Sparse Arithmetic in the Context of Newton's Method for Small Molecules with Bond Constraints},
  booktitle = {Parallel Processing and Applied Mathematics - 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers, Part I},
  year = {2015},
  pages = {160--171},
  url = {http://dx.doi.org/10.1007/978-3-319-32149-3_16},
  doi = {10.1007/978-3-319-32149-3_16}
}

J. Albericio, P. Ibáñez, V. Viñals and J.M. Llabería (2013), "The reuse cache: downsizing the shared last-level cache", In The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 7-11, 2013. , pp. 310-321.

[BibTeX] [DOI] [URL]

BibTeX:

@inproceedings{Albericio2013,
  author = {J. Albericio and P. Ibáñez and V. Viñals and J. M. Llabería},
  title = {The reuse cache: downsizing the shared last-level cache},
  booktitle = {The 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, Davis, CA, USA, December 7-11, 2013},
  year = {2013},
  pages = {310--321},
  url = {http://doi.acm.org/10.1145/2540708.2540735},
  doi = {10.1145/2540708.2540735}
}

A. Pedro-Zapater, C. Rodríguez, J. Segarra, R. Gran Tejero and V. Viñals-Yúfera (2020), "Ideal and Predictable Hit Ratio for Matrix Transposition in Data Caches", Mathematics., February, 2020. Vol. 8(2), pp. 184. MDPI AG.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Matrix transposition is a fundamental operation, but it may present a very low and hardly predictable data cache hit ratio for large matrices. Safe (worst-case) hit ratio predictability is required in real-time systems. In this paper, we obtain the relations among the cache parameters that guarantee the ideal (predictable) data hit ratio assuming a Least-Recently-Used (LRU) data cache. Considering our analytical assessments, we compare a tiling matrix transposition to a cache oblivious algorithm, modified with phantom padding to improve its data hit ratio. Our results show that, with an adequate tile size, the tiling version results in an equal or better data hit ratio. We also analyze the energy consumption and execution time of matrix transposition on real hardware with pseudo-LRU (PLRU) caches. Our analytical hit/miss assessment enables the usage of a data cache for matrix transposition in real-time systems, since the number of misses in the worst case is bound. In general and high-performance computation, our analysis enables us to restrict the cache resources devoted to matrix transposition with no negative impact, in order to reduce both the energy consumption and the pollution to other computations.

BibTeX:

@article{Pedro20Ideal,
  author = {Alba Pedro-Zapater and Clemente Rodríguez and Juan Segarra and Gran Tejero, Rubén and Víctor Viñals-Yúfera},
  title = {Ideal and Predictable Hit Ratio for Matrix Transposition in Data Caches},
  journal = {Mathematics},
  publisher = {MDPI AG},
  year = {2020},
  volume = {8},
  number = {2},
  pages = {184},
  url = {https://doi.org/10.3390/math8020184},
  doi = {10.3390/math8020184}
}

A. Pedro-Zapater, J. Segarra, R. Gran Tejero, V. Viñals and C. Rodríguez (2020), "Reducing the WCET and analysis time of systems with simple lockabl e instruction caches", PLOS ONE., March, 2020. Vol. 15(3), pp. e0229980. Public Library of Science.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: One of the key challenges in real-time systems is the analysis of the memory hierarchy. Many Worst-Case Execution Time (WCET) analysis methods supporting an instruction cache are based on iterative or convergence algorithms, which are rather slow. Our goal in this paper is to reduce the WCET analysis time on systems with a simple lockable instruction cache, focusing on the Lock-MS metho
d. First, we propose an algorithm to obtain a structure-based representation of the Control Flow Graph (CFG). It organizes the whole WCET problem as nested subproblems, which takes advantage of common branch-and-bound algorithms of Integer Linear Programming (ILP) solvers. Second, we add support for multiple locking points per task, each one with specific cache contents, instead of a given locked content for the whole task execution. Locking points are set heuristically before outer loops. Such simple heuristics adds no complexity, and reduces the WCET by taking profit of the temporal reuse found in loops. Since loops can be processed as isolated regions, the optimal contents to lock into cache for each region can be obtained, and the WCET analysis time is further reduced. With these two improvements, our WCET analysis is around 10 times faster than other approaches. Also, our results show that the WCET is reduced, and the hit ratio achieved for the lockable instruction cache is similar to that of a real execution with an LRU instruction cache. Finally, we analyze the WCET sensitivity to compiler optimization, showing for each benchmark the right choices and pointing out that O0 is always the worst option.

BibTeX:

@article{Pedro20Reducing,
  author = {Pedro-Zapater, Alba AND Segarra, Juan AND Gran Tejero, Rubén AND Viñals, Víctor AND Rodríguez, Clemente},
  title = {Reducing the WCET and analysis time of systems with simple lockabl e instruction caches},
  journal = {PLOS ONE},
  publisher = {Public Library of Science},
  year = {2020},
  volume = {15},
  number = {3},
  pages = {e0229980},
  url = {https://doi.org/10.1371/journal.pone.0229980},
  doi = {10.1371/journal.pone.0229980}
}

J. Segarra, J. Cortadella, R. Gran Tejero and V. Viñals Yúfera (2020), "Automatic Safe Data Reuse Detection for the WCET Analysis of Sys tems With Data Caches", IEEE Access., October, 2020. Vol. 8, pp. 192379-192392. IEEE.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Worst-case execution time (WCET) analysis of systems with data caches is one of the key challenges in real-time systems. Caches exploit the inherent reuse properties of programs, temporarily storing certain memory contents near the processor, in order that further accesses to such contents do not require costly memory transfers. Current worst-case data cache analysis methods focus on specific cache organizations (LRU, locked, ACDC, etc.). In this article, we analyze data reuse (in the worst case) as a property of the program, and thus independent of the data cache. Our analysis method uses Abstract Interpretation on the compiled program to extract, for each static load/store instruction, a linear expression for the address pattern of its data accesses, according to the Loop Nest Data Reuse Theory. Each data access expression is compared to that of prior (dominant) memory instructions to verify whether it presents a guaranteed reuse. Our proposal manages references to scalars, arrays, and non-linear accesses, provides both temporal and spatial reuse information, and does not require the exploration of explicit data access sequences. As a proof of concept we analyze the TACLeBench benchmark suite, showing that most loads/stores present data reuse, and how compiler optimizations affect it. Using a simple hit/miss estimation on our reuse results, the time devoted to data accesses in the worst case is reduced to 27% compared to an always-miss system, equivalent to a data hit ratio of 81%. With compiler optimization, such time is reduced to 6.5%.

BibTeX:

@article{Segarra20Automatic,
  author = {Juan Segarra and Jordi Cortadella and Gran Tejero, Rubén and Viñals Yúfera, V\ictor},
  title = {Automatic Safe Data Reuse Detection for the WCET Analysis of Sys tems With Data Caches},
  journal = {IEEE Access},
  publisher = {IEEE},
  year = {2020},
  volume = {8},
  pages = {192379--192392},
  url = {https://doi.org/10.1109/ACCESS.2020.3032145},
  doi = {10.1109/ACCESS.2020.3032145}
}

F. Candel, A. Valero, S. Petit and J. Sahuquillo (2019), "Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance", IEEE Transactions on Computers., October, 2019. Vol. 68(10), pp. 1442-1454.

[BibTeX] [DOI] [PDF]

BibTeX:

@article{Candel2019,
  author = {F. Candel and A. Valero and S. Petit and J. Sahuquillo},
  title = {Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance},
  journal = {IEEE Transactions on Computers},
  year = {2019},
  volume = {68},
  number = {10},
  pages = {1442--1454},
  doi = {10.1109/TC.2019.2907591}
}

G. Desirena-López, A. Ram\irez-Trevi no, J.L. Briz, C.R. Vázquez and D. Gómez-Gutiérrez (2019), "Thermal-aware Real-time Scheduling Using Timed Continuous Petri Nets", ACM Trans. Embed. Comput. Syst.. New York, NY, USA, July, 2019. Vol. 18(4), pp. 36:1-36:24. ACM.

[BibTeX] [DOI] [URL] [PDF]

BibTeX:

@article{DesirenaLopez2019,
  author = {Desirena-López, G. and Ram\irez-Treviño, A. and Briz, J. L. and Vázquez, C. R. and Gómez-Gutiérrez, D.},
  title = {Thermal-aware Real-time Scheduling Using Timed Continuous Petri Nets},
  journal = {ACM Trans. Embed. Comput. Syst.},
  publisher = {ACM},
  year = {2019},
  volume = {18},
  number = {4},
  pages = {36:1--36:24},
  url = {http://doi.acm.org/10.1145/3322643},
  doi = {10.1145/3322643}
}

J. Díaz, T. Monreal, P. Ibáñez, J.M. Llabería and V. Viñals (2019), "ReD: A reuse detector for content selection in exclusive shared last-level caches", Journal of Parallel and Distributed Computing. Vol. "125", pp. 106-120.

[BibTeX] [DOI] [URL] [PDF]

BibTeX:

@article{Diaz2019,
  author = {Javier Díaz and Teresa Monreal and Pablo Ibáñez and José M. Llabería and Víctor Viñals},
  title = {ReD: A reuse detector for content selection in exclusive shared last-level caches},
  journal = {Journal of Parallel and Distributed Computing},
  year = {2019},
  volume = {"125"},
  pages = {106--120},
  url = {"http://www.sciencedirect.com/science/article/pii/S0743731518308414"},
  doi = {"https://doi.org/10.1016/j.jpdc.2018.11.005"}
}

A. Ferrerón, J. Alastruey-Benedé, D.S. Gracia, T.M. Arnal, P.I. Marín and V.V. Yúfera (2019), "A fault-tolerant last level cache for CMPs operating at ultra-low voltage", Journal of Parallel and Distributed Computing. Vol. 125, pp. 31-44. Elsevier.

[BibTeX] [DOI] [URL] [PDF]

BibTeX:

@article{Ferreron2019,
  author = {Alexandra Ferrerón and Jesús Alastruey-Benedé and Darío Suárez Gracia and Teresa Monreal Arnal and Pablo Ibáñez Marín and Víctor Viñals Yúfera},
  title = {A fault-tolerant last level cache for CMPs operating at ultra-low voltage},
  journal = {Journal of Parallel and Distributed Computing},
  publisher = {Elsevier},
  year = {2019},
  volume = {125},
  pages = {31--44},
  url = {http://webdiis.unizar.es/gaz/biblio/pdfs/2019_JPDC_Fault_Tolerant_LLC.pdf},
  doi = {10.1016/j.jpdc.2018.10.010}
}

M.A. Dávila Guzmán, R. Nozal, R. Gran Tejero, M. Villarroya-Gaudó, D. Suárez Gracia and J.L. Bosque (2019), "Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL", The Journal of Supercomputing., March, 2019. Vol. 75(3), pp. 1732-1746.

[Abstract] [BibTeX] [DOI] [URL] [PDF]

Abstract: Heterogeneous systems are the core architecture of most of the high-performance
computing nodes, due to their excellent performance and energy efficiency.
However, a key challenge that remains is programmability, specifically,
releasing the programmer from the burden of managing data and devices
with different architectures. To this end, we extend EngineCL to
support FPGA devices. Based on OpenCL, EngineCL is a high-level framework
providing load balancing among devices. Our proposal fully integrates
FPGAs into the framework, enabling effective cooperation between
CPU, GPU, and FPGA. With command overlapping and judicious data management,
our work improves performance by up to 96% compared with single-device
execution and delivers energy-delay gains of up to 37%. In addition,
adopting FPGAs does not require programmers to make big changes in
their applications because the extensions do not modify the user-facing
interface of EngineCL.

BibTeX:

@article{Guzman2019,
  author = {Dávila Guzmán, María Angélica and Nozal, Raúl and Gran Tejero, Rubén and Villarroya-Gaudó, María and Suárez Gracia, Darío and Bosque, Jose Luis},
  title = {Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL},
  journal = {The Journal of Supercomputing},
  year = {2019},
  volume = {75},
  number = {3},
  pages = {1732--1746},
  url = {https://doi.org/10.1007/s11227-019-02768-y},
  doi = {10.1007/s11227-019-02768-y}
}

J. Herruzo, S. Gonzalez-Navarro, P. Ibáñez, V. Viñals-Yufera, J. Alastruey-Benedé and O. Plata (2019), "Boosting Backward Search Throughput for FM-Index Using a Compressed Encoding", March, 2019. , pp. 577-577.

[BibTeX] [DOI] [PDF]

BibTeX:

@inproceedings{Herruzo2019,
  author = {Herruzo, Jose and Gonzalez-Navarro, Sonia and Ibáñez, Pablo and Viñals-Yufera, Víctor and Alastruey-Benedé, Jesús and Plata, Oscar},
  title = {Boosting Backward Search Throughput for FM-Index Using a Compressed Encoding},
  year = {2019},
  pages = {577--577},
  doi = {10.1109/DCC.2019.00089}
}

G.D. López, L.E.R. Anguiano, A.R. Treviño and J.L. Briz (2019), "A Flexible Framework for Real-Time Thermal-Aware Schedulers using Timed Continuous Petri Nets", Computación y Sistemas., June, 2019. Vol. 23(2), pp. 417-434.

[BibTeX] [DOI] [URL]

BibTeX:

@article{Lopez,
  author = {Gaddiel Desirena López and Laura Elena Rubio Anguiano and Antonio Ramírez Treviño and José Luis Briz},
  title = {A Flexible Framework for Real-Time Thermal-Aware Schedulers using Timed Continuous Petri Nets},
  journal = {Computación y Sistemas},
  year = {2019},
  volume = {23},
  number = {2},
  pages = {417--434},
  url = {https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/3204/2649},
  doi = {10.13053/cys-23-2-3204}
}

A. Navarro-Torres, J. Alastruey-Benedé, P. Ibáñez-Marín and V. Viñals-Yúfera (2019), "Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP", PLOS ONE., August, 2019. Vol. 14(8), pp. 1-24. Public Library of Science.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: SPEC CPU is one of the most common benchmark suites used in computer
architecture research. CPU2017 has recently been released to replace
CPU2006. In this paper we present a detailed evaluation of the memory
hierarchy performance for both the CPU2006 and single-threaded CPU2017
benchmarks. The experiments were executed on an Intel Xeon Skylake-SP,
which is the first Intel processor to implement a mostly non-inclusive
last-level cache (LLC). We present a classification of the benchmarks
according to their memory pressure and analyze the performance impact
of different LLC sizes. We also test all the hardware prefetchers
showing they improve performance in most of the benchmarks. After
comprehensive experimentation, we can highlight the following conclusions:
i) almost half of SPEC CPU benchmarks have very low miss ratios in
the second and third level caches, even with small LLC sizes and
without hardware prefetching, ii) overall, the SPEC CPU2017 benchmarks
demand even less memory hierarchy resources than the SPEC CPU2006
ones, iii) hardware prefetching is very effective in reducing LLC
misses for most benchmarks, even with the smallest LLC size, and
iv) from the memory hierarchy standpoint the methodologies commonly
used to select benchmarks or simulation points do not guarantee representative
workloads.

BibTeX:

@article{NavarroTorres,
  author = {Navarro-Torres, Agustín AND Alastruey-Benedé, Jesús AND Ibáñez-Marín, Pablo AND Viñals-Yúfera, Víctor},
  title = {Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP},
  journal = {PLOS ONE},
  publisher = {Public Library of Science},
  year = {2019},
  volume = {14},
  number = {8},
  pages = {1--24},
  url = {https://doi.org/10.1371/journal.pone.0220135},
  doi = {10.1371/journal.pone.0220135}
}

A. Rodríguez, A. Navarro, R. Asenjo, F. Corbera, R. Gran, D. Suárez and J. Nunez-Yanez (2019), "Exploring heterogeneous scheduling for edge computing with CPU and FPGA MPSoCs", Journal of Systems Architecture. Vol. 98, pp. 27-40.

[Abstract] [BibTeX] [DOI] [URL] [PDF]

Abstract: This paper presents a framework targeted to low-cost and low-power
heterogeneous MultiProcessors that exploits FPGAs and multicore CPUs,
with the overarching goal of providing developers with a productive
programming model and runtime support to fully use all the processing
resources available. FPGA productivity is achieved using a high-level
programming model based on OpenCL, the standard for cross-platform
parallel heterogeneous programming. In this work, we focus on the
parallel_for pattern, and as part of the runtime support for this
pattern, we leverage a new scheduler that strives to maximize the
number of iterations per joule by dynamically and adaptively partitioning
the iteration space between the multicore and the accelerator when
working simultaneously. A total of 7 benchmarks are ported and optimized
for a low-cost DE1 board. The results show that the heterogeneous
solution can improve performance up to 2.9?×? and increases energy
efficiency up to 2.7?×? compared to the traditional approach of keeping
all the CPU cores idle while the accelerator computes the workload.
Our results also demonstrate two interesting insights: first, an
adaptive scheduler able to find at runtime the right chunk size for
each type of application and device configuration is an essential
component for these kinds of heterogeneous platforms, and second,
device configurations that provide higher throughput do not always
achieve better energy efficiency when only the running power (excluding
the idle power component) is considered.

BibTeX:

@article{Rodriguez2019,
  author = {Andrés Rodríguez and Angeles Navarro and Rafael Asenjo and Francisco Corbera and Rubén Gran and Darío Suárez and Jose Nunez-Yanez},
  title = {Exploring heterogeneous scheduling for edge computing with CPU and FPGA MPSoCs},
  journal = {Journal of Systems Architecture},
  year = {2019},
  volume = {98},
  pages = {27--40},
  url = {http://www.sciencedirect.com/science/article/pii/S1383762119300918},
  doi = {10.1016/j.sysarc.2019.06.006}
}

A. Valero, F. Candel, D. Suárez-Gracia, S. Petit and J. Sahuquillo (2019), "An Aging-Aware GPU Register File Design Based on Data Redundancy", IEEE Transactions on Computers., January, 2019. Vol. 68(1), pp. 4-20.

[BibTeX] [DOI] [PDF]

BibTeX:

@article{Valero2019,
  author = {A. Valero and F. Candel and D. Suárez-Gracia and S. Petit and J. Sahuquillo},
  title = {An Aging-Aware GPU Register File Design Based on Data Redundancy},
  journal = {IEEE Transactions on Computers},
  year = {2019},
  volume = {68},
  number = {1},
  pages = {4--20},
  doi = {10.1109/TC.2018.2849376}
}

F. Candel, S. Petit, A. Valero and J. Sahuquillo (2018), "Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache", In European Conference on Parallel Processing. , pp. 235-248.

[BibTeX] [DOI]

BibTeX:

@inproceedings{candel2018improving,
  author = {Candel, Francisco and Petit, Salvador and Valero, Alejandro and Sahuquillo, Julio},
  title = {Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache},
  booktitle = {European Conference on Parallel Processing},
  year = {2018},
  pages = {235--248},
  doi = {10.1007/978-3-319-96983-1_17}
}

A. Ferrerón, J. Alastruey-Benedé, D. Suárez-Gracia and U.R. Karpuzcu (2018), "AISC: Approximate Instruction Set Computer", CoRR. Vol. abs/1803.06955

[BibTeX] [URL]

BibTeX:

@article{ferreron2018aisc,
  author = {Ferrerón, Alexandra and Alastruey-Benedé, Jesús and Suárez-Gracia, Dar\io and Karpuzcu, Ulya R},
  title = {AISC: Approximate Instruction Set Computer},
  journal = {CoRR},
  year = {2018},
  volume = {abs/1803.06955},
  url = {http://arxiv.org/abs/1803.06955}
}

J.M. Herruzo, S. González Navarro, P. Ibáñez, V. Viíals Yufera, J. Alastruey and O. Plata (2018), "Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor", IEEE/ACM Transactions on Computational Biology and Bioinformatics. , pp. 1-1.

[BibTeX] [DOI] [PDF]

BibTeX:

@article{Herruzo2018,
  author = {J. M. Herruzo and S. González Navarro and P. Ibáñez and V. Viíals Yufera and J. Alastruey and O. Plata},
  title = {Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor},
  journal = {IEEE/ACM Transactions on Computational Biology and Bioinformatics},
  year = {2018},
  pages = {1--1},
  doi = {10.1109/TCBB.2018.2884701}
}

J. Nunez-Yanez, S. Amiri, M. Hosseinabady, A. Rodr\iguez, R. Asenjo, A. Navarro, D. Suarez and R. Gran (2018), "Simultaneous multiprocessing in a software-defined heterogeneous FPGA", The Journal of Supercomputing. , pp. 1-18. Springer.

[BibTeX] [DOI] [URL]

BibTeX:

@article{nunez2018simultaneous,
  author = {Nunez-Yanez, Jose and Amiri, Sam and Hosseinabady, Mohammad and Rodr\iguez, Andrés and Asenjo, Rafael and Navarro, Angeles and Suarez, Dario and Gran, Ruben},
  title = {Simultaneous multiprocessing in a software-defined heterogeneous FPGA},
  journal = {The Journal of Supercomputing},
  publisher = {Springer},
  year = {2018},
  pages = {1--18},
  url = {https://zaguan.unizar.es/record/70277/files/texto_completo.pdf},
  doi = {10.1007/s11227-018-2367-9}
}

J. Olivito, F. Serrano, J.A. Clemente, H. Mecha and J. Resano (2018), "Analysis of the reconfiguration latency and energy overheads for a Xilinx Virtex-5 field-programmable gate array", IET Computers Digital Techniques. Vol. 12(4), pp. 150-157.

[BibTeX] [DOI] [URL]

BibTeX:

@article{olivito2018analysis,
  author = {J. Olivito and F. Serrano and J. A. Clemente and H. Mecha and J. Resano},
  title = {Analysis of the reconfiguration latency and energy overheads for a Xilinx Virtex-5 field-programmable gate array},
  journal = {IET Computers Digital Techniques},
  year = {2018},
  volume = {12},
  number = {4},
  pages = {150--157},
  url = {https://zaguan.unizar.es/record/69465/files/texto_completo.pdf},
  doi = {10.1049/iet-cdt.2016.0095}
}

R.T. Possignolo, E. Ebrahimi, E.K. Ardestani, A. Sankaranarayanan, J.L. Briz and J. Renau (2018), "Gpu ntc process variation compensation with voltage stacking", IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Vol. 26(9), pp. 1713-1726. IEEE.

[BibTeX] [DOI] [URL]

BibTeX:

@article{possignolo2018gpu,
  author = {Possignolo, Rafael Trapani and Ebrahimi, Elnaz and Ardestani, Ehsan Khish and Sankaranarayanan, Alamelu and Briz, Jose Luis and Renau, Jose},
  title = {Gpu ntc process variation compensation with voltage stacking},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  publisher = {IEEE},
  year = {2018},
  volume = {26},
  number = {9},
  pages = {1713--1726},
  url = {https://users.soe.ucsc.edu/ renau/docs/tvlsi18.pdf},
  doi = {10.1109/TVLSI.2018.2831665}
}

L. Rubio-Anguiano, G. Desirena-López, A. Ram\irez-Treviño and J. Briz (2018), "Energy-Efficient Thermal-Aware Scheduling for RT Tasks Using TCPN", IFAC-PapersOnLine. Vol. 51(7), pp. 236-242. Elsevier.

[BibTeX] [DOI] [URL]

BibTeX:

@article{rubio2018energy,
  author = {Rubio-Anguiano, L and Desirena-López, G and Ram\irez-Treviño, A and Briz, JL},
  title = {Energy-Efficient Thermal-Aware Scheduling for RT Tasks Using TCPN},
  journal = {IFAC-PapersOnLine},
  publisher = {Elsevier},
  year = {2018},
  volume = {51},
  number = {7},
  pages = {236--242},
  url = {http://webdiis.unizar.es/ briz/papers/WODES_2018.pdf},
  doi = {10.1016/j.ifacol.2018.06.307}
}

F. Candel, A. Valero, S. Petit, D. Suárez-Gracia and J. Sahuquillo (2017), "Exploiting Data Compression to Mitigate Aging in Gpu Register Files", In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). , pp. 57-64.

[BibTeX] [PDF]

BibTeX:

@inproceedings{Candel2017,
  author = {Candel, Francisco and Valero, Alejandro and Petit, Salvador and Suárez-Gracia, Dar\io and Sahuquillo, Julio},
  title = {Exploiting Data Compression to Mitigate Aging in Gpu Register Files},
  booktitle = {2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)},
  year = {2017},
  pages = {57--64}
}

J. D\iaz Maag, P.E. Ibáñez Mar\in, T. Monreal Arnal, V. Viñals Yúfera and J.M. Llaberia Griñó (2017), "ReD: A policy based on reuse detection for demanding block selection in last-level Caches", In The Second Cache Replacement Championship: workshop schedule. , pp. 1-4.

[BibTeX] [PDF]

BibTeX:

@inproceedings{DiazMaag2017,
  author = {D\iaz Maag, Javier and Ibáñez Mar\in, Pablo Enrique and Monreal Arnal, Teresa and Viñals Yúfera, V\ictor and Llaberia Griñó, José M},
  title = {ReD: A policy based on reuse detection for demanding block selection in last-level Caches},
  booktitle = {The Second Cache Replacement Championship: workshop schedule},
  year = {2017},
  pages = {1--4}
}

J. Olivito, J. Resano and J.L. Briz (2017), "Accelerating board games through hardware/software codesign", IEEE Transactions on Computational Intelligence and AI in Games. Vol. 9(4), pp. 393-401. IEEE.

[BibTeX] [PDF]

BibTeX:

@article{Olivito2016,
  author = {Olivito, Javier and Resano, Javier and Briz, José Luis},
  title = {Accelerating board games through hardware/software codesign},
  journal = {IEEE Transactions on Computational Intelligence and AI in Games},
  publisher = {IEEE},
  year = {2017},
  volume = {9},
  number = {4},
  pages = {393--401}
}

M. Ort\in-Obón, M. Tala, L. Ramini, V. Viñals-Yufera and D. Bertozzi (2017), "Contrasting laser power requirements of wavelength-routed optical NoC topologies subject to the floorplanning, placement, and routing constraints of a 3-D-stacked system", IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Vol. 25(7), pp. 2081-2094. IEEE.

[BibTeX] [PDF]

BibTeX:

@article{OrtinObon2017,
  author = {Ort\in-Obón, Marta and Tala, Mahdi and Ramini, Luca and Viñals-Yufera, V\ictor and Bertozzi, Davide},
  title = {Contrasting laser power requirements of wavelength-routed optical NoC topologies subject to the floorplanning, placement, and routing constraints of a 3-D-stacked system},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  publisher = {IEEE},
  year = {2017},
  volume = {25},
  number = {7},
  pages = {2081--2094}
}

R. Rodr\iguez-Rodr\iguez, J. D\iaz, F. Castro, P. Ibáñez, D. Chaver, V. Viñals, J.C. Saez, M. Prieto-Mat\ias, L. Piñuel, T. Monreal and others (2017), "Reuse detector: Improving the management of stt-ram sllcs", The Computer Journal. Vol. 61(6), pp. 856-880. Oxford University Press.

[BibTeX] [PDF]

BibTeX:

@article{RodriguezRodriguez2017,
  author = {Rodr\iguez-Rodr\iguez, Roberto and D\iaz, Javier and Castro, Fernando and Ibáñez, Pablo and Chaver, Daniel and Viñals, V\ictor and Saez, Juan Carlos and Prieto-Mat\ias, Manuel and Piñuel, Luis and Monreal, T and others},
  title = {Reuse detector: Improving the management of stt-ram sllcs},
  journal = {The Computer Journal},
  publisher = {Oxford University Press},
  year = {2017},
  volume = {61},
  number = {6},
  pages = {856--880}
}

A. Valero, N. Miralaei, S. Petit, J. Sahuquillo and T.M. Jones (2017), "On microarchitectural mechanisms for cache wearout reduction", IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Vol. 25(3), pp. 857-871. IEEE.

[BibTeX] [PDF]

BibTeX:

@article{Valero2016,
  author = {A. Valero, N. Miralaei, S. Petit, J. Sahuquillo, T. M. Jones},
  title = {On microarchitectural mechanisms for cache wearout reduction},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  publisher = {IEEE},
  year = {2017},
  volume = {25},
  number = {3},
  pages = {857--871}
}

J.A. Clemente, R. Gran, A. Chocano, C. del Prado and J. Resano (2016), "Hardware Architectural Support for Caching Partitioned Reconfigurations in Reconfigurable Systems", IEEE Transactions on Very Large Scale Integration (VLSI) Systems., February, 2016. Vol. 24(2), pp. 530-543.

[Abstract] [BibTeX] [DOI]

Abstract: The efficiency of the reconfiguration process in modern field-programmable
gate arrays (FPGAs) can improve drastically if an on-chip configuration
memory is included in the system, because it can reduce both the
reconfiguration latency and its energy consumption. However, the
FPGA on-chip memory resources are very limited. Thus, it is very
important to manage them effectively in order to improve the reconfiguration
process as much as possible, even when the size of the on-chip configuration
memory is small. This paper presents a hardware implementation of
an on-chip configuration memory controller that efficiently manages
run-time reconfigurations. In order to optimize the use of the on-chip
memory, this controller includes support to deal with configurations
that have been divided into blocks of customizable size. When a reconfiguration
must be carried out, our controller provides the blocks stored on-chip
and looks for the remaining blocks by accessing to the off-chip configuration
memory. Moreover, it dynamically decides which blocks must be stored
on-chip. To this end, the designed controller implements a simple
but efficient technique that allows maximizing the benefits of the
on-chip memories. Experimental results will demonstrate that its
implementation cost is very affordable and that it introduces negligible
run-time management overheads.

BibTeX:

@article{7087395,
  author = {J. A. Clemente and R. Gran and A. Chocano and C. del Prado and J. Resano},
  title = {Hardware Architectural Support for Caching Partitioned Reconfigurations in Reconfigurable Systems},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  year = {2016},
  volume = {24},
  number = {2},
  pages = {530--543},
  doi = {10.1109/TVLSI.2015.2417595}
}

A. Ferrerón-Labari, D. Suárez-Gracia, J. Alastruey-Benedé, T. Monreal-Arnal and P. Ibáñez (2016), "Concertina: Squeezing in Cache Content to Operate at Near-Threshold Voltage", IEEE Trans. Computers. Vol. 65(3), pp. 755-769. IEEE.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Scaling supply voltage to values near the threshold voltage allows
a dramatic decrease in the power consumption of processors; however,
the lower the voltage, the higher the sensitivity to process variation,
and, hence, the lower the reliability. Large SRAM structures, like
the last-level cache (LLC), are extremely vulnerable to process variation
because they are aggressively sized to satisfy high density requirements.
In this paper, we propose Concertina, an LLC designed to enable reliable
operation at low voltages with conventional SRAM cells. Based on
the observation that for many applications the LLC contains large
amounts of null data, Concertina compresses cache blocks in order
that they can be allocated to cache entries with faulty cells, enabling
use of 100 percent of the LLC capacity. To distribute blocks among
cache entries, Concertina implements a compression- and fault-aware
insertion/replacement policy that reduces the LLC miss rate. Concertina
reaches the performance of an ideal system implementing an LLC that
does not suffer from parameter variation with a modest storage overhead.
Specifically, performance degrades by less than 2 percent, even when
using small SRAM cells, which implies over 90 percent of cache entries
having defective cells, and this represents a notable improvement
on previously proposed techniques.

BibTeX:

@article{Ferreron-Labari2016,
  author = {Alexandra Ferrerón-Labari and Dar\io Suárez-Gracia and Jesús Alastruey-Benedé and Teresa Monreal-Arnal and Pablo Ibáñez},
  title = {Concertina: Squeezing in Cache Content to Operate at Near-Threshold Voltage},
  journal = {IEEE Trans. Computers},
  publisher = {IEEE},
  year = {2016},
  volume = {65},
  number = {3},
  pages = {755--769},
  url = {http://dx.doi.org/10.1109/TC.2015.2479585},
  doi = {10.1109/TC.2015.2479585}
}

M. Ortín-Obón, D. Suárez-Gracia, M. Villarroya-Gaudó, C. Izu and V. Viñals-Yúfera (2016), "Analysis of network-on-chip topologies for cost-efficient chip multiprocessors ", Microprocessors and Microsystems . , pp. - .

[Abstract] [BibTeX] [DOI]

Abstract: Abstract As chip multiprocessors accommodate a growing number of cores,
they demand interconnection networks that simultaneously provide
low latency, high bandwidth, and low power. Our goal is to provide
a comprehensive study of the interactions between the interconnection
network and the memory hierarchy to enable a better co-design of
both components. We explore the implications of the interconnect
choice on overall performance by comparing the behaviour of three
topologies (mesh, torus, and ring) and their concentrated versions.
Simply choosing the concentrated mesh over the ring improves performance
by over 40% in a 64-core chip. The key strength of this work is the
holistic analysis of the network-on-chip and the memory hierarchy.
Experiments are carried out with a full-system simulator that carefully
models the processors (single and multithreaded), memory hierarchy,
and interconnection network, and executes realistic parallel and
multiprogrammed workloads. We corroborate conclusions from several
previous works: network diameter is critical, the concentrated mesh
offers the best area-energy-delay trade-off, and traffic is very
light and highly unbalanced. We also provide interesting insights
about application-specific features that are hidden when studying
only average results. We include a fairness analysis for multiprogrammed
applications, and refute the idea of the memory controller placement
greatly affecting performance.

BibTeX:

@article{OrtínObón2016,
  author = {Marta Ortín-Obón and Darío Suárez-Gracia and María Villarroya-Gaudó and Cruz Izu and Víctor Viñals-Yúfera},
  title = {Analysis of network-on-chip topologies for cost-efficient chip multiprocessors },
  journal = {Microprocessors and Microsystems },
  year = {2016},
  pages = { - },
  doi = {10.1016/j.micpro.2016.01.005}
}

A. Vilches, A. Navarro, R. Asenjo, F. Corbera, R. Gran and M.J. Garzarán (2016), "Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors", IEEE Transactions on Parallel and Distributed Systems., April, 2016. Vol. 27(4), pp. 1099-1115.

[Abstract] [BibTeX] [DOI]

Abstract: In this paper, we consider the problem of efficiently executing streaming
applications on commodity processors composed of several cores and
an on-chip GPU. Streaming applications, such as those in vision and
video analytic, consist of a pipeline of stages and are good candidates
to take advantage of this type of platforms. We also consider that
characteristics of the input may change while the application is
running. Therefore, we propose a framework that adaptively finds
the optimal mapping of the pipeline stages. The core of the framework
is an analytical model coupled with information collected at runtime
used to dynamically map each pipeline stage to the most efficient
device, taking into consideration both performance and energy. Our
experimental results show that for the evaluated applications running
on two different architectures, our model always predicts the best
configuration among the evaluated alternatives, and significantly
reduces the amount of information that needs to be collected at runtime.
This best configuration has, on the average, 20 percent higher throughput
than the configuration recommended by a baseline state of the art
approach, while the ratio throughput/energy is 43 percent higher.
We have measured improvements in throughput and throughput/energy
of up-to 81 and 204 percent, respectively, when the model is used
to adapt to a video that changes from low to high definition.

BibTeX:

@article{Vilches2016,
  author = {A. Vilches and A. Navarro and R. Asenjo and F. Corbera and R. Gran and M. J. Garzarán},
  title = {Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2016},
  volume = {27},
  number = {4},
  pages = {1099--1115},
  doi = {10.1109/TPDS.2015.2432809}
}

A. Ferreron, R. Jagtap and R. Rusitoru (2016), "Identifying representative regions of parallel HPC applications: a cross-architectural evaluation", In 2016 IEEE International Symposium on Workload Characterization (IISWC)., September, 2016. , pp. 1-2.

[BibTeX] [DOI]

BibTeX:

@inproceedings{Xandra2017,
  author = {A. Ferreron and R. Jagtap and R. Rusitoru},
  title = {Identifying representative regions of parallel HPC applications: a cross-architectural evaluation},
  booktitle = {2016 IEEE International Symposium on Workload Characterization (IISWC)},
  year = {2016},
  pages = {1--2},
  doi = {10.1109/IISWC.2016.7581284}
}

S. Ghike, R. Gran, M.J. Garzarán and D. Padua (2015), "Languages and Compilers for Parallel Computing: 27th International Workshop, LCPC 2014, Hillsboro, OR, USA, September 15-17, 2014, Revised Selected Papers" Cham , pp. 19-35. Springer International Publishing.

[BibTeX] [DOI] [URL]

BibTeX:

@inbook{Ghike2015,
  author = {Ghike, Swapnil and Gran, Rubén and Garzarán, María J. and Padua, David},
  editor = {Brodman, James and Tu, Peng},
  title = {Languages and Compilers for Parallel Computing: 27th International Workshop, LCPC 2014, Hillsboro, OR, USA, September 15-17, 2014, Revised Selected Papers},
  publisher = {Springer International Publishing},
  year = {2015},
  pages = {19--35},
  url = {http://dx.doi.org/10.1007/978-3-319-17473-0_2},
  doi = {10.1007/978-3-319-17473-0_2}
}

R. Gran, J. Segarra, A. Pedro-Zapater, L.C. Aparicio, V. Viñals and C. Rodríguez (2015), "A predictable hardware to exploit temporal reuse in real-time and embedded systems", Journal of Systems Architecture., May, 2015. Vol. 61(5-6), pp. 227-238. Elsevier.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: In this paper we propose a new hardware data cache (FAFB, fully-associative
FIFO tagged buffers) to complement the data cache in processors.
It provides predictability when exploiting temporal reuse in array
data structures, i.e. it allows an accurate WCET analysis, which
is required in real-time systems. With our hardware proposal, compiler
transformations that exploit such reuse (essentially tiling) can
be safely applied. Moreover, our proposal has other features of particular
interest to embedded systems, where a set of well-tuned applications
run in a hardware platform which may be constrained in size, complexity
and energy consumption. In order to test the most uncommon features
of the FAFBs (predictability and effectiveness with a small size),
we perform a worst-case analysis on several kernel algorithms for
embedded and real-time computing, showing the interaction between
tiling and our hardware architecture. Our results show that the number
of data cache misses is reduced between 1.3 and 19 times on such
algorithms.

BibTeX:

@article{Gran15predictable,
  author = {Rubén Gran and Juan Segarra and Alba Pedro-Zapater and Luis C. Aparicio and Víctor Viñals and Clemente Rodríguez},
  title = {A predictable hardware to exploit temporal reuse in real-time and embedded systems},
  journal = {Journal of Systems Architecture},
  publisher = {Elsevier},
  year = {2015},
  volume = {61},
  number = {5-6},
  pages = {227--238},
  doi = {10.1016/j.sysarc.2015.05.001}
}

J. Olivito, R. Gran, J. Resano, C. González and E. Torres (2015), "Performance and energy efficiency analysis of a Reversi player for FPGAs and General Purpose Processors", Microprocessors and Microsystems. Vol. 39(2), pp. 64-73.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Abstract Board-game applications are frequently found in mobile devices
where the computing performance and the energy budget are constrained.
Since the Artificial Intelligence techniques applied in these games
are computationally intensive, the applications developed for mobile
systems are frequently simplistic, far from the level of equivalent
applications developed for desktop computers. Currently board games
are software applications executed on General Purpose Processors.
However, they exhibit a medium degree of parallelism and a custom
hardware accelerator implemented on an FPGA can take advantage of
that. We have selected the well-known Reversi game as a case study
because it is a very popular board game with simple rules but huge
computational demands. We developed and optimized software and hardware
designs for this game that apply the same classical Artificial Intelligence
techniques. The applications have been executed on different representative
platforms and the results demonstrate that the FPGAs implementations
provide better performance, lower power consumption and, therefore,
impressive energy savings. These results demonstrate that FPGAs can
efficiently deal with this kind of problems.

BibTeX:

@article{Olivito2015,
  author = {Javier Olivito and Rubén Gran and Javier Resano and Carlos González and Enrique Torres},
  title = {Performance and energy efficiency analysis of a Reversi player for FPGAs and General Purpose Processors},
  journal = {Microprocessors and Microsystems},
  year = {2015},
  volume = {39},
  number = {2},
  pages = {64--73},
  url = {http://www.sciencedirect.com/science/article/pii/S0141933115000022},
  doi = {10.1016/j.micpro.2015.01.001}
}

M. Ortin, L. Ramini, M. Balboni, L. Zuolo, N. Maddalena, V. Vinals and D. Bertozzi (2015), "Partitioning Strategies of Wavelength-Routed Optical Networks-on-Chip for Laser Power Minimization", In Exploiting Silicon Photonics for Energy-Efficient High Performance Computing (SiPhotonics), 2015 Workshop on., January, 2015. , pp. 17-24.

[BibTeX] [DOI]

BibTeX:

@inproceedings{ortin2015partitioning,
  author = {Ortin, M. and Ramini, L. and Balboni, M. and Zuolo, L. and Maddalena, N. and Vinals, V. and Bertozzi, D.},
  title = {Partitioning Strategies of Wavelength-Routed Optical Networks-on-Chip for Laser Power Minimization},
  booktitle = {Exploiting Silicon Photonics for Energy-Efficient High Performance Computing (SiPhotonics), 2015 Workshop on},
  year = {2015},
  pages = {17--24},
  doi = {10.1109/SiPhotonics.2015.13}
}

J. Segarra, C. Rodríguez, R. Gran, L.C. Aparicio and V. Viñals (2015), "ACDC: Small, Predictable and High-Performance Data Cache", ACM Trans. Embed. Comput. Syst.. New York, NY, USA, February, 2015. Vol. 14(2), pp. 38:1-38:26. ACM.

[BibTeX] [DOI] [PDF]

BibTeX:

@article{Segarra15ACDC,
  author = {Segarra, Juan and Rodríguez, Clemente and Gran, Rubén and Aparicio, Luis C. and Viñals, Víctor},
  title = {ACDC: Small, Predictable and High-Performance Data Cache},
  journal = {ACM Trans. Embed. Comput. Syst.},
  publisher = {ACM},
  year = {2015},
  volume = {14},
  number = {2},
  pages = {38:1--38:26},
  doi = {10.1145/2677093}
}

A. Vilches, R. Asenjo, A. Navarro, F. Corbera, R. Gran and M. Garzarán (2015), "Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips", Procedia Computer Science. Vol. 51, pp. 140-149.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Abstract Commodity processors are comprised of several CPU cores and
one integrated GPU. To fully exploit this type of architectures,
one needs to automatically determine how to partition the workload
between both devices. This is specially challenging for irregular
workloads, where each iteration's work is data dependent and shows
control and memory divergence. In this paper, we present a novel
adaptive partitioning strategy specially designed for irregular applications
running on heterogeneous CPU-GPU chips. The main novelty of this
work is that the size of the workload assigned to the GPU and CPU
adapts dynamically to maximize the GPU and CPU utilization while
balancing the workload among the devices. Our experimental results
on an Intel Haswell architecture using a set of irregular benchmarks
show that our approach outperforms exhaustive static and adaptive
state-of-the-art approaches in terms of performance and energy consumption.

BibTeX:

@article{Vilches2015,
  author = {Antonio Vilches and Rafael Asenjo and Angeles Navarro and Francisco Corbera and Rubén Gran and María Garzarán},
  title = {Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips},
  journal = {Procedia Computer Science},
  year = {2015},
  volume = {51},
  pages = {140--149},
  note = {International Conference On Computational Science, ICCS 2015Computational Science at the Gates of Nature},
  url = {http://www.sciencedirect.com/science/article/pii/S1877050915010212},
  doi = {10.1016/j.procs.2015.05.213}
}

M. Balboni, M. Ortin Obon, A. Capotondi, H. Fankem Tatenguem, A. Ghiribaldi, L. Ramini, V. Vinal, A. Marongiu and D. Bertozzi (2014), "Augmenting manycore programmable accelerators with photonic interconnect technology for the high-end embedded computing domain", In Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on., September, 2014. , pp. 72-79.

[Abstract] [BibTeX] [DOI]

Abstract: There is today consensus on the fact that optical interconnects can
relieve bandwidth density concerns at integrated circuit boundaries.
However, when it comes to the extension of this emerging interconnect
technology to on-chip communication as well, such consensus seems
to fall apart. The main reason consists of a fundamental lack of
compelling cases proving the superior performance and/or energy properties
yielded by devices of practical interest, when re-architected around
a photonically-integrated communication fabric. This paper takes
its steps from the consideration that manycore computing platforms
are gaining momentum in the high-end embedded computing domain in
the form of general-purpose programmable accelerators. Hence, the
performance and energy implications when augmenting these devices
with optical interconnect technology are derived by means of an accurate
benchmarking framework against an aggressively optimized electrical
counterpart.

BibTeX:

@inproceedings{7008764,
  author = {Balboni, M. and Ortin Obon, M. and Capotondi, A. and Fankem Tatenguem, H. and Ghiribaldi, A. and Ramini, L. and Vinal, V. and Marongiu, A. and Bertozzi, D.},
  title = {Augmenting manycore programmable accelerators with photonic interconnect technology for the high-end embedded computing domain},
  booktitle = {Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on},
  year = {2014},
  pages = {72--79},
  doi = {10.1109/NOCS.2014.7008764}
}

J.A. Clemente, J. Resano and D. Mozos (2014), "An Approach to Manage Reconfigurations and Reduce Area Cost in Hard Real-Time Reconfigurable Systems", ACM Transactions on Embedded Computing Systems., February, 2014. Vol. 13-4(4)

[BibTeX]

BibTeX:

@article{Clemente2014ManageReconfigurations,
  author = {Juan Antonio Clemente and Javier Resano and Daniel Mozos},
  title = {An Approach to Manage Reconfigurations and Reduce Area Cost in Hard Real-Time Reconfigurable Systems},
  journal = {ACM Transactions on Embedded Computing Systems},
  year = {2014},
  volume = {13-4},
  number = {4}
}

A. Ferreron, D. Suarez, J. Alastruey, T. Monreal and V. Viñals (2014), "Block Disabling Characterization and Improvements in CMPs Operating at Ultra-low Voltages", In 26th Int. Symp. on Computer Architecture and High Performance Computing (SBAC-PAD 2014).

[Abstract] [BibTeX] [PDF]

Abstract: Power density has become the limiting factor in technology scaling
as power budget limits the amount of hardware that can be active
at the same time. Reducing supply voltage to ultra-low voltage ranges
close to the threshold region has the promise of great energy savings.
However, the potential savings of voltage scaling become limited
by the correct operation of SRAM cells, which is not guaranteed below
Vddmin, the minimum voltage in which cache structures operate reliably.
Understanding the effects of operating below Vddmin requires complex
modeling, so we introduce an updated probability failure model of
SRAM cells at 22nm and explore the reliability impact of lowering
the chip voltage supply below Vddmin in shared- memory coherent chip-multiprocessors
(CMP) running a variety of parallel workloads. A microarchitectural
technique to cope with cache reliability at ultra-low voltages is
block disabling; however, in many cases, the savings in on-chip caches
do not compensate for the consumption in the rest of the system,
as the consumption increase of the off-chip memory may offset the
on-chip gain. We make the case that existing coherence mechanisms
can provide the substrate to improve energy savings with block disabling
and propose two low-complexity techniques. Taking the best of both
techniques we can scale voltage below Vddmin and reduce system energy
up to 39%, and system energy-delay up to 10%. Besides, by lowering
the CMP consumption in a power- constrained scenario, we could activate
offline cores, reaching a potential speedup between 3.7 and 4.4

BibTeX:

@inproceedings{Ferreron2014,
  author = {Alexandra Ferreron and Dario Suarez and Jesus Alastruey and Teresa Monreal and Victor Viñals},
  title = {Block Disabling Characterization and Improvements in CMPs Operating at Ultra-low Voltages},
  booktitle = {26th Int. Symp. on Computer Architecture and High Performance Computing (SBAC-PAD 2014)},
  year = {2014}
}

R. Gran, A. Shi, E. Totoni and M.J. Garzarán (2014), "Evaluation of a Feature Tracking Vision Application on a Heterogeneous Chip", In Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on., October, 2014. , pp. 246-253.

[Abstract] [BibTeX] [DOI]

Abstract: Consumers of personal devices such as desktops, tablets, or smart
phones run applications based on image or video processing, as they
enable a natural computer-user interaction. The challenge with these
computationally demanding applications is to execute them efficiently.
One way to address this problem is to use on-chip heterogeneous systems,
where tasks can execute in the device where they run more efficiently.
In this paper, we discuss the optimization of a feature tracking
application, written in OpenCL, when running on an on-chip heterogeneous
platform. Our results show that OpenCL can facilitate programming
of these heterogeneous systems because it provides a unified programming
paradigm and at the same time can deliver significant performance
improvements. We show that, after optimization, our feature tracking
application runs 3.2, 2.6, and 4.3 times faster and consumes 2.2,
3.1, and 2.7 times less energy when running on the multicore, the
GPU, or both the CPU and the GPU of an Intel i7, respectively.

BibTeX:

@inproceedings{Gran2014,
  author = {R. Gran and A. Shi and E. Totoni and M. J. Garzarán},
  title = {Evaluation of a Feature Tracking Vision Application on a Heterogeneous Chip},
  booktitle = {Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on},
  year = {2014},
  pages = {246--253},
  doi = {10.1109/SBAC-PAD.2014.45}
}

M. Ortín, D. Suárez, M. Villarroya, C. Izu and V. Viñals (2014), "Dynamic Construction of Circuits for Reactive Traffic in Homogeneous CMPs", In Proceedings of the Conference on Design, Automation & Test in Europe (DATE 2014). 3001 Leuven, Belgium, Belgium , pp. 241:1-241:4. European Design and Automation Association.

[BibTeX] [URL] [PDF]

BibTeX:

@inproceedings{Ortin2014,
  author = {Ortín, Marta and Suárez, Darío and Villarroya, María and Izu, Cruz and Viñals, Victor},
  title = {Dynamic Construction of Circuits for Reactive Traffic in Homogeneous CMPs},
  booktitle = {Proceedings of the Conference on Design, Automation & Test in Europe (DATE 2014)},
  publisher = {European Design and Automation Association},
  year = {2014},
  pages = {241:1--241:4},
  url = {http://dl.acm.org/citation.cfm?id=2616606.2616901}
}

M. Ort\in-Obón, L. Ramini, V. Viñals and D. Bertozzi (2014), "Capturing the sensitivity of optical network quality metrics to its network interface parameters", Concurrency and Computation: Practice and Experience. Vol. 26(15), pp. 2504-2517.

[BibTeX] [DOI] [URL]

BibTeX:

@article{Ortín2014NI,
  author = {Marta Ort\in-Obón and Luca Ramini and V\ictor Viñals and Davide Bertozzi},
  title = {Capturing the sensitivity of optical network quality metrics to its network interface parameters},
  journal = {Concurrency and Computation: Practice and Experience},
  year = {2014},
  volume = {26},
  number = {15},
  pages = {2504--2517},
  url = {http://dx.doi.org/10.1002/cpe.3330},
  doi = {10.1002/cpe.3330}
}

M. Ort\in-Obón, L. Ramini, H. Tatenguem Fankem, V. Vi nals and D. Bertozzi (2014), "A Complete Electronic Network Interface Architecture for Global Contention-free Communication over Emerging Optical Networks-on-chip", In Proceedings of the 24th Edition of the Great Lakes Symposium on VLSI. New York, NY, USA , pp. 267-272. ACM.

[BibTeX] [DOI] [URL]

BibTeX:

@inproceedings{Ortin-Obon2015CompleteNI,
  author = {Ort\in-Obón, Marta and Ramini, Luca and Tatenguem Fankem, Herve and Viñals, V\ictor and Bertozzi, Davide},
  title = {A Complete Electronic Network Interface Architecture for Global Contention-free Communication over Emerging Optical Networks-on-chip},
  booktitle = {Proceedings of the 24th Edition of the Great Lakes Symposium on VLSI},
  publisher = {ACM},
  year = {2014},
  pages = {267--272},
  url = {http://doi.acm.org/10.1145/2591513.2591536},
  doi = {10.1145/2591513.2591536}
}

X. Qian, B. Sahelices and J. Torrellas (2014), "OmniOrder: Directory-based conflict serialization of transactions", In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. , pp. 421-432.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Effective execution of atomic blocks of instructions (also called
transactions) can enhance the performance and programmability of
multiprocessors. Atomic blocks can be demarcated in software as in
Transactional Memory (TM) or dynamically generated by the hardware
as in aggressive implementations of strict memory consistency. In
most current designs, when two atomic blocks conflict, one is squashed
- a performance loss that is often unnecessary. To avoid this waste,
this paper presents OmniOrder, the first design that efficiently
executes conflicting atomic blocks concurrently in a directory-based
coherence environment. The idea is to keep only non-speculative data
in the caches and, when the cache coherence protocol transfers a
line, include in the message the history of speculative updates to
the line. The coherence protocol transitions are unmodified. We evaluate
OmniOrder with 64-core simulations. In a TM environment, OmniOrder
reduces the execution time of the STAMP applications by an average
of 18.4% over a scheme that squashes on conflict. In an environment
with SC enforcement with speculation, we run 11 programs that implement
concurrent algorithms. OmniOrder reduces the programs' execution
time by an average of 15.3% relative to a scheme that squashes on
conflict. Finally, OmniOrder's communication overhead of transferring
the history of speculative updates is negligible.

BibTeX:

@inproceedings{Qian2014,
  author = {Xuehai Qian and Sahelices, B. and Torrellas, J.},
  title = {OmniOrder: Directory-based conflict serialization of transactions},
  booktitle = {Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on},
  year = {2014},
  pages = {421--432},
  url = {http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6853223},
  doi = {10.1109/ISCA.2014.6853223}
}

X. Qian, B. Sahelices and D. Qian (2014), "Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol", In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. , pp. 433-444.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Record and Deterministic Replay (R&R) of multithreaded programs on
relaxed-consistency multiprocessors with distributed directory protocol
has been a long-standing open problem. The independently developed
RelaxReplay [8] solves the problem by assuming write atomicity. This
paper proposes Pacifier, the first R&R scheme to provide a solution
without assuming write atomicity. R&R for relaxed-consistency multiprocessors
needs to detect, record and replay Sequential Consistency Violations
(SCV). Pacifier has two key components: (i) Relog, a general memory
reordering logging and replay mechanism that can reproduce SCVs in
relaxed memory models, and (ii) Granule, an SCV detection scheme
in the record phase with good precision, that indicates whether to
record with Relog. We show that Pacifier is a sweet spot in the design
space with a reasonable trade-off between hardware and log overhead.
An evaluation with simulations of 16, 32 and 64 processors with Release
Consistency (RC) running SPLASH-2 applications indicates that Pacifier
incurs 3.9% 16% larger logs. The slowdown of Pacifier during replay
is 10.1% 30.5% compared to native execution.

BibTeX:

@inproceedings{Qian2014a,
  author = {Xuehai Qian and Sahelices, B. and Depei Qian},
  title = {Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol},
  booktitle = {Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on},
  year = {2014},
  pages = {433--444},
  url = {http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6853225},
  doi = {10.1109/ISCA.2014.6853225}
}

L. Ramini, H.T. Fankem, A. Ghiribaldi, P. Grani, M. Ort\in-Obón, A. Boos and S. Bartolini (2014), "Towards compelling cases for the viability of silicon-nanophotonic technology in future manycore systems", In Eighth IEEE/ACM International Symposium on Networks-on-Chip, NoCS 2014, Ferrara, Italy, September 17-19, 2014. , pp. 170-171.

[BibTeX] [DOI] [URL]

BibTeX:

@inproceedings{Ramini2014,
  author = {Luca Ramini and Hervé Tatenguem Fankem and Alberto Ghiribaldi and Paolo Grani and Marta Ort\in-Obón and Anja Boos and Sandro Bartolini},
  title = {Towards compelling cases for the viability of silicon-nanophotonic technology in future manycore systems},
  booktitle = {Eighth IEEE/ACM International Symposium on Networks-on-Chip, NoCS 2014, Ferrara, Italy, September 17-19, 2014},
  year = {2014},
  pages = {170--171},
  url = {http://dx.doi.org/10.1109/NOCS.2014.7008778},
  doi = {10.1109/NOCS.2014.7008778}
}

M. Villarroya-Gaudó, S. Baldassarri, M. Lozano, R. Trillo, A.C. Murillo and P. Garrido (2014), "Girls' Day Experience at the University of Zaragoza: Attracting Women to Technology", In Proceedings of the XV International Conference on Human Computer Interaction. New York, NY, USA , pp. 79:1-79:8. ACM.

[BibTeX] [DOI] [URL] [PDF]

BibTeX:

@inproceedings{Villarroya-Gaudo:2014:GDE:2662253.2662332,
  author = {Villarroya-Gaudó, Maria and Baldassarri, Sandra and Lozano, Mayte and Trillo, Raquel and Murillo, Ana C. and Garrido, Piedad},
  title = {Girls' Day Experience at the University of Zaragoza: Attracting Women to Technology},
  booktitle = {Proceedings of the XV International Conference on Human Computer Interaction},
  publisher = {ACM},
  year = {2014},
  pages = {79:1--79:8},
  url = {http://doi.acm.org/10.1145/2662253.2662332},
  doi = {10.1145/2662253.2662332}
}

J. Albericio, P. Ibáñez, V. Viñals and J.M. Llabería (2013), "Exploiting reuse locality on inclusive shared last-level caches", ACM Transactions on Architecture and Code Optimization., January, 2013. Vol. 9-4, pp. 38:1-38:19. ACM.

[BibTeX] [DOI]

BibTeX:

@article{Albericio:2013:ERL:2400682.2400697,
  author = {Albericio, Jorge and Ibáñez, Pablo and Viñals, Víctor and Llabería, Jose María},
  title = {Exploiting reuse locality on inclusive shared last-level caches},
  journal = {ACM Transactions on Architecture and Code Optimization},
  publisher = {ACM},
  year = {2013},
  volume = {9-4},
  pages = {38:1--38:19},
  doi = {10.1145/2400682.2400697}
}

A. Ferrerón-Labari, M. Ortín-Obón, D. Suárez-Gracia, J. Alastruey and V. Viñals-Yúfera (2013), "Shrinking L1 Instruction Caches to Improve Energy-Delay in SMT Embedded Processors", In Proceedings of the 26th International Conference on Architecture of Computing Systems (ARCS 2013)., February, 2013. , pp. 256-267. Springer Berlin / Heidelberg.

[BibTeX]

BibTeX:

@inproceedings{Ferreron2013,
  author = {Alexandra Ferrerón-Labari and Marta Ortín-Obón and Darío Suárez-Gracia and Jesús Alastruey and Víctor Viñals-Yúfera},
  title = {Shrinking L1 Instruction Caches to Improve Energy-Delay in SMT Embedded Processors},
  booktitle = {Proceedings of the 26th International Conference on Architecture of Computing Systems (ARCS 2013)},
  publisher = {Springer Berlin / Heidelberg},
  year = {2013},
  pages = {256--267}
}

C. González, S. Sánchez, A. Paz, J. Resano, D. Mozos and A. Plaza (2013), "Use of FPGA or GPU-based architectures for remotely sensed hyperspectral image processing", the VLSI Journal of Integration. Vol. 46-2, pp. 89-103.

[BibTeX] [DOI]

BibTeX:

@article{Gonzalez13FPGAandGPU,
  author = {C. González and S. Sánchez and A. Paz and J. Resano and D. Mozos and A. Plaza},
  title = {Use of FPGA or GPU-based architectures for remotely sensed hyperspectral image processing},
  journal = {the VLSI Journal of Integration},
  year = {2013},
  volume = {46-2},
  pages = {89--103},
  doi = {10.1016/j.vlsi.2012.04.002}
}

R. Gran, J. Segarra, C. Rodriguez, L.C. Aparicio and V. Viñals (2013), "Optimizing a combined WCET-WCEC problem in instruction fetching for real-time systems", Journal of Systems Architecture., October, 2013. Vol. 59(9), pp. 667-678. Elsevier.

[BibTeX] [DOI] [PDF]

BibTeX:

@article{Gran13Optimizing,
  author = {R. Gran and J. Segarra and C. Rodriguez and L. C. Aparicio and V. Viñals},
  title = {Optimizing a combined WCET-WCEC problem in instruction fetching for real-time systems},
  journal = {Journal of Systems Architecture},
  publisher = {Elsevier},
  year = {2013},
  volume = {59},
  number = {9},
  pages = {667--678},
  doi = {10.1016/j.sysarc.2013.07.012}
}

S. López, T. Vladimirova, C. González, J. Resano, D. Mozos and A. Plaza (2013), "The Promise of Reconfigurable Computing for Hyperspectral Imaging On-Board Systems: Review and Trends", Proceedings of the IEEE., March, 2013. Vol. 101-3, pp. 698-722.

[BibTeX]

BibTeX:

@article{Lopez13Promise,
  author = {S. López and T. Vladimirova and C. González and J. Resano and D. Mozos and A. Plaza},
  title = {The Promise of Reconfigurable Computing for Hyperspectral Imaging On-Board Systems: Review and Trends},
  journal = {Proceedings of the IEEE},
  year = {2013},
  volume = {101-3},
  pages = {698--722}
}

J. Olivito, C. González and J. Resano (2013), "An FPGA-based specific processor for Blokus-Duo", In International Conference on Field-Progammable Technology 2013. Kyoto, Japan, December, 2013. , pp. 502-505.

[BibTeX]

BibTeX:

@inproceedings{Olivito2013BlokusDuo,
  author = {Olivito, J. and González, C. and Resano, J.},
  title = {An FPGA-based specific processor for Blokus-Duo},
  booktitle = {International Conference on Field-Progammable Technology 2013},
  year = {2013},
  pages = {502--505}
}

M.J.R. Ortiga, E.L. Pueyo, A. Rodríguez-Pintó, L.H. Ros, A. Pocoví, J.L. Briz and J.C. Ciria (2013), "A computed tomography approach for understanding 3D deformation patterns in complex folds", Tectonophysics., May, 2013. Vol. 593, pp. 57-72. Elsevier.

[BibTeX]

BibTeX:

@article{Ortiga13tomography,
  author = {Mª José Ramón Ortiga and Emilio Luis Pueyo and Adriana Rodríguez-Pintó and Luis Humberto Ros and Andrés Pocoví and José Luis Briz and José Carlos Ciria},
  title = {A computed tomography approach for understanding 3D deformation patterns in complex folds},
  journal = {Tectonophysics},
  publisher = {Elsevier},
  year = {2013},
  volume = {593},
  pages = {57--72}
}

M. Ortín, A. Ferrerón, J. Albericio, D. Suárez, M. Villarroya-Gaudó, C. Izu and V. Viñals (2013), "Characterization and cost-efficient selection of NoC topologies for general purpose CMPs", In Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip. New York, NY, USA, January, 2013. , pp. 21-24. ACM.

[BibTeX] [DOI] [URL]

BibTeX:

@inproceedings{Ortin2013Topologies,
  author = {Ortín, Marta and Ferrerón, Alexandra and Albericio, Jorge and Suárez, Darío and Villarroya-Gaudó, María and Izu, Cruz and Viñals, Víctor},
  title = {Characterization and cost-efficient selection of NoC topologies for general purpose CMPs},
  booktitle = {Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip},
  publisher = {ACM},
  year = {2013},
  pages = {21--24},
  url = {http://doi.acm.org/10.1145/2482759.2482765},
  doi = {10.1145/2482759.2482765}
}

X. Qian, H. Huang, B. Sahelices and D. Qian (2013), "Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model", In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on. , pp. 554-565.

[Abstract] [BibTeX] [DOI] [URL]

Abstract: Architectures for record-and-replay (R&R) of multithreaded applications
ease program debugging, intrusion analysis and fault-tolerance. Among
the large body of previous works, Strata enables efficient memory
dependence recording with little hardware overhead and can be applied
smoothly to snoopy protocols. However, Strata records imprecise happens-before
relations and assumes Sequential Consistency (SC) machines that execute
memory operations in order. This paper proposes Rainbow, which is
based on Strata but records near-precise happens-before relations,
reducing the number of logs and increasing the replay parallelism.
More importantly, it is the first R&R scheme that supports any relaxed
memory consistency model. These improvements are achieved by two
key techniques: (1) To compact logs, we propose expandable spectrum
(the region between two logs). It allows younger non-conflict memory
operations to be moved into older spectrum, increasing the chance
of reusing existing logs. (2) To identify the overlapped and incompatible
spectra due to reordered memory operations, we propose an SC violation
detection mechanism based on the existing logs and the extra information
can be recorded to reproduce the violations when they occur. Our
simulation results with 10 SPLASH-2 benchmarks show that Rainbow
reduces the log size by 26.6% and improves replay speed by 26.8%
compared to Strata. The SC violations are few but do exist in the
applications evaluated.

BibTeX:

@inproceedings{Qian2013,
  author = {Xuehai Qian and He Huang and Sahelices, B. and Depei Qian},
  title = {Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model},
  booktitle = {High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on},
  year = {2013},
  pages = {554--565},
  url = {http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6522349},
  doi = {10.1109/HPCA.2013.6522349}
}

A. Sankaranarayanan, E. Ardestani, J.L. Briz and J. Renau (2013), "An Energy Efficient GPGPU Memory Hierarchy With Tiny Incoherent Caches", In ISLPED 2013. Beijing, September, 2013.

[BibTeX]

BibTeX:

@inproceedings{Sankaranarayanan13Energy,
  author = {A. Sankaranarayanan and E. Ardestani and J. L. Briz and J. Renau},
  title = {An Energy Efficient GPGPU Memory Hierarchy With Tiny Incoherent Caches},
  booktitle = {ISLPED 2013},
  year = {2013}
}

J. Albericio, P. Ibáñez, R. Gran, V. Viñals and J.M. Llabería (2012), "ABS: a Low-Cost Adaptive Controller for Prefetching in a Banked Shared Last-Level Cache", ACM Transactions on Architecture and Code Optimization., January, 2012. Vol. 8-4, pp. 191-1920. ACM.

[Abstract] [BibTeX] [DOI]

Abstract: Hardware data prefetch is a very well known technique for hiding memory
latencies. However, in a multicore system fitted with a shared Last-Level
Cache (LLC), prefetch induced by a core consumes common resources
such as shared cache space and main memory bandwidth. This may degrade
the performance of other cores and even the overall system performance
unless the prefetch aggressiveness of each core is controlled from
a system standpoint. On the other hand, LLCs in commercial chip multiprocessors
are more and more frequently organized in independent banks. In this
contribution, we target for the first time prefetch in a banked
LLC organization and propose ABS, a low-cost controller with a hill-climbing
approach that runs stand-alone at each LLC bank without requiring
inter-bank communication. The ABS controller operation repeats atfixed
time intervals (epochs). In each epoch a single core is selected
and its prefetch aggressiveness is changed following a previous trend.
At the end of the epoch a global performance index is evaluated and,
depending on the improvement observed against a reference epoch,
the tested change is maintained or undone. Using multiprogrammed
SPEC2K6 workloads, our analysis shows that the mechanism improves
both user-oriented metrics (Harmonic Mean of Speedups by 27% and
Fairness by 11%) and system-oriented metrics (Weighted Speedup increases
22% and Memory Bandwidth Consumption decreases 14%) over an eight-core
baseline system that uses aggressive sequential prefetch with a fixed
degree. Similar conclusions can be drawn by varying the number of
cores or the LLC size, when running parallel applications, or when
other prefetch engines are controlled.

BibTeX:

@article{Albericio2012,
  author = {Albericio, J. and Ibáñez, P. and Gran, R. and Viñals, V. and Llabería, J. M.},
  title = {ABS: a Low-Cost Adaptive Controller for Prefetching in a Banked Shared Last-Level Cache},
  journal = {ACM Transactions on Architecture and Code Optimization},
  publisher = {ACM},
  year = {2012},
  volume = {8-4},
  pages = {191--1920},
  doi = {10.1145/2086696.2086698}
}

P. García-Risueño and P.E. Ibáñez (2012), "A review of High Performance Computing foundations for scientists", International Journal of Modern Physics C., May, 2012. Vol. 23(7), pp. 33.

[Abstract] [BibTeX] [DOI]

Abstract: The increase of existing computational capabilities has made simulation
emerge as a third discipline of Science, lying midway between experimental
and purely theoretical branches [1, 2]. Simulation enables the evaluation
of quantities which otherwise would not be accessible, helps to improve
experiments and provides new insights on systems which are analysed
[3-6]. Knowing the fundamentals of computation can be very useful
for scientists, for it can help them to improve the performance of
their theoretical models and simulations. This review includes some
technical essentials that can be useful to this end, and it is devised
as a complement for researchers whose education is focused on scientific
issues and not on technological respects. In this document we attempt
to discuss the fundamentals of High Performance Computing (HPC) [7]
in a way which is easy to understand without much previous background.
We sketch the way standard computers and supercomputers work, as
well as discuss distributed computing and discuss essential aspects
to take into account when running scientific calculations in computers.

BibTeX:

@article{Garcia-Risueno2012,
  author = {Pablo García-Risueño and Pablo E. Ibáñez},
  title = {A review of High Performance Computing foundations for scientists},
  journal = {International Journal of Modern Physics C},
  year = {2012},
  volume = {23(7)},
  pages = {33},
  doi = {10.1142/S0129183112300011}
}

X. Qian, B. Sahelices and J. Torrellas (2012), "BulkSMT: Designing SMT Processors for Atomic-Block Execution", In International Symposium on High Performance Computer Architecture (HPCA 2012). New Orleans, Louisiana, February, 2012.

[Abstract] [BibTeX] [PDF]

Abstract: Multiprocessor architectures that continuously execute atomic blocks
(or chunks) of instructions can improve performance and software
productivity. However, all of the prior proposals for such architectures
assume single-context cores as building blocks - rather than the
widely-used Simultaneous Multithreading (SMT) cores. As a result,
they are underutilizing hardware resources. This paper presents the
first SMT design that supports continuous chunked (or transactional)
execution of its contexts. Our design, called BulkSMT, can be used
either in a single-core processor or in a multicore of SMTs. We present
a set of BulkSMT configurations with different cost and performance.
We also describe the architectural primitives that enable chunked
execution in an SMT core and in a multicore of SMTs. Our results,
based on simulations of SPLASH-2 and PARSEC codes, show that BulkSMT
supports chunked execution cost-effectively. In a 4-core multicore
with eager chunked execution, BulkSMT reduces the execution time
of the applications by an average of 26% compared to running on single-context
cores. In a single core, the average reduction is 32%.

BibTeX:

@inproceedings{Qian2012BulkSMT,
  author = {Xuehai Qian and Benjamin Sahelices and Josep Torrellas},
  title = {BulkSMT: Designing SMT Processors for Atomic-Block Execution},
  booktitle = {International Symposium on High Performance Computer Architecture (HPCA 2012)},
  year = {2012},
  note = {(aceptado)}
}

C. González, J. Resano, D. Mozos, A. Plaza and D. Valencia (2012), "FPGA Implementation of Abundance Estimation for Spectral Unmixing of Hyperspectral Data Using the Image Space Reconstruction Algorithm", IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING. Vol. 5, nº 1, pp. 248-261. IEEE.

[BibTeX] [DOI]

BibTeX:

@article{Resano2012Abundance,
  author = {Carlos González and Javier Resano and Daniel Mozos and Antonio Plaza and David Valencia},
  title = {FPGA Implementation of Abundance Estimation for Spectral Unmixing of Hyperspectral Data Using the Image Space Reconstruction Algorithm},
  journal = {IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING},
  publisher = {IEEE},
  year = {2012},
  volume = {5, nº 1},
  pages = {248--261},
  doi = {10.1109/JSTARS.2011.2171673}
}

C. González, J. Resano, D. Mozos, A. Plaza and D. Valencia (2012), "FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis", IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. Vol. 50, nº2, pp. 374-388. IEEE.

[BibTeX] [DOI]

BibTeX:

@article{Resano2012PixelPurity,
  author = {Carlos González and Javier Resano and Daniel Mozos and Antonio Plaza and David Valencia},
  title = {FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis},
  journal = {IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING},
  publisher = {IEEE},
  year = {2012},
  volume = {50, nº2},
  pages = {374--388},
  doi = {10.1109/TVLSI.2010.2050158}
}

B. Sahelices, A. de Dios, P. Ibáñez, V. Viñals and J. Llabería (2012), "Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers", Journal of Computer Science and Technology. Vol. 27(1), pp. 75-91. Science Press.

[Abstract] [BibTeX] [DOI]

Abstract: Synchronization in parallel programs is a major performance bottleneck
in multiprocessor systems. Shared data is protected by locks and
a lot of time is spent on the competition arising at the lock hand-off.
In order to be serialized, requests to the same cache line can either
be bounced (NACKed) or buffered in the coherence controller. In this
paper, we focus mainly on systems whose coherence controllers buffer
requests. In a lock hand-off, a burst of requests to the same line
arrive at the coherence controller. During lock hand-offonly the
requests from the winning processor contribute to progress of the
computation, since the winning processor is the only one that will
advance the work. This key observation leads us to propose a hardware
mechanism we call request bypassing, which allows requests from the
winning processor to bypass the requests buffered in the coherence
controller keeping the lock line. We present an inexpensive implementation
of request bypassing that reduces the time spent on all the execution
phases of a critical section (acquiring the lock, accessing shared
data, and releasing the lock) and which, as a consequence, speeds
up the whole parallel computation. This mechanism requires neither
compiler or programmer support nor ISA or coherence protocol changes.
By simulating a 32-processor system, we show that using request bypassing
does not degrade but rather improves performance in three applications
with low synchronization rates, while in those having a large amount
of synchronization activity (the remaining four), we see reductions
in execution time and in lock stall time ranging from 14% to 39%
and from 52% to 71%, respectively. We compare request bypassing with
a previously proposed technique called read combining and with a
system that bounces requests, observing a significantly lower execution
time with the bypassing scheme. Finally, we analyze the sensitivity
of our results to some key hardware and software parameters.

BibTeX:

@article{Sahelices2012,
  author = {B. Sahelices and A. de Dios and P. Ibáñez and V. Viñals and J.M. Llabería},
  title = {Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers},
  journal = {Journal of Computer Science and Technology},
  publisher = {Science Press},
  year = {2012},
  volume = {27(1)},
  pages = {75--91},
  doi = {10.1007/s11390-012-1207-2}
}

J. Segarra, C. Rodríguez, R. Gran, L.C. Aparicio and V. Viñals (2012), "A small and effective data cache for real-time multitasking systems", In IEEE Real-Time and Embedded Technology and Applications Symposium. Beijing, China, April, 2012. , pp. 45-54. IEEE Computer Society Press.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: In multitasking real-time systems, the WCET of each task and also
the effects of interferences between tasks in the worst-case scenario
need to be calculated. This is especially complex with data caches.
In this paper, we propose a small instruction-driven data cache (256
bytes) that effectively exploits locality. It works by preselecting
a subset of memory instructions that will have data cache replacement
permission. Selection of such instructions is based on data reuse
theory. Since each selected memory instruction replaces its own data
cache line, it prevents pollution and performance in tasks becomes
independent of the size of the associated data structures. We have
modeled several memory configurations using the Lock-MS WCET analysis
method. Our results show that, on average, our data cache effectively
services 88% of program data. Such results translate into doubling
the performance of the tested real-time multitasking experiments,
which (increasing from 75 to 89%) approaches the ideal case of always
hitting in instruction and data caches. Additionally, we show that
using partitioning on our proposed hardware only provides marginal
benefits.

BibTeX:

@inproceedings{Segarra12Small,
  author = {J. Segarra and C. Rodríguez and R. Gran and L. C. Aparicio and V. Viñals},
  title = {A small and effective data cache for real-time multitasking systems},
  booktitle = {IEEE Real-Time and Embedded Technology and Applications Symposium},
  publisher = {IEEE Computer Society Press},
  year = {2012},
  pages = {45--54},
  doi = {10.1109/RTAS.2012.11}
}

M.A. Montañés, E. Torres, J. Martínez-Rincón and J.E. Herrero-Jaraba (2012), "Real-Time GPU Color-Based Segmentation of Football Players", Journal of Real-Time Image Processing, Springer., December, 2012. Vol. 7, pp. 267-279.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: In this paper, we propose a multi camera application capable of processing
high resolution images and extracting features based on colors patterns
over graphic processing units (GPU). The goal is to work in real
time under the uncontrolled environment of a sport event like a football
match. Since football players are composed for diverse and complex
color patterns, a Gaussian Mixture Models (GMM) is applied as segmentation
paradigm, in order to analyze sport live images and video. Optimization
techniques have also been applied over the C++ implementation using
profiling tools focused on high performance. Time consuming tasks
were implemented over NVIDIA?s CUDA platform, and later restructured
and enhanced, speeding up the whole process significantly. Our resulting
code is around 4-11 times faster on a low cost GPU than a highly
optimized C++ version on a central processing unit (CPU) over the
same data. Real time has been obtained processing until 64 frames
per second. An important conclusion derived of our study is the scalability
of the application to the number of cores on the GPU

BibTeX:

@article{Torres2012,
  author = {M. A. Montañés and E. Torres and J Martínez-Rincón and J. E. Herrero-Jaraba},
  title = {Real-Time GPU Color-Based Segmentation of Football Players},
  journal = {Journal of Real-Time Image Processing, Springer},
  year = {2012},
  volume = {7},
  pages = {267--279},
  doi = {10.1007/s11554-011-0194-9}
}

L.C. Aparicio, J. Segarra, C. Rodríguez and V. Viñals (2011), "Improving the WCET computation in the presence of a lockable instruction cache in multitasking real-time systems", Journal of Systems Architecture., August, 2011. Vol. 57(7), pp. 695-706. Elsevier.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: In multitasking real-time systems it is required to compute the WCET
of each task and also the effects of interferences between tasks
in the worst case. This is very complex with variable latency hardware,
such as instruction cache memories, or, to a lesser extent, the line
buffers usually found in the fetch path of commercial processors.
Some methods disable cache replacement so that it is easier to model
the cache behavior. The difficulty in these cache-locking methods
lies in obtaining a good selection of the memory lines to be locked
into cache. In this paper, we propose an ILP-based method to select
the best lines to be loaded and locked into the instruction cache
at each context switch (dynamic locking), taking into account both
intra-task and inter-task interferences, and we compare it with static
locking. Our results show that, without cache, the spatial locality
captured by a line buffer doubles the performance of the processor.
When adding a lockable instruction cache, dynamic locking systems
are schedulable with a cache size between 12.5% and 50% of the cache
size required by static locking. Additionally, the computation time
of our analysis method is not dependent on the number of possible
paths in the task. This allows us to analyze large codes in a relatively
short time (100 KB with 1065 paths in less than 3 min).

BibTeX:

@article{Aparicio10Improving,
  author = {L. C. Aparicio and J. Segarra and C. Rodríguez and V. Viñals},
  title = {Improving the WCET computation in the presence of a lockable instruction cache in multitasking real-time systems},
  journal = {Journal of Systems Architecture},
  publisher = {Elsevier},
  year = {2011},
  volume = {57},
  number = {7},
  pages = {695--706},
  doi = {10.1016/j.sysarc.2010.08.008}
}

A. Bosque, V. Viñals, P. Ibañez and J. Llabería (2011), "Filtering Directory Lookups in CMPs with Write-Through Caches", In Euro-Par 2011 Parallel Processing - 17th International Conference. LNCS 6852., September, 2011. Vol. 6852/2011, pp. 269-281. Springer.

[Abstract] [BibTeX] [PDF]

Abstract: In CMPs, coherence protocols are used to maintain data coherence among
the multiple local caches. In this paper, we focus on CMPs using
write-through local caches, and a directory-based coherence protocol
implemented as a duplicate of the local cache tags. A large fraction
of directory lookups is due to stores performed on private data local
to the processor performing the store. We propose to add a filter
before the directory in order to either reduce the associativity
of the lookups or even eliminate those that are unnecessary. When
a block from the shared cache has only one copy in the local caches,
the filter identifies the processor and allows for reducing the number
of comparisons performed in the corresponding directory lookup. When
that is not possible, the filter bits are used to code other situations
that can also reduce the number of directory lookups or their associativity.
We evaluate the fillter in a CMP with 8 in-order processors with
4 threads each and a memory hierarchy with local caches and a shared
cache. We show that a filter representing 0.7% of the size of the
shared cache can avoid, on average, 97% and 93% of all comparisons
performed by directory lookups for SPLASH2 and Specweb2005, respectively.
Only for SPLASH2, there is a small performance loss of 0.3%. As a
result, on average, directory power is reduced 30.8% and 22.4% for
SPLASH2 and Specweb2005, respectively.

BibTeX:

@inproceedings{Bosque2011,
  author = {A. Bosque and V. Viñals and P. Ibañez and J.M. Llabería},
  title = {Filtering Directory Lookups in CMPs with Write-Through Caches},
  booktitle = {Euro-Par 2011 Parallel Processing - 17th International Conference. LNCS 6852},
  publisher = {Springer},
  year = {2011},
  volume = {6852/2011},
  pages = {269--281}
}

A. Bosque, V. Viñals, P. Ibañez and J. Llabería (2011), "Filtering Directory Lookups in CMPs", Microprocessors and Microsystems. Design and Verification of Complex Digital Systems. Vol. vol. 35, n. 8, pp. 695-707. Elsevier.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Coherence protocols consume an important fraction of power to determine
which coherence action to perform. Specifically, on CMPs with shared
cache and directory-based coherence protocol implemented as a duplicate
of local caches tags, we have observed that a big fraction of directory
lookups cause a miss,because the block looked up is not allocated
in any local cache. To reduce the number of directory lookups and
therefore the power consumption, we propose to add a filter before
the directory access. We introduce two filter implementations. In
the first one, filtering information is explicitly kept in the shared
cache for every block. In the second one, filtering information is
decoupled from the shared cache organization, so the filter size
does not depend on the shared cache size. We evaluate our filters
in a CMP with 8 in-order processors with 4 threads each and a memory
hierarchy with write-through local caches and a shared cache. We
show that, for SPLASH2 benchmarks, the proposed filters reduce the
number of directory lookups performed by 60% while power consumption
is reduced by 28%. For Specweb2005, the number of directory lookups
performed is reduced by 68% (44%), while directory power consumption
is reduced by 19% (9%) using the first (second) filter implementation.

BibTeX:

@article{Bosque2011b,
  author = {A. Bosque and V. Viñals and P. Ibañez and J.M. Llabería},
  title = {Filtering Directory Lookups in CMPs},
  journal = {Microprocessors and Microsystems. Design and Verification of Complex Digital Systems},
  publisher = {Elsevier},
  year = {2011},
  volume = {vol. 35, n. 8},
  pages = {695--707},
  doi = {0.1016/j.micpro.2011.08.006}
}

M.J. Ramón, E.L. Pueyo, J.L. Briz, A. Pocoví and J.C. Ciria (2011), "Flexural unfolding of horizons using paleomagnetic vectors", Journal of Structural Geology. Vol. 35, pp. 28-39. Elsevier.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: This paper introduces a new restoration method (Pmag3Drest) designed
for complex folded structures (non-cylindrical, non-coaxial). It
combines paleomagnetic vectors and bedding markers setting up a reference
system that allows deformed and undeformed surfaces to be related
to one another. We assume flexural conditions during the deformation.
Consequently, the stratigraphic horizons are considered to be globally
developable surfaces with total area preservation except in specific
deformation areas. Using paleomagnetism in the proposed restoration
process (Pmag3Drest) helps to locate these areas with greater accuracy.
It is similar to other approaches based on triangulations, but it
forces the available paleomagnetic data to converge with the paleomagnetic
reference vector during the restoration process. Our experiments
use computer and analog models in which the deformed and undeformed
surfaces are perfectly known. This enables us to apply the restoration
method to the deformed surface and compare the parameters of the
restored surface with those of the initial undeformed surface to
quantify the quality of the method. Paleomagnetic data anchor the
surface leading to more accurate results.

BibTeX:

@article{Briz2011Flexural,
  author = {Mª José Ramón and Emilio L. Pueyo and José Luis Briz and Andrés Pocoví and José Carlos Ciria},
  title = {Flexural unfolding of horizons using paleomagnetic vectors},
  journal = {Journal of Structural Geology},
  publisher = {Elsevier},
  year = {2011},
  volume = {35},
  pages = {28--39},
  doi = {10.1016/j.jsg.2011.11.015}
}

L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2011), "Multi-level Adaptive Prefetching based on Performance Gradient Tracking", The Journal of Instruction-Level Parallelism., January, 2011. Vol. 13, pp. 1-14.

[Abstract] [BibTeX] [URL]

Abstract: We introduce a multi-level prefetching framework with three setups,
respectively aimed to minimize cost (Mincost), minimize losses in
individual applications (Minloss) or maximize performance with moderate
cost (Maxperf). Performance is boosted in all cases by a sequential
tagged prefetcher in the L1 cache, with an effective static degree
policy. In both cache levels (L1 and L2), we also apply prefetch
filters. In the L2 cache we use a novel adaptive policy that selects
the best prefetching degree within a fixed set of values, by tracking
the performance gradient. Mincost resorts to sequential tagged prefetching
in the L2 cache as well. Minloss relies on an accurate, home-made,
correlating prefetcher (PDFCM, Differencial Finite Context Method
Prefetcher). Maxperf maximizes performance at the expense of slight
performance losses in a small number of benchmarks, by integrating
a sequential tagged prefetcher with PDFCM in the L2 cache.

BibTeX:

@article{Ramos2011,
  author = {L. M. Ramos and J. L. Briz and P. E. Ibáñez and Victor Viñals},
  title = {Multi-level Adaptive Prefetching based on Performance Gradient Tracking},
  journal = {The Journal of Instruction-Level Parallelism},
  year = {2011},
  volume = {13},
  pages = {1--14},
  url = {www.jilp.org/vol13}
}

C. González, J. Resano, A. Plaza and D. Mozos (2011), "FPGA Implementation of Endmember Extraction Algorithms from Hyperspectral Imagery: Pixel Purity Index versus N-FINDR", In Proc. of SPIE High-Performance Computing in Remote Sensing. Praga,Republica Checa

[BibTeX]

BibTeX:

@inproceedings{Resano2011Endmember,
  author = {Carlos González and Javier Resano and Antonio Plaza and Daniel Mozos},
  title = {FPGA Implementation of Endmember Extraction Algorithms from Hyperspectral Imagery: Pixel Purity Index versus N-FINDR},
  booktitle = {Proc. of SPIE High-Performance Computing in Remote Sensing},
  year = {2011}
}

J.A. Clemente, J. Resano and D. Mozos (2011), "A Replacement Technique to Maximize Task Reuse in Reconfigurable Systems", In IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW). Anchorage, USA

[BibTeX]

BibTeX:

@inproceedings{Resano2011Replacement,
  author = {Juan Antonio Clemente and Javier Resano and Daniel Mozos},
  title = {A Replacement Technique to Maximize Task Reuse in Reconfigurable Systems},
  booktitle = {IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)},
  year = {2011}
}

D. Suárez, G. Dimitrakopoulos, T. Monreal, M.G.H. Katevenis and V. Viñals (2011), "LP-NUCA: Networks-in-Cache for High- Performance Low-Power Embedded Processors", IEEE Transactions on Very Large Scale Integration (VLSI) systems.

[Abstract] [BibTeX] [PDF]

Abstract: High-end embedded processors demand complex on-chip cache hierarchies
satisfying several contradicting design requirements such as high-performance
operation and low energy consumption. This paper introduces light-power
(LP) nonuniform cache architecture (NUCA), a tiled-cache addressing
both goals. LP-NUCA places a group of small and low-latency tiles
between the L1 and the last level cache (LLC) that adapt better to
the application working sets and keep most recently evicted blocks
close to L1. LP-NUCA is built around three specialized ?networks-
in-cache,? each aimed at a separate cache operation. To prove the
design feasibility, we have fully implemented LP-NUCA in a 90-nm
technology. From the VLSI implementation, we observe that the proposed
networks-in-cache incur minimal area, latency, and power overhead.
To further reduce the energy consumption, LP-NUCA employs two network-wide
techniques (miss wave stopping and sectoring) that together reduce
the dynamic cache energy by 35% without degrading performance. Our
evaluations also show that LP-NUCA improves performance with respect
to cache hierarchies similar to those found in high-end embedded
processors. Similar results have been obtained after scaling to a
32-nm technology.

BibTeX:

@article{Suarez2011,
  author = {Darío Suárez and Giorgos Dimitrakopoulos and Teresa Monreal and Manolis G. H. Katevenis and Víctor Viñals},
  title = {LP-NUCA: Networks-in-Cache for High- Performance Low-Power Embedded Processors},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) systems},
  year = {2011}
}

J. Agustí, I. Pellejero, G. Abadal, G. Murillo, M. Urbiztondo, J. Sesé, M. Villarroya-Gaudo, M. Pina, J. Santamaría and N. Barniol (2010), "Optical vibrometer for mechanical properties characterization of silicalite-only cantilever based sensors", Microelectronic Engineering. Vol. 87(5-8), pp. 1207-1209.

[BibTeX] [DOI] [URL]

BibTeX:

@article{Agusti20101207,
  author = {J. Agustí and I. Pellejero and G. Abadal and G. Murillo and M.A. Urbiztondo and J. Sesé and M. Villarroya-Gaudo and M. Pina and J. Santamaría and N. Barniol},
  title = {Optical vibrometer for mechanical properties characterization of silicalite-only cantilever based sensors},
  journal = {Microelectronic Engineering},
  year = {2010},
  volume = {87},
  number = {5-8},
  pages = {1207--1209},
  note = {The 35th International Conference on Micro- and Nano-Engineering (MNE)},
  url = {http://www.sciencedirect.com/science/article/pii/S016793170900851X},
  doi = {10.1016/j.mee.2009.12.009}
}

L.C. Aparicio, J. Segarra, C. Rodríguez and V. Viñals (2010), "Combining prefetch with instruction cache locking in multitasking real-time systems", In Proceedings of the IEEE Int. Conf. on Embedded and Real-Time Computing Systems and Applications. Macau SAR, China, August, 2010. , pp. 319-328. IEEE Computer Society Press.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: In multitasking real-time systems it is required to compute the WCET
of each task and also the effects of interferences between tasks
in the worst case. This is complex with variable latency hardware
usually found in the fetch path of commercial processors. Some methods
disable cache replacement so that it is easier to model the cache
behavior. Lock-MS is an ILP based method to obtain the best selection
of memory lines to be locked in a dynamic locking instruction cache.
In this paper we first propose a simple memory architecture implementing
the next-line tagged prefetch, specially designed for hard
real-time systems. Then, we extend Lock-MS to add support for hardware
instruction prefetch. Our results show that the WCET of a system
with prefetch and an instruction cache with size 5% of the total
code size is better than that of a system having no prefetch and
cache size 80% of the code. We also evaluate the effects of the
prefetch penalty on the resulting WCET, showing that a system without
prefetch penalties has a worst-case performance 95% of the ideal
case. This highlights the importance of a good prefetch design. Finally,
the computation time of our analysis method is relatively short,
analyzing tasks of 96 KB with 10^{65} paths in less than
3 minutes.

BibTeX:

@inproceedings{Aparicio10Combining,
  author = {L. C. Aparicio and J. Segarra and C. Rodríguez and V. Viñals},
  title = {Combining prefetch with instruction cache locking in multitasking real-time systems},
  booktitle = {Proceedings of the IEEE Int. Conf. on Embedded and Real-Time Computing Systems and Applications},
  publisher = {IEEE Computer Society Press},
  year = {2010},
  pages = {319--328},
  doi = {10.1109/RTCSA.2010.8}
}

A. Bosque, V. Viñals, P. Ibañez and J. Llaberia (2010), "Filtering Directory Lookups in CMPs", In Proc. 13th Euromicro Conf. Digital System Design: Architectures, Methods and Tools (DSD). , pp. 207-216.

[Abstract] [BibTeX] [DOI]

Abstract: Coherence protocols consume an important fraction of power to determine
which coherence action should take place. In this paper we focus
on CMPs with a shared cache and a directory-based coherence protocol
implemented as a duplicate of local caches tags. We observe that
a big fraction of directory lookups produce a miss since the block
looked up is not cached in any local cache. We propose to add a filter
before the directory lookup in order to reduce the number of lookups
to this structure. The filter identifies whether the current block
was last accessed as a data or as an instruction. With this information,
looking up the whole directory can be avoided for most accesses.
We evaluate the filter in a CMP with 8 in-order processors with 4
threads each and a memory hierarchy with a shared L2 cache. We show
that a filter with a size of 3% of the tag array of the shared cache
can avoid more than 70% of all comparisons performed by directory
lookups with a performance loss of just 0.2% for SPLASH2 and 1.5%
for Specweb2005. On average, the number of 15-bit comparisons avoided
per cycle is 54 out of 77 for SPLASH2 and 29 out of 41 for Specweb2005.
In both cases, the filter requires less than one read of 1 bit per
cycle.

BibTeX:

@inproceedings{Bosque2010,
  author = {A. Bosque and V. Viñals and P. Ibañez and J.M. Llaberia},
  title = {Filtering Directory Lookups in CMPs},
  booktitle = {Proc. 13th Euromicro Conf. Digital System Design: Architectures, Methods and Tools (DSD)},
  year = {2010},
  pages = {207--216},
  doi = {10.1109/DSD.2010.85}
}

P. Molina-Gaudo, S. Baldassarri, M. Villarroya-Gaudo and E. Cerezo (2010), "Perception and Intention in Relation to Engineering: A Gendered Study Based on a One-Day Outreach Activity", #IEEE_J_EDU#. Vol. 53(1), pp. 61-70.

[Abstract] [BibTeX] [DOI]

Abstract: This paper explores both how male and female high school pupils (15-16
years old) perceive the engineering profession and their willingness
to pursue a career in this area. A study was performed around a one-day
outreach activity, Girls' Day, organized for the first time in Spain.
During Girls' Day, students were exposed to specific activities developed
for them in engineering research labs and companies, carried out
by young female researchers and professionals. The study, based on
two questionnaires answered before and after the activity, focuses
on the differences between groups of female and male students having
differing degrees of interest in studying engineering. The educational
level of mothers, the presence of engineers in families, and perceived
family support emerged as important factors influencing the probability
of a young person's considering pursuing engineering studies. Nevertheless,
the need to expose children to outreach activities at a younger age
and to involve the students' families and teachers has become clear.
If planned properly and thoughtfully, even a single day's experience
can contribute to changing the perception of what an engineer is.

BibTeX:

@article{Molina-Gaudo2010,
  author = {Molina-Gaudo, P. and Baldassarri, S. and Villarroya-Gaudo, M. and Cerezo, E.},
  title = {Perception and Intention in Relation to Engineering: A Gendered Study Based on a One-Day Outreach Activity},
  journal = {#IEEE_J_EDU#},
  year = {2010},
  volume = {53},
  number = {1},
  pages = {61--70},
  doi = {10.1109/TE.2009.2023910}
}

J. Olivito, C. González and J. Resano (2010), "FPGA implementation of a strong Reversi player", In International Conference on Field-Progammable Technology. Beijing, China, December, 2010. Vol. ISBN 978-1-4244-8980-0, pp. 507-510.

[BibTeX]

BibTeX:

@inproceedings{Olivito10Reversi,
  author = {Olivito, J. and Carlos González and Javier Resano},
  title = {FPGA implementation of a strong Reversi player},
  booktitle = {International Conference on Field-Progammable Technology},
  year = {2010},
  volume = {ISBN 978-1-4244-8980-0},
  pages = {507--510}
}

C. Gonzalez, J. Resano, D. Mozos, A. Plaza and D. Valencia (2010), "FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis", EURASIP Journal on Advances in Signal Processing. Vol. 2010, pp. 1-13. HINDAWI.

[BibTeX]

BibTeX:

@article{Resano10c,
  author = {Carlos Gonzalez and Javier Resano and Daniel Mozos and Antonio Plaza and David Valencia},
  title = {FPGA Implementation of the Pixel Purity Index Algorithm for Remotely Sensed Hyperspectral Image Analysis},
  journal = {EURASIP Journal on Advances in Signal Processing},
  publisher = {HINDAWI},
  year = {2010},
  volume = {2010},
  pages = {1--13}
}

J. Clemente, C. González, J. Resano and D. Mozos (2010), "A task graph execution manager for reconfigurable multi-tasking systems", Microprocessors and Microsystems. Vol. 34-Issues 2-4, pp. 73-83. Elsevier.

[BibTeX]

BibTeX:

@article{Resano2010a,
  author = {J.A. Clemente and C. González and J. Resano and D. Mozos},
  title = {A task graph execution manager for reconfigurable multi-tasking systems},
  journal = {Microprocessors and Microsystems},
  publisher = {Elsevier},
  year = {2010},
  volume = {34-Issues 2-4},
  pages = {73--83}
}

J.A. Clemente, J. Resano, C. Gonzalez and D. Mozos (2010), "A Hardware Implementation of a Run-Time Scheduler for Reconfigurable Systems", IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Vol. 19, NO. 7, pp. 1263-1276. IEEE.

[BibTeX]

BibTeX:

@article{Resano2010b,
  author = {Juan Antonio Clemente and Javier Resano and Carlos Gonzalez and Daniel Mozos},
  title = {A Hardware Implementation of a Run-Time Scheduler for Reconfigurable Systems},
  journal = {IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  publisher = {IEEE},
  year = {2010},
  volume = {19, NO. 7},
  pages = {1263--1276}
}

C. González, J. Resano, A. Plaza and D. Mozos (2010), "FPGA for computing the pixel purity index algorithm on hyperspectral images", In The 2010 International International conference on Engineering of Reconfigurable Systems and Algorithms (ERSA). Las Vegas, USA

[BibTeX]

BibTeX:

@inproceedings{Resano2010ComputingPixelPurity,
  author = {Carlos González and Javier Resano and Antonio Plaza and Daniel Mozos},
  title = {FPGA for computing the pixel purity index algorithm on hyperspectral images},
  booktitle = {The 2010 International International conference on Engineering of Reconfigurable Systems and Algorithms (ERSA)},
  year = {2010}
}

M.A. Montañés, E. Torres, J. Martinez and J.E. Herrero (2010), "Scalability of Color-Based Segmentation of Football Players over GPUs", In 2010 International Workshop on GPUs and Scientific Applications (GPUScA 2010) in conjunction with PACT 2010. , pp. 27-35. TR of the Department of Scientific Computing, University of Vienna, TR-10-3.

[Abstract] [BibTeX] [PDF]

Abstract: In this paper, we study the scalability of a real application to the
available number of cores in the GPU. Our application is a real-time
image processing in which a football player feature extractor based
in color patterns obtain feasible measures for tracking system. Since
football players are composed for diverse and complex color patterns,
a Gaussian Mixture Models (GMM) is applied as segmentation paradigm.
Optimization techniques have also been applied over the C++ implementation
using profiling tools focused on high performance. Time consuming
tasks were implemented over NVIDIA?s CUDA platform, and later restructured
and enhanced, speeding up the whole process significantly. Our resulting
code is around 4-11 times faster on a low cost GPU than a highly
optimized C++ version on a central processing unit (CPU) over the
same data. The optimized application has been benchmarked over different
GPUs with different number of cores. Due to data dependencies performance
increase 1.4x when doubling number of cores.

BibTeX:

@inproceedings{Torres2010,
  author = {M. A. Montañés and E. Torres and J. Martinez and J. E. Herrero},
  title = {Scalability of Color-Based Segmentation of Football Players over GPUs},
  booktitle = {2010 International Workshop on GPUs and Scientific Applications (GPUScA 2010) in conjunction with PACT 2010},
  publisher = {TR of the Department of Scientific Computing, University of Vienna, TR-10-3},
  year = {2010},
  pages = {27--35}
}

C. González, J. Olivito and J. Resano (2009), "An initial specific processor for Sudoku solving", In International Conference on Field-Programmable Technology. Sydney, Australia, December, 2009. Vol. ISBN 978-1-4244-4375-8, pp. 530-533.

[BibTeX]

BibTeX:

@inproceedings{Gonzalez09Sudokus,
  author = {Carlos González and Javier Olivito and Javier Resano},
  title = {An initial specific processor for Sudoku solving},
  booktitle = {International Conference on Field-Programmable Technology},
  year = {2009},
  volume = {ISBN 978-1-4244-4375-8},
  pages = {530--533}
}

R. Gran, E. Morancho, A. Olive and J.M. Llabería (2009), "On reducing misspeculations in a pipelined scheduler", In Proc. IEEE Int. Symp. Parallel & Distributed Processing (IPDPS 2009). , pp. 1-12.

[Abstract] [BibTeX] [DOI]

Abstract: Pipelining the scheduling logic, which exposes and exploits the instruction
level parallelism, degrades processor performance. In a 4-issue processor,
our evaluations show that pipelining the scheduling logic over two
cycles degrades performance by 10% in SPEC-2000 integer benchmarks.
Such a performance degradation is due to sacrificing the ability
to execute dependent instructions in consecutive cycles. Speculative
selection is a previously proposed technique that boosts the performance
of a processor with a pipelined scheduling logic. However, this new
speculation source increases the overall number of misspeculated
instructions, and this unuseful work wastes energy. In this work
we introduce a non-speculative mechanism named Dependence Level Scheduler
(DLS) which not only tolerates the scheduling-logic latency but also
reduces the number of misspeculated instructions with respect to
a scheduler with speculative selection. In DLS, the selection of
a group of one-cycle instructions (producer-level) is overlapped
with the wake up in advance of its group of dependent instructions.
DLS is not speculative because the group of woken in advance instructions
will compete for selection only after issuing all producer-level
instructions. On average, DLS reduces the number of misspeculated
instructions with respect to a speculative scheduler by 17.9%. From
the IPC point of view, the speculative scheduler outperforms DLS
by 0.3%. Moreover, we propose two non-speculative improvements to
DLS.

BibTeX:

@inproceedings{Gran2009,
  author = {Gran, R. and Morancho, E. and Olive, A. and Llabería, J. M.},
  title = {On reducing misspeculations in a pipelined scheduler},
  booktitle = {Proc. IEEE Int. Symp. Parallel & Distributed Processing (IPDPS 2009)},
  year = {2009},
  pages = {1--12},
  doi = {10.1109/IPDPS.2009.5160990}
}

S. Gutiérrez, O. Benedí, D. Suárez, J. Marín and V. Viñals (2009), "Processor Energy and Temperature in Computer Architecture Courses: a hands-on approach", In Workshop on Computer Architecture Education Held in conjunction with the 42nd Annual International Symposium on Microarchitecture, New York, United States, December 13, 2009..

[Abstract] [BibTeX] [PDF]

Abstract: Performance has driven the microprocessor industry for more than thirty
years. Its effort has enabled to multiply by several orders of magnitude
the computational power; e.g., the Intel 8080 was able to execute
0.64 MIPS and the newest Core i7 can execute 6400 MIPS. The cost
of this fabulous improvement has been a large rise in energy consumption.
Nowadays, we have reached a point where one of the most limiting
factor for improving performance is energy dissipation. In order
to keep the performance improvement during the next years, it is
necessary to study energy and temperature in deep. Nevertheless,
most current computer architecture curricula include neither energy
nor temperature. The lack of adequate experimental platforms contributes
to the difficulty in teaching these topics. In this paper we propose
a possible solution: to instrument a commodity PC for measuring the
processor power and temperature during the execution of real programs.
The platform is devised for teaching, but it can be used to support
research experiments as well. For example, we describe an interesting
undergraduate laboratory that analyzes the interaction between compiler
optimizations and energy. With this laboratory, students can learn
that performance optimizations usually reduce energy but may increase
power.

BibTeX:

@inproceedings{Gutierrez2009,
  author = {S. Gutiérrez and O. Benedí and D. Suárez and J.M. Marín and V. Viñals},
  title = {Processor Energy and Temperature in Computer Architecture Courses: a hands-on approach},
  booktitle = {Workshop on Computer Architecture Education Held in conjunction with the 42nd Annual International Symposium on Microarchitecture, New York, United States, December 13, 2009.},
  year = {2009}
}

A. Muzahid, D. Suárez, S. Qi and J. Torrellas (2009), "SigRace: signature-based data race detection", In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture. New York, NY, USA , pp. 337-348. ACM.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Detecting data races in parallel programs is important for both software
development and production-run diagnosis. Recently, there have been
several proposals for hardware-assisted data race detection. Such
proposals typically modify the L1 cache and cache coherence protocol
messages, and largely lose their capability when lines get displaced
or invalidated from the cache. To eliminate these shortcomings, this
paper proposes a novel, different approach to hardware-assisted data
race detection. The approach, called SigRace, relies on hardware
address signatures. As a processor runs, the addresses of the data
that it accesses are automatically encoded in signatures. At certain
times, the signatures are automatically passed to a hardware module
that intersects them with those of other processors. If the intersection
is not null, a data race may have occurred. This paper presents the
architecture of SigRace, an implementation, and its software interface.
With SigRace, caches and coherence protocol messages are unmodified.
Moreover, cache lines can be displaced and invalidated with no effect.
Our experiments show that SigRace is significantly more effective
than a state-of-the-art conventional hardware-assisted race detector.
SigRace finds on average 29% more static races and 107% more dynamic
races. Moreover, if we inject data races, SigRace finds 150% more
static races than the conventional scheme.

BibTeX:

@inproceedings{Muzahid2009,
  author = {Muzahid, Abdullah and Suárez, Dario and Qi, Shanxiang and Torrellas, Josep},
  title = {SigRace: signature-based data race detection},
  booktitle = {ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture},
  publisher = {ACM},
  year = {2009},
  pages = {337--348},
  doi = {10.1145/1555754.1555797}
}

L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2009), "Multi-level Adaptive Prefetching based on Performance Gradient Tracking", 1st. Data Prefetching Championship. Raleigh, North Carolina, February, 2009.

[BibTeX] [URL] [PDF]

BibTeX:

@inproceedings{Ramos2009,
  author = {Ramos, Luis M. and Briz, José Luis and Ibáñez, Pablo E. and Viñals, Victor},
  title = {Multi-level Adaptive Prefetching based on Performance Gradient Tracking},
  year = {2009},
  note = {Held in conjunction with the 15th International Symposium on High-Performance Computer Architecture 8HIPEAC-15). Best Paper Award},
  url = {http://www.jilp.org/dpc/online/DPC-1%20Program.htm}
}

C. González, J. Resano and D. Mozos (2009), "FPGA Support for Satellite Computations of Hyper Spectral Images", In 19th International Conference on Field Programmable Logic and Applications (FPL). Praga, República Checa

[BibTeX]

BibTeX:

@inproceedings{Resano2009SupportSatellite,
  author = {Carlos González and Javier Resano and Daniel Mozos},
  title = {FPGA Support for Satellite Computations of Hyper Spectral Images},
  booktitle = {19th International Conference on Field Programmable Logic and Applications (FPL)},
  year = {2009}
}

B. Sahelices, P. Ibáñez, V. Viñals and J. Llabería (2009), "A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors", In Euro-Par 2009 Parallel Processing. 15th International Euro-Par Conference, Delft, The Netherlands, August 25--28, 2009. Vol. LN 5704, pp. 149-161. Springer Berlin / Heidelberg.

[BibTeX] [PDF]

BibTeX:

@inproceedings{Sahelices2009,
  author = {Sahelices, B. and Ibáñez, P. and Viñals, V. and Llabería, J.M.},
  title = {A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors},
  booktitle = {Euro-Par 2009 Parallel Processing. 15th International Euro-Par Conference, Delft, The Netherlands, August 25--28, 2009},
  publisher = {Springer Berlin / Heidelberg},
  year = {2009},
  volume = {LN 5704},
  pages = {149--161}
}

D. Suárez, T. Monreal, F. Vallejo, R. Beivide and V. Viñals (2009), "Light NUCA: A proposal for bridging the inter-cache latency gap", In Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE '09). , pp. 530-535.

[Abstract] [BibTeX] [URL] [PDF]

Abstract: To deal with the “memory wall” problem, microprocessors
include large secondary on-chip caches. But as these caches enlarge,
they originate a new latency gap between them and fast L1 caches
(inter-cache latency gap). Recently, Non-Uniform Cache Architectures
(NUCAs) have been proposed to sustain the size growth trend of secondary
caches that is threatened by wire-delay problems. NUCAs are size-oriented,
and they were not conceived to close the inter-cache latency gap.
To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging
on-chip wire density to interconnect small tiles through specialized
networks, which convey packets with distributed and dynamic routing.
Our design reduces the tile delay (cache access plus one-hop routing)
to a single processor cycle and places cache lines at a finer granularity
than conventional caches, reducing cache latency. Our evaluations
show that in general, an L-NUCA improves simultaneously performance,
energy, and area when integrated into both conventional or D-NUCA
hierarchies.

BibTeX:

@inproceedings{Suarez2009,
  author = {Darío Suárez and Teresa Monreal and Fernando Vallejo and Ramon Beivide and Víctor Viñals},
  title = {Light NUCA: A proposal for bridging the inter-cache latency gap},
  booktitle = {Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE '09)},
  year = {2009},
  pages = {530--535},
  url = {http://www.date-conference.com/archive/conference/proceedings/PAPERS/2009/DATE09/PDFFILES/05.7_1.PDF}
}

E. Torres, P. Ibañez, V. Viñals and J.M. Llaberia (2009), "Store Buffer Design for Multibanked Data Caches", Transactions on Computers (TC 2009). Vol. 58(10), pp. 1307-1320.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: This paper focuses on how to design a store buffer (STB) well suited
to first-level multibanked data caches. The goal is to forward data
from in-flight stores into dependent loads within the latency of
a cache bank. Taking into account the store lifetime in the processor
pipeline and the data forwarding behavior, we propose a particular
two-level STB design in which forwarding is done speculatively from
a distributed first-level STB made of extremely small banks, whereas
a centralized, second-level STB enforces correct store-load ordering.
Besides, the two-level STB admits two simplifications that leave
performance almost unchanged. Regarding the second-level STB, we
suggest to remove its data forwarding capability, while for the first-level
STB, it is possible to: 1) remove the instruction age checking and
2) compare only the less significant address bits. Experimentation
covers both integer and floating point codes executing in dynamically
scheduled processors. Following our guidelines and running SPEC-2K
over an 8-way processor, a two-level STB with four 8-entry banks
in the first level performs similar to an ideal, single-level STB
with 128-entry banks working at the first-level cache latency. Also,
we show that the proposed two-level design is suitable for a memory-latency-tolerant
processor.

BibTeX:

@article{Torres2009,
  author = {Torres, E. and Ibañez, P. and Viñals, V. and Llaberia, J. M.},
  title = {Store Buffer Design for Multibanked Data Caches},
  journal = {Transactions on Computers (TC 2009)},
  year = {2009},
  volume = {58},
  number = {10},
  pages = {1307--1320},
  doi = {10.1109/TC.2009.57}
}

M. Urbiztondo, I. Pellejero, M. Villarroya-Gaudo, J. Sesé, M. Pina, I. Dufour and J. Santamaría (2009), "Zeolite-modified cantilevers for the sensing of nitrotoluene vapors", Sensors and Actuators B: Chemical. Vol. 137(2), pp. 608-616.

[BibTeX] [DOI] [URL]

BibTeX:

@article{Urbiztondo2009608,
  author = {M.A. Urbiztondo and I. Pellejero and M. Villarroya-Gaudo and J. Sesé and M.P. Pina and I. Dufour and J. Santamaría},
  title = {Zeolite-modified cantilevers for the sensing of nitrotoluene vapors},
  journal = {Sensors and Actuators B: Chemical},
  year = {2009},
  volume = {137},
  number = {2},
  pages = {608--616},
  url = {http://www.sciencedirect.com/science/article/pii/S0925400509000549},
  doi = {10.1016/j.snb.2009.01.047}
}

J. Alastruey, T. Monreal, F. Cazorla, V. Viñals and M. Valero (2008), "Selection of the Register File Size and the Resource Allocation Policy on SMT Processors", In Proc. 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD '08)., October, 2008. , pp. 63-70.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: The performance impact of the Physical Register File(PRF) size on
Simultaneous Multithreading processors has not been extensively studied
in spite of being a critical shared resource. In this paper we analyze
the effect on performance of the PRF size for a broad set of resource
allocation policies (Icount, Stall, Flush, Flush++, Static,Dcra and
Hill-climbing) and evaluate them under two metrics: instructions
per second (IPS) for throughput and harmonic mean of weighted IPCs
(Hmean-wIPC) for fairness. We have found that resource allocation
policy and PRF size should be considered together in order to obtain
the best score in the proposed metrics. For instance, for the analyzed
2 and 4-threaded SPEC CPU2000 workloads,small PRFs are best managed
by Flush, whereas for larger PRFs, Hill-climbing and Static lead
to the best values for the throughput and fairness metrics, respectively.The
second contribution of this work is a simple procedure that, for
a given resource allocation policy, selects the PRF size that maximizes
IPS and obtains for Hmean-wIPC a value close to its maximum. According
to our results, Hill-climbing with a 320-entry PRF achieves the best
figures for 2-threaded workloads. When executing 4-threaded workloads,
Hill-Climbing with a 384-entry PRF achieves the best throughput whereas
Static obtains the best throughput-fairness balance.

BibTeX:

@inproceedings{Alastruey2008,
  author = {Alastruey, J. and Monreal, T. and Cazorla, F. and Viñals, V. and Valero, M.},
  editor = {IEEE Computer Society},
  title = {Selection of the Register File Size and the Resource Allocation Policy on SMT Processors},
  booktitle = {Proc. 20th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD '08)},
  year = {2008},
  pages = {63--70},
  doi = {10.1109/SBAC-PAD.2008.17}
}

L.C. Aparicio, J. Segarra, C. Rodríguez, J.L. Villarroel and V. Viñals (2008), "Avoiding the WCET Overestimation on LRU Instruction Cache", In Proc. 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 08). Kaohsiung, Taiwan, August, 2008. , pp. 393-398. IEEE Computer Society Press.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: The WCET computation is one of the main challenges in hard real-time
systems, since all further analysis is based on this value. The complexity
of this problem leads existing analysis methods to compute WCET bounds
instead of the exact WCET. In this work we propose a technique to
compute the exact instruction fetch contribution to the WCET (IFC-WCET)
in presence of a LRU instruction cache. We prove that an exact computation
does not need to analyze the full exponential number of possible
execution paths, but only a bounded subset of them. In the benchmark
codes we have studied, the IFC-WCET is up to 62% lower than a bound
computed with a widely used approach, and the difference between
the number of possible execution paths and the ones relevant for
the analysis is extremely large.

BibTeX:

@inproceedings{Aparicio08Avoiding,
  author = {Aparicio, L. C. and Segarra, J. and Rodríguez, C. and Villarroel, J. L. and Viñals, V.},
  title = {Avoiding the WCET Overestimation on LRU Instruction Cache},
  booktitle = {Proc. 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 08)},
  publisher = {IEEE Computer Society Press},
  year = {2008},
  pages = {393--398},
  doi = {10.1109/RTCSA.2008.10}
}

I. Pellejero, M. Urbiztondo, M. Villarroya-Gaudo, J. Sesé, M. Pina and J. Santamaría (2008), "Development of etching processes for the micropatterning of silicalite films", Microporous and Mesoporous Materials. Vol. 114(1?3), pp. 110-120.

[BibTeX] [DOI] [URL]

BibTeX:

@article{Pellejero2008110,
  author = {I. Pellejero and M. Urbiztondo and Villarroya-Gaudo, M. and J. Sesé and M.P. Pina and J. Santamaría},
  title = {Development of etching processes for the micropatterning of silicalite films},
  journal = {Microporous and Mesoporous Materials},
  year = {2008},
  volume = {114},
  number = {1?3},
  pages = {110--120},
  url = {http://www.sciencedirect.com/science/article/pii/S1387181107007445},
  doi = {10.1016/j.micromeso.2007.12.023}
}

L. Ramos, J. Briz, P. Ibáñez and V. Viñals (2008), "Low-Cost Adaptive Data Prefetching", In Euro-Par 2008 Parallel Processing. Vol. 5168, pp. 327-336. Springer Berlin / Heidelberg.

[BibTeX] [URL]

BibTeX:

@inproceedings{Ramos2008,
  author = {Ramos, Luis and Briz, José and Ibáñez, Pablo and Viñals, Víctor},
  editor = {Luque, Emilio and Margalef, Tomàs and Benítez, Domingo},
  title = {Low-Cost Adaptive Data Prefetching},
  booktitle = {Euro-Par 2008 Parallel Processing},
  publisher = {Springer Berlin / Heidelberg},
  year = {2008},
  volume = {5168},
  pages = {327--336},
  note = {Tasa de aceptación=34% springerlink:10.1007/978-3-540-85451-7_36},
  url = {http://dx.doi.org/10.1007/978-3-540-85451-7_36}
}

J. Resano, D. Mozos, F. Catthoor, J.A. Clemente and C. González (2008), "Efficiently scheduling run-time reconfigurations", Transactions on Design Automation of Electronic Systems. Vol. 13, nº 4, pp. 581-5812. ACM.

[BibTeX]

BibTeX:

@article{Resano2008Scheduling,
  author = {Javier Resano and Daniel Mozos and Francky Catthoor and Juan Antonio Clemente and Carlos González},
  title = {Efficiently scheduling run-time reconfigurations},
  journal = {Transactions on Design Automation of Electronic Systems},
  publisher = {ACM},
  year = {2008},
  volume = {13, nº 4},
  pages = {581--5812}
}

J.A. Clemente, C. González, J. Resano and D. Mozos (2008), "A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems", In International Conference on ReConFigurable Computing and FPGAs (ReConFig). Cancún, Quintana Roo, México

[BibTeX]

BibTeX:

@inproceedings{Resano2008TaskGraph,
  author = {Juan Antonio Clemente and Carlos González and Javier Resano and Daniel Mozos},
  title = {A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems},
  booktitle = {International Conference on ReConFigurable Computing and FPGAs (ReConFig)},
  year = {2008}
}

V. Cholvi and J. Segarra (2008), "Analysis and placement of storage capacity in large distributed video servers", Computer Communications. Vol. 31(15), pp. 3604-3612. Elsevier.

[Abstract] [BibTeX] [DOI] [URL] [PDF]

Abstract: In this paper, we study how to distribute storage capacity along a
hierarchical system with cache-servers located at each node. This
system is intended to deliver stored video streams in a video-on-demand
way, ensuring that, once started, a transmission will be completed
without any delay or quality loss. We use off-line smoothing for
videos, dividing them into CBR video parts. Also, our request rates
are distributed following a 24 h audience curve. In this system,
when a request is received, the server reserves the required bandwidth
at the required time slots, trying to serve the video as soon as
possible. We perform a detailed analysis by means of simulations
of the start-up time delay for some storage distributions. It shows
that an adequate storage distribution can increase performance about
25% with respect to a uniform distribution and about 47% with respect
to one in which all the storage is attached to the gateway routers
that connect the final users. We also analyze bandwidth usage, comparing
the behavior of these storage distributions. Finally, we present
a method which allows dynamic and transparent video reallocations
when their popularity changes.

BibTeX:

@article{Segarra08Analysis,
  author = {Vicent Cholvi and Juan Segarra},
  title = {Analysis and placement of storage capacity in large distributed video servers},
  journal = {Computer Communications},
  publisher = {Elsevier},
  year = {2008},
  volume = {31},
  number = {15},
  pages = {3604--3612},
  url = {http://www.sciencedirect.com/science/article/B6TYP-4STYTYD-6/2/2971affb87cf195fccbfe8c1021e3d6e},
  doi = {DOI: 10.1016/j.comcom.2008.06.012}
}

J. Alastruey, T. Monreal, V. Viñals and M. Valero (2007), "Microarchitectural Support for Speculative Register Renaming", In Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007). , pp. 1-10.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: This paper proposes and evaluates a new microarchitecture for out-of-order
processors that supports speculative renaming. We call speculative
renaming to the speculative omission of physical register allocation
along with the speculative early release of physical registers. These
renaming policies may cause a register operand not to be kept in
the physical register file (PRF). Thus, we add a low-ported auxiliary
register file (XRF) located outside the processor core that keeps
the values absent in PRF and supplies them at higher latency. To
support the location of register operands being either in PRF or
XRF, we use virtual registers. We consider omission and release policies
directed by hardware prediction. Namely, we use a single last-use
predictor that directs both speculative omission and release. We
call this mechanism SR-LUP (speculative renaming based on last-use
prediction). Two last-use predictor designs of incremental complexity
and performance are analyzed. In a 256-ROB, 8-way processor with
an 80int+80fp PRF, SR-LUP with an 11-port 256int+256fp XRF, speeds
up computations up to 11.5% and 29% for INT and FP SPEC2K benchmarks,
respectively. For FP benchmarks, if the PRF limits the clock frequency,
a conventionally managed 128int+128fp PRF can be replaced using SR-LUP
by a 64int+64fp PRF backed up with a 10-port 224int+224fp XRF, showing
19% IPS gain.

BibTeX:

@inproceedings{Alastruey2007,
  author = {Alastruey, J. and Monreal, T. and Viñals, V. and Valero, M.},
  title = {Microarchitectural Support for Speculative Register Renaming},
  booktitle = {Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007)},
  year = {2007},
  pages = {1--10},
  doi = {10.1109/IPDPS.2007.370237}
}

A. Bosque, P. Iba nez, V. Viñals, P. Stenström and J.M. Llaber\ia (2007), "Characterization of Apache web server with Specweb2005", In MEDEA '07: Proceedings of the 2007 workshop on MEmory performance. New York, NY, USA , pp. 65-72. ACM.

[Abstract] [BibTeX] [DOI]

Abstract: Computer manufacturers offer today multicore with multi-threading
capabilities and a broad range of number of cores. An important market
today for these multicores is in the server domain. Web servers are
a class of servers which are widely used to provide access to files
and also as front-ends of more complex services. In this paper the
performance of Apache web server is characterized on multicore chips
using Specweb2005 as URL request generator. This benchmark provides
three workloads in order to characterize different usage environments.
We also compare its performance against Surge that simulates a static
web page URL request generator. We find that the L2 data miss rate
per instruction is below 1.4%, more than the 60% of the misses are
classified as cold or capacity misses and the true sharing misses
represent between 12% and 38% of all the misses. We observe that
though the data miss rate is small, accesses to main memory represent
up to 42% of the execution time. By contrast the true sharing misses
that could be up to 38% of all the misses, represent a small fraction
of time due to the small latency of cache-to-cache transfers inside
the chip.

BibTeX:

@inproceedings{Bosque2007,
  author = {Bosque, Ana and Iba nez, Pablo and Viñals, V\ictor and Stenström, Per and Llaber\ia, Jose M.},
  title = {Characterization of Apache web server with Specweb2005},
  booktitle = {MEDEA '07: Proceedings of the 2007 workshop on MEmory performance},
  publisher = {ACM},
  year = {2007},
  pages = {65--72},
  doi = {10.1145/1327171.1327179}
}

M.V. Gaudo, G. Abadal, J. Verd, J. Teva, F. Perez-Murano, E.F. Costa, J. Montserrat, A. Uranga, J. Esteve and N. Barniol (2007), "Time-Resolved Evaporation Rate of Attoliter Glycerine Drops Using On-Chip CMOS Mass Sensors Based on Resonant Silicon Micro Cantilevers", #IEEE_J_NANO#. Vol. 6(5), pp. 509-512.

[Abstract] [BibTeX] [DOI]

Abstract: The time-resolved evaporation rate of small glycerine drops (in the
attoliter range) is determined by means of a mass sensor based on
a resonant cantilever integrated in a CMOS chip. The cantilever is
fabricated on crystalline silicon, using silicon-on-insulator (SOI)
substrates for the integration of the CMOS-MEMS. Glycerine drops
are deposited at the free end of the cantilever. The high mass sensitivity
of the sensor (8 ag/Hz) allows to determine the evaporation rate
for glycerine drops smaller than 500 aL, which are found to be below
3.2 aL/s in volume or 4 fg/s in mass.

BibTeX:

@article{Gaudo2007,
  author = {Gaudo, M. V. and Abadal, G. and Verd, J. and Teva, J. and Perez-Murano, F. and Costa, E. F. and Montserrat, J. and Uranga, A. and Esteve, J. and Barniol, N.},
  title = {Time-Resolved Evaporation Rate of Attoliter Glycerine Drops Using On-Chip CMOS Mass Sensors Based on Resonant Silicon Micro Cantilevers},
  journal = {#IEEE_J_NANO#},
  year = {2007},
  volume = {6},
  number = {5},
  pages = {509--512},
  doi = {10.1109/TNANO.2007.901477}
}

(2007), "XVIII Jornadas de Paralelismo", September, 2007.

[BibTeX]

BibTeX:

@proceedings{Ibanez07jjpar,,
  editor = {Pablo Ibáñez and Enrique Torres and Juan Segarra and Jesús Alastruey and Luis Manuel Ramos},
  title = {XVIII Jornadas de Paralelismo},
  year = {2007}
}

L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2007), "Data prefetching in a cache hierarchy with high bandwidth and capacity", SIGARCH Comput. Archit. News. New York, NY, USA Vol. 35(4), pp. 37-44. ACM.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: In this paper we evaluate four hardware data prefetchers in the context
of a high-performance three-level on chip cache hierarchy with high
bandwidth and capacity. We consider two classic prefetchers (Sequential
Tagged and Stride) and two correlating prefetchers: PC/DC, a recent
method with a superior score and low-sized tables, and P-DFCM, a
new method. Like PC/DC, P-DFCM focuses on local delta sequences,
but it is based on the DFCM value predictor. We explore different
prefetch degrees and distances. Running SPEC2000, Olden and IAbench
applications, results show that this kind of cache hierarchy turns
prefetching aggressiveness into success for the four prefetchers.
Sequential Tagged is the best, and deserves further attention to
cut it losses in some applications. PC/DC results are matched or
even improved by P-DFCM, using far fewer accesses to tables while
keeping sizes low.

BibTeX:

@article{Ramos2007,
  author = {Ramos, Luis M. and Briz, José Luis and Ibáñez, Pablo E. and Viñals, Victor},
  title = {Data prefetching in a cache hierarchy with high bandwidth and capacity},
  journal = {SIGARCH Comput. Archit. News},
  publisher = {ACM},
  year = {2007},
  volume = {35},
  number = {4},
  pages = {37--44},
  doi = {10.1145/1327312.1327319}
}

J. Segarra and V. Cholvi (2007), "Convergence of periodic broadcasting and video-on-demand", Computer Communications. Vol. 30(5), pp. 1136-1141. Elsevier.

[Abstract] [BibTeX] [DOI] [URL] [PDF]

Abstract: Research on video-on-demand transmissions is essentially divided into
periodic broadcasting methods and on-demand methods. Periodic broadcasting
is aimed to schedule transmissions off-line, so that an optimized
time schedule is achieved. On the other hand video-on-demand has
to deal with constraints at requesting times. Thus, studies on these
areas have been quite isolated. Obviously, in periodic broadcasting
all parameters are known in advance, so timetables can be accurately
adjusted and it is assumed transmissions can be arranged to use less
bandwidth than video-on-demand. In this paper, we analyze the convergence
of both paradigms, showing that the claims that argue that VoD schemes
use more bandwidth than PB ones are not necessarily true. We state
this argument by proving how to convert any periodic broadcasting
method into an on-demand one, which will use equal or less bandwidth.
Moreover, we show that this converted on-demand method can also offer
shorter serving times.

BibTeX:

@article{Segarra07Convergence,
  author = {Juan Segarra and Vicent Cholvi},
  title = {Convergence of periodic broadcasting and video-on-demand},
  journal = {Computer Communications},
  publisher = {Elsevier},
  year = {2007},
  volume = {30},
  number = {5},
  pages = {1136--1141},
  note = {Advances in Computer Communications Networks},
  url = {http://www.sciencedirect.com/science/article/B6TYP-4MR1HD7-1/2/e7ac99bb58b0117cc755a45a520cd7eb},
  doi = {DOI: 10.1016/j.comcom.2006.12.007}
}

M. Villarroya-Gaudó, N. Barniol, C. Martin, F. Pérez-Murano, J. Esteve, L. Bruchhaus, R. Jede, E. Bourhis and J. Gierak (2007), "Fabrication of nanogaps for MEMS prototyping using focused ion beam as a lithographic tool and reactive ion etching pattern transfer", Microelectronic Engineering. Vol. 84(5?8), pp. 1215-1218.

[BibTeX] [DOI] [URL]

BibTeX:

@article{Villarroya20071215,
  author = {Maria Villarroya-Gaudó and Nuria Barniol and Cristina Martin and Francesc Pérez-Murano and Jaume Esteve and Lars Bruchhaus and Ralf Jede and Eric Bourhis and Jacques Gierak},
  title = {Fabrication of nanogaps for MEMS prototyping using focused ion beam as a lithographic tool and reactive ion etching pattern transfer},
  journal = {Microelectronic Engineering},
  year = {2007},
  volume = {84},
  number = {5?8},
  pages = {1215--1218},
  note = {Proceedings of the 32nd International Conference on Micro- and Nano-Engineering},
  url = {http://www.sciencedirect.com/science/article/pii/S0167931707001323},
  doi = {10.1016/j.mee.2007.01.074}
}

M. Villarroya-Gaudo, E. Figueras, J. Montserrat, J. Verd, J. Teva, G. Abadal, F. Perez-Murano, J. Esteve and N. Barniol (2006), "A platform for monolithic CMOS-MEMS integration on SOI wafers", Journal of Micromechanics and Microengineering. Vol. 16(10), pp. 2203.

[Abstract] [BibTeX] [URL]

Abstract: A new platform for micro- and nano-electromechanical systems based
on crystalline silicon as the structural layer in CMOS substrates
is presented. This platform is fabricated using silicon on insulator
(SOI) substrates, which allows the monolithic integration of the
mechanical transducer on crystalline silicon while the characteristics
of the structural layer are kept independent from the CMOS technology.
We report the design characteristics, the fabrication process and
an example of application of the CMOS SOI-MEMS platform to obtain
a mass sensor based on a crystalline silicon resonating cantilever.

BibTeX:

@article{0960-1317-16-10-038,
  author = {Villarroya-Gaudo, M. and Eduard Figueras and Josep Montserrat and Jaume Verd and Jordi Teva and Gabriel Abadal and Francesc Perez-Murano and Jaume Esteve and Nuria Barniol},
  title = {A platform for monolithic CMOS-MEMS integration on SOI wafers},
  journal = {Journal of Micromechanics and Microengineering},
  year = {2006},
  volume = {16},
  number = {10},
  pages = {2203},
  url = {http://stacks.iop.org/0960-1317/16/i=10/a=038}
}

J. Alastruey, T. Monreal, V. Viñals and M. Valero (2006), "Speculative early register release", In Proceedings of the 3rd conference on Computing frontiers (CF '06). New York, NY, USA , pp. 291-302. ACM.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: The late release policy of conventional renaming keeps many registers
in the register file assigned in spite of containing values that
will never be read in the future. In this work, we study the potential
of a novel scheme that speculatively releases a physical register
as soon as it has been read by a predicted last instruction that
references its value. An auxiliary register file placed outside the
critical paths of the processor pipeline holds the early released
values just in case they are unexpectedly referenced by some instruction.
In addition to demonstrate the feasibility of a last-use predictor,
this paper also analyzes the auxiliary register file (latency and
size) required to support a speculative early release mechanism that
uses a perfect predictor. The obtained results set the performance
bound that any real speculative early release implementation is able
to reach. We show that in a processor with a 64int+64fp register
file, a perfect early release supported by an unbounded auxiliary
register file has the potential of speeding up computations up to
23% and 47% for SPECint2000 and SPECfp2000 benchmarks, respectively.
Speculative early release can also be used to reduce register file
size without losing performance. For instance, a processor with a
conventionally managed 96int+96fp register file could be replaced
for equal IPC with a 64int+64fp register file managed with perfect
early register release and backed with a 64int+64fp auxiliary register
file, this representing a 12% IPS (Instructions Per Second) increase
if the processor frequency were constrained by the register file
access time.

BibTeX:

@inproceedings{Alastruey2006,
  author = {Alastruey, Jesús and Monreal, Teresa and Viñals, V\ictor and Valero, Mateo},
  title = {Speculative early register release},
  booktitle = {Proceedings of the 3rd conference on Computing frontiers (CF '06)},
  publisher = {ACM},
  year = {2006},
  pages = {291--302},
  doi = {10.1145/1128022.1128061}
}

J. Alastruey, J.L. Briz, P. Ibañez and V. Viñals (2006), "Software Demand, Hardware Supply", IEEE MICRO., July, 2006. Vol. 26(4), pp. 72-82.

[BibTeX] [DOI] [PDF]

BibTeX:

@article{Alastruey2006a,
  author = {Alastruey, J. and Briz, J. L. and Ibañez, P. and Viñals, V.},
  title = {Software Demand, Hardware Supply},
  journal = {IEEE MICRO},
  year = {2006},
  volume = {26},
  number = {4},
  pages = {72--82},
  doi = {10.1109/MM.2006.80}
}

R. Gran, E. Morancho, A. Olive and J.M. Llabería (2006), "An Enhancement for a Scheduling Logic Pipelined over two Cycles", In Proc. Int. Conf. Computer Design ICCD 2006. , pp. 203-209.

[Abstract] [BibTeX] [DOI]

Abstract: Out of order processors use the dynamic scheduling logic both to expose
and to exploit parallelism. Pipelining this logic may sacrifice the
ability to execute dependent instructions in consecutive cycles.
Several previous studies have shown that pipelining the scheduling
logic over two cycles degrades performance; our evaluations, in a
4-way machine, on SPEC-2000 integer benchmarks show a performance
degradation about 11% compared to an unpipelined scheduling logic.
In this work, we present two non-speculative enhancements for a scheduling
logic pipelined over two cycles. The idea is computing in advance
which instructions will be woken-up by all instructions that are
currently competing for selection. Once all of them have been selected,
the pre-computed group of instructions can compete for selection
in next cycle. The enhancement goal is to tolerate the scheduling-loop
latency when not enough ILP is available through the scheduling of
dependent instructions in consecutive cycles. Our results in a 4-way
machine show that our two proposed enhancements perform, on average,
slightly better than two previously proposed speculative schedulers.
The performance of our proposals is within a 2.6% and 2% of an unpipelined
ideal scheduler.

BibTeX:

@inproceedings{Gran2006,
  author = {Gran, R. and Morancho, E. and Olive, A. and Llabería, J. M.},
  title = {An Enhancement for a Scheduling Logic Pipelined over two Cycles},
  booktitle = {Proc. Int. Conf. Computer Design ICCD 2006},
  year = {2006},
  pages = {203--209},
  doi = {10.1109/ICCD.2006.4380818}
}

L.M. Ramos, J.L. Briz, P.E. Ibáñez and V. Viñals (2006), "Data prefetching in a cache hierarchy with high bandwidth and capacity", In MEDEA '06: Proceedings of the 2006 workshop on MEmory performance. New York, NY, USA , pp. 37-44. ACM.

[Abstract] [BibTeX] [DOI] [PDF]

BibTeX:

@inproceedings{Ramos2006,
  author = {Ramos, Luis M. and Briz, José Luis and Ibáñez, Pablo E. and Viñals, Víctor},
  title = {Data prefetching in a cache hierarchy with high bandwidth and capacity},
  booktitle = {MEDEA '06: Proceedings of the 2006 workshop on MEmory performance},
  publisher = {ACM},
  year = {2006},
  pages = {37--44},
  doi = {10.1145/1166133.1166138}
}

B. Sahelices, A.d. Dios, P. Ibáñez, V. Viñals and J. Llabería (2006), "Speeding-up Synchronizations in DSM Multiprocessors", In Proceedings of the 12th International Euro-Par Conference (EUROPAR 2006).

[Abstract] [BibTeX] [PDF]

Abstract: Synchronization in parallel programs is a major performance bottleneck.
Shared data is protected by locks and a lot of time is spent in the
competition arising at the lock hand-off. In this period of time,
a large amount of traffic is targeted to the line holding the lock
variable. In order to be serialized, the requests to the same cache
line can either be bounced (NACKed) or buffered in the coherence
controller. In this paper we focus on systems whose coherence controllers
buffer requests. During lock hand-off only the requests from the
winning processor contribute to the computation progress, because
the winning processor is the only one that will advance the work.
This key observation leads us to propose a hardware mechanism named
Request Bypass, which allows requests from the winning processor
to bypass the requests buffered in the home coherence controller
keeping the lock line. The mechanism does not require compiler or
programmer support nor ISA or coherence protocol changes. By simulating
a 32 processor system we show that Request Bypass reduces execution
time and lock stall time up to 35% and 75%, respectively. The programs
limited by synchronization benefit the most from Request Bypass.

BibTeX:

@inproceedings{Sahelices2006,
  author = {Sahelices, B. and Dios, A. de and Ibáñez, P. and Viñals, V. and Llabería, J.M.},
  title = {Speeding-up Synchronizations in DSM Multiprocessors},
  booktitle = {Proceedings of the 12th International Euro-Par Conference (EUROPAR 2006)},
  year = {2006}
}

M. Villarroya-Gaudo M., E. Figueras, J. Verd, J. Teva, G. Abadal, F. Perez-Murano, J. Montserrat, A. Uranga, J. Esteve and N. Barniol (2006), "CMOS-SOI platform for monolithic integration of crystalline silicon MEMS", Electronics Letters. Vol. 42(14), pp. 800-801.

[Abstract] [BibTeX] [DOI]

Abstract: A new platform for the fabrication of crystalline micro- and nano-electromechanical
systems fully integrable with CMOS is presented. A pre-CMOS process
on SOI wafers allows bulk silicon areas for standard CMOS processing
and areas with a stack layer of silicon and silicon oxide to be obtained,
in which a set of microelectromechanical devices can be fabricated.
An integrated resonant beam system with electrical actuation and
detection fabricated according to the presented approach is provided.

BibTeX:

@article{Villarroya2006,
  author = {Villarroya-Gaudo, M., M. and Figueras, E. and Verd, J. and Teva, J. and Abadal, G. and Perez-Murano, F. and Montserrat, J. and Uranga, A. and Esteve, J. and Barniol, N.},
  title = {CMOS-SOI platform for monolithic integration of crystalline silicon MEMS},
  journal = {Electronics Letters},
  year = {2006},
  volume = {42},
  number = {14},
  pages = {800--801},
  doi = {10.1049/el:20061097}
}

M. Villarroya-Gaudo, J. Verd, J. Teva, G. Abadal, E. Forsen, F.P. Murano, A. Uranga, E. Figueras, J. Montserrat, J. Esteve, A. Boisen and N. Barniol (2006), "System on chip mass sensor based on polysilicon cantilevers arrays for multiple detection", Sensors and Actuators A: Physical. Vol. 132(1), pp. 154-164.

[BibTeX] [DOI] [URL]

BibTeX:

@article{Villarroya2006154,
  author = {Villarroya-Gaudo, M. and Jaume Verd and Jordi Teva and Gabriel Abadal and Esko Forsen and Francesc Pérez Murano and Arantxa Uranga and Eduard Figueras and Josep Montserrat and Jaume Esteve and Anja Boisen and Núria Barniol},
  title = {System on chip mass sensor based on polysilicon cantilevers arrays for multiple detection},
  journal = {Sensors and Actuators A: Physical},
  year = {2006},
  volume = {132},
  number = {1},
  pages = {154--164},
  note = {The 19th European Conference on Solid-State Transducers},
  url = {http://www.sciencedirect.com/science/article/pii/S0924424706002780},
  doi = {10.1016/j.sna.2006.04.002}
}

M.J. Garzarán, M. Prvulovic, J.M. Llaber\ia, V. Viñals, L. Rauchwerger and J. Torrellas (2005), "Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors", ACM Trans. Archit. Code Optim.. New York, NY, USA Vol. 2(3), pp. 247-279. ACM.

[BibTeX] [DOI] [PDF]

BibTeX:

@article{Garzaran2005,
  author = {Garzarán, Mar\ia Jesús and Prvulovic, Milos and Llaber\ia, José Mar\ia and Viñals, V\ictor and Rauchwerger, Lawrence and Torrellas, Josep},
  title = {Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors},
  journal = {ACM Trans. Archit. Code Optim.},
  publisher = {ACM},
  year = {2005},
  volume = {2},
  number = {3},
  pages = {247--279},
  doi = {10.1145/1089008.1089010}
}

E.F. Torres, P. Ibáñez, V. Viñals and J.M. Llaberia (2005), "Store buffer design in first-level multibanked data caches", In Proc. 32nd International Symposium on Computer Architecture (ISCA '05)., June, 2005. , pp. 469-480.

[BibTeX] [DOI] [PDF]

BibTeX:

@inproceedings{Torres2005,
  author = {Torres, E. F. and Ibáñez, P. and Viñals, V. and Llaberia, J. M.},
  title = {Store buffer design in first-level multibanked data caches},
  booktitle = {Proc. 32nd International Symposium on Computer Architecture (ISCA '05)},
  year = {2005},
  pages = {469--480},
  doi = {10.1109/ISCA.2005.47}
}

J. Verd, G. Abadal, J. Teva, M. Villarroya-Gaudo, A. Uranga, X. Borrise, F. Campabadal, J. Esteve, E.F. Costa, F. Perez-Murano, Z.J. Davis, E. Forsen, A. Boisen and N. Barniol (2005), "Design, fabrication, and characterization of a submicroelectromechanical resonator with monolithically integrated CMOS readout circuit", #IEEE_J_MEMS#. Vol. 14(3), pp. 508-519.

[Abstract] [BibTeX] [DOI]

Abstract: In this paper, we report on the main aspects of the design, fabrication,
and performance of a microelectromechanical system constituted by
a mechanical submicrometer scale resonator (cantilever) and the readout
circuitry used for monitoring its oscillation through the detection
of the capacitive current. The CMOS circuitry is monolithically integrated
with the mechanical resonator by a technology that allows the combination
of standard CMOS processes and novel nanofabrication methods. The
integrated system constitutes an example of a submicroelectromechanical
system to be used as a cantilever-based mass sensor with both a high
sensitivity and a high spatial resolution (on the order of 10^-18
g and 300 nm, respectively). Experimental results on the electrical
characterization of the resonance curve of the cantilever through
the integrated CMOS readout circuit are shown.

BibTeX:

@article{Verd2005,
  author = {Verd, J. and Abadal, G. and Teva, J. and Villarroya-Gaudo, M. and Uranga, A. and Borrise, X. and Campabadal, F. and Esteve, J. and Costa, E. F. and Perez-Murano, F. and Davis, Z. J. and Forsen, E. and Boisen, A. and Barniol, N.},
  title = {Design, fabrication, and characterization of a submicroelectromechanical resonator with monolithically integrated CMOS readout circuit},
  journal = {#IEEE_J_MEMS#},
  year = {2005},
  volume = {14},
  number = {3},
  pages = {508--519},
  doi = {10.1109/JMEMS.2005.844845}
}

M. Villarroya-Gaudo, J. Verd, J. Teva, G. Abadal, F. Perez, J. Esteve and N. Barniol (2005), "Cantilever based MEMS for multiple mass sensing", In Proc. PhD Research in Microelectronics and Electronics. Vol. 1, pp. 197-200.

[Abstract] [BibTeX] [DOI]

Abstract: A cantilever based micro electro mechanical system (MEMS) for mass
detection is presented. The sensor for multiple detections is composed
by several cantilevers in an array configuration integrated monolithically
with CMOS. Cantilevers are excited electrostatically to its resonance
frequency. The oscillation of the microcantilever is detected by
a capacitive detection technique. Mass variation is detected by resonance
frequency shifting. The mechanical transducers are fabricated after
CMOS process on polysilicon, one of the CMOS layers. Optical lithography
is used for the cantilevers definitions. Cantilevers of 50 μm
length, 1.1 μm wide and 600 nm thick have been defined. This sensor
provides a mass sensitivity of 70 ag/Hz.

BibTeX:

@inproceedings{Villarroya2005,
  author = {Villarroya-Gaudo, M. and Verd, J. and Teva, J. and Abadal, G. and Perez, F. and Esteve, J. and Barniol, N.},
  title = {Cantilever based MEMS for multiple mass sensing},
  booktitle = {Proc. PhD Research in Microelectronics and Electronics},
  year = {2005},
  volume = {1},
  pages = {197--200},
  doi = {10.1109/RME.2005.1543038}
}

M. Villarroya-Gaudo, J. Verd, J. Teva, G. Abadal, E. Figueras, F. Perez-Murano, J. Esteve and N. Barniol (2005), "Sensor based on arrays of sub-micrometer scale resonant silicon cantilevers integrated monolithically with CMOS circuitry", In Proc. Spanish Conf. Electron Devices. , pp. 603-606.

[Abstract] [BibTeX] [DOI]

Abstract: A mass sensor, based on arrays of cantilevers for multiple detection,
is presented. Excitation and readout is performed electrostatically.
Polysilicon cantilevers are integrated monolithically with CMOS in
a compatible process. Integration of arrays of cantilever allows
performing multiple detections on the same device. A multiplexing
system to select individual cantilevers is implemented as well as
a scheme based on two readout circuits for differential measurements.
A mass resolution of 5·10^-18 g has been achieved
in the first working devices.

BibTeX:

@inproceedings{Villarroya2005a,
  author = {Villarroya-Gaudo, M. and Verd, J. and Teva, J. and Abadal, G. and Figueras, E. and Perez-Murano, F. and Esteve, J. and Barniol, N.},
  title = {Sensor based on arrays of sub-micrometer scale resonant silicon cantilevers integrated monolithically with CMOS circuitry},
  booktitle = {Proc. Spanish Conf. Electron Devices},
  year = {2005},
  pages = {603--606},
  doi = {10.1109/SCED.2005.1504530}
}

M. Villarroya-Gaudo, F. Peez-Murano, C. Martin, Z. Davis, A. Boisen, J. Esteve, E. Figueras, J. Montserrat and N. Barniol (2004), "AFM lithography for the definition of nanometre scale gaps: application to the fabrication of a cantilever-based sensor with electrochemical current detection", Nanotechnology. Vol. 15(7), pp. 771.

[Abstract] [BibTeX] [URL]

Abstract: The concept, design and fabrication of a cantilever-based sensor operating
in liquid for biochemical applications are reported. A novel approach
for detecting the deflection of a functionalized cantilever is proposed.
It consists of detecting the change of the electrochemical current
level when a voltage is applied between a deflecting cantilever,
acting as one of the electrodes, and a reference fixed electrode
placed in close proximity to the free extreme of the cantilever.
The detection is possible since the distance between the two electrodes
is smaller than 50Â nm. The sensor is fabricated by using a combination
of MEMS technology and AFM-based lithography.

BibTeX:

@article{0957-4484-15-7-009,
  author = {Villarroya-Gaudo, M. and Francesc Peez-Murano and Cristina Martin and Zachary Davis and Anja Boisen and Jaume Esteve and Eduard Figueras and Josep Montserrat and Nuria Barniol},
  title = {AFM lithography for the definition of nanometre scale gaps: application to the fabrication of a cantilever-based sensor with electrochemical current detection},
  journal = {Nanotechnology},
  year = {2004},
  volume = {15},
  number = {7},
  pages = {771},
  url = {http://stacks.iop.org/0957-4484/15/i=7/a=009}
}

T. Monreal, V. Viñals, J. Gonzalez, A. Gonzalez and M. Valero (2004), "Late allocation and early release of physical registers", Transactions on Computers. Vol. 53(10), pp. 1244-1259.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: The register file is one of the critical components of current processors
in terms of access time and power consumption. Among other things,
the potential to exploit instruction-level parallelism is closely
related to the size and number of ports of the register file. In
conventional register renaming schemes, both register allocation
and releasing are conservatively done, the former at the rename stage,
before registers are loaded with values, and the latter at the commit
stage of the instruction redefining the same register, once registers
are not used any more. We introduce VP-LAER, a renaming scheme that
allocates registers later and releases them earlier than conventional
schemes. Specifically, physical registers are allocated at the end
of the execution stage and released as soon as the processor realizes
that there will be no further use of them. VP-LAER enhances register
utilization, that is, the fraction of allocated registers having
a value to be read in the future. Detailed cycle-level simulations
show either a significant speedup for a given register file size
or a reduction in the register file size for a given performance
level, especially for floating-point codes, where the register file
pressure is usually high.

BibTeX:

@article{Monreal2004,
  author = {Monreal, T. and Viñals, V. and Gonzalez, J. and Gonzalez, A. and Valero, M.},
  title = {Late allocation and early release of physical registers},
  journal = {Transactions on Computers},
  year = {2004},
  volume = {53},
  number = {10},
  pages = {1244--1259},
  doi = {10.1109/TC.2004.79}
}

E. Torres, P. Ibáñez, V. Viñals and J. Llabería (2004), "Contents Management in First-Level Multibanked Data Caches", In 10th International Euro-Par Conference, LNCS 3149. Vol. 3149/2004, pp. 516-524. Springer Berlin / Heidelberg.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: High-performance processors will increasingly rely on multibanked
first-level caches to meet frequency requirements. In this paper
we introduce replication degree and data distribution as the main
multibanking design axes. We sample this design space by selecting
current data distribution policy proposals, measuring them on a detailed
model of a deep pipelined processor and evaluating the trade-off
introduced when the replication degree is taken into account. We
find that the best design points use data address interleaving policies
and several degrees of bank replication.

BibTeX:

@inproceedings{Torres2004,
  author = {E. Torres and P. Ibáñez and V. Viñals and J.M. Llabería},
  editor = {LNCS , Springer Berlin / Heidelberg},
  title = {Contents Management in First-Level Multibanked Data Caches},
  booktitle = {10th International Euro-Par Conference, LNCS 3149},
  publisher = {Springer Berlin / Heidelberg},
  year = {2004},
  volume = {3149/2004},
  pages = {516--524},
  doi = {10.1007/978-3-540-27866-5_68}
}

M.J. Garzaran, M. Prvulovic, J.M. Llaberia, V. Viñals, L. Rauchwerger and J. Torrellas (2003), "Tradeoffs in buffering memory state for thread-level speculation in multiprocessors", In Proc. Ninth Int. Symp. High-Performance Computer Architecture HPCA-9 2003. , pp. 191-202.

[Abstract] [BibTeX] [DOI]

Abstract: Thread-level speculation provides architectural support to aggressively
run hard-to-analyze code in parallel. As speculative tasks run concurrently,
they generate unsafe or speculative memory state that needs to be
separately buffered and managed in the presence of distributed caches
and buffers. Such state may contain multiple versions of the same
variable. In this paper, we introduce a novel taxonomy of approaches
to buffering and managing multi-version speculative memory state
in multiprocessors. We also present a detailed complexity-benefit
tradeoff analysis of the different approaches. Finally, we use numerical
applications to evaluate the performance of the approaches under
a single architectural framework. Our key insights are that support
for buffering the state of multiple speculative tasks and versions
per processor is more complexity-effective than support for merging
the state of tasks with main memory lazily. Moreover, both supports
can be gainfully combined and, in large machines, their effect is
nearly fully additive. Finally, the more complex support for future
state in main memory can boost performance when buffers are under
pressure, but hurts performance when squashes are frequent.

BibTeX:

@inproceedings{Garzaran2003b,
  author = {Garzaran, M. J. and Prvulovic, M. and Llaberia, J. M. and Viñals, V. and Rauchwerger, L. and Torrellas, J.},
  title = {Tradeoffs in buffering memory state for thread-level speculation in multiprocessors},
  booktitle = {Proc. Ninth Int. Symp. High-Performance Computer Architecture HPCA-9 2003},
  year = {2003},
  pages = {191--202},
  doi = {10.1109/HPCA.2003.1183537}
}

M.J. Garzaran, M. Prvulovic, V. Viñals, J.M. Llaberia, L. Rauchwerger and J. Torrellas (2003), "Using software logging to support multiversion buffering in thread-level speculation", In Proc. 12th Int. Conf. Parallel Architectures and Compilation Techniques PACT 2003. , pp. 170-181.

[Abstract] [BibTeX] [DOI]

Abstract: In thread-level speculation (TLS), speculative tasks generate memory
state that cannot simply be combined with the rest of the system
because it is unsafe. One way to deal with this difficulty is to
allow speculative state to merge with memory but back up in an undo
log the data that will be overwritten. Such undo log can be used
to roll back to a safe state if a violation occurs. This approach
is said to use future main memory (FMM), as memory keeps the most
speculative state. While the aggressive approach of FMM systems often
delivers better performance than more conservative approaches, it
also requires additional hardware support. To simplify the design
of FMM systems, we propose a software-only design for the undo log
system. We show that an FMM system with software logging is a good
design point: the design has less implementation complexity than
an FMM system with hardware logs, and it only reduces performance
moderately. In particular, in a simulated 16-processor machine, applications
take only 10% longer to execute than if the system had the logging
system fully implemented in hardware.

BibTeX:

@inproceedings{Garzaran2003c,
  author = {Garzaran, M. J. and Prvulovic, M. and Viñals, V. and Llaberia, J. M. and Rauchwerger, L. and Torrellas, J.},
  title = {Using software logging to support multiversion buffering in thread-level speculation},
  booktitle = {Proc. 12th Int. Conf. Parallel Architectures and Compilation Techniques PACT 2003},
  year = {2003},
  pages = {170--181},
  doi = {10.1109/PACT.2003.1238013}
}

E. Torres, P. Ibáñez, V. Viñals and J. Llabería (2003), "Counteracting Bank Missprediction in Sliced First-Level Caches", In 9th International Euro-Par Conference, LNCS 2790. Vol. Volume 1 / 1973 - Volume 6550 / 2011 Springer Berlin / Heidelberg.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Future processors having sliced memory pipelines will rely on bank
prediction to schedule memory instructions to a first-level cache
split into banks. In a deeply pipelined processor, even a small bank
misprediction rate may degrade performance severely. The goal of
this paper is to counteract the bank misprediction penalty, so that
in spite of such bank misprediction, performance suffers little.
Our contribution is twofold: a new recovery scheme for latency misprediction,
and two policies for selectively replicating loads to all banks.
The proposals have been evaluated for 4 and 8-way superscalar processors
and a wide range of pipeline depths. The best combination of our
mechanisms improves IPC of an 8-way baseline processor up to 11%,
removing up to two thirds of the bank misprediction penalty.

BibTeX:

@inproceedings{Torres2003,
  author = {E. Torres and P. Ibáñez and V. Viñals and J.M. Llabería},
  editor = {LNCS , Springer Berlin / Heidelberg},
  title = {Counteracting Bank Missprediction in Sliced First-Level Caches},
  booktitle = {9th International Euro-Par Conference, LNCS 2790},
  publisher = {Springer Berlin / Heidelberg},
  year = {2003},
  volume = {Volume 1 / 1973 - Volume 6550 / 2011},
  doi = {10.1007/978-3-540-45209-6_83}
}

F. Dang, M. Jesus Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, N. Amato, L. Rauchwerger and J. Torrellas (2002), "Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations", In Abstracts and CD-ROM Parallel and Distributed Processing Symposium., International, IPDPS 2002. , pp. 172-181.

[BibTeX] [DOI] [PDF]

BibTeX:

@inproceedings{Dang2002,
  author = {Dang, F. and Jesus Garzaran, M. and Prvulovic, M. and Zhang, Ye and Jula, A. and Yu, Hao and Amato, N. and Rauchwerger, L. and Torrellas, J.},
  title = {Smartapps, an application centric approach to high performance computing: compiler-assisted software and hardware support for reduction operations},
  booktitle = {Abstracts and CD-ROM Parallel and Distributed Processing Symposium., International, IPDPS 2002},
  year = {2002},
  pages = {172--181},
  doi = {10.1109/IPDPS.2002.1016572}
}

T. Monreal, V. Viñals, A. Gonzalez and M. Valero (2002), "Hardware schemes for early register release", In Proc. International Conference on Parallel Processing. , pp. 5-13.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Register files are becoming one of the critical components of current
out-of-order processors in terms of delay and power consumption,
since their potential to exploit instruction-level parallelism is
quite related to the size and number of ports of the register file.
In conventional register renaming schemes, register releasing is
conservatively done only after the instruction that redefines the
same register is committed. Instead, we propose a scheme that releases
registers as soon as the processor knows that there will be no further
use of them. We present two early releasing hardware implementations
with different performance/complexity trade-offs. Detailed cycle-level
simulations show either a significant speedup for a given register
file size, or a reduction in register file size for a given performance
level.

BibTeX:

@inproceedings{Monreal2002,
  author = {Monreal, T. and Viñals, V. and Gonzalez, A. and Valero, M.},
  title = {Hardware schemes for early register release},
  booktitle = {Proc. International Conference on Parallel Processing},
  year = {2002},
  pages = {5--13},
  doi = {10.1109/ICPP.2002.1040854}
}

M.J. Garzaran, J.L. Briz, P.E. Ibañez and V. Viñals (2001), "Hardware prefetching in bus-based multiprocessors: pattern characterization and cost-effective hardware", In Proc. Ninth Euromicro Workshop on Parallel and Distributed Processing. , pp. 345-354.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Data prefetching has been widely studied as a technique to hide memory
access latency in multiprocessors. Most recent research on hardware
prefetching focuses either on uniprocessors, or on distributed shared
memory (DSM) and other non bus-based organizations. However, in the
context of bus-based SMPs, prefetching poses a number of problems
related to the lack of scalability and limited bus bandwidth of these
modest-sized machines. This paper considers how the number of processors
and the memory access patterns in the program influence the relative
performance of sequential and non-sequential prefetching mechanisms
in a bus-based SMP. We compare the performance of four inexpensive
hardware prefetching techniques, varying the number of processors.
After a breakdown of the results based on a performance model, we
propose a cost-effective hardware prefetching solution for implementing
on such modest-sized multiprocessors

BibTeX:

@inproceedings{Garzaran2001,
  author = {Garzaran, M. J. and Briz, J. L. and Ibañez, P. E. and Viñals, V.},
  title = {Hardware prefetching in bus-based multiprocessors: pattern characterization and cost-effective hardware},
  booktitle = {Proc. Ninth Euromicro Workshop on Parallel and Distributed Processing},
  year = {2001},
  pages = {345--354},
  doi = {10.1109/EMPDP.2001.905061}
}

M.J. Garzaran, M. Prvulovic, Y. Zhang, A. Jula, H. Yu, L. Rauchwerger and J. Torrellas (2001), "Architectural support for parallel reductions in scalable shared-memory multiprocessors", In Proc. International Conference on Parallel Architectures and Compilation Techniques. , pp. 243-254.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Reductions are important and time-consuming operations in many scientific
codes. Effective parallelization of reductions is a critical transformation
for loop parallelization, especially for sparse, dynamic applications.
Unfortunately, conventional reduction parallelization algorithms
are not scalable. In this paper, we present new architectural support
that significantly speeds up parallel reduction and makes it scalable
in shared-memory multiprocessors. The required architectural changes
are mostly confined to the directory controllers. Experimental results
based on simulations show that the proposed support is very effective.
While conventional software-only reduction parallelization delivers
average speedups of only 2.7 for 16 processors, our scheme delivers
average speedups of 7.6

BibTeX:

@inproceedings{Garzaran2001b,
  author = {Garzaran, M. J. and Prvulovic, M. and Zhang, Ye and Jula, A. and Yu, Hao and Rauchwerger, L. and Torrellas, J.},
  title = {Architectural support for parallel reductions in scalable shared-memory multiprocessors},
  booktitle = {Proc. International Conference on Parallel Architectures and Compilation Techniques},
  year = {2001},
  pages = {243--254},
  doi = {10.1109/PACT.2001.953304}
}

M. Prvulovic, M.J. Garzaran, L. Rauchwerger and J. Torrellas (2001), "Removing architectural bottlenecks to the scalability of speculative parallelization", In Proc. 28th Annual International Symposium on Computer Architecture. , pp. 204-215.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Speculative thread-level parallelization is a promising way to speed
up codes that compilers fail to parallelize. While several speculative
parallelization schemes have been proposed for different machine
sizes and types of codes, the results so far show that it is hard
to deliver scalable speedups. Often, the problem is not true dependence
violations, but sub-optimal architectural design. Consequently, we
attempt to identify and eliminate major architectural bottlenecks
that limit the scalability of speculative parallelization. The solutions
that we propose are: low-complexity commit in constant time to eliminate
the task commit bottleneck, a memory-based overflow area to eliminate
stall due to speculative buffer overflow, and exploiting high-level
access patterns to minimize speculation-induced traffic. To show
that the resulting system is truly scalable, we perform simulations
with up to 128 processors. With our optimizations, the speedups for
128 and 64 processors reach 63 and 48, respectively. The average
speedup for 64 processors is 32, nearly four times higher than without
our optimizations

BibTeX:

@inproceedings{Prvulovic2001,
  author = {Prvulovic, M. and Garzaran, M. J. and Rauchwerger, L. and Torrellas, J.},
  title = {Removing architectural bottlenecks to the scalability of speculative parallelization},
  booktitle = {Proc. 28th Annual International Symposium on Computer Architecture},
  year = {2001},
  pages = {204--215},
  doi = {10.1109/ISCA.2001.937450}
}

L. Ramos, P. Ibáñez, V. Viñals and J.M. Llabería (2000), "Modeling load address behaviour through recurrences", In Proc. ISPASS Performance Analysis of Systems and Software 2000 IEEE Int. Symp. , pp. 101-108.

[Abstract] [BibTeX] [DOI]

Abstract: Addresses of load instructions exhibit regularity in their behaviour
which is modelled through several models (locality repetitive patterns,
etc.) and exploited in processor and memory hierarchy design. Nevertheless,
sparse and symbolic applications are intensive in addressing patterns
not entirely covered by current models. In this work we introduce
a new recurrence among load pairs called “linear link”
in order to identify more regularity from such applications. A linear
link is a type of recurrence between the value read by a (producer)
load and the address issued by a (consumer) load, which is detected
tracking on-the-fly dependencies among loads. We consider a broad
workload (Nas, Olden, Perfect, Spec95 and IAbench) and conclude that
linear links together with stride recurrences can identify many address
streams in symbolic and scientific applications traversing either
dense, linked data structures or compressed forms of sparse arrays.
The two recurrence combinations identify more than 90% of the addresses
in more than a half the programs (in 24 our of 55), and more than
75% of the addresses in 90% of the programs (50 our of 55). Finally,
we show several measures related to the use of linear links as address
predictors for executing loads speculatively and for issuing data
prefetches (prediction distance ahead capacity, etc.)

BibTeX:

@inproceedings{Ramos2000a,
  author = {Ramos, L. and Ibáñez, P. and Viñals, V. and Llabería, J. M.},
  title = {Modeling load address behaviour through recurrences},
  booktitle = {Proc. ISPASS Performance Analysis of Systems and Software 2000 IEEE Int. Symp},
  year = {2000},
  pages = {101--108},
  doi = {10.1109/ISPASS.2000.842288}
}

T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez and V. Viñals (1999), "Delaying physical register allocation through virtual-physical registers", In Proc. 32nd Annual International Symposium on MICRO-32 Microarchitecture. , pp. 186-192.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: Register file access time represents one of the critical delays of
current microprocessors, and it is expected to become more critical
as future processors increase the instruction window size and the
issue width. This paper presents a novel physical register management
scheme that allows for a late allocation (at the end of execution)
of registers. We show that it can provide significant savings in
number of registers and thus, it can significantly shorten the register
file access time. The approach is based on virtual-physical registers,
which we presented in a previous work, extended with a new register
allocation policy. This policy consists of an on-demand allocation
in order to maximize the register usage, combined with a stealing
mechanism that prevents older instruction from being delayed by younger
ones. This shortens the average number of cycles that each physical
register is allocated, and allows for an early execution of instructions
since they can obtain a physical register for its destination earlier
than with the conventional scheme. Early execution is especially
beneficial for branches and memory operations, since the former can
be resolved earlier and the latter can prefetch their data in advance

BibTeX:

@inproceedings{Monreal1999,
  author = {Monreal, T. and Gonzalez, A. and Valero, M. and Gonzalez, J. and Viñals, V.},
  title = {Delaying physical register allocation through virtual-physical registers},
  booktitle = {Proc. 32nd Annual International Symposium on MICRO-32 Microarchitecture},
  year = {1999},
  pages = {186--192},
  doi = {10.1109/MICRO.1999.809456}
}

P. Ibáñez, V. Viñals, J.L. Briz and M.J. Garzarán (1998), "Characterization and improvement of load/store cache-based prefetching", In ICS '98: Proceedings of the 12th international conference on Supercomputing. New York, NY, USA , pp. 369-376. ACM.

[BibTeX] [DOI] [PDF]

BibTeX:

@inproceedings{Ibanez1998,
  author = {Ibáñez, Pablo and Viñals, Víctor and Briz, José Luis and Garzarán, María Jesús},
  title = {Characterization and improvement of load/store cache-based prefetching},
  booktitle = {ICS '98: Proceedings of the 12th international conference on Supercomputing},
  publisher = {ACM},
  year = {1998},
  pages = {369--376},
  doi = {10.1145/277830.277921}
}

A. Gonzalez, M. Valero, J. Gonzalez and T. Monreal (1997), "Virtual registers", In Proc. Fourth International Conference on High-Performance Computing. , pp. 364-369.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: The number of physical registers is one of the critical issues of
current superscalar out-of-order processors. Conventional architectures
allocate, in the decoding stage, a new storage location (e.g. a physical
register) for each operation that has a destination register. When
an instruction is committed, it frees the physical register allocated
to the previous instruction that had the same destination logical
register. Thus, an additional register (i.e. in addition to the number
of logical registers) is used for each instruction with a destination
register from the time it is decoded until it is committed. In this
paper, we propose a novel register organization that allocates physical
registers when instructions complete their execution. In this way,
the register pressure is significantly reduced, since the additional
register is only used from the time execution completes until the
instruction is committed. For some long-latency instructions (e.g.
load with a cache miss) and for parts of the code with a small amount
of parallelism, the savings could be very high. We have evaluated
the new scheme for a superscalar processor and obtained a significant
speedup

BibTeX:

@inproceedings{Gonzalez1997,
  author = {Gonzalez, A. and Valero, M. and Gonzalez, J. and Monreal, T.},
  title = {Virtual registers},
  booktitle = {Proc. Fourth International Conference on High-Performance Computing},
  year = {1997},
  pages = {364--369},
  doi = {10.1109/HIPC.1997.634516}
}

P. Ibañez and V. Viñals (1996), "Performance assessment of contents management in multilevel on-chip caches", In Proc. 22nd EUROMICRO Conf. EUROMICRO 96. 'Beyond 2000: Hardware and Software Design Strategies'. , pp. 431-440.

[Abstract] [BibTeX] [DOI] [PDF]

Abstract: This paper deals with two level on-chip cache memories. We show the
impact of three different relationships between the contents of these
levels on the system performance. In addition to the classical Inclusion
contents management, we propose two alternatives, namely Exclusion
and Demand, developing for them the necessary coherence support and
quantifying their relative performance in a design space (sizes,
latencies, ...) in agreement with the constraints imposed by integration.
Two performance metrics are considered: the second-level cache miss
ratio and the system CPI. The experiments have been carried out running
a set of integer and floating point SPEC'92 benchmarks. We conclude
showing the superiority of our improved version of Exclusion throughout
all the sizing and workload spectrum studied

BibTeX:

@inproceedings{Ibanez1996,
  author = {Ibañez, P. and Viñals, V.},
  editor = {IEEE Computer Society Press. ISBN: 0-8186-7487-3},
  title = {Performance assessment of contents management in multilevel on-chip caches},
  booktitle = {Proc. 22nd EUROMICRO Conf. EUROMICRO 96. 'Beyond 2000: Hardware and Software Design Strategies'},
  year = {1996},
  pages = {431--440},
  doi = {10.1109/EURMIC.1996.546467}
}

L. Jimeno, P. Ibáñez and V. Viñals (1996), "Warm Time Sampling: Fast and Accurate Cycle-Level Simulation of Cache Memory", In 22nd Euromicro Conference. Short Contributions. , pp. 39-44.

[Abstract] [BibTeX] [PDF]

Abstract: This paper proposes a new technique for reducing cache memory simulation
time when measuring CPI We perform timesampling simulation but
still use the parts of the trace that do not belong to the sample
to update the state of the memory system in order to avoid coldstart
problems at the beginning of the next simulated interval In our
simulation environment and using this warmup technique we achieve
a reduction by a factor of in the elapsed simulation time with
an error less than in the CPI estimation

BibTeX:

@inproceedings{Jimeno1996,
  author = {L. Jimeno and P. Ibáñez and V. Viñals},
  editor = {IEEE Computer Society Press. ISBN: 0-8186-7703-1},
  title = {Warm Time Sampling: Fast and Accurate Cycle-Level Simulation of Cache Memory},
  booktitle = {22nd Euromicro Conference. Short Contributions},
  year = {1996},
  pages = {39--44}
}

Created by JabRef on 27/01/2021.